Every business will suffer from an incident in one form or another through the life of the business, it doesn’t haven’t to be specific to IT. How you identify, manage and recover from the incident is what is important. You need to ensure that you learn from the incident, that you perform reviews and fix the issues that are causing your incident, otherwise business reputation and cost impact (amongst others) will play an important part, which if unlucky could be the final nail in the coffin for the business. For example, having ISO 27001 in place, will help with a lot with incident management.
Laying the foundations
Before you can start dealing with an incident within any business, you first need to establish what an incident will be and who will be involved within the incident team. Depending upon the size of the business, this may be a single go to person (not ideal but sometimes unavoidable), or it may be the department heads of multiple teams.
Once you have your team in place, you then need to elect someone who will manage the incident team. The incident manager can determine whether an incident is classified as a specific severity rating, such as P1, P2, P3 etc., this can be achieved through having an agreed pre-defined definition for comparison. ITIL recommends using three criteria:
- Urgency: Effect on important business deadlines
- Impact: Impact to the business’s finances, reputation and viability
- Severity: Impact to end users, including employees and customers
Once you have the definition defined, you should share it with everyone who will be involved with an incident, for software / IT companies, this could be your operations, development, QA and customer support teams. To ensure that everyone is up to speed, the business should run through table top exercises to ensure that everyone is trained up and experienced in what they should expect, when an incident happens.
When the unfortunate happens, the business needs to keep everyone informed, this should be handled via the incident response team and should go through a pre-defined process or triage and fix. Having people running around like headless chickens trying to resolve the problem without communication and letting anyone know what’s going on can only make things worse, especially in larger teams. Communication and notification during an incident is crucial.
Utilising notification systems such as https://www.statuspage.io/ and https://status.io/ can help your business keep people in the know and open that door to transparency about platform issues. Whether these are used internally, or externally or both, can help everyone ensuring that the business is performing against their SLA’s.
Allowing people to have the ability to view your availability metrics in real time, helps the business ensure that they are working on reliability and availability.
However, simply having notifications in place, doesn’t help if you are not going to go 100% into the process. For example, if you setup a status page and only update it if the whole system is down after 30 minutes of outage will help no one. You should look at ensuring it is kept up to date (through API’s if possible) and allow the business to be as transparent as possible.
How you handle and keep people informed during an incident can be crucial, hiding an incident from customers, suppliers and even internal staff, could open a can of worms. In the worst cases, hiding an incident could end up like what happened to Uber: http://www.bbc.co.uk/news/technology-42075306, although this is the extreme.
Being transparent during incidents will not only help you reduce the number of customer support calls, as your customers can see your having issues, but it can also help ensure that damage to your business reputation is limited.
Incident management and transparency is only the beginning, ensuring you have a robust incident response plan in place, and ensuring that the business and team perform retrospectives and tasks in place to ensure that any repeating incidents are resolved is key. However, this is sometimes difficult to achieve if there is no buy in from the business leaders.
I hope this post has helped you think about how your business operations and manages its own incident plans.