For the most part, incidents are unique.
Therefore the lessons learned will vary from problem to problem. However, it’s helpful to spot trends in response efforts to both identify what is working and what needs improvement.
It’s also helpful for engineering teams to have a sense of how frequent problems are arising and how quickly they are addressed and resolved.
When tracking incidents using Azure Boards, it’s quite simple to build reports and charts provide a high level snapshot of incident management efforts.
Now that we have a tool to track the incident details, we need to ensure we are tracking all of the important aspects. Such as When did we know about the problem and more.
Let’s now take a look at how we can customize Azure Boards to track additional incident details.
When did we know? When a new record (or incident) is created in Azure Boards we will automatically have the date and time as well as a change log throughout the incident’s lifecycle…
Tracking incidents is as easy as setting up a datastore, like the table storage used for the on-call roster. However, why reinvent the wheel? Why not use something already available, customizable, extensible, and free?
Azure Boards is my tool of choice in this tutorial, but honestly this could probably be done with any popular project management tool with an API.
First, login or create a free Azure Devops account.
Did our monitoring systems tell us, or did a customer inform us?
Capturing whether problems were detected through telemetry or another method, means we can easily identify gaps in our monitoring tools and practices.
It also helps early responders to know where the problem is and what is affected. By communicating to the engineer where the problem was first detected, we provide valuable context in their early triaging efforts.
Incidents are often chaotic and stressful.
How Bad is It?
We may not have any notion of severity or impact and there is no place for us to find out how bad the problem really is, and who is affected. These are tough questions to answer if nothing is tracked.
There are many people who care about how things are going during an incident. We call them stakeholders. From CEO’s to sales teams, many more people outside of the engineering team want, need, and deserve to know “How bad is it?
To recover from incidents effectively, it’s important to communicate and collaborate effectively. In order to share relevant details of what is known about an incident, who is addressing it, and more, it’s important to have a method of tracking incident details.
Basic details such as:
When did know about the problem? How did we find out about the problem? Who is awareness of the problem? What is being done? How Bad is It?
What (if anything) is being done?
Is everyone assuming someone else is looking into it?
These types of questions emerge as additional people join to assist.
When alerts are sent to distribution lists or general chat rooms it’s easy for them to be lost or at best delayed of action.
As first responders begin to assess the problem, conversations take place in persistent group chat tools such as Microsoft Teams or Slack.
Is this a new incident?
If we are trying to reduce the time it takes to recover from incidents, we will need to start capturing when we are aware of issues.
By examining incidents in phases, we can look for improvements in specific areas such as Detection. If we start capturing when we knew about a problem, patterns will emerge over time on what could be done to know sooner.
Am I the First to Know?
When the primary responder acknowledges an incident, they are announcing their awareness of the alert.
They may not have additional information yet, but like an alarm clock, they have to press the stop button and take action. This not only stops any continued alarms but indicates to others that YES, someone is aware and looking in to the problem.
Ok. Who else is aware? Do the right people know there is a problem?