For the most part, incidents are unique.
Therefore the lessons learned will vary from problem to problem. However, it’s helpful to spot trends in response efforts to both identify what is working and what needs improvement.
It’s also helpful for engineering teams to have a sense of how frequent problems are arising and how quickly they are addressed and resolved.
When tracking incidents using Azure Boards, it’s quite simple to build reports and charts provide a high level snapshot of incident management efforts.
Now that we have a tool to track the incident details, we need to ensure we are tracking all of the important aspects. Such as When did we know about the problem and more.
Let’s now take a look at how we can customize Azure Boards to track additional incident details.
When did we know? When a new record (or incident) is created in Azure Boards we will automatically have the date and time as well as a change log throughout the incident’s lifecycle…
Tracking incidents is as easy as setting up a datastore, like the table storage used for the on-call roster. However, why reinvent the wheel? Why not use something already available, customizable, extensible, and free?
Azure Boards is my tool of choice in this tutorial, but honestly this could probably be done with any popular project management tool with an API.
First, login or create a free Azure Devops account.
If you search online for “Incident Response” a majority of what you’ll find is information related to security threats and breaches. What doesn’t show up in the results is stuff about how to properly respond to threats related to something else entirely.
How should a business respond to technical challenges and failures as they come up? The ones that affect reliability concerns such as availability, latency, correctness, and others. What happens when service level expectations are breached and it’s time for a human to get involved?
Site Reliability Engineering is an engineering discipline devoted to helping organizations sustainably achieve the appropriate level of reliability in their systems, services, and products.
The key concepts to take away from this definition are:
Reliability. You learned in the introductory module that there are multiple aspects to reliability and later in this module, you’ll examine each in more detail. You also learned about the importance of reliability – why it matters.
To conduct a good post-incident review, you must first ensure that everyone is on the same page. Toward that end, remember that language matters; there are terms you should use and some you shouldn’t. Key points in conducting the review are who to include (everyone) and when to do it (within twenty-four to thirty-six hours). This will help you accomplish the purpose of the review: to learn and improve.
We learn from incidents by conducting a post-incident review, which happens during the analysis phase.
As we delve into our post-incident review, we need to be on guard against some human tendencies that can lead us to inaccurate or incomplete conclusions and distract us from accomplishing the core purpose of the review: learning about our systems so as to improve their reliability.
Now you have a roadmap help you get started on the post-incident review process, but it would also be useful to know about some of the obstacles you might encounter on this journey.
In the field of psychology, counterfactual thinking is a concept that’s associated with the human tendency to invent possible alternatives to past events – how things might have turned out differently.
Counterfactual means “contrary to facts,” and counterfactual reasoning refers to telling a story about events that did not happen, in order to explain the events that did. You can identify counterfactual statements by key phrases:
Mechanistic reasoning refers to the concept that a particular outcome can be inferred from intervention. It’s sometimes called the meddling kids syndrome based on the premise that “Our system would have worked fine … if it hadn’t been for those meddling kids.”
When you use mechanistic reasoning in your post-incident review, you build your conclusions on the fallacy that the systems you work with and within are basically working correctly, and if only those “meddling kids” hadn’t done whatever they did, the failure would not have occurred.
Normativity relates to the human tendency to designate some actions and outcomes as desirable or good and other actions or outcomes as undesirable or bad. A norm is a standard of correctness agreed upon by a society.
When you use normative language in a post-incident review, you judge the decisions and actions of those responding to the incident with the benefit of hindsight. This language implies that there was an obviously correct course of action that the operator should have followed.