You now know what a post-incident review is, its role in the incident response process, and when it should be conducted. In this unit, you’ll dive a little deeper into the details of what makes a post-incident review most effective.
Because incidents differ, the exact makeup of post-incident reviews can be different, too. But there are some common characteristics and components of a good review that can provide you with a solid foundation for carrying out the process.
To conduct a good post-incident review, you must first ensure that everyone is on the same page. Toward that end, remember that language matters; there are terms you should use and some you shouldn’t. Key points in conducting the review are who to include (everyone) and when to do it (within twenty-four to thirty-six hours). This will help you accomplish the purpose of the review: to learn and improve.
We learn from incidents by conducting a post-incident review, which happens during the analysis phase.
You’ve probably heard all your life the adage, attributed to Benjamin Franklin, that “an ounce of prevention is worth a pound of cure.” The accepted meaning is that it’s better to keep a problem from happening than to fix it after it’s happened.
In your efforts to achieve a high level of reliability for your systems and services, you should do everything possible to prevent incidents from occurring. However, due to the complexity of those systems, as explained above, prevention isn’t always possible.
As we delve into our post-incident review, we need to be on guard against some human tendencies that can lead us to inaccurate or incomplete conclusions and distract us from accomplishing the core purpose of the review: learning about our systems so as to improve their reliability.
Now you have a roadmap help you get started on the post-incident review process, but it would also be useful to know about some of the obstacles you might encounter on this journey.
In the field of psychology, counterfactual thinking is a concept that’s associated with the human tendency to invent possible alternatives to past events – how things might have turned out differently.
Counterfactual means “contrary to facts,” and counterfactual reasoning refers to telling a story about events that did not happen, in order to explain the events that did. You can identify counterfactual statements by key phrases:
Mechanistic reasoning refers to the concept that a particular outcome can be inferred from intervention. It’s sometimes called the meddling kids syndrome based on the premise that “Our system would have worked fine … if it hadn’t been for those meddling kids.”
When you use mechanistic reasoning in your post-incident review, you build your conclusions on the fallacy that the systems you work with and within are basically working correctly, and if only those “meddling kids” hadn’t done whatever they did, the failure would not have occurred.
Normativity relates to the human tendency to designate some actions and outcomes as desirable or good and other actions or outcomes as undesirable or bad. A norm is a standard of correctness agreed upon by a society.
When you use normative language in a post-incident review, you judge the decisions and actions of those responding to the incident with the benefit of hindsight. This language implies that there was an obviously correct course of action that the operator should have followed.
Humans make mistakes. However, human error is not a diagnosis; it’s a symptom. When human error is deemed to be the reason for a failure, you may stop there instead of further analyzing the incident to determine the root cause – which is a deeper, systemic issue.
System design, organizational context, and personal context all affect when, how and with what impact people make mistakes. “Human error” is a label that causes you to quit investigating at precisely the moment when you’re about to discover something interesting about your system.
“There is also a subtler effect, which is that the fear of on-call is often enough by itself to radically change people’s behavior. Entire development teams reject outright the notion of going on call, because of the impact on their personal lives, family, and in-hours effectiveness.” - Niall Murphy, Microsoft “Seeking SRE”(O’Reilly)
In some cases, we downplay the significance of an outage … or worse … intentionally mis-label or not report a disruption in service for fear of reprimand.
If you ask engineers across different organizations and industries, you will get many different answers about what an incident is.
Sometimes it is only when a customer is affected.
Others will label disruptions as incidents even if a customer never experienced a thing.
Subjectivity is an unfortunate property of incidents in a lot of cases, even when it comes to identifying severity levels.
Exploring Fear of Incidents