As we delve into our post-incident review, we need to be on guard against some human tendencies that can lead us to inaccurate or incomplete conclusions and distract us from accomplishing the core purpose of the review: learning about our systems so as to improve their reliability.
Now you have a roadmap help you get started on the post-incident review process, but it would also be useful to know about some of the obstacles you might encounter on this journey.
In the field of psychology, counterfactual thinking is a concept that’s associated with the human tendency to invent possible alternatives to past events – how things might have turned out differently.
Counterfactual means “contrary to facts,” and counterfactual reasoning refers to telling a story about events that did not happen, in order to explain the events that did. You can identify counterfactual statements by key phrases:
Mechanistic reasoning refers to the concept that a particular outcome can be inferred from intervention. It’s sometimes called the meddling kids syndrome based on the premise that “Our system would have worked fine … if it hadn’t been for those meddling kids.”
When you use mechanistic reasoning in your post-incident review, you build your conclusions on the fallacy that the systems you work with and within are basically working correctly, and if only those “meddling kids” hadn’t done whatever they did, the failure would not have occurred.
Normativity relates to the human tendency to designate some actions and outcomes as desirable or good and other actions or outcomes as undesirable or bad. A norm is a standard of correctness agreed upon by a society.
When you use normative language in a post-incident review, you judge the decisions and actions of those responding to the incident with the benefit of hindsight. This language implies that there was an obviously correct course of action that the operator should have followed.
Humans make mistakes. However, human error is not a diagnosis; it’s a symptom. When human error is deemed to be the reason for a failure, you may stop there instead of further analyzing the incident to determine the root cause – which is a deeper, systemic issue.
System design, organizational context, and personal context all affect when, how and with what impact people make mistakes. “Human error” is a label that causes you to quit investigating at precisely the moment when you’re about to discover something interesting about your system.
“There is also a subtler effect, which is that the fear of on-call is often enough by itself to radically change people’s behavior. Entire development teams reject outright the notion of going on call, because of the impact on their personal lives, family, and in-hours effectiveness.” - Niall Murphy, Microsoft “Seeking SRE”(O’Reilly)
In some cases, we downplay the significance of an outage … or worse … intentionally mis-label or not report a disruption in service for fear of reprimand.
If you ask engineers across different organizations and industries, you will get many different answers about what an incident is.
Sometimes it is only when a customer is affected.
Others will label disruptions as incidents even if a customer never experienced a thing.
Subjectivity is an unfortunate property of incidents in a lot of cases, even when it comes to identifying severity levels.
Exploring Fear of Incidents