You’ve probably heard all your life the adage, attributed to Benjamin Franklin, that “an ounce of prevention is worth a pound of cure.” The accepted meaning is that it’s better to keep a problem from happening than to fix it after it’s happened.
In your efforts to achieve a high level of reliability for your systems and services, you should do everything possible to prevent incidents from occurring. However, due to the complexity of those systems, as explained above, prevention isn’t always possible.
The remediation phase is the blurriest of them all. A big reason is that sometimes there’s no difference between what takes place during the response and an action intended to improve the situation (i.e. remediation step).
Much of incident response is just trial and error, quite honestly. We quickly think through what to do, we do it, we hope for quick feedback, we examine if things improved, and we iterate.
The methods used to determine when we have a problem have changed over the years.
Alerting a person to a spike in CPU usage isn’t as valuable these days. Especially those in the process of adopting the cloud. Instead, we want to know when our customer is experiencing a problem while using our system.
The problems will vary but the methods used to determine when a human needs to get involved have evolved.
During and after a post-incident review many ideas will surface around how to improve not only various aspects of each phase of the lifecycle but also how the team can improve in other areas. Communication for example.
During the review, engineers might have pointed out that there were long gaps in the conversation timeline where nobody said anything. It’s helpful to be verbose in what engineers are doing, thinking, even feeling.
Once our detection efforts have been configured to send actionable alerts to the people who build the systems, we need to make sure they are sending those alerts to the right people.
Right People How do you know who the right people are? In most cases it is situational. A few things that can be done to help establish some formatlity and standard around responding to incidents is through the use of roles, rosters, and rotations.
Although thinking of incidents in terms of phases allows for us to shorten each in their own unique ways, responding to and remediating an incident often begin to blur. Especially when actions to mitigate or improve the situation, have the opposite result.
Now that we’ve covered the foundations of building a good incident response plan, let’s talk about remediation efforts and how Supplying Context & Guidance to on-call engineers rather than step by step procedures can dramatically help reduce the impact of an incident.
Azure Monitor includes a number of features and tools you can use to monitor and measure reliability, using data from many different sources.
This article’s focus is to help you keep your “eyes on the prize” – the goal of improving reliability in your organization. You should now understand what reliability is and why it’s important. You know that operational awareness is an essential precursor to monitoring for reliability, and that you need to establish a baseline of “normal” behavior as a first step.
We want to develop a sense making approach to the present situation in order to act upon it better in the future… in the moment.
There are different ways to conduct a useful post-incident review, but there are some common practices that can make the process easier and more effective. If you run a facilitated review meeting, keep your review and planning meetings separate, ask better questions, and identify how things went right, you can learn more and use what you learn to improve the reliability of your systems, services, and processes.
If we start to think of incidents as a normal part of a system, then we can also build some formality around the patterns and practices we inevitably see when people instinctually do what they do when something goes wrong.
From the beginning of a problem to analyzing what and how things happened, we can measure them independantly of each other. By doing so, we can look for improvements in each phase.
Incident response doesn’t stop when the incident is over. They say those who don’t study history are doomed to repeat it. Likewise, those who don’t study, analyze, and learn from the incidents they resolved are doomed to keep repeating the process, as well.
Your most important means of learning from incidents is the post-incident review.
When an incident occurs, your first reaction probably isn’t, “Hurray – a learning opportunity!” Your immediate priority is figuring out what went wrong and fixing it as quickly as possible, to reduce the impact on your customers and end users – as it should be.