Distinguishing Prevention and Preparation

You’ve probably heard all your life the adage, attributed to Benjamin Franklin, that “an ounce of prevention is worth a pound of cure.” The accepted meaning is that it’s better to keep a problem from happening than to fix it after it’s happened. In your efforts to achieve a high level of reliability for your systems and services, you should do everything possible to prevent incidents from occurring. However, due to the complexity of those systems, as explained above, prevention isn’t always possible.

Identifyig the Incident Remediation Phase

The remediation phase is the blurriest of them all. A big reason is that sometimes there’s no difference between what takes place during the response and an action intended to improve the situation (i.e. remediation step). Much of incident response is just trial and error, quite honestly. We quickly think through what to do, we do it, we hope for quick feedback, we examine if things improved, and we iterate.

Identifying the Incident Detection Phase

The methods used to determine when we have a problem have changed over the years. Alerting a person to a spike in CPU usage isn’t as valuable these days. Especially those in the process of adopting the cloud. Instead, we want to know when our customer is experiencing a problem while using our system. The problems will vary but the methods used to determine when a human needs to get involved have evolved.

Identifying the Incident Readiness Phase

During and after a post-incident review many ideas will surface around how to improve not only various aspects of each phase of the lifecycle but also how the team can improve in other areas. Communication for example. During the review, engineers might have pointed out that there were long gaps in the conversation timeline where nobody said anything. It’s helpful to be verbose in what engineers are doing, thinking, even feeling.

Identifying the Response Phase of an Incident

Once our detection efforts have been configured to send actionable alerts to the people who build the systems, we need to make sure they are sending those alerts to the right people. Right People How do you know who the right people are? In most cases it is situational. A few things that can be done to help establish some formatlity and standard around responding to incidents is through the use of roles, rosters, and rotations.

Improving the Remediation of Incidents

Although thinking of incidents in terms of phases allows for us to shorten each in their own unique ways, responding to and remediating an incident often begin to blur. Especially when actions to mitigate or improve the situation, have the opposite result. Now that we’ve covered the foundations of building a good incident response plan, let’s talk about remediation efforts and how Supplying Context & Guidance to on-call engineers rather than step by step procedures can dramatically help reduce the impact of an incident.

Reviewing Azure's Monitoring Tools

Azure Monitor includes a number of features and tools you can use to monitor and measure reliability, using data from many different sources. This article’s focus is to help you keep your “eyes on the prize” – the goal of improving reliability in your organization. You should now understand what reliability is and why it’s important. You know that operational awareness is an essential precursor to monitoring for reliability, and that you need to establish a baseline of “normal” behavior as a first step.

Running a Post-incident Review

We want to develop a sense making approach to the present situation in order to act upon it better in the future… in the moment. There are different ways to conduct a useful post-incident review, but there are some common practices that can make the process easier and more effective. If you run a facilitated review meeting, keep your review and planning meetings separate, ask better questions, and identify how things went right, you can learn more and use what you learn to improve the reliability of your systems, services, and processes.

Understanding the Full Lifecycle of an Incident

If we start to think of incidents as a normal part of a system, then we can also build some formality around the patterns and practices we inevitably see when people instinctually do what they do when something goes wrong. From the beginning of a problem to analyzing what and how things happened, we can measure them independantly of each other. By doing so, we can look for improvements in each phase.

Understanding Why We Learn From Incidents

Incident response doesn’t stop when the incident is over. They say those who don’t study history are doomed to repeat it. Likewise, those who don’t study, analyze, and learn from the incidents they resolved are doomed to keep repeating the process, as well. Your most important means of learning from incidents is the post-incident review. When an incident occurs, your first reaction probably isn’t, “Hurray – a learning opportunity!” Your immediate priority is figuring out what went wrong and fixing it as quickly as possible, to reduce the impact on your customers and end users – as it should be.