You now know what a post-incident review is, its role in the incident response process, and when it should be conducted. In this unit, you’ll dive a little deeper into the details of what makes a post-incident review most effective.
Because incidents differ, the exact makeup of post-incident reviews can be different, too. But there are some common characteristics and components of a good review that can provide you with a solid foundation for carrying out the process.
“An alert is something which requires a human to perform an action.” - Pagerduty “Alerting Principles”
To understand why alerting can create a problem, you need to think about the purpose of alerts and how they differ from other monitoring components.
Actionable alerts are not:
Logs. Alerts are not records of events; that’s the role of logs.
Notifications. Alerts are not intended to announce non-critical occurrences such as the completion of a software build.
To address some of the challenges around how we communicate we also want to find a way to create a unique channel or space for engineers to discuss the details of the incident - a “conversation bridge” in our persistent group chat tool -which for Tailwind Traders is Microsoft Teams.
We want a channel that is unique to the incident only. We do not want conversations about other engineering efforts.We don’t want conversations about what people are doing for lunch.
Creating a repeatable response plan means establishing who does what when something goes wrong. We don’t want there to be any question around who is supposed to be doing what.
Because of this, it is important to establish roles and the associated expectations. This isn’t a separation of duties exercise. In fact, we want to encourage less of that. It is however, a way of establishing better coordination and communication. It prevents people from stepping on each others toes while enabling cross-collaboration amongst not only on-call rosters, but an entire organization.
Rosters establish a framework around who is on-call at any given point. A roster, or team, is made up of multiple engineers. Rosters can also contain multiple rotations. I’m testing out how to edit a page.
Creating an on-call roster using Azure Table Storage
People shouldn’t have to be on-call for long periods of time. It’s detrimental to their health and therefore the systems they create and look after.
In order to make sure people aren’t expected to respond to problems 24-hours a day, indefinitely, we use rotations. Also referred to as scheduled shifts. That way people can take turns.
Being on-call requires a heavy cognitive task which then negatively impacts many aspects of a person’s life.
Alerts play an important role in your reliability monitoring strategy, but in order to be helpful, they must be properly constructed for situations that warrant immediate human attention, and they should be devised with simplicity, scope, and context in mind.
Preferences on how alerts are delivered can be designed using Action Groups in Azure.
You have learned how to monitor and interact with indicators of the reliability of your systems and create reliability goals, but there is also an important way by which reliability interacts with you.
To create effective actionable alerts, you must understand their components and characteristics. Actionable alerts have:
Simplicity is self-explanatory: make your alerts easy for you and others to understand, even if you’re reading them after being awakened at 2:00 a.m. Scope and context should be included in the content of the alert.
Let’s look at some elements that an actionable alert should always include:
The foundations of building reliable systems including a good incident response plan, have to start with determining “Who is expected to respond to problems?” and “How do let them know?”.
The best place to start, is to design what is to establish roles, rosters, and rotations.
Roles : Well defined responsibilities and expectations of individuals on an on-call team (or roster). The Primary Responder, for example.
Rosters : A group of individuals, each with their own assigned role and understood responsibilities and expectations.
If we start to think of incidents as a normal part of a system, then we can also build some formality around the patterns and practices we inevitably see when people instinctually do what they do when something goes wrong.
From the beginning of a problem to analyzing what and how things happened, we can measure them independantly of each other. By doing so, we can look for improvements in each phase.