The remediation phase is the blurriest of them all. A big reason is that sometimes there’s no difference between what takes place during the response and an action intended to improve the situation (i.e. remediation step).
Much of incident response is just trial and error, quite honestly. We quickly think through what to do, we do it, we hope for quick feedback, we examine if things improved, and we iterate.
The post-incident review is where the idea of incidents begin to shift from things that are feared and avoided to things that can provide valuable information to a team and business.
Rarely will you find a business today that doesn’t heavily rely on digital services to earn and keep customers. There are going to be problems along the way. Not only that, but customers expect improvements, technology changes, competitors get smarter.
The methods used to determine when we have a problem have changed over the years.
Alerting a person to a spike in CPU usage isn’t as valuable these days. Especially those in the process of adopting the cloud. Instead, we want to know when our customer is experiencing a problem while using our system.
The problems will vary but the methods used to determine when a human needs to get involved have evolved.
During and after a post-incident review many ideas will surface around how to improve not only various aspects of each phase of the lifecycle but also how the team can improve in other areas. Communication for example.
During the review, engineers might have pointed out that there were long gaps in the conversation timeline where nobody said anything. It’s helpful to be verbose in what engineers are doing, thinking, even feeling.
Once our detection efforts have been configured to send actionable alerts to the people who build the systems, we need to make sure they are sending those alerts to the right people.
Right People How do you know who the right people are? In most cases it is situational. A few things that can be done to help establish some formatlity and standard around responding to incidents is through the use of roles, rosters, and rotations.
Although thinking of incidents in terms of phases allows for us to shorten each in their own unique ways, responding to and remediating an incident often begin to blur. Especially when actions to mitigate or improve the situation, have the opposite result.
Now that we’ve covered the foundations of building a good incident response plan, let’s talk about remediation efforts and how Supplying Context & Guidance to on-call engineers rather than step by step procedures can dramatically help reduce the impact of an incident.