In other words … unplanned Work
Most of what we as engineers and technologists do is planned work.
We spend a lot of time and effort understanding the work in front of us.
We calculate story points. We plan sprints. We have a pretty good idea on what we are supposed to be working on.
So, when an incident occurs, it is disruptive. It is… unplanned work.
Often, we view this as a terrible thing, but in reality, incidents are actually “investments” in supplying the value we are trying to deliver to end users.
The first role we need to talk about is the “Primary Responder” – the Primary “On-call” engineer.
This person is expected to acknowledge their awareness of an incident once the alert has been received.
Identifying the Secondary Responder
Once our detection efforts have been configured to send actionable alerts to the people who build the systems, we need to make sure they are sending those alerts to the right people.
Right People How do you know who the right people are? In most cases it is situational. A few things that can be done to help establish some formatlity and standard around responding to incidents is through the use of roles, rosters, and rotations.
Then we have the secondary responder – who is there as back up -Another engineer who can step in if the primary responder is unavailable or unreachable. Or if they just need another pair of eyes.
Identifying The Incident Commander
The foundations of building reliable systems including a good incident response plan, have to start with determining “Who is expected to respond to problems?” and “How do let them know?”.
The best place to start, is to design what is to establish roles, rosters, and rotations.
Roles : Well defined responsibilities and expectations of individuals on an on-call team (or roster). The Primary Responder, for example.
Rosters : A group of individuals, each with their own assigned role and understood responsibilities and expectations.