Creating an On-call Roster Using Azure Table Storage

On-call rosters allow teams to identify who is responsible for acknowledging and addressing incidents as they occur. They are made up of the names and contact information of everyone expected to take part in the response and remediation of service disruptions. On-call Roster Name Email Service On-call Jason Hand jason@xyz.com API Yes Chris Smith chris@xyz.com API No Lauren Jones lauren@xyz.

Breaking Down the Components of a Post-incident Review

You now know what a post-incident review is, its role in the incident response process, and when it should be conducted. In this unit, you’ll dive a little deeper into the details of what makes a post-incident review most effective. Because incidents differ, the exact makeup of post-incident reviews can be different, too. But there are some common characteristics and components of a good review that can provide you with a solid foundation for carrying out the process.

Defining Alerts

“An alert is something which requires a human to perform an action.” - Pagerduty “Alerting Principles” To understand why alerting can create a problem, you need to think about the purpose of alerts and how they differ from other monitoring components. Actionable alerts are not: Logs. Alerts are not records of events; that’s the role of logs. Notifications. Alerts are not intended to announce non-critical occurrences such as the completion of a software build.

Defining Incidents

If you search online for “Incident Response” a majority of what you’ll find is information related to security threats and breaches. What doesn’t show up in the results is stuff about how to properly respond to threats related to something else entirely. How should a business respond to technical challenges and failures as they come up? The ones that affect reliability concerns such as availability, latency, correctness, and others. What happens when service level expectations are breached and it’s time for a human to get involved?

Defining Site Reliability Engineering

Site Reliability Engineering is an engineering discipline devoted to helping organizations sustainably achieve the appropriate level of reliability in their systems, services, and products. The key concepts to take away from this definition are: Reliability. You learned in the introductory module that there are multiple aspects to reliability and later in this module, you’ll examine each in more detail. You also learned about the importance of reliability – why it matters.

Establishing Communication Channels

To address some of the challenges around how we communicate we also want to find a way to create a unique channel or space for engineers to discuss the details of the incident - a “conversation bridge” in our persistent group chat tool -which for Tailwind Traders is Microsoft Teams. We want a channel that is unique to the incident only. We do not want conversations about other engineering efforts.We don’t want conversations about what people are doing for lunch.

Establishing On-call Roles

Creating a repeatable response plan means establishing who does what when something goes wrong. We don’t want there to be any question around who is supposed to be doing what. Because of this, it is important to establish roles and the associated expectations. This isn’t a separation of duties exercise. In fact, we want to encourage less of that. It is however, a way of establishing better coordination and communication. It prevents people from stepping on each others toes while enabling cross-collaboration amongst not only on-call rosters, but an entire organization.

Establishing On-call Rosters

Rosters establish a framework around who is on-call at any given point. A roster, or team, is made up of multiple engineers. Rosters can also contain multiple rotations. I’m testing out how to edit a page. Next steps Creating an on-call roster using Azure Table Storage

Establishing On-call Rotations

People shouldn’t have to be on-call for long periods of time. It’s detrimental to their health and therefore the systems they create and look after. In order to make sure people aren’t expected to respond to problems 24-hours a day, indefinitely, we use rotations. Also referred to as scheduled shifts. That way people can take turns. Being on-call requires a heavy cognitive task which then negatively impacts many aspects of a person’s life.

Identifying the Communication Coordinator

The Communication Coordinator is meant to be the person working in conjunction with the incident commander to share more information beyond those who are in the firefight actively working to recover from the incident itself. That could be customers. It could be the sales and marketing teams. Maybe your customer support. There are many people within an organization who need to be made aware of what’s taking place and the status around how things are progressing.