For the most part, incidents are unique.
Therefore the lessons learned will vary from problem to problem. However, it’s helpful to spot trends in response efforts to both identify what is working and what needs improvement.
It’s also helpful for engineering teams to have a sense of how frequent problems are arising and how quickly they are addressed and resolved.
When tracking incidents using Azure Boards, it’s quite simple to build reports and charts provide a high level snapshot of incident management efforts.
On-call rosters allow teams to identify who is responsible for acknowledging and addressing incidents as they occur.
They are made up of the names and contact information of everyone expected to take part in the response and remediation of service disruptions.
Name Email Service On-call Jason Hand firstname.lastname@example.org API Yes Chris Smith email@example.com API No Lauren Jones lauren@xyz.
Now that we have a tool to track the incident details, we need to ensure we are tracking all of the important aspects. Such as When did we know about the problem and more.
Let’s now take a look at how we can customize Azure Boards to track additional incident details.
When did we know? When a new record (or incident) is created in Azure Boards we will automatically have the date and time as well as a change log throughout the incident’s lifecycle…
Tracking incidents is as easy as setting up a datastore, like the table storage used for the on-call roster. However, why reinvent the wheel? Why not use something already available, customizable, extensible, and free?
Azure Boards is my tool of choice in this tutorial, but honestly this could probably be done with any popular project management tool with an API.
First, login or create a free Azure Devops account.
Azure Monitor includes a number of features and tools you can use to monitor and measure reliability, using data from many different sources.
This article’s focus is to help you keep your “eyes on the prize” – the goal of improving reliability in your organization. You should now understand what reliability is and why it’s important. You know that operational awareness is an essential precursor to monitoring for reliability, and that you need to establish a baseline of “normal” behavior as a first step.