Tech Used The brains behind this solution is an Azure Function (running Node.js) that is triggered via outgoing webhook (from Microsoft Teams). The function modifies an index.html file stored in a “web server” served from a serverless SMB file share in Azure Storage.
Users can open, update, and close “status updates” by invoking them from within a chat channel.
The text that follows the command will be stored and displayed on the site below the colored (Red or Green) header.
The Communication Coordinator is meant to be the person working in conjunction with the incident commander to share more information beyond those who are in the firefight actively working to recover from the incident itself. That could be customers. It could be the sales and marketing teams. Maybe your customer support. There are many people within an organization who need to be made aware of what’s taking place and the status around how things are progressing.
Another key role to identify, in many cases, is the incident commander. An incident commander can be extremely helpful when you have got a large-scale outage that effects a lot of different components or requires coordination across many teams and different systems. They are great for making sure that engineers stay focused and they are working on their own remediation efforts… Ensuring people are not stepping on each other or undoing each other’s work.
The scribe’s role is to document the conversation in as much detail as possible. Teams commonly use phone bridges, conference calls, or video chat to get everyone together and try to understand what is going on, which can certainly help create space for the conversation. However, it is difficult for us to go through and understand in detail what the engineers were saying and doing unless it is transcribed. As a result, a scribe is that person who can help us document as much as possible to review later.
Chatops is the use of tools from within a group chat tool.
One great example of Chatops is when engineers can update important information regarding an incident to a broader audience, possibly even to affected users.
Allowing people to quickly update stakeholders by typing a few quick commands from within the same environment as the related conversations has a number of clear benefits.
Let’s take a look at Building a Serverless Status Page Solution with Azure Functions, blob storage, and Microsoft Teams to illustrate.
Incident response doesn’t stop when the incident is over. They say those who don’t study history are doomed to repeat it. Likewise, those who don’t study, analyze, and learn from the incidents they resolved are doomed to keep repeating the process, as well.
Your most important means of learning from incidents is the post-incident review.
When an incident occurs, your first reaction probably isn’t, “Hurray – a learning opportunity!” Your immediate priority is figuring out what went wrong and fixing it as quickly as possible, to reduce the impact on your customers and end users – as it should be.
It’s important that internal teams are aware of what’s happening when an incident occurs. If we don’t provide consistent updates, stakeholders will start coming around and asking.
It’s also important to acknowledge awareness, status, and expectations to stakeholders beyond the internal groups of an organization. Customers, users, fans, and anyone relying on the availability of your service will want to know what’s going on when an incident begins to impact them.