collaboration

Building a Serverless Status Page Solution

Tech Used The brains behind this solution is an Azure Function (running Node.js) that is triggered via outgoing webhook (from Microsoft Teams). The function modifies an index.html file stored in a “web server” served from a serverless SMB file share in Azure Storage. Users can open, update, and close “status updates” by invoking them from within a chat channel. The text that follows the command will be stored and displayed on the site below the colored (Red or Green) header.

Identifying the Subject Matter Experts

It’s quite common within on-call rosters to identify subject matter experts, so that early responders know who to escalate too quickly. These people should not be on call all the time, of course, but we do want to be able to identify who is our database expert. Who is our front-end expert? Who are the people that we can reach out to if our primary and secondary responders are not able to diagnose and resolve the issue themselves?

Managing Tasks from Group Chat - ChatOps

Chatops is the use of tools from within a group chat tool. One great example of Chatops is when engineers can update important information regarding an incident to a broader audience, possibly even to affected users. Allowing people to quickly update stakeholders by typing a few quick commands from within the same environment as the related conversations has a number of clear benefits. Let’s take a look at Building a Serverless Status Page Solution with Azure Functions, blob storage, and Microsoft Teams to illustrate.

Supplying Context and Guidance

During an active incident, it’s often difficult to recall information such as system names, IP addresses, admin logins, location of logs, and more. Remembering where to find dasbhoards, metrics, and reports that could be helpful is often difficult. To build a proactive response plan to managing incidents it’s important to provide some useful context and guidance for the first responders to begin investigating, triaging, and escalating. Where should the first responder start?

Understanding Why We Learn From Incidents

Incident response doesn’t stop when the incident is over. They say those who don’t study history are doomed to repeat it. Likewise, those who don’t study, analyze, and learn from the incidents they resolved are doomed to keep repeating the process, as well. Your most important means of learning from incidents is the post-incident review. When an incident occurs, your first reaction probably isn’t, “Hurray – a learning opportunity!” Your immediate priority is figuring out what went wrong and fixing it as quickly as possible, to reduce the impact on your customers and end users – as it should be.

Updating Stakeholders

It’s important that internal teams are aware of what’s happening when an incident occurs. If we don’t provide consistent updates, stakeholders will start coming around and asking. It’s also important to acknowledge awareness, status, and expectations to stakeholders beyond the internal groups of an organization. Customers, users, fans, and anyone relying on the availability of your service will want to know what’s going on when an incident begins to impact them.