Answers To Your Questions
What Is This?
The systems we work in eventually have problems.
They are built, maintained, and supported by technolgists such as yourself. And when an issue inevitably occurs, someone needs to take action to restore services.
Responding to those problems helps maintain functionality and operational abilities of an organization’s IT services, serving both internal and external users.
Many organizations don’t currently have an incident response plan in place. In fact, efforts to recover from service disruptions rarely follow any kind of repeatable and measured framework at all. Engineers react rather than respond.
With the increased reliance on digital services and their underlying technology it’s more important than ever to establish an explicit response plan. There are small steps that you could take immediately so that when the next problem occurs, everyone knows what to do. The incident itself can be viewed not just as an outage but an opportunity to learn.
On-call Life is dedicated to providing foundational concepts and information related to being on-call including monitoring, incident response, and the post-incident review process.
This is a live site with new information added regularly. Much of the content is syndicated from presentations created for and delivered during Microsoft’s Ignite the Tour.
Throughout these articles, demonstrations and resources specific to Azure will be used, but the foundations of monitoring, incident response, and retrospectives are agnostic to tooling. Demonstrations on Azure is done to illustrate rather than to suggest “best practice” implementations.
How Do I Use This?
Begin, by examining why the responsibilities of on-call have become so critical to nearly every business, group, and government.