To address some of the challenges around how we communicate we also want to find a way to create a unique channel or space for engineers to discuss the details of the incident - a “conversation bridge” in our persistent group chat tool -which for Tailwind Traders is Microsoft Teams.
We want a channel that is unique to the incident only. We do not want conversations about other engineering efforts.We don’t want conversations about what people are doing for lunch.
Creating a repeatable response plan means establishing who does what when something goes wrong. We don’t want there to be any question around who is supposed to be doing what.
Because of this, it is important to establish roles and the associated expectations. This isn’t a separation of duties exercise. In fact, we want to encourage less of that. It is however, a way of establishing better coordination and communication. It prevents people from stepping on each others toes while enabling cross-collaboration amongst not only on-call rosters, but an entire organization.
Rosters establish a framework around who is on-call at any given point. A roster, or team, is made up of multiple engineers. Rosters can also contain multiple rotations. I’m testing out how to edit a page.
Creating an on-call roster using Azure Table Storage
People shouldn’t have to be on-call for long periods of time. It’s detrimental to their health and therefore the systems they create and look after.
In order to make sure people aren’t expected to respond to problems 24-hours a day, indefinitely, we use rotations. Also referred to as scheduled shifts. That way people can take turns.
Being on-call requires a heavy cognitive task which then negatively impacts many aspects of a person’s life.
The Communication Coordinator is meant to be the person working in conjunction with the incident commander to share more information beyond those who are in the firefight actively working to recover from the incident itself. That could be customers. It could be the sales and marketing teams. Maybe your customer support. There are many people within an organization who need to be made aware of what’s taking place and the status around how things are progressing.
Another key role to identify, in many cases, is the incident commander. An incident commander can be extremely helpful when you have got a large-scale outage that effects a lot of different components or requires coordination across many teams and different systems. They are great for making sure that engineers stay focused and they are working on their own remediation efforts… Ensuring people are not stepping on each other or undoing each other’s work.