Site Reliability Engineering is an engineering discipline devoted to helping organizations sustainably achieve the appropriate level of reliability in their systems, services, and products.
The key concepts to take away from this definition are:
Reliability. You learned in the introductory module that there are multiple aspects to reliability and later in this module, you’ll examine each in more detail. You also learned about the importance of reliability – why it matters.
When deciding what to monitor for reliability it becomes clear that there are many approaches because there are many aspects to reliability. It’s important to examine the reliability of your service from the user’s perspective to determine what is important and how to prioritize.
What Does Reliability Mean to the User? To determine the reliability of a system, service, application, or process many look to a combination of the following eight aspects of reliability.
Can people reach a website or service?
This is typically the first thing people think of when talking about reliability. Can “the thing” be reached (and used) when expected?
This is important from the perspective of both external and internal users who depend on a service. Because of this, availability is a good place to start discussing reliability.
Availability If a service can’t be reached at all then it is of no value to its intended user and in some cases detrimental.
Did the process that you ran on the data yield the correct or expected result? For each input it produced the expected output.
This is an important factor to include in monitoring for reliability.
No matter how fast or “always available” your service or site is, if it returns incorrect results, it’s not reliable in the eyes of your customers.
Monitoring for correctness of results is an important part of monitoring for reliability.
Coverage refers to how much of the data that you expected to process was actually processed.
Reliability means getting the whole job done, every time. How can we monitor for that in a way that indicates what it’s like as a user?
Another aspect of reliablity that is closely tied to data is our next one, durability. It is really crucial that data written to the service can be read back out again later when desired.
Durability generally relates to longevity and resilience. It’s the ability to remain functional over time.
Durability is especially important in situations such as storage systems where it is really crucial that a bit written to the service can be read back out again later when desired.
Exploring the Fidelity Aspect of Reliability
Fidelity pertains to the ability of your service to continue to provide a reduced or degraded experience when something goes wrong.
For example, if different parts of the home page on your website are provided by different microservices, and one of those microservices goes down, ideally you can still serve the home page with only that section missing or replaced with some static content.
Fidelity, then, is the measure of how often the page served that degraded or partial experience in comparison to serving the full page as intended with full fidelity.
Freshness refers to how up-to-date the information is in situations where timeliness matters to the customer.
Freshness of what is available to users is a big component of reliability to many. If my travel booking tool is out of sync with the airline inventory, it’s frustrating to delay the process because something wasn’t “actually” available. Their information and therfore value to me is unreliable.
Examples would include sports scores or election results, in which the data is constantly and quickly changing.