DevOps Institute

Site Reliability Engineering Key Concepts: SLO, Error Budget, TOIL and Observability


By: Niladri Choudhuri

“What happens when a software engineer is tasked with what used to be called operations” – Ben Treynor, Google.

Around 2003, much before DevOps came into existence, Google created Site Reliability Engineering (SRE). SRE is a discipline where software engineering principles are applied to the infrastructure and operations problems to make the systems much more stable and reliable and to be able to ultra-scale as per the business needs. The goal of Site Reliability Engineering is to create ultra-scalable and highly reliable distributed software systems.

SREs spend 50% of their time doing “ops” related work such as issue resolution, on-call, and manual interventions and spend 50% of their time on development tasks such as new features, scaling or automation. Monitoring, alerting and automation are a large part of SRE work.

You may be interested in The Global SRE Pulse Report.

The following are SRE Principles:

  • Operations is a software problem
  • SRE services are managed with Service Level Objectives (SLOs)
  • SRE practices aim at removing TOIL through automation
  • Automate as much as possible
  • SRE help reduce the cost of failure
  • SREs have skillsets of both Dev and Ops and share “Wisdom of Production” with the development team

What is an SLO?

SLO or Service Level Objective is the availability criteria for the product and service. It is the expected goal for how well a service should operate. SLOs are very strongly related to the user experience. Once the SLOs are met, customer satisfaction will be high as users will be happy. SLOs need to be set and monitored regularly as it is a key objective of SRE. There should be various SLOs for Products and Services. SLOs are always from the Customer point of view.

According to the Catchpoint SRE Survey Report 2019, the following are the most frequent SLOs:

  • Availability – 72%
  • Response Time – 47%
  • Latency – 46%
  • We do not have SLOs – 27%

What is TOIL?

“TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows”. – Vivek Rau, Google.

Examples of toil are manual releases, physically connecting to infrastructure to check something, doing regular password resets, testing over and over, acknowledging the same alerts every day, creating users, manual resets, on-call response, extracting data, manual scaling of infrastructure, etc.

TOIL is bad because:

  • It slows down progress
  • Manual Work reduces the quality
  • Career Progression slows down
  • A never-ending list of manual tasks
  • Burnout of resources

What is Error Budget?

“100% is the wrong reliability target for basically everything” – Ben Treynor

Error Budget means the amount of Time Budget we have where service can get affected. This is the time that is used to bring in new features or make architectural changes. If we tend to spend more than the budget, there has to be a consequence. One such consequence is to stop new features and get the system stable. So, all the post-mortem-related backlogs are prioritized over the new features. SRE encourages burning the Error Budget to Zero and using it strategically to balance velocity (speed) and availability (stability). We need to be lean and have smaller batches as big changes can lead to higher risk and thus burn up of the error budget.

For Example:

SLO – 99.9% Availability of the System

Error Budget – 43 minutes per month (0.1%) Within this time all new feature releases, patches, and planned and unplanned downtime needs to be fit into these 43 minutes.

Consequence – If the Error Budget is used up, then the release of new features has to stop and user stories from the post-mortem-related backlogs need to be prioritized.

What is Observability?

“Observability, as a noun, is a property of a system, it’s a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Therefore, if our IT systems don’t adequately externalize their state, then even the best monitoring can fall short” – Peter Waterhouse, CA

Observability is about having enough data that can be used to answer questions that are not already known. Observability required architecting is such a way so that the system can provide information to be able to help understand the health of the systems.

Observability is important because:

  • Service growth is happening at a very rapid pace
  • Architectures are dynamic in nature
  • The workload of containers is increasing
  • There are service dependencies
  • High level of Customer Experience is very important

There are many other concepts that need to be looked at for delivering the best of services to the customer.

Learn more about becoming certified in SRE

Link to original article

Upskilling IT 2023 Report

Community at DevOps Institute

related posts

[EP109] From a DBA Jerk to a Collaborator!

[EP109] From a DBA Jerk to a Collaborator!

Join Eveline Oehrlich and Grant Fritchey, Product Advocate at Redgate Software, to discuss product advocacy, collaboration, and leadership. Grant has worked for more than 30 years in IT as a developer and a DBA. He has built systems from major enterprises to...

[EP108] Leading an Engineering Team Today

[EP108] Leading an Engineering Team Today

Join Eveline Oehrlich and Nickolas Means, VP of Engineering at Sym, to discuss the best practices and challenges of leading an engineering team, collaboration, and more. Nick is the VP of Engineering at Sym, the adaptive access tool built for developers. He’s been an...

[EP106] Identity Orchestration Tidbits

[EP106] Identity Orchestration Tidbits

Join Eveline Oehrlich and Topher Marie, CTO and Co-founder of Strata, to discuss Container Orchestration. Before Strata, Topher was the CTO and a co-founder of JumpCloud. In the past, he has also been an Architect for Oracle’s global cloud identity and security...