DevOps Institute

Choosing the Right Service Level Indicators

SRE

By: Anurag Sharma

A well-known quote from Google Site Reliability Engineering handbook is: “Site Reliability Engineering as discipline that requires a lot of curiosity, humility and openness.” And this is fantastic for those who are in such a culture, but we need to understand not everyone can operate at Google scale and everybody has their own challenge.

It’s become crucial to use great SRE TLAs (three-letter acronym) like SLI, SLO and SLA.

Service-level Indicator (SLI):          

  • A quantifiable measure of service reliability, such as throughput, latency
  • Directly measurable & observable by the users
  • This could represent the user’s experience
  • In simple words, this talks about what exactly you are going to measure

Service-level Objective (SLO):

  • It defines how the service should perform, from the perspective of the user (measured via SLI). In simple words, how good services should be?. A threshold beyond which an improvement of the service is required
  • The point at which the users may consider opening up support ticket, the “pain threshold”, e.g., Amazon find product taking longer, issue with google search, youtube buffering
  • Driven by business requirements, not just current performance

Service-level Agreement:

  • This is a business contract to provide a customer some form of compensation if the   service did not meet expectations.
  • In simple words SLO + consequences

Above is the simplest and quickest view of TLAs. And this also shows SLI (Service-level indicator) is key and drives rest, so it’s very important to choose it cautiously.

From the dawn of service management, organisations considered lots of measurements, which are available in their service management system. But you shouldn’t track everything which is available in the system as SLI. It becomes very hard to pay the right level of attention to indicators if you choose too many SLIs at the same time there are high chances you may miss/leave behaviour of your system which are crucial to deliver a class experience to your users.

 Below are few practices which can be considered for SLI

1.   Understand your users: No matter how best you set up your servers, load balancers but when it comes to users. Users should be able to access features without delay. Availability, latency and throughput plays crucial roles here. It’s very important to identify user-facing features of your site and work accordingly

2.   SLIs should be directly observable and measurable by users and shouldn’t be internal metrics such as CPU utilization or disk latency in your SLIs because the user cannot directly measure these values. Understand user’s behaviour, “what they really care about” and “how they quantify” experience

3.   Understand your system well, most organisations take utilisation as measure but its actually not state usability of system  

4.   Correctness of information, Data integrity is key SLI and should be part of every journey. You can’t leave data alone

5.   You need to understand your storage well. It’s okay to keep the database down instead of losing it. always be careful in figuring out those extra nines in durability, latency and availability when it comes to storage

6.   Once you understand initial user behaviour, start defining SLIs within capabilities of system boundaries

7.   If you are going to define SLIs for availability and performance, you should observe factors that surround service

8.   Never be overly mathematical. Understand what’s the required frequency to measure, sampling interval and use the SLI formula accordingly

9.   Don’t forget to share documented SLI and SLO with your team

10. Always remain engaged; SLI and SLO always evolve over time. Keep iterating and fine tuning your SLIs

It takes a long time to get mature in building a good reliability practice but no matter how much time and effort you put in, it’s hard to achieve a resilient and reliable architecture without getting a clear definition of basics like SLI/SLO. Remember SLIs help your engineering team to make better decisions. Make it right, keep it right and keep refining.

Upskilling IT 2023 Report

Community at DevOps Institute

related posts

[EP112] Why an AIOps Certification is Something You Should Think About

[EP112] Why an AIOps Certification is Something You Should Think About

Join Eveline Oehrlich and Suresh GP for a discussion on Why an AIOps Certification is Something You Should Think About Transcript 00:00:02,939 → 00:00:05,819 Narrator: You're listening to the Humans of DevOps podcast, a 00:00:05,819 → 00:00:09,449 podcast focused on...

[Ep110] Open Source, Brew and Tea!

[Ep110] Open Source, Brew and Tea!

Join Eveline Oehrlich and Max Howell, CEO of tea.xyz and creator of Homebrew, to discuss open source including "the Nebraska problem," challenges, and more. Max Howell is the CEO of tea.xyz and creator of Homebrew. Brew was one of the largest open source projects of...