By: Anurag Sharma
A well-known quote from Google Site Reliability Engineering handbook is: “Site Reliability Engineering as discipline that requires a lot of curiosity, humility and openness.” And this is fantastic for those who are in such a culture, but we need to understand not everyone can operate at Google scale and everybody has their own challenge.
It’s become crucial to use great SRE TLAs (three-letter acronym) like SLI, SLO and SLA.
Service-level Indicator (SLI):
- A quantifiable measure of service reliability, such as throughput, latency
- Directly measurable & observable by the users
- This could represent the user’s experience
- In simple words, this talks about what exactly you are going to measure
Service-level Objective (SLO):
- It defines how the service should perform, from the perspective of the user (measured via SLI). In simple words, how good services should be?. A threshold beyond which an improvement of the service is required
- The point at which the users may consider opening up support ticket, the “pain threshold”, e.g., Amazon find product taking longer, issue with google search, youtube buffering
- Driven by business requirements, not just current performance
Service-level Agreement:
- This is a business contract to provide a customer some form of compensation if the service did not meet expectations.
- In simple words SLO + consequences
Above is the simplest and quickest view of TLAs. And this also shows SLI (Service-level indicator) is key and drives rest, so it’s very important to choose it cautiously.
From the dawn of service management, organisations considered lots of measurements, which are available in their service management system. But you shouldn’t track everything which is available in the system as SLI. It becomes very hard to pay the right level of attention to indicators if you choose too many SLIs at the same time there are high chances you may miss/leave behaviour of your system which are crucial to deliver a class experience to your users.
Below are few practices which can be considered for SLI
1. Understand your users: No matter how best you set up your servers, load balancers but when it comes to users. Users should be able to access features without delay. Availability, latency and throughput plays crucial roles here. It’s very important to identify user-facing features of your site and work accordingly
2. SLIs should be directly observable and measurable by users and shouldn’t be internal metrics such as CPU utilization or disk latency in your SLIs because the user cannot directly measure these values. Understand user’s behaviour, “what they really care about” and “how they quantify” experience
3. Understand your system well, most organisations take utilisation as measure but its actually not state usability of system
4. Correctness of information, Data integrity is key SLI and should be part of every journey. You can’t leave data alone
5. You need to understand your storage well. It’s okay to keep the database down instead of losing it. always be careful in figuring out those extra nines in durability, latency and availability when it comes to storage
6. Once you understand initial user behaviour, start defining SLIs within capabilities of system boundaries
7. If you are going to define SLIs for availability and performance, you should observe factors that surround service
8. Never be overly mathematical. Understand what’s the required frequency to measure, sampling interval and use the SLI formula accordingly
9. Don’t forget to share documented SLI and SLO with your team
10. Always remain engaged; SLI and SLO always evolve over time. Keep iterating and fine tuning your SLIs
It takes a long time to get mature in building a good reliability practice but no matter how much time and effort you put in, it’s hard to achieve a resilient and reliable architecture without getting a clear definition of basics like SLI/SLO. Remember SLIs help your engineering team to make better decisions. Make it right, keep it right and keep refining.