By: Niladri Choudhuri
“What happens when a software engineer is tasked with what used to be called operations” – Ben Treynor, Google.
Around 2003, much before DevOps came into existence, Google created Site Reliability Engineering (SRE). SRE is a discipline where software engineering principles are applied to the infrastructure and operations problems to make the systems much more stable and reliable and to be able to ultra-scale as per the business needs. The goal of Site Reliability Engineering is to create ultra-scalable and highly reliable distributed software systems.
SREs spend 50% of their time doing “ops” related work such as issue resolution, on-call, and manual interventions and spend 50% of their time on development tasks such as new features, scaling or automation. Monitoring, alerting and automation are a large part of SRE work.
You may be interested in The Global SRE Pulse Report.
The following are SRE Principles:
- Operations is a software problem
- SRE services are managed with Service Level Objectives (SLOs)
- SRE practices aim at removing TOIL through automation
- Automate as much as possible
- SRE help reduce the cost of failure
- SREs have skillsets of both Dev and Ops and share “Wisdom of Production” with the development team
What is an SLO?
SLO or Service Level Objective is the availability criteria for the product and service. It is the expected goal for how well a service should operate. SLOs are very strongly related to the user experience. Once the SLOs are met, customer satisfaction will be high as users will be happy. SLOs need to be set and monitored regularly as it is a key objective of SRE. There should be various SLOs for Products and Services. SLOs are always from the Customer point of view.
According to the Catchpoint SRE Survey Report 2019, the following are the most frequent SLOs:
- Availability – 72%
- Response Time – 47%
- Latency – 46%
- We do not have SLOs – 27%
What is TOIL?
“TOIL is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical devoid of enduring value and that scales linearly as a service grows”. – Vivek Rau, Google.
Examples of toil are manual releases, physically connecting to infrastructure to check something, doing regular password resets, testing over and over, acknowledging the same alerts every day, creating users, manual resets, on-call response, extracting data, manual scaling of infrastructure, etc.
TOIL is bad because:
- It slows down progress
- Manual Work reduces the quality
- Career Progression slows down
- A never-ending list of manual tasks
- Burnout of resources
What is Error Budget?
“100% is the wrong reliability target for basically everything” – Ben Treynor
Error Budget means the amount of Time Budget we have where service can get affected. This is the time that is used to bring in new features or make architectural changes. If we tend to spend more than the budget, there has to be a consequence. One such consequence is to stop new features and get the system stable. So, all the post-mortem-related backlogs are prioritized over the new features. SRE encourages burning the Error Budget to Zero and using it strategically to balance velocity (speed) and availability (stability). We need to be lean and have smaller batches as big changes can lead to higher risk and thus burn up of the error budget.
For Example:
SLO – 99.9% Availability of the System
Error Budget – 43 minutes per month (0.1%) Within this time all new feature releases, patches, and planned and unplanned downtime needs to be fit into these 43 minutes.
Consequence – If the Error Budget is used up, then the release of new features has to stop and user stories from the post-mortem-related backlogs need to be prioritized.
What is Observability?
“Observability, as a noun, is a property of a system, it’s a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Therefore, if our IT systems don’t adequately externalize their state, then even the best monitoring can fall short” – Peter Waterhouse, CA
Observability is about having enough data that can be used to answer questions that are not already known. Observability required architecting is such a way so that the system can provide information to be able to help understand the health of the systems.
Observability is important because:
- Service growth is happening at a very rapid pace
- Architectures are dynamic in nature
- The workload of containers is increasing
- There are service dependencies
- High level of Customer Experience is very important
There are many other concepts that need to be looked at for delivering the best of services to the customer.