June 11, 2021
By Jennifer Petoff, Director of SRE Education at Google and Co-Author of Google’s SRE book
I am the global director of Site Reliability Engineering (SRE) Education at Google and I’m based in Dublin, Ireland. My proudest accomplishment as a member of the Google SRE team was program managing the publication of the original SRE book in 2016. I’ve been at Google for over 14 years now. I lead a training team that focuses on ramping up new SREs at Google, and on continuing education opportunities related to reliability principles, best practices, and SRE culture for the broader engineering community at Google.
The Origins of SRE
I think it’s first important to point out that necessity is the mother of invention, and that’s how SRE got its start at Google in 2003.
Google was managing the operations more or less in the traditional way, which was the standard in the industry at the time. But, faced with unprecedented growth, throwing more people at the problem wasn’t feasible. It would have been cost-prohibitive and probably near impossible for our hiring teams to find enough people to manage operations traditionally. So that’s when Google VP Ben Treynor Sloss decided to engineer his way out of the problem.
We didn’t want to feed the machines with human toil, so we needed to look for engineering solutions. Ben has been quoted in many publications as saying, “SRE is what happens when you ask a software engineer to design an operations function.”
Fast forward 10 years, when we decided to write a book about SRE. We hoped that the industry as a whole might benefit from some of the lessons we learned, the principles and practices that we had developed, and the culture of reliability that we’d built within Google.
For example, if you ask a leader, “How much reliability do you need?”, their first instinct might be to say, “100%!” But no, it’s really about considering what your customers need and aiming for that appropriate level of reliability. Excess reliability beyond what’s required to keep customers happy increases costs and slows velocity. Applying SRE principles like SLOs and error budgets potentially translates to increased profits and success for businesses of all sizes.
Other benefits to the industry could include happier customers – and thus a lower operational overhead for customer support – and happier engineers. Applying SRE principles helps teams manage services with lower stress and lower risk of burnout. If you target 100% reliability, every outage no matter how small must be treated as an emergency. The idea here is that SRE doesn’t feed the machines with human toil, as I mentioned above.
Read more: What’s the Relationship between SRE and DevOps? Insights from DevOps Institute Ambassadors
Why is SRE Gaining Attention?
I think SRE is gaining in popularity because SRE practices are built on some simple foundational principles that are, at their core, very rational. SRE is about aligning the objectives and interests of different functions–for instance: development, operations, and the business. Instead of working at cross purposes, SRE uses service-level objectives (SLOs) tied to the customer experience as a common goal and a way to determine how priorities need to shift based on various circumstances.
SLOs allow an organization to move as fast as possible when things are going well. However, if the system is unstable and goes out of SLO, and drops below this agreed level of reliability, the focus would then shift away from feature launches to increasing the stability of the system. So there’s this push and pull towards a common goal.
SLOs also really help to fight unrealistic expectations. For example, historically, if you ask leadership how reliable a service needs to be, what’s their first inclination? ”100%!. Give me all the reliability, and probably all the features, too.”
They want a pony, perhaps.
However, if you start with this expectation of 100% reliability, and you’re demanding perfection, you need to treat any failure, no matter how small, as an emergency. In truth, we know that the only truly reliable system out there is one that does nothing at all, and can never change or have anything change around it. So think of something that would have to exist in a vacuum, hermetically sealed from the world. Such a system is pretty much impossible to create or maintain, or at the very least prohibitively expensive, and these SRE principles help to expose that and accommodate for that.
SRE as a practice is also empowering. SREs are empowered to proactively work on projects that improve the reliability of their service, instead of just reacting all the time, as might be the case with a more traditional operations approach.
What are Examples of How SRE Helps an Organization to Either Manage or Scale its Services?
As I mentioned earlier, SRE was born out of necessity because of the rate at which Google was growing, and we really couldn’t hire enough people to keep up with that demand of operating our services. SREs help to manage and scale services by applying an engineering mindset to their work, so rather than manually repeating the same task, SREs will engineer a solution so the computer can do the work instead. That frees up those engineers to work on new and more impactful, and frankly, probably more interesting problems.
In terms of how SRE helps an organization manage and scale its services, it’s important to touch on culture. SRE culture is important to scaling robust systems. We talk a lot about blamelessness — a critical component of an SRE culture. We assume that humans are smart and that they’re trying to do the right things. The reality is, “human errors” are actually system problems. You can’t fix people, but you can fix systems and processes to better support people in making the right choices.
You also want to make sure that people feel comfortable raising issues that exist in a system. Things are inevitably going to go wrong (Murphy’s Law tells us so), and if you blame people when things go wrong, it incentivizes people to cover up incidents and system faults that, if they were addressed, would make the system more robust and reliable. So the reality is you can’t run an effective organization based on the fear of being blamed or of losing your job.
SRE can help an organization manage and scale its services by consulting on software designs from the beginning. SRE often has a wider view of production compared to software engineers working on product development. Product development engineers tend to be more deeply focused on a narrower set of features.
What Advice Do You Have for People Interested in Exploring SRE?
At Google, SREs typically have a mix of software engineering and systems engineering backgrounds. However, there’s also a wider range of roles that are integral to the success of the organization, and we all work together towards a common goal.
For example, I’m a program manager, and my background is in chemistry. I’m technical, but in a very different sense than when you think about software engineering. I feel right at home on the SRE team at Google. There are places for people with varying levels of “technical abilities” within an SRE organization.
When thinking about upskilling, it can help to start by thinking about ways to develop a software engineering mindset, at both an individual and a team level. Think about questions like: does your team use a defect tracking system like JIRA? Does your team use a project planning model, and how can you apply it? Is your work done in a version control system like GIT? Are solutions clearly articulated with a clearly documented design before you start implementation? Does your team have a shared repository of tools and libraries?
If the answer to some or all of these questions is no, can you do some research, and learn by implementing one or more of these practices in your current work? There’s nothing better than learning by doing.
I think other ways to upskill, like learning a programming language such as Go or Python, can be helpful, or even just adopting a vanquishing toil mindset. Question yourself, “why am I doing this task again and again? Can I write a simple script to eliminate the need for a human to do this?”
Looking to get certified? Learn more about the SRE Certifications from DevOps Institute.
Why is Training so Important to Adopting and Managing an SRE Practice, and What are Some of the Best Practices There?
Training SREs is about building confidence, and more about fighting imposter syndrome than it is about teaching facts. In reality, there’s no one size fits all approach. Rather, there’s a variety of training methods that you can access that lie along a continuum from low to high effort.
Sink or swim (at the low end of the effort spectrum) means you’re just letting people figure things out on their own, with no real guidance or support. Self-study means you’re pointing people at relevant resources, but they’re ultimately in charge of consuming the material on their own. Continuing on the training continuum – you might pair people up via the buddy system or provide new members of the team with a mentor to answer questions and to shadow and/or reverse shadow.
Moving up the effort spectrum, ad hoc classes or whiteboard sessions are often a good way to convey key activities. Then, at the highest effort end of the spectrum, you could consider a systematic training program. A training program has well-thought-out learning objectives, and the content to support those objectives. The curriculum is typically offered on a consistent and scheduled basis.
My first piece of advice would be to avoid sink or swim, especially if you value inclusivity in an organization. Letting people fend for themselves can breed stress, frustration, and even attrition. It can also contribute to imposter syndrome.
For the other options, choose wisely and consider the return on investment of the effort that you’re going to be investing. Consider the size of your organization, how fast it’s growing, and the mix of people you have. Also, consider where you are on your SRE journey. Are you pros with a well-established SRE practice, or are you just getting started?
If you’re small but rapidly growing, investing in an onboarding program to get your new people up to speed makes sense.
If you’re large but growing slowly, thinking about deeper topics like ongoing education is probably going to work better for you, so you can continue to invest in and grow the people that you have.
If you’re small, and you’re not growing very fast, formal classroom training doesn’t make sense, because you don’t have economies of scale. Maybe prepare some videos or self-study resources, or pair people up with a buddy.
Regardless of where you decide to focus, one thing to point out is a great training program starts with learning objectives.
Looking to further your SRE knowledge? Learn more about the new SRE Practitioner course from DevOps Institute.
The Humans of DevOps Ep 16, “The Origins of SRE and Why It’s Important.” Listen here.