Jennifer Petoff, Head of SRE Education at Google and Co-Author of Google’s SRE book, discusses the origins of SRE, why it’s gaining attention, what problems it’s trying to solve, how it helps an organization either manage or scale their services, why training is so important to adopting and managing an SRE practice, and more.
The lightly edited transcript can be found below.
You’re listening to The Humans of DevOps Podcast, a podcast focused on advancing the humans of DevOps through skills, knowledge, ideas, and learning. Or, the S-K-I-L framework. Here’s your host, DevOps Institute CEO Jayne Groll.
Hi everyone, it’s Jayne Groll, CEO of the DevOps Institute, and welcome to another episode of The Humans of DevOps Podcast. I’m delighted today to be joined by Jennifer Petoff, head of SRE Education at Google. Hi, Jennifer.
Hi, Jayne. Great to be here today.
Thank you for joining us. So just to get started, Jennifer, why don’t you tell us a little bit about yourself, and your role at Google?
Sure, happy to do that, Jayne. As you say, I’m Jennifer Petoff, but my friends and colleagues actually call me Dr. J. I am head of SRE Education at Google, and a senior program manager on the site reliability engineering team, based in Dublin, Ireland. My one claim to fame is I’m one of the co-editors of the original SRE book that we published at Google, in 2016. And, I’ve been a Google for a whopping 13 years, now. I can’t believe how the time flies.
Let’s see, like I said, I manage the SRE EDU team, so we’re a training team that focuses on ramping up new SREs at Google, and on continuing education opportunities on production, SRE best practices, for the broader engineering community at Google. When I do these things, I also think it’s fun to include a couple of fun facts, or other things about me, so things I might tell you if we were chatting over coffee, or in a hallway at a conference.
So fun fact number one, I have a PhD in Chemistry, synthetic chemistry, which is actually what gave rise to the nickname Dr. J. And, I love to travel, and I’m also a part time travel blogger at Sidewalk Safari, in my copious spare time.
Wow, I didn’t know that. I knew about your PhD, but I didn’t know about travel, so that’s really, really interesting.
You know, Jennifer, you reference the SRE book, which really gained so much attention, and actually spawned at acceleration of SRE. Tell us a little bit about the origins of SRE, and the decision to write the book.
Sure, I’d be happy to talk about that. I think, first, it’s important to point out, really, necessity is the mother of invention, and that’s how site reliability engineering got its start at Google, back in 2003.
Back in the day, Google was managing the operations in, more or less, the traditional way, that was the standard in the industry at the time. But, faced with unprecedented growth, throwing more people at the problem, it really wasn’t feasible. It would have been cost prohibitive, and probably pretty impossible for our hiring teams to actually find enough people to manage operations, in a traditional way. So that’s when our VP of 24 by Seven Engineer, Ben Treynor Sloss, actually decided to engineer his way out of the problem.
So if we didn’t want to feed the machines with human toil, as we sometimes say, we really needed to look for engineering solutions. Ben basically, he’s said this before, and has been quoted in a number of publications as saying, “SRE is what happens when you ask a software engineer to solve an operations problem.”
So we decided to write a book about SRE, fast forward a decade later, because we hoped that the industry as a whole might benefit from some of the lessons learned, and the principles and practices that we had developed at Google. Not to mention, the culture of reliability that we’d built within Google. We saw some of the potential benefits to the industry as aiming for a necessary and appropriate level of reliability. If you talk to someone in leadership, they might say, “How much reliability do you need? 100%.” But no, it’s really about what your customers need, and aiming for that appropriate level of reliability. And by doing that, that potentially translates to increased profits, and success for businesses of all sizes.
Other benefits to the industry could include happier customers, and thus a lower operational overhead for customer support. And, happier engineers. By applying SRE principles, if we could see adoption in the industry, this could mean that more services are managed with lower stress, and lower risk of burnout, so to speak. The idea here is SRE doesn’t feed the machines with human toil, as I mentioned above.
Yeah, it’s interesting. Even my team has adopted the term toil. “Hey, I don’t want to do that, that’s toil.” Which I love that, because it’s becoming very viral.
You were right, when you expected that the industry might be interested in these practices. As we saw in the recent Upskilling DevOps Enterprise Skill report from DevOps Institute, the rise in interest in hiring for site reliability engineers is increasing very, very rapidly. Why do you think SRE is gaining attention? And, the acceleration from the enterprise organizations, what problems are they trying to solve?
Sure, that’s a great question, Jayne. I think SRE is really gaining in popularity because an SRE practice is really built on some simple foundational principles that are, at their core, they’re very rational. Really what’s happening, SRE is providing a way to align incentives of different functions, so development, operations, and the business. Instead of working across purposes, SRE uses service level objectives tied to the customer experience as a common goal, and a way to determine how priorities need to shift based on various circumstances.
So SLOs, they allow an organization to move as fast as possible when things are going well, however if the system’s unstable and goes out of SLO … Another way of saying this if it burns, it’s error budget, the so-called budget of unreliability, and drops below this agreed level of reliability, the focus would then shift away from feature launches to increasing the stability of the system. So there’s this push and pull, but towards a common goal.
I think SLOs also really helps to fight unrealistic expectations. So for example, historically, I said this previously, if you ask leadership how reliable a service needs to be, what’s their first inclination? You know, 100%, give me all the reliability, and probably all the features, too. They want a pony, perhaps. However, if you start with this expectation of 100% reliability, and you’re demanding perfection, you really need to treat any failure, no matter how small, as an emergency. But in truth, we know that the only truly reliable system out there is one that does nothing at all, can never change, or have anything change around it. So think of something that would basically have to exist in a vacuum, hermetically sealed from the world. It’s pretty much impossible, and frankly, unnecessary, and these SRE principles help to expose that, and accommodate for that.
I wanted to call out that I think SRE as a practice is also empowering, so SREs are empowered to proactively work on projects that improve the reliability of their service, instead of just reacting all the time, as might be a case with a more traditional operations approach.
You know what I find really interesting about that is the proactive nature of it, and I think that’s very attractive to those that are looking to solve enterprise level problems.
You know, the concept of service level objectives, I think, is really resonating. The key word there is service. You and your coauthors describe SRE as Google’s approach to service management. Can you give us examples on how SRE helps an organization to either manage or scale their services?
Sure thing. As I mentioned earlier, SRE really was born out of necessity because of the rate at which Google was growing, back a decade ago, and we really couldn’t hire enough people to keep up with that demand of operating our services. I think SREs help manage and scale services by applying an engineering mindset to their work, so rather than manually repeating the same task, over and over, SREs will engineer a solution so the computer can do the work instead, freeing up those engineers to work on new, and more impactful, and frankly, probably more interesting problems.
I think, also, in terms of how SRE helps an organization to manage and scale its services, I wanted to talk about culture for a minute. So SRE culture, I think, is also important to scaling robust systems. We talk a lot about blamelessness, so blamelessness is a critical component of an SRE culture. We assume that humans are smart, we assume they’re trying to do the right things, and that the reality is “human errors,” they’re really, really, systems problems. You can’t fix people, but you can fix systems and processes to better support people in making the right choices.
You know, you also want to make sure that people feel comfortable raising issues that exist in a system. Things are inevitably going to go wrong, Murphy’s Law tells us so, and if I you blame people when things go wrong, it incentivizes people to cover up incidents, to cover up system faults. That, if they were addressed, would actually make the system more robust and reliable. So the reality is you can’t run an effective organization on the basis of fear, a fear of being blamed, a fear of losing your job.
I think the other piece to this, too, is that SRE can really help an organization manage and scale its services by consulting on software designs, right from the start. So SRE often has a wider view of production, compared to software engineers working on product developments. Product development’s SREs tend to be more focused on narrower set of features, but go deeper. While SREs can really advise on design patterns that scale well, and that are easier to productionize and maintain as usage grows, basically.
What’s really fascinating about that is I love that you’re talking about culture. Because a lot of the things that I think IT is facing today, call it under the umbrella of digital transformation, is really culture shift. I think SRE has brought that to the spotlight, particularly on the operations side. So much so, that it’s actually spawned some other reliability engineering practices at Google, right? I mean, I’ve been hearing a little bit more about customer reliability engineer, and network reliability engineering. Tell us a little bit about those.
Sure, happy to talk about those as well.
I think one thing we’ve learned is that the foundational principles of SRE, they do actually work well, and translate into a variety of contexts. CRE, for example, as you mentioned it’s Google’s customer reliability engineering team. Google CRE, it’s a team of CREs that actually help our customers achieve high levels of reliability, by partnering with them to implement these SRE operational best practices.
In addition to promoting reliability best practices with our customers, they actually get meta and use SRE principles in their engagements. So for example, you might expect a service like CRE to come at a cost, but in fact, CRE is a scarce but free resource, and this is by design. CRE is only going to engage with customers who demonstrate commitment to reliability, and doing the work required to improve the reliability of their services. It’s not about throwing work over the fence, our outsourcing management of a customer service. In fact, CRE can actually walk away from the engagement if a customer’s not upholding their end of the deal, and putting in the necessary effort to improve the reliability of their systems.
You mentioned NRE as well, so network reliability engineering. I would say, similarly, network operations has traditionally scaled linearly with the size of the network, but by using principles like software defined networking, et cetera, some of the engineering and automation approaches that are foundational to SRE can be applied to the network space, so we’re no longer feeding that network with human toil, and freeing up our valuable human resources to do more value added and proactive work, to improve the performance of the network.
Do you want to jump in there? I’ve got one more thing that I will say.
Please, go ahead. Go ahead.
Oh, no problem at all. I also wanted to call out, even on something that seems very far removed from even engineering, so on my team the SRE EDU team, we also pride ourselves on applying SRE principles to our training program operations. So we actually take inspiration from the service reliability hierarchy that we included in the original SRE book.
So the hierarchy, for those who are less familiar, covers the elements that go into making a service reliable, from the most foundational to the most advanced. We’ve seen that we can adapt those elements of the service reliability hierarchy to a training context. So for example, my team, we do monitoring in the form of attendance tracking and survey feedback, we address issues that surface via this monitoring. We’re occasionally writing post mortems when things go wrong, so we can proactively learn from failure. We do a lot of testing of new content and programs, and all the while, we’re scaling our operations, we’re looking for opportunities for vanquish toil. Like you said, toil, it feels really good to get rid of it through automation, so we can make the most of those limited human resources that we have.
It’s only when we do these things that our program is actually going to be fully actualized, and we can realize that full potential of our curriculum design, and the program itself. As you can see, SRE principles, you can apply them in many and sundry ways.
But again, what I really love about this whole approach is it’s the marriage, in my mind, of engineering and culture, right? It’s funny, because when you look at some of the things that are written about SRE, there are many an in IT Ops that go, “Hey wait, it’s an engineering approach, its self-regulating practices and principles.” Some are concerned that it’s too technical. I’ll tell you, honestly, my perspective of that is that we’re IT people, we’re supposed to be technical. It’s in our titles, right?
But, what advice do you have for upskilling, perhaps for an existing IT Ops professional? Everyone’s looking at either pivoting or growing their careers, certainly growing their knowledge and their experience. What advice do you have for upskilling, for those that are interested in exploring it more? Particularly those that may think it’s “too technical,” because it’s an engineering practice.
Sure. I think it’s interesting, at Google, SREs typically have a mix of software engineering and systems engineering backgrounds. However, there’s also a wider range of roles that are really integral to the success of the organization, and we all work together towards a common goal.
So for example, I’m a program manager, and my background’s in chemistry, and basically what that says is I’m technical, but in a very different sense than when you think about software engineering, but they’re right at home on the SRE team at Google, so there’s places for people with varying level of “technical abilities.”
I think, when you’re thinking about upskilling, it can help to start by thinking about ways to develop a software engineering mindset, at both an individual and a team level. So thinking about questions like, looking at your team, does your team use a defect tracking system, like JIRA or Bugs? Does your team have a project planning model that they use, what can you apply there? Is your work done in a version control system, like GIT? Are the solutions clearly articulated, embedded by a design process before you start implementation? Does your team have a shared repository of tools and libraries?
If the answer to some of these questions, or all of these questions is no, can you do some research, and learn by implementing one or more of these practices in your current work? There’s nothing better than learning by doing, basically.
I think other ways to upskill, like learning a programming language like Go or Python can be helpful, and even just adopting a vanquishing toil mindset can be helpful. So, questioning yourself, why am I doing this task, again and again? Can I write a simple script to eliminate the need for a human to do this? Learn by doing is generally a good approach.
Yeah, an immersive, experimentation approach, right?
I think you mentioned Python. A lot of folks these days, that might not have been coders before, are really dabbling with Python, and just trying to understand how to, really, engineer their way out of toil, and I think that’s really important. Particularly because, again, if you can’t write the script, you probably could find somebody who can, who can teach you how to write it for the next time. I think a lot of peer-to-peer is available, too.
Which segues us into, unfortunately, our last question because we’re going to run out of a little bit of time. But, you’re leading SRE education internally at Google, and timing is so important. There’s so many different ways that people learn, right? Some of its formal training, some of it is immersive training, some of its hands-on training. Why is training so important to adopting and managing an SRE practice, and what are some of the best practices there?
Sure, sure. This is a topic near and dear to my heart.
First of all, a lot of people think about training as being all about how much information can I cram into someone’s head. In reality, though, that’s not actually what you typically need to do, because the reality is, no matter how much you tell people, they’re only going to remember a small fraction of what you tell them. If you open up the fire hose, so to speak, they can only drink so much.
So training is, actually, more about building confidence, and more about fighting imposter syndrome than it is, actually, about teaching facts. So, in light of this, how should you go about training your SREs? Again, the reality is there’s no one size fits all approach, but rather, there’s a variety of training methods that you can access, that lie along continuum from low to high effort.
If we explore this for a second, let’s unpack it. Sink or swim, that’s at the low end of the effort spectrum, that means you’re basically just letting people figure things out on their own, with no real guidance or support. Self study means you’re pointing people at relevant resources, but they’re ultimately in charge of consuming the material on their own. Continuing on that training continuum, pairing people up via the buddy system, or providing new members of the team with a mentor to answer questions, and to shadow, reverse shadow key things is another approach you can take.
Moving up the effort spectrum, ad hoc classes, whiteboard sessions are often a good way to convey key activities. Then, at the highest end of the spectrum, you could consider a systematic training program. A training program is really something that’s got well thought out learning objectives, and then the content to actually support those objectives. Then, the curriculum is typically offered on a consistent and scheduled basis.
So piece of advice number one would be to avoid sink or swim, especially if you value inclusivity in an organization. Because sink or swim, and just letting people fend for themselves, it can breed stress, frustration, even attrition, and can also contribute to imposter syndrome. But, for the other options, choose wisely and consider that return in investment of the effort that you’re going to be investing. If you think about the size of your organization, if you think about how fast it’s growing, if you think about the mix of people you have. Also consider where you are on your SRE journey. Are you pros, with a well established SRE practice, or are you just getting started?
For example, if you’re small but growing rapidly, investing in an onboarding program makes a lot of sense, to get your new people up to speed. You’re going to get, probably, the biggest bang for the buck there. If you’re large but growing slowly, thinking about deeper topics, thinking about ongoing education, this is probably going to work better for you, so you can continue to invest in and grow the people that you have. If you’re small and you’re not growing very fast, formal classroom trainings don’t really make sense, because you don’t really have the common needs of scale, so maybe prepare some videos, or self study resources, or pair people up with a buddy, so think through this a little bit.
But regardless of where you decide to focus, I think one thing to point out is a great training program really starts out with learning objectives, or what we often call ASBATs. I know that sounds funny, what is an ASBAT? An ASBAT is, basically, a student should be able to, it’s a bit of an acronym there. So you want to figure out what you want people to do as a result of the training you’re developing, before you jump in and begin developing content.
So for example, understand the Dollar Sign foo service, that’s not really a great ASBAT. A better formulation of that learning objective could be something like use Dollar Sign tool to identify how much memory a job is using, interpret a graph in Dollar Sign monitor tool to identify the health of the foo service, or move traffic away from a cluster using your particular drain tool, in a particular period of time. If you start with well thought out ASBATs, the rest will follow.
Again, thank you for that, because I think that for organizations that are in differing places in the journey, and also for individuals, learning, training is an individual effort. I can’t force you to learn what I give you, what knowledge I give you, so it is a human effort that’s supported by, hopefully, enterprises, and resources from Google, from DevOps Institute, and from others.
I would think, understanding the practices and principles around SRE is critical, because it is as much of an engineering practice as it is a culture practice, and I don’t think people should be swayed in one direction or the other. Too much engineering, too much culture, somewhere in between the two is probably the right answer, although I don’t think you can have too much culture.
Anyhow, thank you. Really great insight into where we are, where SRE has come from, some of the things that you’re doing. But, more importantly, organizations may not be Google, they may not be Google, but everything that you and the Google team … And, Google very generously, sharing this with us. We can read the site reliability engineering books for free online, I think there’s a lot of really great resources there. On behalf of the community, thank you for the generosity in sharing that. And, it’ll be interesting to watch the rise, interesting to see, if you and I talk again next year, which I hope we will, I think we will, that-
We’re old friends at this point, it’s great.
That we get to really identify some of the patterns, because it is a journey.
It is, indeed.
So with that, thank you. This has been very insightful, it’s really nice to hear some of the things that you can read about, but it’s really nice to have somebody explain it to you, so I appreciate that.
Thanks for inviting me.
Yeah, you’re always welcome, you know that. But again, I do thank you for your time, and again, I just can’t say enough about the generosity of Google as an organization, in sharing these practices because they’re certainly gaining so much attention.
Again, Jennifer Petoff, head of SRE Education at Google has been my guest today. I’m Jayne Groll, CEO of the DevOps Institute, you’ve been listening to another episode of The Humans of DevOps Podcast. Stay well.
Thanks for listening to this episode of The Humans of DevOps Podcast. Don’t forget to join our global community to get access to even more great resources like this. Until next time, remember, you are part of something bigger than yourself, you belong.