September 14, 2022
Welcome to a new season of the Humans of DevOps Podcast with your host Eveline Oehrlich.
In this episode Eveline is joined by DevOps Institute CEO Jayne Groll to discuss Site Reliability Engineering (SRE). Jayne and Eveline discuss the findings of the 2022 Global SRE Pulse report, how SRE came to be, and the developments and frameworks that are leading SRE into the future.
Special thanks to our sponsor RANGE.
Enjoy the Humans of DevOps Podcast? We’re incredibly grateful to be voted one of the Best 25 DevOps Podcasts by Feedspot.
Want access to more DevOps-focused content and learning? When you join SKILup IT Learning you gain the tools, resources and knowledge to help your organization adapt and respond to the challenges of today. And if you’re looking for the answers to DevOps’ persistent questions, pop on in to SKILup Discussions, one of the fastest-growing DevOps communities around!
Have questions, feedback or just want to chat about the podcast? Send us an email at [email protected]
Eveline Oehrlich 0:01
Hello, welcome. This is my first humans of DevOps podcast and I am excited to have the boss lady with me today, Jane Groll. Hello, Jane, how are you?
Jayne Groll 0:12
Hello, Evelyn, how are you? And thank you so much for taking on this new role as host and facilitator for the humans of DevOps Podcast. I’m really excited to be here with you today.
Eveline Oehrlich 0:25
And I’m excited to take it on because I, as I already shared with you, there’s tons of things we can talk about. But today, we’re actually here to talk about a most exciting topic, which is Site Reliability Engineering. We did a recent report where, you know, we had a ton of responses. So first question Site Reliability, why is that such a big topic today?
Jayne Groll 0:50
So, you know, you and I both come from IT ops. And so I think, and, of course, share your perspective on that. But it’s very gratifying to see Site Reliability Engineering, really giving it ops, kind of a new lens, or a new perspective, or even a new role, right, an official role that organizations are hiring and a set of practices that organizations can embrace. I think it puts it ops look a lot about shift left in DevOps, it really shifts IT ops. So far left, that, you know, according to Google’s definition, SRE is what happens when you take a software engineer to help design and be part of it operations. Right. So it’s a really exciting set of practices. Is there a lot of new in SRE? Maybe, maybe not. I mean, I always think of it is, you know, kind of a new, modern approach to it, service management. But I think there’s just an underlying human excitement. And I said, you and I’ve been watching it, you know, trending for a while with the upskilling report. And I think it’s extra exciting today, that SRE has crossed the chasm, right? I mean, we saw that in the report that it cross the chasm. And now enterprises are starting to embrace it. What about your excitement? I said, we both you know, we’ve known each other a long time, we’ve kind of watched this rise together.
Eveline Oehrlich 2:16
Yeah. So as you said, I’ve been in IT ops also started out my career there in 94. And my first day on the job was called a programmer by somebody in the business line. Of course, I thought I was at that time programmer wasn’t a great at that time, it was not a great term. But I have to say it was, at that time already a fun job I had to do and could do a lot of things. So maybe both of us already were SOEs, without even knowing it, then. But again, exciting for what reasons what, what are some of the things we found, if you want to go dive into why is it exciting? What’s changing? Why is everybody talking about it? And why did we do the research?
Jayne Groll 2:59
Well, I think the big change is looking at operations as an engineering discipline. And so when we look at software engineering, we look at DevOps engineering, right, we look at a lot of the things that that kind of have this engineering mentality, IT operations was always on the sidelines, right? You know, you and I stood with the help desk and everyone else, you know, with our hands up against the fence waiting for something to be thrown over. But we really weren’t part of the solution. We were there to fix the problems. And so I think that SRE kind of respond that right, respond that in a way, starting with Google, but then new practices grafting on top of it. So that led us to think about this report. Because you know, you know, you and I have been doing as part of DevOps Institute’s upskilling report for years. And one of the things we’ve seen is this, you know, this trend with SRE, and then given our experience with IT operations, and then you know, kind of this new excitement about this role of site reliability engineer, it made sense for you and I to say, Okay, let’s take an agnostic approach. And let’s go back to the community, like we do with the upscaling report and say, Tell us what’s happening in the real world with this set of practices? Is it the same? Is it you know, the same as other frameworks? What are organizations doing? How do you feel about about the role and what kind of benefits or activities are you executing? And so I was excited because I came to you and said, Hey, can we do this? And you went, sure. And then we were even fortunate to get, you know, Sumo Logic and StackState and Sedai to help underwrite it. So, yes, it’s a passion project, for sure. I think for you and for I, right.
Eveline Oehrlich 4:53
Yeah. So at first, you know, I thought going down into the research and crafting the survey questions, I was really focusing more on the practices and the automation and of course, the adoption and the different team topologies. And then when we talked it over with the sponsors, we realized there was also a change in behavior. And we wanted to tickle out that potential opportunity for people to get excited about something different. Maybe you can talk a little bit about that from a career perspective, because both of us have been, you know, coaches, we’ve been leaders, we’ve been in IT ops, you’ve done ITSM. I’ve been an analyst. I’ve had lots of conversations with people who are just kind of like, okay, yeah, this is my job. I go in in the morning and then come home in the evening. But it’s it felt to me that there was something more tell us a little bit I know you use you see it in the research, and I know you like that tickle as well. What did you what? What’s your thought around that?
Jayne Groll 5:57
So the first thing that struck me about SRE and I had the opportunity to go to a very early ESRI con, right, where there was still discussion, and it was still very much something that was very Google specific. So technology companies were looking at, you know, how do we replicate what was in the books, but there’s an underlying excitement or a dynamic that that is baked into SRA, because it looks at some things that maybe in the past have been a little bit heavier or more disconnected, like, change management is a great example of that, you know, for years and years and years, organizations have struggled with how do we manage changes? How do we control changes? Same thing with service level objectives, right? What is the service level? Is it how we, you know, fix something when it’s broken? Or is it actually maintaining a level of reliability. And then you add on top of that, the fact that in the early days of DevOps, there was a lot of talk about no ops. Now, the intent of no ops was more automation, which is also baked into sre. But you know, as an IT ops person as assistance administrator, as perhaps a role that wasn’t considered part of the cool kids club. Right? No ops kind of spoke to no job, no respect. And now you spin around, and now you’ve got SRE, which is, is very much a cool job. But but also, there’s a very human element that’s baked into SRE as well, it is IT Service Management. You know, I talked about this all the time, it is about Incident Management, change management, Service Level Management, you know, observability event management. And so it is a kind of a cool version of ITSM, more lightweight than perhaps some of the other practices. But some of the core principles that were brought forth in essary are time to make tomorrow better than today. So in a pure world, and SRE is supposed to spend about 50% of their time, you know, reacting to, you know, whatever’s happening, and 50% of their time reducing toil, right, so looking at technical debt, looking at automation, looking at some of the things that have just been dragging it for a very, very long time. SRE says, Hey, take some time, learn, right shadow share. And let’s look at ways to reduce that manual redundant work that nobody likes to do. Right. But there’s automation that will do it for you. Right, but you’ve got to be able to have that kind of innovation and the proactiveness. So there’s a lot of principles that are baked into SRE that are outside just how do we manage your infrastructure? How do you handle level two that are very human. And I think that may be something that’s contributing. And we saw in the report that people said they were more excited about their roles that they were being paid better that they were more engaged in, in you know, the practices? I mean, you’re more familiar with, with the responses than I am? What did you see in terms of like, did you feel an excitement in the data?
Eveline Oehrlich 9:04
Yeah, a couple of things, which, when you were talking about service management, and the excitement, one is the topic of collaboration, always seem to everybody always talks about, you know, DevOps, we need to be better collaborators and collaboration and knowledge management. But what we found in the report is first, knowledge management was one of those things where SRE is actually paid to capture their knowledge and shifted left so that others upstream or even right downstream can actually leverage to improve. So your job is becoming somebody who on a proactive and while you’re reacting at a proactive level, insert your knowledge. So the next time somebody knows a better way of doing things while you actually do that, that is fantastic. So while we’re still reacting to some things, because we can’t fix it all Oh, we can be proactive. I think that was that was most important to me being that collaborator. And if somebody asks me, What makes how, what should I be what my skills need to be to be in an SRE? Well, you have to be, like you said, an engineer, you have to know data modeling, process modeling, you need to be in analytics, you need to know software development lifecycle, you have to have some knowledge about all this tech things. But you also need to be able to understand where you can actually allow the toil to be reduced. Where can you actually help to make the biggest difference to developers, to testers, to customers, to product line owners to all these internal and external customers. And to me as an extrovert I am unfortunately, some people think I may be too extrovert. That is, that’s the beauty of this job, you get to talk to people, you get to collaborate, you get to work on multiple projects, while you actually fix things, and you actually can see an outcome. And that was reflected in being valued as a team member. Because as I remember, back in my time, IT operations, the only time I think I was really valued was when I carried the pager. The factory and I could actually drive in, in, you know, whatever, 10 minutes I was, was at the data center. So that excitement, I think is that collaboration gives excitement, and allows you to become much more of a valuable member than then you were maybe in the past and an IT ops job.
Jayne Groll 11:40
You know, and it’s funny you say that, because I think one of the areas of SRE, there really shows how important collaboration is incident management. So in your experience, in my experience, right, an incident occurs, everybody goes not my fault, not my fault, right? You spend half your time trying to figure out what change because it hasn’t been captured, or documented anywhere. And it becomes this like hot potato of a ticket in an ITSM system being passed around according to some escalation path, and you have 15 minutes to get it and two hours to fix it or whatever. And it becomes a follow the pointing finger. Right. So like, you know, it’s very non collaborative. In SRE, there’s a couple of things, I think that really showcase that collaboration. First of all, there’s an incident command system, right? Where it isn’t just pass around a ticket, but there’s a collaborative opportunity to figure out what change because that’s usually the root cause of an incident, what change, how do we fix it? Who can we get involved and, and instead of playing, you know, pass the ticket around, it becomes a collaborative event. The other side of it, I thought was really interesting is, if there’s a breach if the service level objectives are breached, because of an incident, all innovative work stops, right, all new work stops until they figure out what caused the breach. Now, for a business that can be very disruptive. So who wants to breach? Nobody? Right? One of the disappointing parts, I think that we found in the survey was that service level objectives were not necessarily the first thing that organizations did, we saw that in ITIL, as well, we’re Service Level Management was was later hopefully, we can flip the model on that. But it shows a collaborative spirit, that instead of just pointing out it wasn’t me, or it was you and we follow that pointing finger, that there isn’t ever to work together, and to solve problems together. And hopefully, that, first of all transfers knowledge, but also helps to reduce toil, because in order to reduce toil or technical debt, you first have to accept that it’s there. Yeah. And so you know, so I think there’s some pieces of that, what do you think about that SLO piece? So why do you think that’s a struggle?
Eveline Oehrlich 13:54
Yeah, I think the the challenge of it has always been really defining what it is they are delivering from an outcome perspective and the conversation between those who are creating it, or those who are fixing it, and those who are operating it. And those who are actually consuming it be at the employee level of systems of record or system of engagements at the customer level, those conversations are very, very difficult to have, and then put that into some kind of a perspective. I remember one conversation at my job at Forrester, where a gentleman was asking me to help him move from a 99.4 nines, right 99.9999 availability to five nines. And so of course, as an analyst, my question was, why would you want to do that? And he had no answer. He really didn’t have an answer. So we actually calculated for him, he just thought that was the right thing to do. We didn’t calculate it that what it would cost for him to actually do that. And he was like, Oh my gosh, I don’t think my business will fund that. So, right, the the conversation of having a service level objective, which is tied to an SLA, and is defined through SLIs, becomes difficult to define. And as we are in this digital world as things, our customer experience and employee experience is like the utmost goal of everybody, how is it defined? What’s the benchmark? How do I capture that? And I think that’s the challenge. There’s plenty of more work to do to understand how and what and I think it’s really a matter of bringing people together across the value stream, looking at from the outside in, what is it we want to do versus what we and it think we should be doing?
Jayne Groll 15:49
Absolutely. And you know, the other thing about SRE that I find particularly fascinating, is while there’s a series of books that were published by Google, that isn’t necessarily the you know, the body of knowledge or the prescriptive guidance, or it isn’t religious, right, and we say some of these frameworks became religious. And so now, we’ve grafted other things into this umbrella of SRE, like observability, or chaos engineering, right? Things that would have existed outside of it, and could have been part of framework wars, right. But now, you know, this whole concept of IT operations under Site Reliability, or systems, reliability engineering, now, graphs, other practices. So what are you seeing in that direction? I think, because again, as a human, that gives me new opportunities, collaborate gives me new things to learn, like observability, in particular, I think is just really risen, as practices that are considered part of that surgery.
Eveline Oehrlich 16:52
Yeah. So on the people side, if I, if I think of it from people process technology, from a people side, I think, individuals who are really looking for a new career, this is a great way to get into engineering, if you are an operations professional, get, you know, get yourself some upskilling and move. That’s one thing. from a leadership perspective, I think having the perspective to look at those people you have in your team and give them the chance to maybe shift to a role like that is another thing. And then maybe from an organizational perspective, that’s still around people, maybe we need to think about how we organize our teams a little bit different from a process perspective. All these things we mentioned, you know, knowledge management, incident management, configuration management, right? The evil CMDBA, the CMDB, yes, or no, or change management, all these things will become more natural. And then from a technology perspective, everybody wants to go to the cloud, we know we have to go to the cloud organizations are shifting, but there is opportunities as well to become a kind of a stack person around the cloud. Now, one thing I want to make sure everybody knows this SRE is not just for companies who are in the cloud or going to the cloud, we see a lot of organizations who have very mixed and hybrid environments who are applying SRE. And even in the mainframe, there isn’t really a perfect technology stack for adopting SRE. SRE is more of how and what you do rather than around a specific technology. So let me ask you a question. Jane. What about next year? What’s your prediction for next year? And I hope we will do the research again. And we’ll look forward to of course, do it in the same cadence. We usually do our survey around the November September timeframe. And then sorry, in May and then republished around July. So any predictions for next year?
Jayne Groll 18:58
Yeah, I think SRE is going to continue to be adopted by organizations. I think that, you know, if we look at kind of the evolution of practices, starting with agile, which is, what, 20 years ago, now over 20 years ago, and you know, kind of the scaling of agile and then DevOps, which is 11/12 years ago, and looking at how do we shift left? And how do we, you know, adopt more automation and the human collaboration, it feels like essary is that third piece of the puzzle. And I think that organizations will start to see and there’ll be more case studies that are released, of course, our report will do will do every year. And I think there’ll be a migration because the digital first world which we’re moving into, and we got pushed into maybe a little faster than we thought we would. You’re right, you don’t have to be in the cloud to adopt SRE practices, right. It’s just about good practices and IT operations but I think also It’s gonna break some of the barriers that we’ve seen with say IT Service Management that have always been, there’s some really great guidance and IT Service Management. But there’s always been a disconnect, say between those that are pre production and those that are post production. And SRE seems to break down those those walls. I think that the job, we’re gonna see more and more people really start to move into those roles, regardless of what their history was whether they came from opposite they came from dev, right. So I see this trending faster, because the definition of IT operations is no longer no ops, we called it “new ops” five years ago. And I think SRE is really new ops, and it will build very, very rapidly, probably more rapidly than agile or DevOps. What do you think? What do you see next year?
Eveline Oehrlich 20:52
Yeah, I agree. Absolutely agree. And I’m a little bit disappointed in us in the analyst community at the time that we did not come up with that term Site Reliability Engineering, and we talked about new ops. But yes, I think the it gives a complete new career path and allows for a, an engineering path for young folks to get into the millennials of today, as we know, they’re very different. I have to at home, they don’t have it careers, but I do see how they click and how they work. So for them, I think that’s going to be a great opportunity. All right. Well, for anybody listening in, Jane, thank you. This was fantastic. Appreciate your time. We have a report anybody listening in if you are a fantastic case study on SRE, reach out if you need some ideas on how to upskill reach out, you know, where we where we are, the report can be found on the DevOps Institute comm. website. If you have questions, comments, we are happy to entertain. Great, thank you. This was a very easy first podcast for me. Thank you for being so easy to transition into this role. Have a great day and thank you, everybody.
Jayne Groll 22:10
Thanks, Evelyn. Thanks, everyone.