By Eveline Oehrlich, Chief Research Officer at DevOps Institute
Organizations today depend on information (business) technology to serve, support, and operate customer-patients-employees-partners-etc., interactions. We have seen tremendous growth of digital services and products, and the digital transformation accelerated. The Site Reliability Engineering (SRE) team is a key team to enable and provide technology and services for these critical interactions. These interactions include anything to support across the different value chains of customer service, employee service, sales, operations, marketing, facilities, and so forth.
Introduced by Google, SRE leverages a software engineering approach to IT Operations and Infrastructure to modernize such. As we see SRE adoption grow year over year, we are eager to research and identify patterns, antipatterns, best practices, trends, and results and then share them. We believe these findings are essential for SRE success today and in the future. This is why DevOps Institute has launched the Global SRE Pulse Survey.
You may like The Origins of SRE from the Director of SRE Education at Google
Shifting from Reactive Response to Proactive Planning
The shift from reactive response to proactive planning and automation is happening with SRE as a best practice. Fighting fires is too stressful, and the impact on business in the digital environment is greater than before. I&O leaders recognize that people and culture must shift towards proactive or predictive operational excellence instead of chaos. In the 2021 Upskilling IT report, we found that the adoption of SRE grew to 22% in 2021 compared to 15% in the previous year, and we expect this adoption to continue.
The Key to Success in a Digital Transformation
SRE is the key to success in a digital transformation. Whether organizations have transformed their computing environment towards the cloud or operating in a hybrid or legacy environment, many companies have experienced some outage or loss of service. According to ITIC’s 2021 Hourly Cost of Downtime Survey, 44% of firms indicate that hourly downtime costs exceed $1 million to over $5 million, exclusive of legal fees, fines, or penalties. While outages cannot be avoided, response to downtime becomes critical in a 24/7 digital environment.
Some existing IT architectures can absorb or adjust to performance impacts and adapt to new levels of demand. Infrastructure and operations (I&O) teams are working on this already. Gartner predicts global public cloud spending to grow at a rate of 21.7% to reach $482 in 2022, up from $396 billion, and IT organizations are embracing new serverless architectures at different levels of adoption.
While important within technology, transformations must also happen across how humans work and think. SRE is changing how IT operations function and how products are built, developed and released. The key goal is to establish a healthy and productive interaction between the development and an SRE team. Leveraging best practices such as SLOs and error budgets ensures the balance between fast delivery of new features versus software reliability.
Getting a Pulse on the State of SRE
While self-healing infrastructure, workload balancing, and other capabilities are helpful, how we solve issues has a significant impact on preventing things in the first place. Additionally, SRE initiatives allow teams to increase their release velocity and innovation, as time is not wasted on fixing problems. SRE teams are composed of individuals who possess skills in software development, networking and system engineering.
According to Google, the goal for this team is to spend at least 50% of their time in automating to resolve incidents before they happen, instead of only fixing things. Incidents become the source of learning, and the knowledge gained is leveraged to automate, and improve whatever happens to create a more resilient ecosystem. Additionally, SRE teams leverage the principle of blameless work. This key principle of SRE means that responsibility is shared, and the team avoids blame while continuously looking for ways to improve.
While uptime and 24/7 availability are critical expectations for today’s digital economy, different time zones, availability requirements, IT infrastructure, application footprints and existing skills vary greatly across IT teams.
The Global SRE Pulse Overview
The Global SRE Pulse survey will collect input from the global community on SRE practices and thought leadership. We are proud to welcome Sumo Logic as a platinum sponsor of our inaugural survey. When the Sumo Logic team heard about our research, they were keenly interested in sponsoring it to help the community understand the state of SRE by talking to the people that make it up. .
- Anonymize and summarize the data across all responses
- Publish the results and share them back with the community
- For every completed survey, DevOps Institute and Sumo Logic will donate a combined $2 to the International Committee of the Red Cross to help the humanitarian crisis in Ukraine.
We will NOT:
- Use your responses for any marketing purposes whatsoever
- Share your answers with any recruiters, partners, or vendors
Keep up to date about the Global SRE Pulse at devopsinstitute.com/global-sre-pulse