DevOps Institute

Site Reliability Engineering (SRE): Get Certified to Make a Difference

SRE, Upskilling

Updated January 19, 2023

By Eveline Oehrlich, Chief Research Officer, DevOps Institute

According to Pew Research, work is one of the most common sources of meaning for adults around the world: A median of 25% across the publics surveyed mention their job, career, or profession. In further research, I discovered that individuals making a difference are happier. Finding a problem you care about and solving it can have powerful outcomes.   

What the COVID-19 Pandemic Meant for Stamina, Agility and Resistance

Agility and resistance have been key success factors for organizations, leaders, and all of us working and living through the past two years. We have also seen tremendous growth of digital services and products and accelerated digital transformation.

 According to industry experts and market research organizations, the digital transformation market size is projected to grow at a compound annual growth rate of 19.1%, from $521.5 billion in 2021 to $127.5 billion in 2026. IT teams have been a major player, and a big focus was the existing (or lacking) operational excellence. IT Leaders had to think about optimizing various operational processes, maximizing value, and minimizing waste. Resources, velocity, quality, monitoring, and real-time data become top of mind.  

The Technology and Process Conundrum

No matter if your organization has transformed its computing environment towards the cloud or if you are operating in a hybrid or legacy environment, your company has experienced some outage or loss of service. According to ITIC’s 12th annual 2021 Hourly Cost of Downtime Survey, about 44% of firms indicate that hourly downtime costs exceed $1 million to over $5 million, exclusive of legal fees, fines, or penalties. Outages cannot be avoided, but what is essential in the digital 24/7 world is that the response to downtime becomes critical. What you should consider:

  • Technology for resilience and reliability. Some existing IT architectures can absorb or adjust to performance impacts and adapt to new levels of demand. Infrastructure and operations (I&O) teams are working on this already. Gartner predicts global public cloud spending to grow at a rate of 21.7% to reach $482 in 2022, up from $396 billion, and IT organizations are embracing new serverless architectures at different levels of adoption.
  • Continuously improving processes is essential. While self-healing infrastructure, workload balancing, and other capabilities are helpful, how we go about solving issues has a significant impact on preventing things in the first place. The shift from fixing problems to proactive planning and automation is happening with Site Reliability Engineering (SRE) as a best practice. The mentality of fighting fires is too stressful, and the impact on business in the digital environment is bigger than before. I&O leaders recognize that people and culture must shift towards proactive or predictive operational excellence instead of chaos. In 2021, we found that the adoption of SRE grew to 22% in 2021 compared to 15% in the previous year, and we expect this adoption to continue.

Initiate Operational Excellence by Leveraging SRE Principles

An SRE team is typically comprised of individuals who possess skills in software development, networking, and system engineering. The goal for this team is to spend about 50% (or more) of their time automating to resolve things before they happen, instead of only fixing things. Incidents that happen become the source of learning and the knowledge from these are leveraged to automate and improve whatever happened to create a more resilient ecosystem. SRE has many principles, but two of the most important are: 

  • SRE means focusing on shared responsibility. Blame is typical, but within SRE, it is essential to avoid blame and look for the learning opportunity. This means the team must look at the current state and what caused the incident, and the future state of what can be done to avoid the incident without blaming teams or individuals. SREs must work across different teams such as infrastructure applications, understand processes and have human skills to work this way. This blameless approach allows individuals to learn what went wrong and enables them to change things.  
  • Deploying different SRE models. While uptime and 24/7 availability are key expectations for today’s digital economy, different time zones, availability requirements, IT infrastructure, application footprints and existing skills vary greatly across IT teams. Therefore, there is no specific way to organize and structure an SRE team. Leadership teams should discuss the decisions on which model to use. While the skills for an SRE range from prior knowledge of operations, infrastructure, networking, hardware, distributed systems, monitoring, stability, capacity planning, and software engineering, they also must be knowledgeable about architecture and implementation of technical infrastructure and supporting services.  

Further Your Understanding of SRE

In my experience, the companies that are most successful at extracting the full value from SRE commit to managing IT operations differently. They have an integrated SRE operating model of teams of people—including those from development, infrastructure, and operations—with essential skills and capabilities and adopt automation to sustain reliable and available services for their employees and customers.  

To help organizations and individuals pursuing SRE further, DevOps Institute offers Site Reliability Engineering (SRE) Foundation and Practioner-level certifications to provide a deeper understanding of practical implementation of SRE culture. 

What You’ll Learn

  • SRE Principles and Practices
  • Service Level Objectives and Error Budgets
  • Reducing Toil
  • Monitoring and Service Level Indicators
  • SRE Tools and Automation
  • Anti-Fragility and Learning from Failure
  • Organizational Impact of SRE
  • SRE, Other Frameworks, The Future

Gain the skills required to identify, troubleshoot and solve complex problems in the world of SRE. Learn more at devopsinstitute.com/certifications

To access more SRE resources, subscribe to SKILup IT Learning

Eveline Oehrlich is Chief Research Officer at DevOps Institute. As former VP and Research Director at Forrester Research, Eveline led and conducted research around a variety of topics including DevOps, Digital Operational Excellence, Cognitive Intelligence, and Application Performance Management for 12 years. She is the author of many research papers and thought leadership pieces and a well-known presenter and speaker. She has more than 25 years of experience in IT. Her passion is to help companies transform their IT organization, processes, and tools towards high-performing teams enabling their business partners to achieve better business results. She has helped some of the largest enterprises across the world to adopt new strategies, workflows, and automation within their journey towards a digital business.

 

Upskilling IT 2023 Report

Community at DevOps Institute

related posts

[EP112] Why an AIOps Certification is Something You Should Think About

[EP112] Why an AIOps Certification is Something You Should Think About

Join Eveline Oehrlich and Suresh GP for a discussion on Why an AIOps Certification is Something You Should Think About Transcript 00:00:02,939 → 00:00:05,819 Narrator: You're listening to the Humans of DevOps podcast, a 00:00:05,819 → 00:00:09,449 podcast focused on...

[Ep110] Open Source, Brew and Tea!

[Ep110] Open Source, Brew and Tea!

Join Eveline Oehrlich and Max Howell, CEO of tea.xyz and creator of Homebrew, to discuss open source including "the Nebraska problem," challenges, and more. Max Howell is the CEO of tea.xyz and creator of Homebrew. Brew was one of the largest open source projects of...