Updated January 19, 2023
By Eveline Oehrlich, Chief Research Officer, DevOps Institute
According to Pew Research, work is one of the most common sources of meaning for adults around the world: A median of 25% across the publics surveyed mention their job, career, or profession. In further research, I discovered that individuals making a difference are happier. Finding a problem you care about and solving it can have powerful outcomes.
What the COVID-19 Pandemic Meant for Stamina, Agility and Resistance
Agility and resistance have been key success factors for organizations, leaders, and all of us working and living through the past two years. We have also seen tremendous growth of digital services and products and accelerated digital transformation.
According to industry experts and market research organizations, the digital transformation market size is projected to grow at a compound annual growth rate of 19.1%, from $521.5 billion in 2021 to $127.5 billion in 2026. IT teams have been a major player, and a big focus was the existing (or lacking) operational excellence. IT Leaders had to think about optimizing various operational processes, maximizing value, and minimizing waste. Resources, velocity, quality, monitoring, and real-time data become top of mind.
The Technology and Process Conundrum
No matter if your organization has transformed its computing environment towards the cloud or if you are operating in a hybrid or legacy environment, your company has experienced some outage or loss of service. According to ITIC’s 12th annual 2021 Hourly Cost of Downtime Survey, about 44% of firms indicate that hourly downtime costs exceed $1 million to over $5 million, exclusive of legal fees, fines, or penalties. Outages cannot be avoided, but what is essential in the digital 24/7 world is that the response to downtime becomes critical. What you should consider:
- Technology for resilience and reliability. Some existing IT architectures can absorb or adjust to performance impacts and adapt to new levels of demand. Infrastructure and operations (I&O) teams are working on this already. Gartner predicts global public cloud spending to grow at a rate of 21.7% to reach $482 in 2022, up from $396 billion, and IT organizations are embracing new serverless architectures at different levels of adoption.
- Continuously improving processes is essential. While self-healing infrastructure, workload balancing, and other capabilities are helpful, how we go about solving issues has a significant impact on preventing things in the first place. The shift from fixing problems to proactive planning and automation is happening with Site Reliability Engineering (SRE) as a best practice. The mentality of fighting fires is too stressful, and the impact on business in the digital environment is bigger than before. I&O leaders recognize that people and culture must shift towards proactive or predictive operational excellence instead of chaos. In 2021, we found that the adoption of SRE grew to 22% in 2021 compared to 15% in the previous year, and we expect this adoption to continue.
Initiate Operational Excellence by Leveraging SRE Principles
An SRE team is typically comprised of individuals who possess skills in software development, networking, and system engineering. The goal for this team is to spend about 50% (or more) of their time automating to resolve things before they happen, instead of only fixing things. Incidents that happen become the source of learning and the knowledge from these are leveraged to automate and improve whatever happened to create a more resilient ecosystem. SRE has many principles, but two of the most important are:
- SRE means focusing on shared responsibility. Blame is typical, but within SRE, it is essential to avoid blame and look for the learning opportunity. This means the team must look at the current state and what caused the incident, and the future state of what can be done to avoid the incident without blaming teams or individuals. SREs must work across different teams such as infrastructure applications, understand processes and have human skills to work this way. This blameless approach allows individuals to learn what went wrong and enables them to change things.
- Deploying different SRE models. While uptime and 24/7 availability are key expectations for today’s digital economy, different time zones, availability requirements, IT infrastructure, application footprints and existing skills vary greatly across IT teams. Therefore, there is no specific way to organize and structure an SRE team. Leadership teams should discuss the decisions on which model to use. While the skills for an SRE range from prior knowledge of operations, infrastructure, networking, hardware, distributed systems, monitoring, stability, capacity planning, and software engineering, they also must be knowledgeable about architecture and implementation of technical infrastructure and supporting services.
Further Your Understanding of SRE
In my experience, the companies that are most successful at extracting the full value from SRE commit to managing IT operations differently. They have an integrated SRE operating model of teams of people—including those from development, infrastructure, and operations—with essential skills and capabilities and adopt automation to sustain reliable and available services for their employees and customers.
To help organizations and individuals pursuing SRE further, DevOps Institute offers Site Reliability Engineering (SRE) Foundation and Practioner-level certifications to provide a deeper understanding of practical implementation of SRE culture.
What You’ll Learn
- SRE Principles and Practices
- Service Level Objectives and Error Budgets
- Reducing Toil
- Monitoring and Service Level Indicators
- SRE Tools and Automation
- Anti-Fragility and Learning from Failure
- Organizational Impact of SRE
- SRE, Other Frameworks, The Future
Gain the skills required to identify, troubleshoot and solve complex problems in the world of SRE. Learn more at devopsinstitute.com/certifications
To access more SRE resources, subscribe to SKILup IT Learning