Eveline Oehrlich, Chief Research Officer, DevOps Institute
According to Pew Research, work is one of the most common sources of meaning for adults around the world: A median of 25% across the 17 publics mention their job, career, or profession. In 12 of the 17 publics surveyed, work is among the top three most mentioned topics – and in Spain, it ranks even higher than family and children. Wow! But not only work. In further research, I discovered that individuals making a difference are happier. So, I need to find a problem I care about and start solving it. Obviously, I will not fix the global pandemic or any other world problem, but I can contribute to making a difference. And that feeling of making a difference, I now realize, has kept me going for the past years of working, giving me meaning.
The COVID-19 Pandemic Meant to Stamina Agility and Resistance
Agility and resistance have been key success factors for organizations, leaders, and all of us working and living through the past two years. We have also seen tremendous growth of digital services and products and accelerated digital transformation. For example, 58% of customer interactions are digital in 2020 vs. 36% in 2019. IT teams have been a major player, and a big focus was the existing (or lacking) operational excellence. IT Leaders had to think about optimizing various operational processes, maximizing value, and minimizing waste. Resources, velocity, quality, monitoring, real-time data become top of mind.
The Technology and Process Conundrum
No matter if your organization has transformed its computing environment towards the cloud or if you are operating in a hybrid or legacy environment, your company has experienced some outage or loss of service. According to ITIC’s 12th annual 2021 Hourly Cost of Downtime Survey, about 44% of firms indicate that hourly downtime costs exceed $1 million to over $5 million, exclusive of legal fees, fines, or penalties. Outages cannot be avoided, but what is essential in the digital 24/7 world is that the response to downtime becomes critical. What you should consider:
- Technology for resilience and reliability. Some existing IT architectures can absorb or adjust to performance impacts and adapt to new levels of demand. Infrastructure and operations (I&O) teams are working on this already. Gartner predicts global public cloud spending to grow at a rate of 21.7% to reach $482 in 2022, up from $396 billion, and IT organizations are embracing new serverless architectures at different levels of adoption.
- Continuously improving processes is essential. While self-healing infrastructure, workload balancing, and other capabilities are helpful, how we go about solving issues has a significant impact on preventing things in the first place. The shift from fixing problems to proactive planning and automation is happening with Site Reliability Engineering (SRE) as a best practice. The mentality of fighting fires is too stressful, and the impact on business in the digital environment is bigger than before. I&O leaders recognize that people and culture must shift towards proactive or predictive operational excellence instead of chaos. In 2021, we found that the adoption of SRE grew to 22% in 2021 compared to 15% in the previous year, and we expect this adoption to continue.
Initiate Operational Excellence by Leveraging SRE Principles
An SRE team is typically comprised of individuals who possess skills in software development, networking, and system engineering. The goal for this team is to spend about 50% (or more) of their time automating to resolve things before they happen instead of only fixing things. Incidents that happen become the source of learning and the knowledge from these are leveraged to automate and improve whatever happened to create a more resilient ecosystem. SRE has many principles, but two of the most important are:
- SRE means focusing on shared responsibility. Blame is typical, but within SRE, it is essential to avoid blame and look for the learning opportunity. This means the team must look at the current state and what caused the incident, and the future state of what can be done to avoid the incident without blaming teams or individuals. SREs must work across different teams such as infrastructure applications, understand processes and have human skills to work this way. This blameless approach allows individuals to learn what went wrong and enables them to change things.
- Deploying different SRE models. While uptime and 24/7 availability are key expectations for today’s digital economy, different time zones, availability requirements, IT infrastructure, application footprints and existing skills vary greatly across IT teams. Therefore, there is no specific way to organize and structure an SRE team. Leadership teams should discuss the decisions on which model to use. While the skills for an SRE range from prior knowledge of operations, infrastructure, networking, hardware, distributed systems, monitoring, stability, capacity planning, and software engineering, they also must be knowledgeable about architecture and implementation of technical infrastructure and supporting services.
Further Your Understanding of SRE
To help organizations and individuals pursuing SRE further, DevOps Institute has created an eBook: “SKILbook Summary: Site Reliability Engineering.” In this abridged version of the full Site Reliability Engineering SKILbook, you’ll find guidance to get you started towards a sustainable SRE practice. The Summary SKILbook is written by a group of experienced ambassadors and practitioners and supported through the research and data of the DevOps Institute.
In my experience, the companies that are most successful at extracting the full value from SRE commit to managing IT operations differently. They have an integrated SRE operating model of teams of people—including those from development, infrastructure, and operations—with essential skills and capabilities and adopt automation to sustain reliable and available services for their employees and customers.