By Eveline Oehrlich, Chief Research and Content Officer at DevOps Institute
One thing that we learned during the pandemic is that balance is critical. Be it to balance a good healthy work-life during the week, balancing the time spent with family inside the home and outside the home office, managing the amount of time spent on virtual calls, or balancing our own fear of uncertainty and doubt (FUD) with the positive outlook that this pandemic will not last forever. I was truly fortunate to be involved in many exciting research projects working with excellent collaborators and partners over the last year which helped me sustain the balance of happiness and progress both professionally and personally.
One such project was to design, develop, analyze, and co-author the 2021 SRE report in partnership with Catchpoint and VMware. This was the second study that we worked on together and while we had already found great details around how Site Reliability Engineering (SRE) is adopted in the 2020 report, we dug in deeper during this research to give us more insights on patterns used — or not used — and understand other aspects around the ecosystem and automation usage.
Here is a Summary of the Five Key Findings from the Report:
1. Good balancing on operations vs. development work.
I am excited to share that Site Reliability Engineers (SREs) of today also know how to balance. The passionate 300 survey respondents are balancing their work across operations and on-call (20% median value of spending time on call) with doing development work (40% median value of time spent exclusively on development). It is a good sign that SREs across the globe are spending time on development activities. We can conclude that SREs are sharing their wisdom of production with the development teams and therefore adding towards the reliability and health of applications and services.
2. Service Level Objectives (SLO) are continually refined.
SLOs determine the specifics about a metric within a Service Level Agreement (SLA), for example, the uptime of a service or response time. These are goals that need to be met by the team providing the service. An SLO is typically a numerical target that is determined by site reliability engineering and is set across for the service described in the SLA. An example is application availability will be 99.95% of the time over any given 24-hour period. While SLOs are easier to determine than SLAs, the challenge still lies in not being specific enough. About 50% of our survey respondents told us that they are continually refining SLOs while 30% publish them to their customers to set expectations.
3. SRE topology depends on the organization’s ecosystem.
No two IT and business technology ecosystems are the same. Our survey participants had a multi-provider approach across the board for DNS, API, and CDN. Such environments provide improved resilience of systems, and it allows organizations to tap into the strengths of the different providers. But it also brings complexity, a need for different skills, diverse automation challenges and the never-ending story of different tools and rules to manage each provider tech stack. The topologies used to manage a multi-provider environment vary from a centralized SRE to a decentralized SRE team. Digging into the survey data with further correlations, we found that once the number of SREs and employees within an organization increases, SREs become more decentralized. Perhaps this path is unavoidable, given the different expertise of SREs.
4. Performance monitoring is always focused on infrastructure, networks, and applications.
The top three monitoring tools which are always used are that of infrastructure performance (62% always used), network performance (42% always used), and application performance (49% always used) with some digital experience monitoring (26% always used), and AIOps (12% always used). We still see the rare use of benchmarking intelligence (9% always) and public sentiment/social media monitoring (8% always).
5. IT operations thinking still drives the use of monitoring data.
The responses to the question, “What drives the use of monitoring data?” showed that augmenting troubleshooting and root cause analysis was the resounding leader with 66% responses. “Ensuring service level objectives (SLOs) are met” came in second. Additionally, 49% of SREs said, “enhance the customer experience” was a major driver for the use of monitoring data. This shows a nice pivot of SREs towards understanding the “outside-in” customer experience perspective.
Interested in becoming a Site Reliability Engineer? Get #DevOpsCertified in SRE
Give Your SRE Efforts a Boost
There are a variety of other nuances and findings within the report which I have not highlighted here. There is still much for you to discover by reading the report. Look and explore for yourself what baselines exist, then look at your own SRE team and see where you can innovate further. But if you are eager to bring some suggestions to your team or want to take some actions to get things moving…
Here are Five Tips to Ramp up Your SRE Efforts:
1. Continue to focus on toil by automating the heck out of your environment.
While SREs at Google might spend 50% of their time on development, Google might not be the typical use case. SREs perform all types of activities including operational tasks, tactical implementations, and high-level strategic initiatives. The goal to “automate ALL the things!” is a core SRE tenet so that you and your team can focus on value-based activities. This increases the happiness factor, reduces stress and burnout, and provides balance within the SRE job.
2. Measure toil and use an error budget.
The past year might not have inspired you to measure toil but as digital services and the desire to transform towards digital continues, it is essential to measure toil and leverage an error budget to balance innovation with managing existing services and applications. The demand from the business will continue the desire for increases in velocity, quality, and security and SRE plays a critical role in fulfilling these demands.
3. Adopt a platform operation team.
The sheer complexity of the different platforms across the different providers within your ecosystem demands a different operating model. Adopting a platform operation team that allows for improved scalability, continuous learning, knowledge transfer, expertise, and focused collaboration is a must.
4. Expand your monitoring beyond the typical.
While infrastructure, network, and application performance are essential, the customer, patient, and client experience are essential for feedback. This will allow adjustments across the products, services, or applications for happy, repeatable, and profitable experiences with your company’s products and/or brand.
5. Converge your IT metrics with business and employee metrics for happiness.
Adopt metrics that will augment your existing IT metrics to avoid being stuck in IT operations mode. The key to focus on is what value and benefits are we delivering and promising to our customers in their eyes (and what it is they demand) and then take and look for what metrics do support these values and benefits.
The 2021 SRE Report is now available for download. Get your copy.