DevOps Institute

The Dawn of Site Reliability Engineering: SRE Fuelling the Journey Towards Digital Reinvention

SRE

May 22, 2020

By: Biswajit Mohapatra

The day to day responsibilities of IT development and operations team are continuously evolving. Once upon a time there were separate development and operation teams in an organization operating in silos. Thereafter came DevOps to build, deploy, run and manage systems together breaking the silos. This was a great improvement. However, there were still many unanswered questions. How do we balance change velocity vs. stability, reliability and other operational attributes? How do we improve performance of the system? How do we avoid incidence response burnout? This gave rise to the advent of Site Reliability Engineering (SRE) in the horizon with increased focus on performance engineering, demand forecasting, capacity management, change management, incidence management, setting up service level indicator (SLI), objective (SLO) and agreement (SLA), risk budgeting, proactive monitoring and tracking.

SRE is a specialized discipline that integrates software engineering practices and principles with infrastructure and ongoing operations of systems. SRE is what happens when a software engineer is put to address operational challenges as mentioned above. The main objective of SRE is to create scalable software systems by handling operations like a software engineering problem, upfront designing reliable service architectures and automating system administration tasks.

The ever changing digital and cloud landscape brings in unprecedented need for collaboration, transparency, resiliency, stability, performance, reliability and correctness. SRE is a set of practices focused on reducing silos by shared ownership, planning for failures using error budgets, small batch changes with focus on stability, automation of manual tasks and introducing culture of measurement, monitoring and tracking. The fundamental goal of SRE is aimed at depicting a prescriptive approach to plan, build, implement, measure and achieve DevOps objectives with focus on reliability and automation at every opportunity.

With the world becoming more intelligent, interconnected and instrumented, Organizations are more and more looking for new ways of improving stability of IT systems. This is making organizations focus on Site Reliability Engineer as a role with depth and breadth of understanding on how IT systems work, why they fail, what needs to be done to improve, how they can be designed better and monitored better. Site Reliability Engineers are change agents within organizations who champion reliability best practices, designing resilient systems, implement process, methods tools and self-service solutions. Site Reliability Engineers work with design, build and devops squads to establish elastic architecture, bridge application and platform design from operation point of view. The scope of SRE covers several critical areas of cloud platform architecture such as orchestrated automation, responsive operation, optimized performance, Just-in-Time scaling, modernized environment and predictive event management.

Organizations planning to embrace SRE should take a staged approach, obtain stakeholder buy-in, establish a squad based delivery team, identify scope, define process, methods, tools for integrated delivery pipeline and finally iterate and evolve along the way in a factory delivery model. SRE is poised for effective galvanization of strategy, business, technology and cost to deliver right outcomes. The mission is to delineate fast and flexible software engineering practices and principles to bring together the whole end to end digital reinvention journey for organizations effectively.

Link to original article

 

sidebar graphic with register for London SKILup Festival on September 13, 2022CTA

Membership at DevOps Institute

related posts

8 Insights From the Upskilling IT 2022 Report [Infographic]

8 Insights From the Upskilling IT 2022 Report [Infographic]

By Eveline Oehrlich Chief Research Officer, DevOps Institute This year’s Upskilling IT Report reveals a critical need to close DevOps skills gaps, identifies top skills capabilities, and highlights emerging job roles to help individuals and organizations accelerate IT...

[EP81] What is a “Radical Enterprise” with Matt Parker

[EP81] What is a “Radical Enterprise” with Matt Parker

On this episode of the Humans of DevOps, Jason Baum is joined by Matt K. Parker, author of A Radical Enterprise: Pioneering the Future of High-Performing Organizations. Matt and Jason discuss successful and truly radical business models, what leads folks to try and...

What Are Cloud AI Developer Services?

What Are Cloud AI Developer Services?

Cloud AI Developer Services are growing and cloud providers now offer these services to developers. These hosted models allow developers to gain access to Artificial Intelligence/Machine Learning (AI/ML) technologies without needing deep data science expertise.  As an...