Image by Peshkova via Getty Images
A SKILup Day Event Recap
Recapped by Jaida Olvera
SKILup Days by DevOps Institute are back and better than ever for 2023! Each month, we’ll have an amazing line-up of speakers and IT-focused sessions to help you upskill and elevate your tech offering – all from the comfort of your own home or office.
At Site Reliability Engineering (SRE) SKILup Day on February 15, 2023, leading SRE experts and practitioners shared their stories of SRE implementations. They gave valuable insights into how SRE practices have accelerated their entry into the digital economy and how their investment has provided returns in business and customer outcomes.
If you missed the event, we’ve got you covered with a round-up of the top themes from the sessions and conversations around this incredibly important topic.
Why is SRE Critical for Digital Transformation?
The practice of SRE has risen to a must-have engineering practice for enterprises seeking to accelerate digital transformations or re-engineer their interfaces to digital-first. According to our SRE research, 62% of survey respondents said that they are leveraging SRE. As enterprises implement SRE in their teams, by developing and adjusting the best practices introduced by Google, this operating model continuously gains attention from decision-makers within IT.
Our research also highlighted a variety of benefits from adopting SRE: teams that have adopted SRE perceived their company as leaders across customer experience and the quality and speed of innovation across different products, offerings, processes or services. Continuously improving customer experience and the speed of innovation requires attention for digital transformations or digital-first to succeed. But there are also challenges associated with the adoption of SRE, such as finding people who are skilled and knowledgeable to step into this new role and operating model.
We’ll review key discussion points from the SKILup Day sessions in this post.
TL;DR: SRE SKILup Day speakers explored:
- Introducing: The New Reliability
- Why SRE is Fundamental to Outstanding Customer Experience
- Best SRE Practices to Help Developers Troubleshoot Kubernetes
- Observability vs. Monitoring – A Common Pattern in Operations
- Flow: Friend or Foe?
- Chaos Experiments Under the Lens of AIOps
- How to Create Saga-Free Distributed Transactions
- How to Assess SRE Status Quo Through Maturity Models
Introducing: The New Reliability
Emily Arnott, Community Manager at Blameless, joined this SKILup Day to introduce a new definition of reliability. Arnott highlighted that although reliability seems nebulous, pinning down a definition that your org can agree on has tremendous benefits. She presented reliability as something clear, measurable and concrete. She then shared what you need to present reliability to your organization in a way that’s true to your needs and resonates across the business overall.
Arnott’s session reintroduced humanity into the reliability equation and emphasized that it’s not just about product health, it’s about the humans on your app and the humans behind it. It’s important to understand that reliability needs to be considered in terms of system health, the happiness of users, and the resilience of teams which is often the biggest factor for the system’s success or failure in crucial moments.
Why SRE is Fundamental to Outstanding Customer Experience
Suresh GP, Managing Director at TaUB Solutions, presented pragmatic insights to provide a business case for leaders’ funding for the SRE journey and demonstrate how it can leverage an outstanding customer experience. In the session, GP revealed that he sees many DevOps professionals struggle to articulate SRE’s business value to an organization’s Senior Executives.
He explained that it is key to have an understanding of the driver for SRE Adoption (user experience) and how to pitch its value to a Board and garner the necessary funding and support.
GP further explored the Value Proposition of SRE to outstanding customer experience and shared SRE trends and their implication for business value. Further, he presented key tenets and principles to get management buy-in and support
Best SRE Practices to Help Developers Troubleshoot Kubernetes
Andreas Prins, VP of Product at StackState, introduced several best practices for enabling Site Reliability Engineers to take a leadership role in scaling Kubernetes skills by knowledge sharing across development.
Prins identified that with the rapid adoption of Kubernetes, many development teams struggle in having the right troubleshooting skills. Fast remediation of issues is critical to avoid customer disruption, yet Kubernetes is a complex technology. Prins then presented how to implement encoding best practices in easy-to-understand ways, to help novices learn Kubernetes skills. He then demonstrated how to implement the necessary process automation to ensure high-quality Kubernetes services
Observability vs. Monitoring – A common pattern in Operations
Larry Sellers, Public Cloud Services Product Lead at NTT DATA Services, explored a common IT operations pattern where operations intersect applications support, the cause and impact. Sellers overviewed how SRE with Observability can overcome this pattern and drive true indicators of reliability. Further, Sellers identified causes of this common pattern and how to drive the practice of SRE with Observability to overcome the shortcoming.
Sellers also explained the difference between monitoring and observability. He highlighted how many organizations lean on pre-defined metrics monitoring to achieve availability. Finally, he demonstrated how to leverage SLOs, SLIs and normal product behavior to determine an observability strategy.
Flow: Friend or Foe?
Josh Ether, Principle Agilist, Coach and Consultant at Fidelity Investments, declared that SRE and Flow are intimately tied to the very nature of the job. Ether’s session defined relationships between Flow and SRE. Ether emphasized that SREs are constantly working on complex projects that require a great deal of attention to detail and focus. To ensure the reliability of a system, one must be able to think and act quickly, seeking out and troubleshooting problems as they arise.
Ether indicated that the same level of concentration and focus is necessary for achieving Flow. It is no surprise that SREs benefit from, and often actively seek out, Flow. Building on this, Ether overviewed all about what makes up Flow, the pros and cons and how you to use it to develop and enhance the SRE experience.
Chaos Experiments Under the Lens of AIOps
Michele Dodic, SRE DevOps Specialist at Accenture, used case studies, reflections of main principles and best practices as well as a prepared live demo to discuss how a combination of Chaos Engineering and AIOps tools can significantly increase cyber resiliency while maintaining full end-to-end transparency and observability of your entire system.
Dodic shared this scenario – imagine you’re an SRE at a major tech giant and you are responsible for overall system health. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities are just some of the daily struggles an average SRE goes through.
Dodic highlighted that thanks to AIOps and Chaos Engineering, things can be different.
He presented how to ensure your system’s resiliency and robustness, as well as the application of Chaos Engineering to test your predictive AI solution. The session also provided an overview of how to bridge the gap between security and DevOps
How to Create Saga-Free Distributed Transactions
Ilya Kaznacheev, Technical Lead at MTS Cloud, indicated that you can’t just cut a monolith into microservices. Simple database transactions become a stumbling block and a pain for the developer in distributed transaction form. Further, existing solutions are complex and unreliable, either unnecessary, difficult to implement, and unstable to operate.
Kaznacheev explored how to resolve this issue using state management in microservices. He also demonstrated how to achieve saga-free distributed transactions with a simple and transparent process (without breaking domain isolation.)
How to Assess SRE Status Quo Through Maturity Models
Yury Niño Roa, Cloud Infrastructure Engineer at Google, shared subject matter expertise with an SRE Maturity Model based on a set of questions that allow one to determine the level of adoption in an organization. Roa provided key insights into the compensation of an SRE maturity model and how to leverage and take the next steps upon an assessment.
She then presented a tool that collects the answers and provides a regressive, beginner and advanced score and explored the next steps to progress to higher levels of adoption.
There are many events, webinars, and in-person opportunities for 2023. Stay up to date with the DevOps Institute event calendar: devopsinstitute.com/events
Sharpen your skills by subscribing to SKILup IT Learning. Watch SKILup Day and SKILup Hour content and earn DevOps Institute Continuing Education Units (CEUs) as part of our Continuing Education Program.
Learn more and subscribe: devopsinstitute.com/skilup-it-learning
Get SRE Certified
DevOps Institute empowers DevOps humans to advance career development and upskill for enterprise transformation by providing the resources, guidance, experts, and encouragement to learn. We’ve put together a suggested SRE Engineer Certification Path and offer essential core competencies and various certifications to help advance your DevOps career and grow professionally.
Get started at devopsinstitute.com/certifications