DevOps Institute

The Practice of Chaos Engineering Observability


image of a binocular lens looking out towards a city scape

Updated January 19, 2023
By: Biswajit Mohapatra

Applications and services are evolving rapidly with increasing complexity over time. At the same time, supporting architecture, infrastructure, and storage complexities are growing exponentially, making systems more prone to failure. Today’s modern distributed applications and services are associated with many unpredictable failure scenarios that are extremely difficult to monitor across all failure points.

Monitoring alone is no longer sufficient

Monitoring is the process of checking the behavior of an application, service and the supporting resources to ensure everything is functioning as expected. However, monitoring itself is not good enough when we are dealing with modern digital applications and services associated with complex integration points and interfaces across operating systems, Kubernetes layers and application stacks. Because of the changes in complexity and dependencies, it gave rise to observability as a discipline comprising three pillars: logging, monitoring and metrics.

What is Observability?

Observability is the property of the system that helps to understand what is going on in the system and get related information and insights to troubleshoot. Observability is about understanding what are systems failure modes, how data and insights can be leveraged to iterate and improve the system. Correlation between logging, monitoring and tracing coupled with index-free log aggregation and data-driven insights is poised to drive the success of observability solutions in the future. Observability determines the internal state of a system from the knowledge of external outputs. A good observability solution should have the ability to externalize data and additional learning embedded into it. Sometimes we don’t even put effort in fixing the problem since we don’t know that the problem exists.

Observability meets Chaos Engineering

Chaos engineering is the practice of facilitating controlled experiments to uncover weaknesses in the system. Crash testing your systems in a simulated environment will aid in identifying the failure modes and taking corrective measures. The goal is to identify and address issues proactively before they reach your users. This can be achieved through hypothesizing normal steady-state behavior and continue to create failure modes that will impact the hypothesis, modeling failure of systems to improve resiliency, simulating production load, fault injection, controlled roll out using canary deployments, varying real-world scenarios through simulation of hardware failure, malformed responses within ecosystems, sudden spikes in traffic to check for scalability and reducing the blast radius to contain and minimize the impact caused due to experiments.

Here is a Chaos Engineering workflow that includes the following steps:

(1)  Plan the experiment creating hypothesis around steady-state behavior

(2)  Initiate an attack that is small enough to give information about how systems react

(3)  Measure the impact compared with steady-state metrics

(4)  If an issue (or issues) is detected, cut off the attack and fix the issue

(5)  If issue is not detected, scale the attack until issues are observed

(6)  Learn, make improvements and automate experiments to run continuously


Learn from your weaknesses before they become failures

Chaos Engineering will introduce real-time failures into systems to assess the system’s ability to tolerate failures, recoverability, resiliency and high availability. By designing chaos engineering experiments you can learn weaknesses in the system that could potentially lead to failures. Then these weaknesses can be addressed proactively going beyond the reactive process that currently dominates most incident response models. However, it’s important not to rush into the practice of chaos engineering without proper planning and designing of experiments.

Every chaos experiment should begin with a hypothesis

The test should be designed with a small scope with a focused team working on the same. Every organization should focus on controlled chaos promoted by observability to improve system resiliency.

Leverage observability to discover and overcome system weaknesses. Without observability, there is no chaos engineering. Organizations need to focus on building a culture of observing systems. It’s no longer about writing code. It’s all about adopting processes to build resilient systems. Introducing continuous chaos in your DevOps CI/CD pipeline helps automate experiments and failure testing, enabling detection, debugging and fixing issues more proactively. The practice of chaos engineering observability will improve confidence in the system, enable faster deployments, prioritize business KPIs and drive the auto-healing of systems. The use of AI/ML will aid in building observability patterns and antipatterns by close monitoring of system and user behavior over a period of time. The hypothesis developed over these patterns and antipatterns can help auto-heal the systems.

The industry has started recognizing differentiated value propositions provided by the practice of chaos engineering observability. This is certainly helping to address many unknown facets of systems’ unpredictability. Chaos engineering experiments coupled with cognitive observability study of complex systems using trend analysis, regression analysis and time series analysis will help take the systems into newer heights in the near future.


Upskilling IT 2023 Report

Community at DevOps Institute

related posts

[EP109] From a DBA Jerk to a Collaborator!

[EP109] From a DBA Jerk to a Collaborator!

Join Eveline Oehrlich and Grant Fritchey, Product Advocate at Redgate Software, to discuss product advocacy, collaboration, and leadership. Grant has worked for more than 30 years in IT as a developer and a DBA. He has built systems from major enterprises to...

[EP108] Leading an Engineering Team Today

[EP108] Leading an Engineering Team Today

Join Eveline Oehrlich and Nickolas Means, VP of Engineering at Sym, to discuss the best practices and challenges of leading an engineering team, collaboration, and more. Nick is the VP of Engineering at Sym, the adaptive access tool built for developers. He’s been an...

[EP106] Identity Orchestration Tidbits

[EP106] Identity Orchestration Tidbits

Join Eveline Oehrlich and Topher Marie, CTO and Co-founder of Strata, to discuss Container Orchestration. Before Strata, Topher was the CTO and a co-founder of JumpCloud. In the past, he has also been an Architect for Oracle’s global cloud identity and security...