Updated January 19, 2023
By: Biswajit Mohapatra
Applications and services are evolving rapidly with increasing complexity over time. At the same time, supporting architecture, infrastructure, and storage complexities are growing exponentially, making systems more prone to failure. Today’s modern distributed applications and services are associated with many unpredictable failure scenarios that are extremely difficult to monitor across all failure points.
Monitoring alone is no longer sufficient
Monitoring is the process of checking the behavior of an application, service and the supporting resources to ensure everything is functioning as expected. However, monitoring itself is not good enough when we are dealing with modern digital applications and services associated with complex integration points and interfaces across operating systems, Kubernetes layers and application stacks. Because of the changes in complexity and dependencies, it gave rise to observability as a discipline comprising three pillars: logging, monitoring and metrics.
What is Observability?
Observability is the property of the system that helps to understand what is going on in the system and get related information and insights to troubleshoot. Observability is about understanding what are systems failure modes, how data and insights can be leveraged to iterate and improve the system. Correlation between logging, monitoring and tracing coupled with index-free log aggregation and data-driven insights is poised to drive the success of observability solutions in the future. Observability determines the internal state of a system from the knowledge of external outputs. A good observability solution should have the ability to externalize data and additional learning embedded into it. Sometimes we don’t even put effort in fixing the problem since we don’t know that the problem exists.
Observability meets Chaos Engineering
Chaos engineering is the practice of facilitating controlled experiments to uncover weaknesses in the system. Crash testing your systems in a simulated environment will aid in identifying the failure modes and taking corrective measures. The goal is to identify and address issues proactively before they reach your users. This can be achieved through hypothesizing normal steady-state behavior and continue to create failure modes that will impact the hypothesis, modeling failure of systems to improve resiliency, simulating production load, fault injection, controlled roll out using canary deployments, varying real-world scenarios through simulation of hardware failure, malformed responses within ecosystems, sudden spikes in traffic to check for scalability and reducing the blast radius to contain and minimize the impact caused due to experiments.
Here is a Chaos Engineering workflow that includes the following steps:
(1) Plan the experiment creating hypothesis around steady-state behavior
(2) Initiate an attack that is small enough to give information about how systems react
(3) Measure the impact compared with steady-state metrics
(4) If an issue (or issues) is detected, cut off the attack and fix the issue
(5) If issue is not detected, scale the attack until issues are observed
(6) Learn, make improvements and automate experiments to run continuously
Learn from your weaknesses before they become failures
Chaos Engineering will introduce real-time failures into systems to assess the system’s ability to tolerate failures, recoverability, resiliency and high availability. By designing chaos engineering experiments you can learn weaknesses in the system that could potentially lead to failures. Then these weaknesses can be addressed proactively going beyond the reactive process that currently dominates most incident response models. However, it’s important not to rush into the practice of chaos engineering without proper planning and designing of experiments.
Every chaos experiment should begin with a hypothesis
The test should be designed with a small scope with a focused team working on the same. Every organization should focus on controlled chaos promoted by observability to improve system resiliency.
Leverage observability to discover and overcome system weaknesses. Without observability, there is no chaos engineering. Organizations need to focus on building a culture of observing systems. It’s no longer about writing code. It’s all about adopting processes to build resilient systems. Introducing continuous chaos in your DevOps CI/CD pipeline helps automate experiments and failure testing, enabling detection, debugging and fixing issues more proactively. The practice of chaos engineering observability will improve confidence in the system, enable faster deployments, prioritize business KPIs and drive the auto-healing of systems. The use of AI/ML will aid in building observability patterns and antipatterns by close monitoring of system and user behavior over a period of time. The hypothesis developed over these patterns and antipatterns can help auto-heal the systems.
The industry has started recognizing differentiated value propositions provided by the practice of chaos engineering observability. This is certainly helping to address many unknown facets of systems’ unpredictability. Chaos engineering experiments coupled with cognitive observability study of complex systems using trend analysis, regression analysis and time series analysis will help take the systems into newer heights in the near future.