Image source: Getty Images
By Mike Hicks, Principal Solutions Analyst, Cisco ThousandEyes
In today’s Internet-enabled environment, no application update or feature can be introduced in isolation. Modern applications rely on a complex ecosystem of plug-ins and API dependencies, each of which have their own interdependencies on other services to operate and function, including the Internet which itself is unpredictable. It only takes one of these elements in the end-to-end application delivery chain to break to affect everything upstream and downstream of it.
A team may be responsible for only one part of a software application, but it has to be aware of, and have immediate visibility into, the impact of actions on the rest of the code packages that make up that application. Moreover, they need to be aware of how underlying Internet and network infrastructure are impacting their app’s performance.
Disruptions, incidents, and outages happen all the time and to everyone, no matter the size of your team or the resource of your tech stack. Let’s look at three examples in 2022 where interdependencies created unintended DevOps incidents.
Yes, to automation! But beware of unintended consequences
Automation is often positioned as a game-changer; an architectural future state capable of reducing human error, saving costs, and mitigating an untold number of potential outage scenarios.
To reduce downtime, for example, organizations tend to construct as many potential scenarios as possible to break and test the resilience of their infrastructure in the face of unknown operating conditions. They inevitably employ a high degree of automation for testing and results collection. This is all fine and the power of automating both processes and technologies cannot be overstated. If the automated responses don’t work as expected, though, they run the risk of introducing more complexity and challenges when troubleshooting incidents.
In early September, Microsoft experienced an issue that affected connections and services leveraging Azure Front Door (AFD), an Application Delivery Network (ADN) as-a-service that provides global load-balancing capabilities for Azure-based services.
The AFD is designed to automatically load balance traffic between global edge sites. In the event of a failure or overload at any of the edge sites, AFD is able to move the traffic to an alternate edge site, to reduce the impact on end users as much as possible.
In this case, however, the problems kicked in when the AFD service tried to balance out “an unusual spike in traffic,” only to wind up causing “multiple environments managing this traffic to go offline.”
The road to performance is paved with good intentions
As we strive to streamline processes and enhance the overall digital experience, there is a risk of making siloed “tuning” changes to the code or application without necessarily considering the overall services delivery chain, including associated dependencies.
In late August, Microsoft 365 desktop users were unable to authenticate and sign into services due to a plug-in issue (its web and mobile versions of Microsoft 365 were spared). The authentication issues impacted a large number of users. Japan, in particular, was hard-hit, while others in the region also experienced some impacts.
The root cause appeared to have had nothing to do with Microsoft, and everything to do with a third-party security plug-in. According to Microsoft, a vulnerability scan apparently uninstalled the desktop authentication client used to connect to Microsoft 365 services. Microsoft and Tenable eventually patched the plug-in, but the impact on customers had already been felt.
Fortunately, the impact was limited to a specific configuration of enterprise systems. It required desktop use of Tenable and Microsoft 365. In addition, an article in Tenable’s online community advised that customers “with scanners on custom feed schedules or in ‘air-gapped’ environments were most at-risk.”
When things go bump in the night
In the age of DevOps and chaos engineering, “moving fast” means that not everything pushed to production will work perfectly all the time. One way around it is to schedule these pushes to occur outside of business hours to reduce risk as much as possible. This is, of course, a sound idea, but only if you are able to select an out-of-hours time only for users with the potential to be impacted.
In early August, Google experienced an outage that affected the availability of several services, including Google Search, Google Maps, and associated services that use them, such as Gmail. The official explanation for the issue was a software update that went wrong.
Customer-facing impacts were first observed at around 9:15 PM EDT, as users were unable to access the service, although the application remained reachable from a network perspective. The disruption lasted a total of 41 minutes and cleared by around 10:10 PM EDT. The timing of the outage meant that, while the U.S. East Coast may have experienced disruption, APAC users would have likely been more significantly impacted by the disruption occurring right in the middle of typical business hours.
You may be interested in Site Reliability Engineering: Monitoring and Observability
The moral of the story is to serve your DevOps (with a side of visibility)
While incidents are inevitable, cross-ecosystem visibility across the entire service delivery chain is critical in reducing their impact by rapidly identifying issues, relative to the end users’ experience and responsible domain.
This may be easier said than done in large organizations, where it may be challenging to map out the full range of consequences that could potentially flow from taking a single action. Even the best quality assurance engineers are limited in visibility by their scripts, tools, and monitors used to determine if the environment is healthy.
Within large and complex environments, many horror stories can be avoided by giving the team that pushes a software update or new feature to production a means to receive an immediate feedback mechanism where they can monitor for, and quickly identify, instances where the introduction of new code has unexpected impacts on the overall service delivery.
Developers or teams may only be accountable for one part of an application or service, but they still have a responsibility to ensure their actions don’t break or degrade the overall application or service delivery. There’s always a possibility that coding mistakes might slip through, but complete end-to-end application and service delivery visibility, which should include the network and Internet layers, can provide an extra layer of assurance for fast-moving dev teams so that when something unexpectedly breaks, it can be recognized and rolled back before significantly impacting the user’s digital experience.
Learn how to engineer DevOps solutions by getting DevOps Certified: devopsinstitute.com/certifications