Evolve your systems to be fireproof with automated fire drills
Distributed systems are becoming more complex as microservice architectures evolve, and more unpredictable as the velocity of development increases.
A distributed system may have distinct services that work just fine in isolation, but interactions with other system components can create unpredictable behavior — often only exposed and observed in production environments.
So how do we identify these weaknesses before they spread into our production systems and impact the customer experience?
Chaos Engineering is about introducing controlled disruptions into a distributed system, carefully studying the behavior, identifying the weak areas, and improving resiliency with automation.
By carefully designing experiments that introduce controlled disruptions, we can proactively identify and address weaknesses — and break away from the dysfunctional and reactive incident response model.
The disruptions can be as simple as killing a process on a Linux server, or inducing errors for segment of live traffic serving customers in production.
Some other examples of Chaos Engineering experiments that introduce controlled disruptions into complex distributed system include:
- stopping, rebooting, and terminating virtual machines
- removing network services, routers, and load balancers
- simulating the failure of an entire region
- introducing latency between services, missing messaging topics, random errors, and crashing docker containers
- mimic the unavailability of 3rd party APIs or create additional latency
- simulate operating system issues with I/O, CPU, and memory spikes
While Chaos Engineering techniques are still being learned and adopted by smaller organizations, the methods are already institutionalized across large organizations like Amazon and Netflix.
Failure as a Service
To support controlled disruption, the Failure as a Service (FaaS) architecture was proposed in a research study from the University of Berkeley.
The goal is analogous to that of fire drills. That is, before experiencing unexpected failure scenarios, a cloud service could perform failure drills from time to time to find out the real-deployment scenarios in which its recovery does not work.
The architecture is based on three important characteristics:
- Failure drills are based on the large-scale injection of disruption
- Failure drills are conducted against online production environments
- Failure drills are available to the organization as an easy-to-use service
The architecture includes the following components:
- The FaaS Controller is a fault-tolerant service and sends fire drill commands to the agents running in VMs.
- The FaaS Agent runs on the same VM as the target service, and receives the drill commands that invokes controlled disruption.
- The Monitoring Service collects data about the target services that is used for the failure-drill scenarios and specifications.
The architecture provides a basic model for designing a system for introducing common failure modes in large-scale distributed system.
I guess it is now time for me to really go get this tattoo pic.twitter.com/QovpYPnIKh— Werner Vogels (@Werner) March 2, 2017
By leveraging the principles of Chaos Engineering to exercise large-scale on-line failures, your organization and systems can evolve into a more resilient — and fireproof — culture.
What’s your experience with Chaos Engineering, or the challenges in your organization with adopting the techniques? Drop a comment below or connect with me on twitter!