Share on facebook
Share on twitter
Share on linkedin

Control failure with chaos engineering

A Cloud Guru News
A Cloud Guru News

Evolve your systems to be fireproof with automated fire drills

“Everything fails all the time. We lose whole data centers! Those things happen.”

Werner Vogels

Distributed systems are becoming more complex as microservice architectures evolve, and more unpredictable as the velocity of development increases.

A distributed system may have distinct services that work just fine in isolation, but interactions with other system components can create unpredictable behavior — often only exposed and observed in production environments.

So how do we identify these weaknesses before they spread into our production systems and impact the customer experience?

Controlled disruption

Chaos Engineering is about introducing controlled disruptions into a distributed system, carefully studying the behavior, identifying the weak areas, and improving resiliency with automation.

By carefully designing experiments that introduce controlled disruptions, we can proactively identify and address weaknesses — and break away from the dysfunctional and reactive incident response model.

The disruptions can be as simple as killing a process on a Linux server, or inducing errors for segment of live traffic serving customers in production.

Some other examples of Chaos Engineering experiments that introduce controlled disruptions into complex distributed system include:

  • stopping, rebooting, and terminating virtual machines
  • removing network services, routers, and load balancers
  • simulating the failure of an entire region
  • introducing latency between services, missing messaging topics, random errors, and crashing docker containers
  • mimic the unavailability of 3rd party APIs or create additional latency
  • simulate operating system issues with I/O, CPU, and memory spikes

While Chaos Engineering techniques are still being learned and adopted by smaller organizations, the methods are already institutionalized across large organizations like Amazon and Netflix.

Failure as a Service

To support controlled disruption, the Failure as a Service (FaaS) architecture was proposed in a research study from the University of Berkeley.

The goal is analogous to that of fire drills. That is, before experiencing unexpected failure scenarios, a cloud service could perform failure drills from time to time to find out the real-deployment scenarios in which its recovery does not work.

The architecture is based on three important characteristics:

  1. Failure drills are based on the large-scale injection of disruption
  2. Failure drills are conducted against online production environments
  3. Failure drills are available to the organization as an easy-to-use service

The architecture includes the following components:

  • The FaaS Controller is a fault-tolerant service and sends fire drill commands to the agents running in VMs.
  • The FaaS Agent runs on the same VM as the target service, and receives the drill commands that invokes controlled disruption.
  • The Monitoring Service collects data about the target services that is used for the failure-drill scenarios and specifications.

The architecture provides a basic model for designing a system for introducing common failure modes in large-scale distributed system.

By leveraging the principles of Chaos Engineering to exercise large-scale on-line failures, your organization and systems can evolve into a more resilient — and fireproof — culture.

What’s your experience with Chaos Engineering, or the challenges in your organization with adopting the techniques? Drop a comment below or connect with me on twitter!

Recommended

Get more insights, news, and assorted awesomeness around all things cloud learning.

Get Started
Who’s going to be learning?
Sign In
Welcome Back!
Thanks for reaching out!

You’ll hear from us shortly. In the meantime, why not check out what our customers have to say about ACG?

How many seats do you need?

  • $499 USD per seat per year
  • Billed Annually
  • Renews in 12 months

Ready to accelerate learning?

For over 25 licenses, a member of our sales team will walk you through a custom tailored solution for your business.


$2,495.00

Checkout