Resiliency through failure - Netflix's Approach to Extreme Availability in the Cloud

Grand Ballroom - Salon D

Netflix created a suite of tools, collectively called the Simian Army, to improve resiliency and maintain the cloud environment. In the typical case, failure modes are corner cases which are poorly, if at all, tested. Its only by failing often that we can ensure that we are resilient to failure. We look for ways to induce failure in our production environment to better prepare us for the inevitable failures that will occur. The main players in the Simian Army follow.

Chaos Monkey randomly terminates virtual machines to ensure that services are resilient to node failure.

Chaos Gorilla is a more powerful version of Chaos Monkey, terminating an entire AWS Availability Zone (data center) to ensure resiliency to a single zone failure.

Latency Monkey induces random network delays and errors to ensure that services are resilient to degradation in their dependencies.

Janitor Monkey is the cloud cleaning crew. It prevents clutter by cleaning up old and unused resources.

Chaos Monkey and Janitor Monkey have been open sources in the past year and are free available to the pubic.

Reference for further details:

Ariel Tseitlin's picture
Ariel Tseitlin manages the Netflix Cloud and is interested in all things cloudy. At Netflix, he is Director of Cloud Solutions, helping Netflix be successful in the Cloud, including cloud tooling, monitoring, performance and scalability, and cloud operations and reliability engineering. Ariel's team builds Asgard and the Simian Army, including the Chaos Monkey. Prior to Netflix, Ariel was VP of Technology and Products at Sungevity and before that was the Founder & CEO of CTOWorks.