Resiliency through failure - Netflix's Approach to Extreme Availability in the Cloud
Netflix created a suite of tools, collectively called the Simian Army, to improve resiliency and maintain the cloud environment. In the typical case, failure modes are corner cases which are poorly, if at all, tested. Its only by failing often that we can ensure that we are resilient to failure. We look for ways to induce failure in our production environment to better prepare us for the inevitable failures that will occur. The main players in the Simian Army follow.
Chaos Monkey randomly terminates virtual machines to ensure that services are resilient to node failure.
Chaos Gorilla is a more powerful version of Chaos Monkey, terminating an entire AWS Availability Zone (data center) to ensure resiliency to a single zone failure.
Latency Monkey induces random network delays and errors to ensure that services are resilient to degradation in their dependencies.
Janitor Monkey is the cloud cleaning crew. It prevents clutter by cleaning up old and unused resources.
Chaos Monkey and Janitor Monkey have been open sources in the past year and are free available to the pubic.
Reference for further details: