warning icon QCon New York 2020 has been canceled. See our current virtual and in-person events.
You are viewing content from a past/completed QCon

Presentation: Using Chaos To Build Resilient Systems

Track: Chaos, Complexity, and Resilience

Location: Soho Complex, 7th fl.

Duration: 1:40pm - 2:30pm

Day of week: Friday

Slides: Download Slides

Level: Intermediate

Persona: Architect, CTO/CIO/Leadership, Developer

Share this on:

This presentation is now available to view on InfoQ.com

Watch video

What You’ll Learn

  • Learn how chaos engineering can help an organization build more resilient systems.
  • Understand strategies on how to get a chaos engineering program started and what are appropriate first steps.
  • Hear first-hand experiments from a senior principal SRE how chaos engineering has affected her systems.

Abstract

There are those of us that are motivated to build resilient systems, improve uptime, move fast and keep systems reliable. Then there are those of us who feel overwhelmed by our to-do lists and the features or projects we feel we need to get out the door. 

The world needs more resilient systems because the world needs engineers in this for the long haul. We can create a better future for ourselves, those who come after us, our customers and our wider teams by focusing on building resilient systems. How do we make it easier for everyone to build resilient systems? 

It is not easy to build resilient systems, but that doesn’t mean we shouldn’t try. Engineers love a technical challenge. In this talk I will explain how focusing on the detection, mitigation, resolution and prevention of incidents is a great place to start. I will share my experiences using chaos engineering to build resilient systems... even when you can’t build your systems from scratch.

Question: 

What do you want someone to leave your talk with?

Answer: 

Everyone who comes along to this talk will leave with an understanding of how they can start seeing massive benefits from practicing Chaos Engineering within 3 months. Chaos Engineering to me is the fastest, most efficient way to take a giant leap forward for the resilience of your systems and team. 

Question: 

Can you give me an example of a time Choas Engineering really saved you? 

Answer: 

Through practicing Chaos Engineering I have personally achieved a 10x reduction in incidents and the complete elimination of high severity (SEV 0) incidents for 12+ months. This giant leap was achieved within a 3-month window. That means less downtime and less pagerpain for everyone. 

Question: 

What is the level of experience someone attending this talk should have?

Answer: 

To get the most value from this talk you have ideally been on-call and felt the pain of keeping the lights on.  

Speaker: Tammy Butow

Principal Site Reliability Engineer @Gremlin

Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers. Prior to this Tammy worked at DigitalOcean and one of Australia's largest banks in Security Engineering, Product Engineering and Infrastructure Engineering.

Find Tammy Butow at

Tracks