Track: Chaos & Resilience

Location: Majestic Complex, 6th fl

Day of week: Tuesday

Failure isn’t a question of if, but when. Embracing a habit of introducing chaos on a regular basis strengthens systems. In this track we’ll hear from experts who have designed systems that became increasingly more resilient and reliable over time. Attendees will learn architectural patterns and approaches that didn’t and did work, with take-aways that can be applied to their own systems. Attendees will hear how chaos engineering, disaster recovery testing and other tools are being used to create incredibly resilient systems.

Track Host:
Tammy Butow
SRE Manager @Dropbox

Tammy Butow is an Australian who relocated to the USA in 2014. She lives in San Francisco and is a Site Reliability Engineering Manager at Dropbox. Tammy leads the Databases & Magic Pocket SRE teams. She enjoys working on large scale infrastructure systems and enjoys chaos engineering, resiliency, automation, durability engineering, Go and Linux. Tammy previously worked in Security Engineering and Product Engineering. She likes to ride bikes, skateboard and snowboard. Tammy is the Co-Founder of Girl Geek Academy, a global movement to teach 1 million women technical skills by 2025.

Trackhost Interview

QCon: Interview: What is the Chaos & Resilience track about?

Tammy: The main goal of the Chaos & Resilience track is to share with everyone who's coming along the idea that it's not really a question if failure will happen but when, and present things that we can embrace to strengthen our systems so when they do fail it doesn't impact customers. And it doesn't impact all of the services that we have at our companies. The goal is to help everybody build more resilient systems by using different techniques and we're going to share many techniques throughout the entire day from the number of companies Netflix, DropBox, Betterment, Comcast.

And we also have Bruce Wong who was the person who founded the concept of Chaos Engineering at Netflix, and he's now doing R&D at Twilio, and he's going to talk about the current state of Chaos Engineering and where is heading in the next few years to close out the day. It's going to be really nice, to hear some practical things that I can do right now for my company, and some of the things that I should be thinking about doing in the next two, three, four, five years.

I want people to leave with the idea that even if they have excuses in their minds that it is not going to be enough anyway, or I don't have enough people to do it, or I'm not sure if this is a top priority, these are techniques that one can use to get more time to build better tools, better features for customers, a better product, to make customers happy and get new customers. This will speed up shipping velocity. The techniques that we're going to share are not complicated. They won't take one a lot of time; it's like set it up once, and leave it running for years. You don't need to continuously be changing them.

10:35am - 11:25am

by Leonid Movsesyan
Engineering Manager @Dropbox

In the modern world, tech companies build their products on extremely reliable servers that never break. They’re stacked in the racks with highly reliable switches with firmware that is rock solid and guaranteed to have no bugs. These switches talk to each other over super low latency networks that have close to zero packet loss rates. And this whole thing is located in the building with infinite and redundant power supply. Just kidding, it’ll all break.

Companies can buy the most...

11:50am - 12:40pm

by Bruce Wong
R&D Leadership at @Twilio

“I don’t always test my resilience, but when I do, it’s at 3 a.m.”

“I don’t always test my resilience, but when I do, it’s in Prod.”

“I don’t always test my resilience, but when I do, its an outage!”

These were the days… the days before Chaos Engineering. More and more practitioners are on their way to discovering the benefits of Chaos Engineering. What started as an odd, bold, and even scary practice has been embraced by many in the pursuit of more nines. This talk...

1:40pm - 2:30pm

by John Mileham
VP Architecture @Betterment

Resilience in the face of chaos is a tall order. As a vertically integrated financial institution where rapidly delivered features with complete data consistency and scrupulous correctness are all non-negotiable, Betterment had its work cut out for it. So we moved the goalposts - inward. By eliminating complexity that many teams consider table stakes, we’ve built a distributed software ecosystem that empowers engineers to do their best work with a minimum of high-wire distributed systems...

2:55pm - 3:45pm

by Jearvon Dharrie
Senior Software Engineer @Comcast

When talking about resiliency and Elixir, The Open Telecom Platform (OTP) is usually the main topic discussed. In this talk we will discuss other factors that contribute to Elixir's perfect match for fault tolerance and resiliency. Topics that will be discussed are, ease of deploying, operations and monitoring, typespecs, and the BEAM's forgiving nature.

4:10pm - 5:00pm

Open Space
5:25pm - 6:15pm

by Nora Jones
Senior Chaos Engineer @Netflix

Chaos Engineering is described as "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production". This is immensely beneficial when executed properly, however all too often the road to cultural acceptance may not match our expectations as SREs, Chaos Engineers, and Productivity engineers.

Choose Your Own Adventure is a series of children's gamebooks where each story is written from a...


Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June