Chaos, Complexity, and Resilience

Reliability is created by people. The engineers who write the functionality, those who operate and maintain the system, and even the management that allocates resources toward it are all part of a complex system. We each have a role in creating that reliability, bringing to bear best practices and focused attention to this property of the system. Tools can help. Chaos Engineering and Resilience Engineering are tools that we can use to create reliability. As practitioners in this industry, our success relies not on removing the complexity from our systems, but on learning to live with it, navigate it, and optimize for other business critical properties despite the underlying complexity.

Track Host:
Casey Rosenthal
CTO at Backplane

Wrote the book on Chaos Engineering; wrote the definition/manifesto; runs Chaos Community Day; managed the Chaos Team at Netflix for 3 years

10:35am - 11:25am

by Nathan Aschbacher
Chief Technology Officer @PolySync

Chaos Engineering makes up an essential component of our validation methods used in developing resilient, safety-critical autonomous vehicle software systems.
There's an adage in some functional-safety circles that goes something like, "The risk and danger live in the interfaces." Which, among other things, is a succinct way of stating that it's in the integration points where things most commonly...

11:50am - 12:40pm

by Tanya Reilly
Principal Engineer at @squarespace

When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug, we usually have a contingency plan. We reduce damage, redirect traffic, page someone, drop low-priority requests, follow documented procedures. But why do many failures still come as a surprise? In this talk, we'll look at how fire safety in buildings parallels how we prevent and manage software failures. Fire partitions. Public safety campaigns. Smoke alarms. Sprinkler systems. Doors that say “This is...

1:40pm - 2:30pm

by Haley Tucker
Senior Software Engineer, Chaos Engineering @Netflix

How do you gain confidence that a system is behaving as designed and identify vulnerabilities before they become outages? You may have thought about using chaos engineering for this purpose, but it’s not always clear what that means or if it’s a good fit for your system and team.
My experience at Netflix has led me to embrace chaos engineering to build more resilient distributed systems. I will share...

2:55pm - 3:45pm

by Tammy Butow
Principal Site Reliability Engineer @Gremlin

by Ana Medina
Software Engineer @Gremlin

There are those of us that are motivated to build resilient systems, improve uptime, move fast and keep systems reliable. Then there are those of us who feel overwhelmed by our to-do lists and the features or projects we feel we need to get out the door. 

The world needs more resilient systems because the world needs engineers in this for the long haul. We can create a better future for ourselves, those who come after us, our customers and our wider teams by focusing on building...

4:10pm - 5:00pm

