Track: Chaos, Complexity, and Resilience

Location: Soho Complex, 7th fl.

Day of week: Friday

Reliability is created by people. The engineers who write the functionality, those who operate and maintain the system, and even the management that allocates resources toward it are all part of a complex system. We each have a role in creating that reliability, bringing to bear best practices and focused attention to this property of the system. Tools can help. Chaos Engineering and Resilience Engineering are tools that we can use to create reliability. As practitioners in this industry, our success relies not on removing the complexity from our systems, but on learning to live with it, navigate it, and optimize for other business critical properties despite the underlying complexity.

Track Host: Casey Rosenthal

CTO @backplaneio

Wrote the book on Chaos Engineering; wrote the definition/manifesto; runs Chaos Community Day; managed the Chaos Team at Netflix for 3 years


10:35am - 11:25am

Properties of Chaos

Chaos Engineering makes up an essential component of our validation methods used in developing resilient, safety-critical autonomous vehicle software systems.

There's an adage in some functional-safety circles that goes something like, "The risk and danger live in the interfaces." Which, among other things, is a succinct way of stating that it's in the integration points where things most commonly breakdown. In a traditional safety-critical development process this focus on danger at the interface level is partially borne out of the assumption that the rigors of formalized safety-critical development processes (i.e. ISO 26262, IEC 61508, etc.) will have squeezed out serious issues in the design and various components that make up a system.

As it turns out, it is true that there's enormous opportunity for failure at the integration points between different systems and components, but it's also true that even some of the most rigorous SDLC processes available today leave room for unintended, undefined, or undesirable emergent behaviors elsewhere in the implementation. This problem is exacerbated significantly by the scale and complexity of the systems that are required to facilitate and operate an autonomous vehicle.

By automatically exploring the input space of chaos in a given system, we try to build stronger inductive proofs of our system's resilience semantics, so that we can augment the deductive proofs of correctness we derive from the use of formal methods in other facets of our solutions. The ultimate goal being to make assurances about safety-properties and resiliency-behaviors that would be otherwise impossible without the use of Chaos Engineering.

Nathan Aschbacher, Chief Technology Officer @PolySync

11:50am - 12:40pm

Heretical Resilience: To Repair is Human

Resilient architecture is often thought of solely in terms of its technical aspects - with the right distributed system or automated failover or fancy new orchestration software, we want to believe we can avoid the inevitability of failure. While it is certainly true that we can design our systems to be more robust, true resilience comes from humans. The humans in complex systems, and especially the human-computer interactions and interfaces, are what can really make or break the true resiliency of these systems. This human-centric approach requires a different mindset than a solely infrastructure-focused one, but is no less rigorous and encourages change that are equally if not more important.

In this talk, I will describe the “Apache SNAFU” described in the SNAFU Catchers’ Stella Report, sharing my experiences as the instigator of that snafu and walking through the lessons that can be learned from such an event. Takeaways will include ideas for how to design tools, processes, and systems in ways that maximize the resilience and responsiveness of humans throughout engineering organizations.

Ryn Daniels, Staff Infrastructure Engineer @travisci

1:40pm - 2:30pm

Using Chaos To Build Resilient Systems

There are those of us that are motivated to build resilient systems, improve uptime, move fast and keep systems reliable. Then there are those of us who feel overwhelmed by our to-do lists and the features or projects we feel we need to get out the door. 

The world needs more resilient systems because the world needs engineers in this for the long haul. We can create a better future for ourselves, those who come after us, our customers and our wider teams by focusing on building resilient systems. How do we make it easier for everyone to build resilient systems? 

It is not easy to build resilient systems, but that doesn’t mean we shouldn’t try. Engineers love a technical challenge. In this talk I will explain how focusing on the detection, mitigation, resolution and prevention of incidents is a great place to start. I will share my experiences using chaos engineering to build resilient systems... even when you can’t build your systems from scratch.

Tammy Butow, Principal Site Reliability Engineer @Gremlin

2:55pm - 3:45pm

UNBREAKABLE: Learning to Bend but Not Break at Netflix

How do you gain confidence that a system is behaving as designed and identify vulnerabilities before they become outages? You may have thought about using chaos engineering for this purpose, but it’s not always clear what that means or if it’s a good fit for your system and team.

My experience at Netflix has led me to embrace chaos engineering to build more resilient distributed systems. I will share examples of chaos experiments which identified problems and built confidence in our resilience mechanisms, as well as several challenges, lessons, and benefits we have encountered while scaling chaos engineering across Netflix.

Key Takeaways

  1. How Chaos experiments complement other types of testing.
  2. How your perspective of Chaos engineering changes as your role and services evolve.
  3. How Chaos engineering can be used to gain confidence in platforms, configurations, and resiliency mechanisms.
  4. Lessons automating Chaos experiments safely and effectively.

Haley Tucker, Senior Software Engineer, Chaos Engineering @Netflix

4:10pm - 5:00pm

Have You Tried Turning It Off and On Again?

Would you jump on this train of thought for a moment and see if you agree? Let’s say you have some number of computers. It could be three, it could be kerjillions, the number probably doesn’t matter too much for this thought experiment. Now lets say you have a number of people, probably closer to three than kerjillions, but find a number that works for you. And these people are tasked with making those computers function together in an resilient fashion in the real world. Can we agree that how the people operate the computers in production can have a significant impact on the resilience of the system? Almost obvious, no? 

But much less obvious are the deeper questions like: what are the characteristics of an operations practice that actively influence a system towards greater resiliency? Which practices (lets call them “operations theatre”) pretend to assist us in this goal but really work against us? In this talk not only will we uncover the answers, but we’ll use concrete examples from the breadth of the Site Reliability Engineering discipline to illustrate just how they work.

David Blank-Edelman, Senior Cloud Ops Advocate @Microsoft