Track: Architecting for Failure


Day of week:

Complex systems fail in spectacular ways. Failure isn’t a question of if, but when. Resilient systems recover from failure; robust systems resist failure. In this track we’ll hear from experts who have designed systems that shifted from fragility to resilience and robustness in the face of failure. Attendees will learn architectural patterns and approaches that didn’t and did work, with take-aways that can be applied to their own systems.

10:35am - 11:25am

by Jon Moore
Senior Fellow, Comcast Cable

Comcast’s TV products serve tens of millions of customers and are powered by a suite of dozens of services that are continuously developed and operated by hundreds of technical staff. While we have enjoyed many of the touted benefits of a microservice architecture--looser coupling between teams, independent deployments--we have also encountered the corresponding reliability challenges. Delivering business value in this environment can seem like hacking your way through the wilderness at...

11:50am - 12:40pm

by Nori Heikkinen
Google Site Reliability Engineering Expert

Failure is a fact of life, so we design our system to be fault-tolerant at all levels. In practice, however, some components almost never fail. As the product grows, these components are increasingly stressed in new and different ways; when they ultimately do fail they create outages for which we are unprepared. We thought we were designing for failure, but the design didn't include failures at this level. At Google, some of our most exciting production snafus involve large and unpredictable...

1:40pm - 2:30pm

by Kolton Andrus
Chaos Engineer at Netflix

Netflix’s 57M members watch over 2 billion hours of content per month and their streaming accounts for 1/3rd of Internet traffic in some parts of the world. The Edge platform, which 1000’s of devices rely on to access the streaming experience, guards the front door to Netflix where any major issue results in a twitter storm.

In order to harden our systems, we designed “Failure as a Service” to allow anyone to test and validate how our systems handle failure. Purposefully injecting...

2:55pm - 3:45pm

by Tom Limoncelli
Author, SRE @ Stack Exchange

Distributed or "cloud" computing involves many moving parts, any of which can break or fail. Succeeding in this environment requires embracing failure, not running or hiding from it. To do this requires challenging our instincts with radical ideas. Tom will highlight some of the most radical advice from the new book “The Practice of Cloud System Administration”.

Topics will include: create resiliency at the most economic level, do risky procedures often, and create a blameless culture...

4:10pm - 5:00pm

Open Space

Architecting for Failure Open Space

5:25pm - 6:15pm

by Joe Stein
‎Founder, Principal Consultant at Big Data Open Source Security LLC

Building and deploying elastic distributed data centric systems that can fail, without losing data and without sacrificing elasticity, has been traditionally challenging. With Apache Mesos, an open source project that is the kernel for your data center, we can now create fully elastic end to end compute environments. With Mesos, distributed data persistent services can run durably and elastically. Kafka, HDFS, Cassandra, MySQL and more data centric systems run on Mesos.

We will talk...

Host: Philip Fisher-Ogden Director of Engineering at Netflix


Wednesday Jun 10

Thursday Jun 11

Friday Jun 12