Track: Better than Resilient: Antifragile


Day of week:

“Antifragility is beyond resilience or robustness. The resilient resists shocks and stays the same; the antifragile gets better.”

Failure and change are constants in Internet scale companies. Uptime is a battle, with tales of glory and heartache. How can we do better than withstand, but improve with each step? Learn from industry leaders how they proactively prepare for the inevitable. Find out how these techniques help them to weather production storms and have confidence in the behavior of their complex systems.

Track Host:
Kolton Andrus
Founder Gremlin and previously a Chaos Engineer at Netflix
Kolton is the founder of Gremlin Inc - helping companies build more robust services. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. At both companies he has served as a ‘Call Leader’, managing the resolution of company-wide incidents. Kolton is passionate about building resilient systems, primarily as it lets him break things for fun and profit.
10:35am - 11:25am

by Theo Schlossnagle
Founder and CEO @Circonus, Editorial board of ACM's ‘Queue’

In this presentation, I'll talk about lessons learned in building a always-on distributed time-series database with aggressive quality of service guarantees. As any distributed systems engineer knows, coping with a failed machine is an easy problem compared to an under performing one. When SLAs are tight, under performing is effectively byzantine behavior. I will talk about both macro and micro techniques used in our system to cope with bad machines, bad actors...

11:50am - 12:40pm

by Luke Kosewski
Founding Member of Netflix Chaos and Traffic Team

The Netflix control plane handles a third of peak Internet traffic. That's an awful lot of customers we need to keep safe from any service outages. Netflix developed "Flow" to wage war against these outages. Flow coordinates recovery from localized disruptions and enables periodic verification through production experimentation called “Chaos Kong.”

Flow endows all services within Netflix the capabilities to withstand regional...

1:40pm - 2:30pm

by Michalis Zervos
Service Resilience Software Engineer @Microsoft

For any company to run on the cloud they need assurances that their workloads, services, and data will be always available and secure. To be able to provide such guarantees, application developers and cloud providers need to perform extensive verification across a number of distributed services. Traditional testing tools were not designed to verify the resiliency of such systems.

At Microsoft, we actively develop and use fault...

2:55pm - 3:45pm

by Richard Kasperowski
Author of The Core Protocols: A Guide to Greatness

Open Space
4:10pm - 5:00pm

by Abel Mathew
Co-founder & CEO of Backtrace I/O

Resilience for many of us comes from our ability to restart applications in the face of failure. We as debuggers and operators are often forced to go back and analyze clues left behind to tease out root-cause from assets like logs, heap dumps, or even core dumps. As our systems grow, and become more distributed, these one-off investigations become less tenable and a scalable way to analyze incidents after-the-fact is needed. In this talk, we'll explore examples...

5:25pm - 6:15pm

by Thomissa Comellas
Technical Project Manager @Dropbox

by Tammy Butow
SRE Manager @Dropbox

Thomissa joined the Dropbox Infrastructure team 100 days ago. This presentation will share her experiences developing and rolling out new Disaster Recovery Testing techniques at Dropbox. Tammy will join Thomissa to share how her team runs DRTs and has implemented the techniques Thomissa has evangelized.

Dropbox was founded by engineers, and the ethos of technical innovation is fundamental to our culture. We’ve grown enormously...


Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June