Netflix
Presentations about Netflix
UNBREAKABLE: Learning to Bend but Not Break at Netflix
Large Scale Architectures Panel
Large Scale Architectures Panel
Large Scale Architectures Panel
Large Scale Architectures Panel
Large Scale Architectures Panel
Large Scale Architectures Panel
Large Scale Architectures Panel
Containers at Web Scale Panel
Containers at Web Scale Panel
Containers at Web Scale Panel
Containers at Web Scale Panel
Containers at Web Scale Panel
Scaling Push Messaging for Millions of Devices @Netflix
How Machines Help Humans Root Cause Issues @Netflix
Interviews
UNBREAKABLE: Learning to Bend but Not Break at Netflix
Tell me about your talk.
I’m going to share my personal journey at Netflix learning to build and operate distributed systems -- both as a service owner and as a Chaos engineer. As service owner, I’ll provide examples of how I used Chaos engineering to build better systems, even for non-critical services. As a chaos engineer, I’ll cover some of the lessons I’ve learned while building better tooling for safe experimentation.
Can you give me an example of one the lessons?
When running Chaos experiments, we leverage a canary strategy. We have a control and an experiment cluster, and we monitor KPI data during the experiment so we can shut it off quickly if things go awry. We've been adding more KPI’s so that we can watch different dimensions, and one of the challenges we’ve encountered is how to monitor low volume metrics to get a reliable signal for shut off. False positives create a lot of noise. We don’t want our users to get alert fatigue from unreliable results, so we have to find the right balance between failing on the side of caution and minimizing noise.