Presentation: Chaos Kong - Endowing Netflix with Antifragility

Location:

Duration

Duration: 
11:50am - 12:40pm

Day of week:

Level:

Persona:

Key Takeaways

  • Understand the Netflix approach to shifting traffic based on failure scenarios.
  • Learn challenges that occur at scale when moving traffic between data centers.
  • Hear actionable techniques that you can implement to handle failure more seamlessly in your enterprise.

Abstract

The Netflix control plane handles a third of peak Internet traffic. That's an awful lot of customers we need to keep safe from any service outages. Netflix developed "Flow" to wage war against these outages. Flow coordinates recovery from localized disruptions and enables periodic verification through production experimentation called “Chaos Kong.”

Flow endows all services within Netflix the capabilities to withstand regional failure, without individual team intervention. This talk describes the internals and machinations of Flow, how a similar approach may add value to your microservice architecture, and what preconditions must be met - both technically and culturally - for such a recovery mechanism to succeed. We'll give a real-life war story of a 2015 Q4 outage to demonstrate how this approach can turn the threat of a nasty disruption into a harmless, routine event.

Interview

Question: 
Let’s start with your role. You are at Netflix working on chaos, can you tell me more about that? What does that mean, to work at chaos and the traffic team at Netflix?
Answer: 
The chaos and traffic teams are two sides of the same coin. The traffic team is responsible for evacuating traffic from areas of failure. The chaos team is responsible for causing that failure, and understanding whether or not our mechanisms are sufficiently resilient to be able to deal with it, both at the service but also at the datacenter traffic level. I work mostly on the evacuation tools. and the management tools that deal with our traffic flow leaving datacenters.
Question: 
Can you give me some background for this talk?
Answer: 
When I came to Netflix, my original title was an SRE. There were only three of us in a company of 700-800 engineers. It was like playing a game of Whack-a-Mole.
We would go to one service that was having issues and we would tighten up everything, ratchet down the bolts, pat them on the back and say “You guys are good to withstand an issue.”
We would iterate through the services that way and, by the time we were done, they had deployed so many code changes that we had to go back and do it all over again.
It wasn’t a sustainable practice to get the uptime numbers that we wanted, so Flow was born. Flow is an implementation of this Chaos Kong idea, the idea that we can take down a datacenter and move that traffic around to somewhere else. Flow is the tool that can do that. It is moving the traffic.
From then, the paradigm for our team shifted, and chaos and traffic grew out of that.
We decided to isolate the pieces of software we run to various fault domains, so that in the case of an outage, caused with our own tools or happening in real life, we could evacuate traffic from the datacenter and re-balance it. And we provided various guarantees that we were not going to topple over other regions when we would move all that traffic.
That is the meat of what I am going to talk about. The motivation for that, and the unique challenges, and non-unique in some cases, that Netflix faced when developing this tool and how it’s actually done, how our secret sauce works for a regional fail.
Question: 
What is the persona that you see coming to this talk?
Answer: 
There are a number of people at various different levels, an architect, a senior engineer, or a tech lead. I am going to enumerate many of the problems that Flow had overcome, and the issues that existed within our tech stack.
Somebody had to go and reason them out. These people are in a unique position to do that. They will be able to look at the challenges that we tackled and ask “What would we face within our organization to solve the same problems?”
At the same time, I hope to see more director level and VP people, people who have the ability to change priorities across a large number of organizations within a company. These decision makers will be able to walk away with a value proposition.
Flow is a project that was heavily technically-driven, but there was a lot of executive support for the concept of regional failover. It was necessary because different teams have different priorities, and you have to get everybody on the same page if you are going to pick a launch date. These people will be able to see the value that Netflix got out of it, and if it makes sense to do something similar in their organization.
Question: 
Can you provide some more detail on what Flow is?
Answer: 
The applications in each region are decoupled in such a way that they can withstand the failure of any region. Even something like a Cassandra ring, if we drop out a region the other two regions are still not failing. Netflix doesn’t scale in order to be able to handle more than some buffer percentage above regular traffic.
If we suddenly have a datacenter outage, we actually need to start creating new instances of our applications elsewhere, and Flow is going to coordinate that for us and it is going to do it correctly. We won’t be paged. We won’t need to be woken up in the case of an outage. As long as a service conforms to these things, Flow will just do the right thing for us.
Question: 
So Flow is focused on service orchestration?
Answer: 
Yes, very much so.
Question: 
What are the actionable benefits that a tech lead or an architect is going to walk away with, ready to go implement back in their own shop?
Answer: 
In this talk I’ll enumerate a bunch of problems that we had to overcome at the scale of Netflix and it will show them a number of concrete solutions that Netflix used to solve these problems.
I think every organization, depending on what technology exists in place already, is going to have slightly different challenges than Netflix has, but they will walk away with is a very good starting point, such as, “Here is a way I can sell this to people” “Here is a success story where Kong and Flow saved the company’s bread and butter, what are the challenges that exist within my organization? Do these solutions work? Do we need to come up with different solutions?”
I think they are going to be able to walk away with a good starting point, if not a complete picture, for how they might implement a regional or fault domain failover, because I don’t think this is necessarily related to only AWS within your own organization.
Question: 
Can you give me an example of one of the problems and the solution to that problem that you will in your QCon talk?
Answer: 
One example is that we use level 7 proxying between our regions in order to flatten out small irregularities and during a failover, a lot of what we do is we flip DNS. That is a really good starting point, to have all of our client devices start moving over to another region.
But one of the things that has happened when we flip DNS is then the load balancers in the healthy regions get crushed under the load, what can you do? How can you work around that?
We do have those mechanisms because we have all of this level 7 proxying stuff, and if you can make assumptions about healthy proxies in one region, you can start doing a lot of cool back-end proxying stuff bypassing all of the load balancers.
Question: 
Are the lessons you plan to discuss only for that large scale companies out there or are there lessons applicable to smaller shops?
Answer: 
I think it applies to anybody who has a multi-region deployment in AWS. I came to Netflix from PagerDuty and they happen to run in multiple AWS regions. What we do at Netflix on a larger scale, companies like PagerDuty could use the same sort of template. So, smaller companies could definitely use the same template.
Will they worry so much about capacity? Perhaps not, because when they are talking about a 20 or 40% percent capacity overhead, it is not nearly the same level of money that we are talking about when Netflix talks about a 20 or 40% capacity overhead. But there is a lot of the same concepts just slightly blown up at the scale of Netflix.

Speaker: Luke Kosewski

Founding Member of Netflix Chaos and Traffic Team

Luke Kosewski has been involved in many aspects of computing for over 12 years - from Linux kernel drivers and embedded systems to and distributed web applications. He currently works as a Sr. Software Engineer and a founding member of the Traffic & Chaos team at Netflix. He built "Flow", Netflix's traffic management service, to solve reliability issues at scale and continues to innovate on this, making Netflix more resilient to new threats. In his spare time, Luke SCUBA dives and tinkers with motorcycles.

Find Luke Kosewski at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June