Presentation: Scheduling a Fuller House: Container Mgmt @Netflix

Location:

Duration

Duration: 
10:35am - 11:25am

Day of week:

Level:

Persona:

Key Takeaways

  • Hear lessons building and implement a container scheduling system.
  • Gain insights from real world experience building and operating a container cloud 
  • Learn the challenges and solution of operating at scale and the things to consider before adopting or building a cluster management solution.

Abstract

Customers from over all over the world streamed Forty Two Billion hours of Netflix content last year. Various Netflix batch jobs and an increasing number of service applications use containers for their processing. In this talk Netflix will present a deep dive on the motivations and the technology powering container deployment on top of the AWS EC2 service. The talk will cover our approach to cloud resource management and scheduling with the open source Fenzo library, along with details on docker execution engine as a part of project Titus. As well, the talk will share some of the results so far, lessons learned, and end with a brief look at the developer experience for containers.

Interview

Question: 
What are your backgrounds and what are your roles today?
Answer: 
ANDREW: We have three areas that we focus on as part of the Titus project, Sharma and I are both full time on Titus.
I focus more on the Docker execution side, and integration with the AWS environment as well as the integration with all of our other Netflix infrastructure systems. Our experience is in moving from immutable VM infrastructure to containers, and they are sort of both in immutable. We already had a wealth of systems around CICD and telemetry and IPC.
Not only did we have to do the work to integrate really deeply with the AWS environment, we had to do the work to integrate deeply with those supporting systems, because we took a design center that we weren’t designing a new cloud platform around containers.
We made containers work within our existing ecosystems. You can look at my part of the equation as sort of the execution on down and integration into all of those systems.
SHARMA: I come from a scheduling background. Before Titus got started, I was working on a stream processing system, and there we were developing around scheduling on top of Mesos and that led me to writing the scheduling library Fenzo.
The success we had in doing scheduling work resource management there led to collaboration and now working full time on Titus. My perspective is in how do we do resource management and scheduling, and bring in multiple use cases, like batch and service, to co-mingle in a single cluster and how do we achieve the SLAs for all of them.
Question: 
Why write your own? Why didn’t you use one of the existing scheduling libraries?
Answer: 
SHARMA: When we started, Mesos was reasonably immature. There were popular schedulers like Apache Aurora, from Twitter; there was Marathon, from Mesosphere. However, when we started looking at that, although they were mature in terms of having an API and executing jobs, they were very limited and primitive in terms of scheduling capabilities. You could not do much advanced scheduling and that is what we were after.
For stream processing, trying to do things like stream locality. Sort of like data locality for batch jobs and also balancing services across availability zones on EC2. Another primary factor was we are running on an elastic cloud. We wanted to do auto scaling of the cluster. None of them autoscale the underlying cluster, even today. So all of those schedulers are built with the datacenter in mind, where as, we need to be auto scaling.
Question: 
When you talk about containers on EC2, you are talking about deploying them in the AMI’s right? You are not talking about using ECS. Is there a reason why you chose that approach and not something like Mesos or ECS?
Answer: 
ANDREW: Correct. We are talking about taking larger VMs and scheduling containers into those with this Titus scheduler and framework technology.
Why choose Titus when there are a lot of other container management and scheduling systems out there? It comes down to a couple points. Sharma already talked about the complex job scheduling requirements, and the scale and the elasticity that we need around that, but I would add on to that it is deeply integrating with the existing Netflix systems.
We have done things like when you are on an Amazon instance, there is a metadata URL that tells you everything about the instance you are on. We use that metadata URL to inject information into service discovery to do IPC based routing. We got in there and decided we need to replace that metadata URL with one that is container-specific as opposed to VM-specific.
That’s a case where not only did we extend sort of the AWS support, but we also did that in very direct of integrating into our existing service discovery technologies. Continuous deployment is another example of this.
I think if you look at the Kubernetes it starts to become a wider offering than what we needed. We already had CI/CD systems and approaches to IPC. If you look at our cloud platform and Spinnaker offerings there is quite a bit of infrastructure that is around us that if we were coming from a mentality of we already had sort of pre-baked immutable infrastructure and micro services and very much fault tolerance.
How do you make containers work in that environment as opposed to bringing all of that along with the container cluster management so people have to learn two different systems?
As for Mesos, we strongly believe in and Titus leverages the underlying resource scheduling with custom frameworks. ECS offers a similar concept, but has a few scalability gaps we are actively working with the ECS team to resolve. Over time, it is likely that Titus will support both Mesos and ECS given our strong support of the Amazon environment.
Question: 
What is the primary goal for this talk?
Answer: 
ANDREW: It’s really letting people know how hard this really is.
Sharma has a great quote: "Building a good cluster manager is easy. Building one that is fault tolerant, scalable and performant is quite hard."
There are probably a bunch of other abilities to throw into that list that, when we come to Netflix with a container management solution, it has to be on-par with where the VM system was today or it’s a non-starter.
The complexity of doing that, versus starting with a solution that’s part of the way there for someone that wants to adopt it and grow into the container space, is very, very different than where we came from, of trying to really make sure it was as good as we were before now with the added benefits of the container management.
Question: 
Not everybody out there can operate at the scale you are talking about. What type of lessons are you going to be sharing that are going to be applicable to the other 98% of folks that don’t run one third of the internet?
Answer: 
ANDREW: I put fault tolerant well before scale. A lot of the Netflix infrastructure, Titus included, it’s far more important to get the fault tolerance part right.
We are expected to do scalability, but I think what people can take out of this the most is the aspects we’ve done around reconciliation and health checks and all the other infrastructure that we have built into the Titus system that makes it as reliable as it is in this point in time.
Not to say scale isn’t important, but I would definitely focus on fault tolerance before I would focus on scalability.
SHARMA: We are also going to talk about diversity of our cloud. When people talk about scale, a lot of times they are thinking about how many instances are you going on.
It is also important to see the variety of workloads do you have. Running 100 micro services instances versus 10,000 is a scale problem, but not as big as if you were to run 1,000 micro services as well as 10 or 20 or 100 different batch jobs at the same time.
I think scale also comes in variety, and is one of the lessons we have learned. We are going to share about how we built the architecture based on those lessons, to separate out how the different variety of job load concerns are versus scheduling concerns and things like that.

Speaker: Andrew Spyker

Senior Software Engineer @Netflix

Andrew has worked on the Cloud Platform team as Netflix for the past two years. He joined Netflix to help with performance and scalability across the core building blocks of Netflix’s cloud platform. Since joining Netflix, he has helped with not only performance but also architecture, open source, and container strategy. More recently Andrew has been an engineer as well as product manager for the container cloud initiative (Project Titus).

Find Andrew Spyker at

Speaker: Sharma Podila

Software Engineer @Netflix & Creator of Fenzo Extensible Scheduler

Sharma works on the Edge Engineering team at Netflix. He is the author of open source Fenzo scheduling library for Apache Mesos frameworks. His current work includes developing resource management and scheduling infrastructure for Netflix project Titus, a Docker based application deployment platform, and project Mantis, a reactive stream processing platform.

Find Sharma Podila at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June