Presentation: A Series of Unfortunate Container Events @Netflix

Track: Immutable Infrastructures: Orchestration, Serverless, and More

Location: Broadway Ballroom South Center, 6th fl.

Day of week: Wednesday

Level: Intermediate

Persona: Architect, DevOps Engineer

Share this on:

Abstract

Project Titus is Netflix's container runtime on top of Amazon EC2. Titus powers algorithm research through massively parallel model training, media encoding, data research notebooks, ad hoc reporting, NodeJS UI services, stream processing and general micro-services. As an update from last year's talk, we will focus on the lessons learned operating one of the largest container runtimes on a public cloud. We'll cover the migration we've seen of applications and frameworks from VM's to containers. We will cover the operational issues with containers that only showed after we reached the large scale (1000's of container hosts, 100's of thousands of containers launched weekly) we are currently supporting. We'll touch base on the unique features we have added to help both batch and microservices run across a variety of runtimes (Java, R, NodeJS, Python, etc) and how higher level frameworks have taken avantage of Titus's scheduling capabilities.

Question: 

QCon: What’s the motivation for your talk?

Answer: 

Andrew & Amit: Last year we talked about why we choose to build Titus, our container management platform. We also talked about the architecture and implementation of Titus. This year, we wanted to talk about lessons learned running and continuing to evolve this critical system at Netflix. We wanted to share the lessons learned providing Netflix required availability running this system at our unique levels of scale.

Question: 

QCon: What’s the level & core persona?

Answer: 

Andrew & Amit: We expect most attendees who are considering containers as part of their infrastructure to benefit with the lessons learned across the varied use cases we've seen benefit from containers. Additionally, those growing container environments from test and development will learn key lessons of what it takes to run containers in production. Finally, lessons learned only at Netflix scale will be presented. These final lessons of worldwide scale are usually academically interesting to most engineers.

Question: 

QCon: What 3 actionable things do you want persona to walk away with?

Answer: 

Andrew & Amit:

1. What aspects to consider after deploying an off-the-shelf container management platform

2. How to think about reliability in the context of a large scale distributed container management platform

3. What levels of scale are possible, not through synthetic benchmarks, but real world container deployments

Question: 

QCon: Ask and answer an interesting question of your choice that you think an attendee might have after reading your abstract?

Answer: 

Andrew & Amit:

Q: Why is Netflix's container management platform different from other open source container management platforms?

A: Given our existing VM based cloud native infrastructure, we approached container management as an addition to our existing cloud platform -- instead of a replacement. This means a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams at Netflix provide. This also meant leveraging Amazon Web Services (AWS) deeply seamlessly integrating VM's and containers while supporting existing operations and security models in AWS. Finally, it was important to consider how we choose workloads that benefited from containers as compared to pushing all workloads to containers.

Speaker: Amit Joshi

Senior Software Engineer @Netflix

Find Amit Joshi at

Speaker: Andrew Spyker

Manager, Netflix Container Cloud @Netflix

Previously worked to mature the technology base of our container cloud (Project Titus) within the development team including advanced scheduling and resource management, Docker container execution, and AWS & Netflix infrastructure integration. Recently, moved into a product management role collaborating with supporting Netflix infrastructure dependencies as well as supporting new container cloud usage scenarios including user on-boarding, feature prioritization/delivery and relationship management. Now, managing the extended development team that will enable our container cloud to be a key aspect of Netflix's infrastructure. Still on-call, but now loving building the team as much as building the product.

Find Andrew Spyker at

Similar Talks

Psychologically Safe Process Evolution in a Flat Structure

Director of Software Development @Hunter_Ind

Chris Lucian

Let's talk locks!

Software Engineer @Samsara

Kavya Joshi

Graceful Degradation as a Feature

Director of Product @GremlinInc

Lorne Kligerman

What Breaks Our Systems: A Taxonomy of Black Swans

Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee

Laura Nolan

Scaling Infrastructure Engineering at Slack

Senior Director of Infrastructure Engineering @Slack

Julia Grace

Liberating Structures at Capital One

Agile Coach, Engineering @CapitalOne

Greg Myers

Tracks

Monday, 24 June

Tuesday, 25 June

Wednesday, 26 June