Abstract

Project Titus is Netflix's container runtime on top of Amazon EC2. Titus powers algorithm research through massively parallel model training, media encoding, data research notebooks, ad hoc reporting, NodeJS UI services, stream processing and general micro-services. As an update from last year's talk, we will focus on the lessons learned operating one of the largest container runtimes on a public cloud. We'll cover the migration we've seen of applications and frameworks from VM's to containers. We will cover the operational issues with containers that only showed after we reached the large scale (1000's of container hosts, 100's of thousands of containers launched weekly) we are currently supporting. We'll touch base on the unique features we have added to help both batch and microservices run across a variety of runtimes (Java, R, NodeJS, Python, etc) and how higher level frameworks have taken avantage of Titus's scheduling capabilities.

Interview

Question:

QCon: What’s the motivation for your talk?

Answer:

Andrew & Amit: Last year we talked about why we choose to build Titus, our container management platform. We also talked about the architecture and implementation of Titus. This year, we wanted to talk about lessons learned running and continuing to evolve this critical system at Netflix. We wanted to share the lessons learned providing Netflix required availability running this system at our unique levels of scale.

Question:

QCon: What’s the level & core persona?

Answer:

Andrew & Amit: We expect most attendees who are considering containers as part of their infrastructure to benefit with the lessons learned across the varied use cases we've seen benefit from containers. Additionally, those growing container environments from test and development will learn key lessons of what it takes to run containers in production. Finally, lessons learned only at Netflix scale will be presented. These final lessons of worldwide scale are usually academically interesting to most engineers.

Question:

QCon: What 3 actionable things do you want persona to walk away with?

Answer:

Andrew & Amit:

1. What aspects to consider after deploying an off-the-shelf container management platform

2. How to think about reliability in the context of a large scale distributed container management platform

3. What levels of scale are possible, not through synthetic benchmarks, but real world container deployments

Question:

QCon: Ask and answer an interesting question of your choice that you think an attendee might have after reading your abstract?

Answer:

Andrew & Amit:

Q: Why is Netflix's container management platform different from other open source container management platforms?

A: Given our existing VM based cloud native infrastructure, we approached container management as an addition to our existing cloud platform -- instead of a replacement. This means a key to the success of Titus was deciding what Titus would not do, leveraging the full value other infrastructure teams at Netflix provide. This also meant leveraging Amazon Web Services (AWS) deeply seamlessly integrating VM's and containers while supporting existing operations and security models in AWS. Finally, it was important to consider how we choose workloads that benefited from containers as compared to pushing all workloads to containers.

Answer:

https://medium.com/netflix-techblog/the-evolution-of-container-usage-at-netflix-3abfc096781b

Speaker: Andrew Spyker

Open Source Coordinator @Netflix

Andrew has worked on the Cloud Platform team as Netflix for the past two years. He joined Netflix to help with performance and scalability across the core building blocks of Netflix’s cloud platform. Since joining Netflix, he has helped with not only performance but also architecture, open source, and container strategy. More recently Andrew has been an engineer as well as product manager for the container cloud initiative (Project Titus).

Find Andrew Spyker at

Speaker page

@aspyker

Senior Software Engineer at Netflix

Speaker: Amit Joshi

Senior Software Engineer @Netflix

Find Amit Joshi at

Speaker page

Similar Talks

The Effective Remote Developer

Director of Engineering

David Copeland

Evaluating Machine Learning Models: A Case Study

Data Scientist @Opendoor

Nelson Ray

Multi-host, Multi-network Persistent Containers

CTO and Co-Founder @Aerospike

Brian Bulkowski

I Have A NoSQL toaster

Developer Advocate @Couchbase

Matthew Groves

Engineer Innovation Through Rapid Prototyping

Principal Software Engineer @ Vistaprint

Ramon Harrington

Nonconformist Resilience: DB-Backed Job Queues

VP Architecture @Betterment

John Mileham

Building Microservices @Squarespace

Director of Engineering @ Squarespace

Franklin Angulo

Refactor Frontend APIs & Accounting for Tech Debt

Software Engineer @Indiegogo

Julia Nguyen

Reasoning About Complex Distributed Systems

Software Engineer @Jet, previous CTO

Erich Ess

Tracks

Monday, 26 June

Microservices: Patterns & Practices

Practical experiences and lessons with Microservices.
Java - Propelling the Ecosystem Forward

Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
High Velocity Dev Teams

Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
Modern Browser-Based Apps

Reactive, cross platform, progressive - webapp tech today.
Innovations in Fintech

Technology, tools and techniques supporting modern financial services.

Tuesday, 27 June

Architectures You've Always Wondered About

Case studies from the most relevant names in software.
Developer Experience: Level up Your Engineering Effectiveness

Trends, tools and projects that we're using to maximally empower your developers.
Chaos & Resilience

Failures, edge cases and how we're embracing them.
Stream Processing at Large

Rapidly moving data at scale.
Building Security Infrastructure

How our industry is being attacked and what you can do about it.

Wednesday, 28 June

Next Gen APIs: Designs, Protocols, and Evolution

Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
Immutable Infrastructures: Orchestration, Serverless, and More

What's next in infrastructure. How cloud function like lambda are making their way into production.
Machine Learning 2.0

Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS.
Optimizing Yourself

Maximizing your impact as an engineer, as a leader, and as a person.
Ask Me Anything (AMA)

This Year's Schedule

Track: Immutable Infrastructures: Orchestration, Serverless, and More

Location: Broadway Ballroom South Center, 6th fl.

Duration: 10:35am - 11:25am

Day of week: Wednesday

Level: Intermediate

Persona: Architect, DevOps Engineer

Abstract

Interview

Find Andrew Spyker at

Find Amit Joshi at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: A Series of Unfortunate Container Events @Netflix

Track: Immutable Infrastructures: Orchestration, Serverless, and More

Location: Broadway Ballroom South Center, 6th fl.

Duration: 10:35am - 11:25am

Day of week: Wednesday

Level: Intermediate

Persona: Architect, DevOps Engineer

More talks on:

Abstract

Interview

Find Andrew Spyker at

Find Amit Joshi at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World