Using Traffic Modeling to Load-Balance Netflix Traffic at Global Scale

Netflix Infrastructure supports personalized UI and Streaming experience across 230M+ members around the world. Spread across multiple locations, it’s important to have predictability and control over how user traffic is balanced across them, to ensure balance between latency, infrastructure costs, and availability risk. 

This talk will tell a story of how Netflix has shifted from geo-based DNS load-balancing to latency-based approach, relying on real-user measurements and building a global data model of Netflix traffic to reduce costs while reducing latency and outage risks. We will cover challenges of integrating the solution into Cloud and CDN components of Netflix infrastructure, and trade-offs between model accuracy and traffic model complexity. The talk also demonstrates how the data-driven approach was applied to influence future infrastructure decisions, by simulating impact of potential infrastructure changes with precision and minimal engineering effort.


Speaker

Niosha Behnam

Staff Software Engineer @Netflix

Niosha is a Staff Software Engineer on the Compute Abstractions Team at Netflix.  Over his tenure he was a founding member of the Traffic & Chaos Team where he helped build the software that powers cloud traffic management, regional failover, and resilience.  Most recently, in addition to exploring opportunities for expanding Netflix’s global cloud footprint, Niosha has been tackling improved traffic steering visibility to minimize cloud cost while optimizing user experience.

Prior to Netflix, Niosha built custom IaaS offerings for specialized private clouds and contributed to R&D leveraging big data approaches to ingest, analyze, and visualize large volumes of relational data.

Read more
Find Niosha Behnam at:

Speaker

Sergey Fedorov

Director of Engineering @Netflix

Sergey is a hands-on engineering leader working for the Content Delivery team at Netflix. An early member of the team that built an Open Connect CDN delivering 13% of the world Internet traffic, he spent years building monitoring and data analysis systems for Netflix video streaming. As part of that work, he released FAST.com — one of the most popular Internet speed tests.

Today Sergey focuses on improving interactive requests from Netflix applications to achieve better latency, reliability, and control over client-server communications. Prior to Netflix, he worked on optimizing developer infrastructure at Microsoft and real-time photorealistic rendering at Intel.   

Sergey is a vocal advocate of an observability approach to engineering and making data-driven decisions. Finding actionable signals in loosely controlled environments is what keeps him awake, much better than caffeine. This might also explain why outside of work Sergey can be seen playing ice hockey, brewing beer, or exploring exotic travel destinations.

Read more
Find Sergey Fedorov at:

Date

Tuesday Jun 13 / 10:35AM EDT ( 50 minutes )

Location

Salon A-C (North Tower)

Topics

Architecture Platform Engineering Data Analytics Traffic Management Capacity Management

Share

From the same track

Session Architecture

Global Capacity Management through Strategic Demand Allocation

Tuesday Jun 13 / 01:40PM EDT

Meta currently operates in more than 15 data center regions around the world. This rapidly expanding global datacenter footprint poses new challenges for service owners as well as our infrastructure management systems.

Ranjith Kumar S

Software Engineer @Meta

Session Architecture

From Open Source to SaaS: The Journey of ClickHouse

Tuesday Jun 13 / 05:25PM EDT

Have you ever wondered what it takes to go from an open-source project to a fully-fledged saas product? How about doing that in only 1 year’s time? If the answer is yes, then this talk is for you. You’ll hear straight from the experts who worked on the design, and execution of this huge project.

Sichen Zhao

Senior Software Engineer @Clickhouse

Shane Andrade

Principal Software Engineer @ClickHouse

Session

Performance Pitfalls in Using Redux Store at Slack-Scale and How to Keep It in Check

Tuesday Jun 13 / 02:55PM EDT

Details coming soon.

Session Platform

Building Sub-Second Latency Video Infrastructure at Cloudflare

Tuesday Jun 13 / 04:10PM EDT

Cloudflare has deployed a sub-second latency live streaming system at scale over the last few years. In this talk, we’ll provide insight on how this works under the cover, specifically focusing on protocols that Cloudflare Stream uses: HLS, DASH, RTMPS, SRT and WebRTC.

Renan Dincer

Systems Engineer @Cloudflare

Session Architecture

Unconference: Architectures You've Always Wondered About

Tuesday Jun 13 / 11:50AM EDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.