What You’ll Learn

Assess whether a streaming solution is fit for your problem.
Learn how to design and architect a solution for replacing a batch system with a streaming one.
Discuss how Spark compares to Flink and how to decide which engine is best for your problem.

Abstract

Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.

This talk will cover the experiences Netflix’s Personalization Data team had in stream processing unbounded datasets. The datasets consisted of - but were not limited to - the stream of playback events (all of Netflix’s plays worldwide) that are used as feedback for recommendation algorithms. These datasets when ultimately consumed by the team's machine learning models, directly affect the customer’s personalized experience. As such, the impact is high and tolerance for failure is low.

This talk will provide guidance on how to determine whether a streaming solution is a right fit for your problem. It will compare microbatch versus event-based approaches with Apache Spark and Apache Flink as examples. Finally, the talk will discuss the impact moving to stream processing had on the team's customers, and (most importantly) the challenges faced.

Interview

Question:

QCon: What are you doing at Netflix?

Answer:

Shriya: I work on the Data engineering team for Personalization. Which, among other things, delivers recommendations made for each user. We are responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix homepage .

Question:

QCon: What's the focus of your talk?

Answer:

Shriya: Today, the training of our machine learning models happens offline and it happens at most once a day. As the size of Netflix user base and subsequently that of the data being collected is exploding and the researchers are innovating with newer models we are exploring if we can train these models on a more frequently updated dataset. Going streaming also has technical advantages. As is for most cloud solutions, our storage costs are much higher than compute costs. If we are not storing these large amounts of raw data, waiting for batch processes to pick them up but rather processing, aggregating and discarding them as they come in, it makes for a more efficient use of our cluster resources.

Question:

QCon: What are some of the considerations that you have to take into account when attempting to get to real time?

Answer:

Shriya: One big thing is the accuracy of the data. Streaming data has an important temporal advantage: it's ready for access sooner. But, is it as accurate and as reliable as the batched data? Batch systems tend to be very accurate because you have all the data that you will need to process and all your sources have reconciled. Batch systems also deal with data recovery and repair far more easily than streaming systems. These are things you need to tackle in your streaming design.

Question:

QCon: Can you give me a taste of what you might go into regarding dealing with late arriving data?

Answer:

Shriya: At Netflix, all day long we receive data on what a user played and where (on the homepage) they found that content. It's possible that one of the upstream services that was sending this play information had a delay, and it is sending data from the play that happened four hours ago. I can't store the information based on the time it arrived. We have to find out when it was actually played. This is an example of late arriving. We can solve it by either figuring out the partitioning scheme of the output data or by maintaining windows in the streaming app.

Question:

QCon: How do you decide between using Spark or Flink to solve these problems you have at Netflix?

Answer:

Shriya: Well, different teams at Netflix use different streaming technologies, choosing the one that best fits their problem. In personalization, we care for the feature-richness of the engine a lot. The data we're producing today is being produced once a day, unlike a lot of completely online systems that are sensitive to sub-second SLAs, we are not. Streaming data pipelines serve a variety of purposes, some are for pure event routing where there isn’t a lot of business logic baked in the pipeline, some like ours where a majority of data manipulation is written natively in the pipeline. So that plays into our decision of choosing what engine to use.

Question:

QCon: Have you chosen Flink over Spark?

Answer:

Shriya: We are moving forward with a proof of concept of solving one of our problems in Flink, success of which in production will determine future use-cases.

Question:

QCon: What is the level of the talk, is it intermediate or advanced?

Answer:

Shriya: It is intermediate, not advanced, because it would not go super deep into technical details of any one streaming engine. But it's not beginner either as I am assuming the audience has already started thinking about this problem set. It covers how to design and architect a solution, if you were to replace a batch system with a streaming one.

Question:

QCon: What is the persona you are addressing with this talk?

Answer:

Shriya: I'm talking to that person who has a batch system and is trying to do streaming. Since there are so many options out there, I'm trying to help people make an informed decision.

Speaker: Shriya Arora

Senior Data Engineer @Netflix

Shriya is a data engineer at Netflix. She has been working on writing a framework on top of Spark batch processing that allows for a generic way of producing the various data-sets that are required for the machine learning algorithms that enable recommendations on the service. She is now exploring streaming as an alternate to batch ETL to process these data-sets so the models serving the recommendations can be trained more frequently in order to improve the personalized experience of Netflix users.

Find Shriya Arora at

Speaker page

@shriyarora

Data Engineering at Netflix

Similar Talks

The Effective Remote Developer

Director of Engineering

David Copeland

Evaluating Machine Learning Models: A Case Study

Data Scientist @Opendoor

Nelson Ray

I Have A NoSQL toaster

Developer Advocate @Couchbase

Matthew Groves

Engineer Innovation Through Rapid Prototyping

Principal Software Engineer @ Vistaprint

Ramon Harrington

Nonconformist Resilience: DB-Backed Job Queues

VP Architecture @Betterment

John Mileham

Managing Millions of Data Services @Heroku

Senior Infrastructure Engineer @Heroku

Gabriel Enslein

Building Microservices @Squarespace

Director of Engineering @ Squarespace

Franklin Angulo

Refactor Frontend APIs & Accounting for Tech Debt

Software Engineer @Indiegogo

Julia Nguyen

Reasoning About Complex Distributed Systems

Software Engineer @Jet, previous CTO

Erich Ess

Tracks

Monday, 26 June

Microservices: Patterns & Practices

Practical experiences and lessons with Microservices.
Java - Propelling the Ecosystem Forward

Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
High Velocity Dev Teams

Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
Modern Browser-Based Apps

Reactive, cross platform, progressive - webapp tech today.
Innovations in Fintech

Technology, tools and techniques supporting modern financial services.

Tuesday, 27 June

Architectures You've Always Wondered About

Case studies from the most relevant names in software.
Developer Experience: Level up Your Engineering Effectiveness

Trends, tools and projects that we're using to maximally empower your developers.
Chaos & Resilience

Failures, edge cases and how we're embracing them.
Stream Processing at Large

Rapidly moving data at scale.
Building Security Infrastructure

How our industry is being attacked and what you can do about it.

Wednesday, 28 June

Next Gen APIs: Designs, Protocols, and Evolution

Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
Immutable Infrastructures: Orchestration, Serverless, and More

What's next in infrastructure. How cloud function like lambda are making their way into production.
Machine Learning 2.0

Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS.
Optimizing Yourself

Maximizing your impact as an engineer, as a leader, and as a person.
Ask Me Anything (AMA)

This Year's Schedule

Track: Stream Processing at Large

Location: Majestic Complex, 6th fl

Duration: 4:10pm - 5:00pm

Day of week: Tuesday

Level: Intermediate - Advanced

Persona: Data Scientist

What You’ll Learn

Abstract

Interview

Find Shriya Arora at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Personalizing Netflix with Streaming Datasets

Track: Stream Processing at Large

Location: Majestic Complex, 6th fl

Duration: 4:10pm - 5:00pm

Day of week: Tuesday

Level: Intermediate - Advanced

Persona: Data Scientist

More talks on:

What You’ll Learn

Abstract

Interview

Find Shriya Arora at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World