Presentation: Personalizing Netflix with Streaming Datasets
What You’ll Learn
- Assess whether a streaming solution is fit for your problem.
- Learn how to design and architect a solution for replacing a batch system with a streaming one.
- Discuss how Spark compares to Flink and how to decide which engine is best for your problem.
Abstract
Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.
This talk will cover the experiences Netflix’s Personalization Data team had in stream processing unbounded datasets. The datasets consisted of - but were not limited to - the stream of playback events (all of Netflix’s plays worldwide) that are used as feedback for recommendation algorithms. These datasets when ultimately consumed by the team's machine learning models, directly affect the customer’s personalized experience. As such, the impact is high and tolerance for failure is low.
This talk will provide guidance on how to determine whether a streaming solution is a right fit for your problem. It will compare microbatch versus event-based approaches with Apache Spark and Apache Flink as examples. Finally, the talk will discuss the impact moving to stream processing had on the team's customers, and (most importantly) the challenges faced.
Interview
Shriya: I work on the Data engineering team for Personalization. Which, among other things, delivers recommendations made for each user. We are responsible for the data that goes into training and scoring of the various machine learning models that power the Netflix homepage .
Shriya: Today, the training of our machine learning models happens offline and it happens at most once a day. As the size of Netflix user base and subsequently that of the data being collected is exploding and the researchers are innovating with newer models we are exploring if we can train these models on a more frequently updated dataset. Going streaming also has technical advantages. As is for most cloud solutions, our storage costs are much higher than compute costs. If we are not storing these large amounts of raw data, waiting for batch processes to pick them up but rather processing, aggregating and discarding them as they come in, it makes for a more efficient use of our cluster resources.
Shriya: One big thing is the accuracy of the data. Streaming data has an important temporal advantage: it's ready for access sooner. But, is it as accurate and as reliable as the batched data? Batch systems tend to be very accurate because you have all the data that you will need to process and all your sources have reconciled. Batch systems also deal with data recovery and repair far more easily than streaming systems. These are things you need to tackle in your streaming design.
Shriya: At Netflix, all day long we receive data on what a user played and where (on the homepage) they found that content. It's possible that one of the upstream services that was sending this play information had a delay, and it is sending data from the play that happened four hours ago. I can't store the information based on the time it arrived. We have to find out when it was actually played. This is an example of late arriving. We can solve it by either figuring out the partitioning scheme of the output data or by maintaining windows in the streaming app.
Shriya: Well, different teams at Netflix use different streaming technologies, choosing the one that best fits their problem. In personalization, we care for the feature-richness of the engine a lot. The data we're producing today is being produced once a day, unlike a lot of completely online systems that are sensitive to sub-second SLAs, we are not. Streaming data pipelines serve a variety of purposes, some are for pure event routing where there isn’t a lot of business logic baked in the pipeline, some like ours where a majority of data manipulation is written natively in the pipeline. So that plays into our decision of choosing what engine to use.
Shriya: We are moving forward with a proof of concept of solving one of our problems in Flink, success of which in production will determine future use-cases.
Shriya: It is intermediate, not advanced, because it would not go super deep into technical details of any one streaming engine. But it's not beginner either as I am assuming the audience has already started thinking about this problem set. It covers how to design and architect a solution, if you were to replace a batch system with a streaming one.
Shriya: I'm talking to that person who has a batch system and is trying to do streaming. Since there are so many options out there, I'm trying to help people make an informed decision.
Similar Talks
Tracks
Monday, 26 June
-
Microservices: Patterns & Practices
Practical experiences and lessons with Microservices.
-
Java - Propelling the Ecosystem Forward
Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
-
High Velocity Dev Teams
Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
-
Modern Browser-Based Apps
Reactive, cross platform, progressive - webapp tech today.
-
Innovations in Fintech
Technology, tools and techniques supporting modern financial services.
Tuesday, 27 June
-
Architectures You've Always Wondered About
Case studies from the most relevant names in software.
-
Developer Experience: Level up Your Engineering Effectiveness
Trends, tools and projects that we're using to maximally empower your developers.
-
Chaos & Resilience
Failures, edge cases and how we're embracing them.
-
Stream Processing at Large
Rapidly moving data at scale.
-
Building Security Infrastructure
How our industry is being attacked and what you can do about it.
Wednesday, 28 June
-
Next Gen APIs: Designs, Protocols, and Evolution
Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
-
Immutable Infrastructures: Orchestration, Serverless, and More
What's next in infrastructure. How cloud function like lambda are making their way into production.
-
Machine Learning 2.0
Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS.
-
Optimizing Yourself
Maximizing your impact as an engineer, as a leader, and as a person.
-
Ask Me Anything (AMA)