Presentation: Handling Streaming Data in Spotify Using the Cloud

Location:

Duration

Duration: 
5:25pm - 6:15pm

Day of week:

Level:

Persona:

Key Takeaways

  • Hear how the Spotify’s data delivery pipelines delivers events for a 100 million monthly active users to a central location for all the other developers in the company to use.
  • Hear practical lessons from Spotify around large scale, global deployments of Kafka.
  • Learn about multi-data center implementation for streaming data, including how they handle failure, replication, and reliability.

Abstract

Spotify’s data is increasing at a rate of 60 billion events per day. The current event delivery system, which is based on Kafka 0.7, is slowly but certainly reaching its limitations. To be able to seamlessly scale the event delivery system with Spotify’s growth, we decided to base the new event delivery system on Google Cloud Pubsub and Google Cloud Dataflow.

Spotify’s event delivery system is one of the foundational pieces of Spotify’s data infrastructure. It has a key requirement to deliver complete data with a predictable latency and make it available to Spotify developers via well-defined interface. Delivered data is than used to produce Discover Weekly, Spotify Party, Year in music and many other Spotify features.

This talk is going to cover the evolution of Spotify’s event delivery system focusing on the lessons learned from operating it and reasons behind the decision of moving it into the cloud.

This talk will also going to cover Scio, which we developed to make the Dataflow SDK easier to use. Scio is a high level Scala API for the Dataflow SDK which brings Dataflow closer to popular data processing frameworks.

Interview

Question: 
What is your role today?
Answer: 
I am a software engineer working in the Spotify’s data infrastructure. For the past year, we have been working on moving the event delivery from the Legacy to the promised lands.
Question: 
What does the promised land's architecture look like?
Answer: 
The legacy architecture is built on top of Kafka 0.7 with a lot of additional services added on top of Kafka. Services were added to cope with the limitations of the Kafka that we were using at the time.
Kafka didn’t have the replication, or the reliability. Kafka didn’t have a really good cross datacenter way of working.
When we were adding services on top of Kafka, we didn’t really think about the future At the end it turned out to be a mess, with a lot of operational legacy today.
Today, we are looking at simplifying the architecture., We’re thinking about all the services that we’re building today with the operational mindset, so the operational burden would be as small as possible.
When we deploy the code we want to exactly know what version is running. If there is a problem, we can pinpoint the problem immediately, because we have enough monitoring around it. We want to be the ones noticing the problem before we get paged by our users because we got a large delay in the data delivery.
I guess that is the promised land for us.
Question: 
Your focus isn’t on code. It is on how you are building this promise land?
Answer: 
We focus on both. The code needs to be well tested before it goes into production. Right now, most of the focus goes in having all the components of the new system in the continuous integration and continuous deployment mode.
Everything should be trackable: we need to know when something was deployed and what version is running, to ensure that everything is correct.
Question: 
Are you going to talk about this data delivery pipeline?
Answer: 
50% of the talk is going to be dedicated to it - where our event delivery system was, and where we want it to be. I am doing joint talk with Neville. Neville is going to cover what is happening with the data once it’s delivered.,, and what is the tooling they are building to crunch the delivered data. They are basing their tooling on the Google Cloud Dataflow.
Question: 
How many events would is going through your event delivery system?
Answer: 
It’s 1 million events per second globally in the peak time. Right now we have four datacenters, but if we also include the Google Cloud as a new datacenter, then we can say five.
Question: 
Regarding datacenters, can you talk about failure across these data centers and how you deal with replication?
Answer: 
In the current system there is no state replication between our data centers. All the events are delivered to our Hadoop cluster, which is located in a single data center. To cope with the network downtime between Hadoop data center and our all other data centers we buffer all events in syslog files on service machines before sending them through our event delivery system. In the new event delivery system we’re introducing replicated state in between all our service machines and Hadoop. For keeping this replicated state we’re going to use Cloud Pub/Sub.
Question: 
How would you rate this talk: Beginner, Intermediate, or Advanced
Answer: 
It's intermediate. I try to make it accessible, even to the people who don't know all the technologies involved., It helps if you have some experience with the problem domain though.
Question: 
What do you feel is the most disruptive tech in IT right now?
Answer: 
I like containers, and I think they are a game changer. In Spotify, we have started moving forward with containers in the last two years, and I can see so many benefits in the whole continuous integration and deployment model that we have now. It allowed us to iterate a lot faster.

Speaker: Neville Li

Software Engineer @Spotify

Neville is a software engineer at Spotify who works mainly on data infrastructure for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.

Find Neville Li at

Speaker: Igor Maravić

Software Engineer @Spotify

As a part of the band he worked on developing and maintaining Spotify's gateways, migrating mobile clients from using custom TLV protocol to HTTP, designing and developing continuous delivery infrastructure, stress testing services... Currently he's living and breathing event delivery.

Find Igor Maravić at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June