Presentation: Spotify Lessons: Learning to Let Go of Machines
What You’ll Learn
- Understand what should be present for an organization to migrate to an ephemeral/immutable infrastructure.
- Learn how to identify the parts of your infrastructure that cause issues for developers and how to reason about them.
- Develop ideas for managing and iteratively changing (through tooling) the parts of your system that causes the most issues.
Abstract
Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly active users. At Spotify, a team of 6 engineers maintains the machine provisioning and capacity fleet for all 150+ Spotify teams. This talk is going to tell the story of how Spotify’s infrastructure evolved from teams owning and doting on groups of long-running servers to a distinctive separation of business code and value from the underlying ephemeral machines all of Spotify's services actually run on. We'll examine how this evolution also changed the way that Spotify developers write code and the vast increase in iteration and shipping speed. This talk will also cover a potential endpoint of improving the provisioning and capacity experience for developers: a world where service developers don't need to handle or concern themselves with any of the infrastructure at all. We'll discuss why Spotify wants to move toward this state and how we're getting there.
Interview
James: Focus of work today is on Phoenix. Phoenix is a service my team is creating that can orchestrate and carry out a full rolling reset of all machines in the GCP portion of our fleet. For many of our machines, restarts are required to get the latest security updates so Phoenix ensures that all GCP machines in the fleet have the latest secure packages. Phoenix also enforces the concept of ephemeral infrastructure for devs and forces devs to make their services stateless and robust if they aren't already. For any service that a developer creates, temporarily removing an instance of that service (via the rolling resets) should not affect the overall service's performance. Phoenix helps enforce this desired behavior.
James: It amazes me that Spotify has achieved a great balance in embracing the ops in squads model that limits how much context and infra/ops knowledge the squads actually need to know. The Ops-in-Squads model at Spotify involves individual teams taking on all the operational and on-call responsibilities for their services.
The impetus for this was that a singular or even multiple dedicated ops teams were not scaling well with the tens of teams and hundreds of services at Spotify. It can be difficult for even the best Ops engineers to handle incidents and ops for hundreds of services that they don't have much context on. This basic premise seems to imply that feature teams would need to take on and remember a huge amount of operational context and knowledge. However, with Developer Platform Alliance's and more specifically the Infrastructure and Operations tribe's tooling, feature teams are able to maintain their services without requiring too much additional context or time.
I want to share how Spotify maintains this balance and what infrastructure/ops concerns IO has removed from feature developer's responsibilities.
James: My initial instinct is to say to embrace ephemeral, immutable infrastructure. However, it's really finding out what (ideally through quantitative means based off of operational data) development and operational pain points developers are going through and writing tooling that fixes or alleviates those pain points. Ephemeral, immutable infrastructure may not be for everyone. An early-stage startup with 5 engineers probably doesn't need one engineer or resources dedicated to ensuring that their 2 servers use immutable deployments and can potentially auto-scale up to 100 servers. But then again, if it's observable that stateful deployments or servers is indeed a huge operational pain point, it might be the answer.
Similar Talks

Tracks
Monday, 26 June
-
Microservices: Patterns & Practices
Practical experiences and lessons with Microservices.
-
Java - Propelling the Ecosystem Forward
Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
-
High Velocity Dev Teams
Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
-
Modern Browser-Based Apps
Reactive, cross platform, progressive - webapp tech today.
-
Innovations in Fintech
Technology, tools and techniques supporting modern financial services.
Tuesday, 27 June
-
Architectures You've Always Wondered About
Case studies from the most relevant names in software.
-
Developer Experience: Level up Your Engineering Effectiveness
Trends, tools and projects that we're using to maximally empower your developers.
-
Chaos & Resilience
Failures, edge cases and how we're embracing them.
-
Stream Processing at Large
Rapidly moving data at scale.
-
Building Security Infrastructure
How our industry is being attacked and what you can do about it.
Wednesday, 28 June
-
Next Gen APIs: Designs, Protocols, and Evolution
Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
-
Immutable Infrastructures: Orchestration, Serverless, and More
What's next in infrastructure. How cloud function like lambda are making their way into production.
-
Machine Learning 2.0
Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS.
-
Optimizing Yourself
Maximizing your impact as an engineer, as a leader, and as a person.
-
Ask Me Anything (AMA)