What You’ll Learn

Understand what should be present for an organization to migrate to an ephemeral/immutable infrastructure.
Learn how to identify the parts of your infrastructure that cause issues for developers and how to reason about them.
Develop ideas for managing and iteratively changing (through tooling) the parts of your system that causes the most issues.

Abstract

Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly active users. At Spotify, a team of 6 engineers maintains the machine provisioning and capacity fleet for all 150+ Spotify teams. This talk is going to tell the story of how Spotify’s infrastructure evolved from teams owning and doting on groups of long-running servers to a distinctive separation of business code and value from the underlying ephemeral machines all of Spotify's services actually run on. We'll examine how this evolution also changed the way that Spotify developers write code and the vast increase in iteration and shipping speed. This talk will also cover a potential endpoint of improving the provisioning and capacity experience for developers: a world where service developers don't need to handle or concern themselves with any of the infrastructure at all. We'll discuss why Spotify wants to move toward this state and how we're getting there.

Interview

Question:

QCon: What is the focus of your work today?

Answer:

James: Focus of work today is on Phoenix. Phoenix is a service my team is creating that can orchestrate and carry out a full rolling reset of all machines in the GCP portion of our fleet. For many of our machines, restarts are required to get the latest security updates so Phoenix ensures that all GCP machines in the fleet have the latest secure packages. Phoenix also enforces the concept of ephemeral infrastructure for devs and forces devs to make their services stateless and robust if they aren't already. For any service that a developer creates, temporarily removing an instance of that service (via the rolling resets) should not affect the overall service's performance. Phoenix helps enforce this desired behavior.

Question:

What’s the motivation for your talk?

Answer:

James: It amazes me that Spotify has achieved a great balance in embracing the ops in squads model that limits how much context and infra/ops knowledge the squads actually need to know. The Ops-in-Squads model at Spotify involves individual teams taking on all the operational and on-call responsibilities for their services.

The impetus for this was that a singular or even multiple dedicated ops teams were not scaling well with the tens of teams and hundreds of services at Spotify. It can be difficult for even the best Ops engineers to handle incidents and ops for hundreds of services that they don't have much context on. This basic premise seems to imply that feature teams would need to take on and remember a huge amount of operational context and knowledge. However, with Developer Platform Alliance's and more specifically the Infrastructure and Operations tribe's tooling, feature teams are able to maintain their services without requiring too much additional context or time.

I want to share how Spotify maintains this balance and what infrastructure/ops concerns IO has removed from feature developer's responsibilities.

Question:

What do you feel is the most important thing/practice/tech/technique for a developer/leader in your space to be focused on today?

Answer:

James: My initial instinct is to say to embrace ephemeral, immutable infrastructure. However, it's really finding out what (ideally through quantitative means based off of operational data) development and operational pain points developers are going through and writing tooling that fixes or alleviates those pain points. Ephemeral, immutable infrastructure may not be for everyone. An early-stage startup with 5 engineers probably doesn't need one engineer or resources dedicated to ensuring that their 2 servers use immutable deployments and can potentially auto-scale up to 100 servers. But then again, if it's observable that stateful deployments or servers is indeed a huge operational pain point, it might be the answer.

Speaker: James Wen

Site Reliability Engineer @Spotify

James Wen is currently a Site Reliability Engineer at Spotify. He's on the ALF squad at Spotify, maintaining and developing the tooling for capacity management + provisioning and internal DNS for 150+ teams at Spotify. He was formerly the Team Lead (Anchor) of the Cloud Foundry Buildpacks team at Pivotal and a core contributor and maintainer of Bundler. He graduated with a B.A. in Computer Science from Columbia University and is currently working toward his Master's in Computer Science with a specialization in Machine Learning from Georgia Tech via the OMSCS program. He is an avid proponent of technical domains like open source, reliability, continuous integration, collective ownership, highly accessible context/knowledge, automation, and clean, maintainable code. He absolutely loves to climb, whether on real rock or plastic, or bouldering or lead.

Find James Wen at

Speaker page

Similar Talks

The Effective Remote Developer

Director of Engineering

David Copeland

Evaluating Machine Learning Models: A Case Study

Data Scientist @Opendoor

Nelson Ray

Mixing in React

Software Engineer @Agrilyst

Rushaine McBean

Multi-host, Multi-network Persistent Containers

CTO and Co-Founder @Aerospike

Brian Bulkowski

I Have A NoSQL toaster

Developer Advocate @Couchbase

Matthew Groves

Engineer Innovation Through Rapid Prototyping

Principal Software Engineer @ Vistaprint

Ramon Harrington

The Java Evolution of Eclipse Collections

Technology Associate @GoldmanSachs

Kristen O'Leary

Nonconformist Resilience: DB-Backed Job Queues

VP Architecture @Betterment

John Mileham

Managing Millions of Data Services @Heroku

Senior Infrastructure Engineer @Heroku

Gabriel Enslein

Tracks

Monday, 26 June

Microservices: Patterns & Practices

Practical experiences and lessons with Microservices.
Java - Propelling the Ecosystem Forward

Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
High Velocity Dev Teams

Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
Modern Browser-Based Apps

Reactive, cross platform, progressive - webapp tech today.
Innovations in Fintech

Technology, tools and techniques supporting modern financial services.

Tuesday, 27 June

Architectures You've Always Wondered About

Case studies from the most relevant names in software.
Developer Experience: Level up Your Engineering Effectiveness

Trends, tools and projects that we're using to maximally empower your developers.
Chaos & Resilience

Failures, edge cases and how we're embracing them.
Stream Processing at Large

Rapidly moving data at scale.
Building Security Infrastructure

How our industry is being attacked and what you can do about it.

Wednesday, 28 June

Next Gen APIs: Designs, Protocols, and Evolution

Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
Immutable Infrastructures: Orchestration, Serverless, and More

What's next in infrastructure. How cloud function like lambda are making their way into production.
Machine Learning 2.0

Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS.
Optimizing Yourself

Maximizing your impact as an engineer, as a leader, and as a person.
Ask Me Anything (AMA)

This Year's Schedule

Track: Developer Experience: Level up Your Engineering Effectiveness

Location: Broadway Ballroom South Center, 6th fl.

Duration: 2:55pm - 3:45pm

Day of week: Tuesday

Level: Intermediate

Persona: Architect, DevOps Engineer

What You’ll Learn

Abstract

Interview

Find James Wen at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Spotify Lessons: Learning to Let Go of Machines

Track: Developer Experience: Level up Your Engineering Effectiveness

Location: Broadway Ballroom South Center, 6th fl.

Duration: 2:55pm - 3:45pm

Day of week: Tuesday

Level: Intermediate

Persona: Architect, DevOps Engineer

More talks on:

What You’ll Learn

Abstract

Interview

Find James Wen at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World