Presentation: Spotify Lessons: Learning to Let Go of Machines

Track: Developer Experience: Level up Your Engineering Effectiveness

Location: Broadway Ballroom South Center, 6th fl.

Duration: 2:55pm - 3:45pm

Day of week: Tuesday

Level: Intermediate

Persona: Architect, DevOps Engineer

What You’ll Learn

  • Understand what should be present for an organization to migrate to an ephemeral/immutable infrastructure.
  • Learn how to identify the parts of your infrastructure that cause issues for developers and how to reason about them.
  • Develop ideas for managing and iteratively changing (through tooling) the parts of your system that causes the most issues. 

Abstract

Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly active users. At Spotify, a team of 6 engineers maintains the machine provisioning and capacity fleet for all 150+ Spotify teams. This talk is going to tell the story of how Spotify’s infrastructure evolved from teams owning and doting on groups of long-running servers to a distinctive separation of business code and value from the underlying ephemeral machines all of Spotify's services actually run on. We'll examine how this evolution also changed the way that Spotify developers write code and the vast increase in iteration and shipping speed. This talk will also cover a potential endpoint of improving the provisioning and capacity experience for developers: a world where service developers don't need to handle or concern themselves with any of the infrastructure at all. We'll discuss why Spotify wants to move toward this state and how we're getting there.

Interview

Question: 
QCon: What is the focus of your work today?
Answer: 

James: Focus of work today is on Phoenix. Phoenix is a service my team is creating that can orchestrate and carry out a full rolling reset of all machines in the GCP portion of our fleet. For many of our machines, restarts are required to get the latest security updates so Phoenix ensures that all GCP machines in the fleet have the latest secure packages. Phoenix also enforces the concept of ephemeral infrastructure for devs and forces devs to make their services stateless and robust if they aren't already. For any service that a developer creates, temporarily removing an instance of that service (via the rolling resets) should not affect the overall service's performance. Phoenix helps enforce this desired behavior.

Question: 
What’s the motivation for your talk?
Answer: 

James: It amazes me that Spotify has achieved a great balance in embracing the ops in squads model that limits how much context and infra/ops knowledge the squads actually need to know. The Ops-in-Squads model at Spotify involves individual teams taking on all the operational and on-call responsibilities for their services.

The impetus for this was that a singular or even multiple dedicated ops teams were not scaling well with the tens of teams and hundreds of services at Spotify. It can be difficult for even the best Ops engineers to handle incidents and ops for hundreds of services that they don't have much context on. This basic premise seems to imply that feature teams would need to take on and remember a huge amount of operational context and knowledge. However, with Developer Platform Alliance's and more specifically the Infrastructure and Operations tribe's tooling, feature teams are able to maintain their services without requiring too much additional context or time.

I want to share how Spotify maintains this balance and what infrastructure/ops concerns IO has removed from feature developer's responsibilities.

Question: 
What do you feel is the most important thing/practice/tech/technique for a developer/leader in your space to be focused on today?
Answer: 

James: My initial instinct is to say to embrace ephemeral, immutable infrastructure. However, it's really finding out what (ideally through quantitative means based off of operational data) development and operational pain points developers are going through and writing tooling that fixes or alleviates those pain points. Ephemeral, immutable infrastructure may not be for everyone. An early-stage startup with 5 engineers probably doesn't need one engineer or resources dedicated to ensuring that their 2 servers use immutable deployments and can potentially auto-scale up to 100 servers. But then again, if it's observable that stateful deployments or servers is indeed a huge operational pain point, it might be the answer.

Speaker: James Wen

Site Reliability Engineer @Spotify

James Wen is currently a Site Reliability Engineer at Spotify. He's on the ALF squad at Spotify, maintaining and developing the tooling for capacity management + provisioning and internal DNS for 150+ teams at Spotify. He was formerly the Team Lead (Anchor) of the Cloud Foundry Buildpacks team at Pivotal and a core contributor and maintainer of Bundler. He graduated with a B.A. in Computer Science from Columbia University and is currently working toward his Master's in Computer Science with a specialization in Machine Learning from Georgia Tech via the OMSCS program. He is an avid proponent of technical domains like open source, reliability, continuous integration, collective ownership, highly accessible context/knowledge, automation, and clean, maintainable code. He absolutely loves to climb, whether on real rock or plastic, or bouldering or lead.

Find James Wen at

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June