Presentation: The Human Side of Microservices



5:25pm - 6:15pm

Day of week:



Key Takeaways

  • Hear the commonly untold story of how you move a company culture to embrace Microservices.
  • Learn how Yelp educates teams on distributed systems during their evolution to Microservices
  • Hear stories from the tech lead on how experimentation helped Yelp explore decomposition of the monolith and access its impact before rolling it across development teams.


At Yelp we value our ability to quickly ship code. One key factor in scaling our engineering process to over three hundred engineers and several million lines of Python has been our move to a microservices architecture; over the course of the past four years we've gone from zero to over one hundred production microservices. During this process we've had to solve many difficult technical issues, but some of the most interesting challenges have involved the human side of engineering.

In this presentation John will start by discussing how to win over those in your organization who are skeptical about the benefits of microservices. The talk will also include tips on educating developers on the aspects of distributed systems that are inherent to a microservices architecture (caching, dealing with failures, performing backwards compatible interface changes etc.). John will then go into the potential problems with a centralized operations model in a microservices architecture, and how to smoothly transition to a world where developers share responsibility for site operations.


You are tech lead at Yelp. Tell me a bit about your role there and what you work on.
I have been at Yelp just over five years. When I joined, I was on the search team and there were probably at that time 70, 80 developers in total. I was working on the search infrastructure part, making sure the search engine stayed up. This is the thing that you hit when you type “pizza in New York.” It returns the search results. A couple of years later, we started down this services road in order to scale our development, and I spearheaded that part, working with a small team to craft the roadmap going forward. Now I am technical lead on our core infrastructure team, supporting all of those back-end services teams, finding what the common problems are, and developing solutions to solve them.
What does a service look like at Yelp?
Most of our services are written in Python. Some of them, a minority, are written in Java for the high performance areas like search or ads. We have this common Python service stack. By default, if you need a database, it’s MySQL. Although, we do also support Cassandra and ElasticSearch. We have common tooling for metrics, logging, and identifying performance problems. We have services that do security that you can interface with. We use Swagger for defining our interfaces to services, so think HTTP REST. We use a lot of JSON. Those are the core technologies.
Is it all RESTful or are you using other protocols between your services?
I think we are all RESTful. There may be a few dark corners of the world where we have Thrift but I haven’t come across them recently. So yeah, very much HTTP, REST, and JSON. Those are our common technologies.
There seems to be a big trend that I am hearing from a lot of people- a lot of Thrift, a lot of gRPC, a lot of interesting stories about moving away from REST, which is curious.
REST has worked well for us. It does have some performance issues when you start throwing around multi-MB JSON payloads. But it’s very easy for people to get started on. It’s easy to debug and we are happy with the direction it has taken us.
What is the scale? Are we talking tens of services, hundreds of services, thousands of services?
At my last count, we were somewhere above 100 production services. We have certainly crossed that scaling threshold where you can keep track of exactly what everybody is doing in each service. And we continue to grow, we are increasing the number of services every week or month.
Can you talk a bit about the genesis of your presentation?
I was there when we decided to try out this microservice thing, or service-oriented architecture that we ended up calling it. I watched it from the very beginning. We tried pulling out a bit of code from our two or three-million-lines Python monolith, just to see whether we could do this. It was just a very small experiment, and we did this because we were seeing these initial scaling problems in our monolith.
You can’t have 300+ developers, all contributing to this one codebase, without encountering quite a lot of friction. So we did this initial experiment. Then we went about generalizing the lessons from this one service and started to build an increasing number of services. And I saw certain friction points because the other side of microservices is ending up having to distribute a lot of responsibilities across the organization. The operations team, who was the group who ran the monolith, now don’t have enough context to actually run 100 production services.
There is no way they can understand what all those different services are doing, how they relate. As a result, almost inevitably you end up splitting out those responsibilities across development teams. But along the way, you encounter certain friction points, and it’s a very natural reaction. “Oh, we are going to recentralize some of these responsibilities,” and we have those discussions. But we found that the long term way to scale your organization is to empower developers to learn the skills they need to have that distributed role in the organization. There are a bunch of different areas where we have worked to help our developers with those new roles. There is a lot of education that we have consciously done.
Most developers do not have a lot of experience with distributed systems, and we tried quite hard to develop a set of training materials to help developers get up to speed in that. And likewise, most developers do not have that much operations experience, so we have had to develop quite a lot of tooling so that our developers can just dive into performance problems or figure how to monitor their services and how to respond. It’s quite a steep learning curve for the organization, and I feel what we have learned is very applicable to other organizations going through that transition.
What does the training consist of?
We have a few things. We have a tutorial which walks a developer through setting up a Hello World service. This is the basic template you start out with, and here is how you talk to a database. Here is how you monitor. Here are what health checks should look like. Here is how you define your interface. It is like a programming course. That is one building block.
Another thing that we do at Yelp is organizing tech lead summits and unconferences. We get a bunch of people who are more senior in their teams and we get them together in one place, and we have a bunch of different talks which span quite a few of these topics, distributed systems and organizational issues. We also have unconferences where we get people together and we ask what problems they are seeing. And right then and there we split into groups and have those discussions. We also have training videos that we put together to focus on a particular area, like ElasticSearch.
We will get the person responsible for running the ElasticSearch platform, and we will sit down and record a Q&A session with them in front of a whiteboard, and other developers can learn from that.
What are some of the takeaways from your talk?
It depends on where you are in your organization’s move to microservices. If you are just starting, then hopefully I can present how to run an initial experiment, what people might respond to, what evidence you need to gather. If you are a little bit further down the road, I can present the pain points that we had and the discussions that we had that you probably wouldn’t normally see because quite often people tend to present that everything was rosy.
I am going to talk about some of those hard discussions that we had and how we decided to educate people as much as possible and trust them instead of going, “Oh, no. This is too scary. Let’s go back.” Hopefully, those sorts of insights will be very handy when people come across particular discussions or particular post-mortems, those sorts of things. They will say, “Aha! Yelp also had those problems. Here is what they did in response.”

Speaker: John Billings

Tech Lead @Yelp

John is a Technical Lead for Infrastructure at Yelp, where he's been working for the past five years. He loves building scalable, backend systems. Prior to this, he received his PhD from the University of Cambridge by building compilers for Internet routing protocols.

Find John Billings at

Similar Talks

Director of Engineering @XOGroupInc
Software Architect @VinSolutions, Author @pluralsight
Senior Director of Distribution Platforms @ESPN
Partner & Tech Lead @CarbonFive
Lead Data Scientist @betaworks
Leading Machine Learning Researcher, Vowpal Wabbit Contributor


Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June