Presentation: Scaling Uber to 1,000 Services

Location:

Duration

Duration: 
11:50am - 12:40pm

Day of week:

Level:

Persona:

Key Takeaways

  • Thoughts from a person leading a high growth environment on what they might have done differently on the way to 1000's of production services.
  • Hear a senior software leader “acknowledge the pain” in growing the architecture at an explosive growth company, and how you might benefit from their painpoints.
  • Learn from the success and failure of one of the fastest growing software companies in the world.

Abstract

To Keep up with Uber's growth, we've embraced microservices in a big way. This has led to an explosion of new services, crossing over 1,000 production services in early March 2016. Along the way, we've learned a lot. If we had to do it all over again, we'd do some things differently. If you are earlier along on your personal microservices journey than we are, then this talk may save you from having to learn some things learn the hard way.

Note: This is a portion of a full podcast with Matt Ranney. You can read the full show notes  and listen to the podcast on InfoQ.com. You can also subscribe to all future podcasts by following us on our RSS feeds on InfoQ, SoundCloud, or on iTunes.

Interview

Question: 
What’s the motivation for this talk?
Answer: 
I have been to a number of QCons, and I have been to a lot of other conferences as well (as part of Uber and in my career). I started to reflect on this: If I was going to pick a conference to attend, what would I want to get out of it? And I started to notice that there seemed to be a little mismatch about what I was getting from a number of these other Architectures You Have Always Wonder About tracks (which are all super interesting and super great), but a lot of them left me feeling inadequate or bad.
Somehow other people had it all figured out but not me. And as I asked around, I realized that a lot of people feel is that “Wow, Google sure has it all figured out. I guess I will try and learn what they are good at but, man, it’s too bad my stuff all sucks.” I want to question that because as I have talked to a lot of people at these places, I realized that the kind of people who volunteer to give talks are always about some new awesome, shiny thing that they finally did just figure out but meanwhile, the rest of their infrastructure is all various shades of legacy shambles, right?
What I wanted to try to reflect is the struggle, this is hard, and let’s acknowledge that this is really hard but if there are some ways that you can save yourself some pain, I think that there is a way to make it less painful by sharing some learnings from people who have suffered so you don’t have to. That’s where I am coming from this time.
Question: 
I love the phrase, acknowledge the struggle and save yourself some pain. That’s good. So can you give me an idea of types of things you will be going into?
Answer: 
Sure. You know, of course, it’s a technology conference so I am going to say the word microservices 40 or 50 times. It is just inevitable. That’s an example of one of those things. Before we had a lot of services I was like, “Oh, all the cool kids have lots of services. That’s definitely something that you should go get.” And we did. That was good but, like everything in engineering, it’s a tradeoff and a lot of the tradeoffs were not obvious.
I will give you an example. Adopting microservices allows you to write your software in different programming languages. You could have some stuff written Node.JS, and some stuff written in Python and some stuff written in Go and some stuff written in Java. That’s a very specific example of our exact infrastructure where we have that exact situation. Oh, and some other sprinklings of Scala or other. I think someone did an Elixir thing, I don’t know for sure. Anyway, you can do this, right? That is the benefit of putting things into microservices and people can own their release cycles and their own alerting and be responsible for their own up time and this is really cool.
But what’s the downside? At what cost? It was great that we had this framework to allow people to do things differently but because they did things differently, the aggregate velocity in many cases was a lot slower because now the Java people had to figure out how to talk to the metrics systems and so do the Go people and so do the Node people and now sometimes, they do it differently. Then, some hard fought bug on one platform that they fixed has a similar shape but have to be similarly battled on the other platform. I hadn’t expected the cost of multiple languages to be as high as it was.
Question: 
I remember in San Francisco you mentioned something like ten times growth in the engineering size in the year and a half that you have been there. How do you manage that kind of environment with that type of growth? How do you manage the culture? How do manage velocity you mentioned before? How do you manage just not stepping on each other?
Answer: 
I have got to be honest, we don’t have it all figured out. We are lucky that we have a very successful product and everyone is very enthusiastic and we are able to hire really, really good engineers. It is something that we are still actively trying to get better at. Luckily, because of all those things that we have going for us, even when we stumble around and make mistakes and try to figure this stuff out, we are still able to keep on making progress and making the system work. But the growth rate of just stacking that many engineers up on the same sets of problems so quickly, it would not be possible if we weren’t able to hire really good engineers.
It would be much more efficient if we did it more slowly, way more efficient if we did it more slowly. But the competition, especially if you consider the competitive landscape globally, is fierce and we are working very, very hard to stay ahead of them and that requires you know pushing hard up against a lot of conventional limits of what you might think is a reasonable growth rate. So, it’s not efficient, it’s super not efficient but we’re getting better. We are making it more efficient all the time and you know we’re definitely learning as we go.
One of the interesting things about this cultural side of the story, an interesting evolution that we have gone through because we were adding people so quickly, is that it would not have been possible if we weren’t building things in lots of tiny services. I don’t think there was any way we could have ever done it if we had to have some organized, top-down architecture that we would say about “Isn’t it wonderful and elegant?”
We had to make this loose collection of services that in aggregate made Uber work. It was the only way to allow independent progress from lots of teams of varying levels of experience. That culture, for better or for worse, is not always cohesive. You might not even know who the other teams are who are using your thing and that’s weird.
That’s definitely something that we are working on, to try to give everybody a clear picture of where their system fits into the broader architecture and get a little bit more consistency about the way that we do things. It’s easy to read some other team’s source code but it’s not “Wow, look at how this team does it.” It’s hard to understand. It puts some artificial barriers on team cohesiveness. I don’t know if that answers your question. It goes in my general theme of this is a struggle and there are some interesting things about that.

Speaker: Matt Ranney

Chief Systems Architect @Uber, Co-Founder @Voxer

Matt is the Chief Systems Architect at Uber, where he's helping build and scale everything he can. Previously, Matt was a founder and CTO of Voxer, probably the largest and busiest deployment of Node.js. He has a computer science degree which has come in handy over a career of mostly network engineering, operations, and analytics.

Find Matt Ranney at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June