Presentation: Scaling Container Architectures with OSS & Mesos

Location:

Duration

Duration: 
1:40pm - 2:30pm

Day of week:

Level:

Persona:

Key Takeaways

  • Hear how Mesos and OSS was leveraged by Two Sigma to improve speed with their research environment.
  • Learn lessons involved with multi-tenancy and quality of service.
  • Hear how Two Sigma used metrics related to user happiness to drive their success and how they developed a fair job scheduler on Mesos for batch and realtime workloads

Abstract

Fluctuating prices. Changing weather. Increasing globalization. Two Sigma uses artificial intelligence and other advanced technologies to look for meaningful patterns in the world’s data. Then, we use those insights to create investment strategies.

In this talk, we’ll discuss how Two Sigma was able to scale up their research to harness tens of thousands of CPUs. First, we’ll look at the previous generation of Two Sigma’s compute resource scheduler, and the challenges a quota-based system brought with it. Then, we’ll learn about Cook, the Mesos-based scheduler Two Sigma developed and contributed to open source. We’ll also discuss some of the challenges faced along the road to increasing the number of machines by orders of magnitude, and the solutions we engineered to build a reliable, scalable platform. We’ll conclude by discussing the state of the infrastructure today, and how Two Sigma leveraged open source software to build their world-class platform.

Interview

Question: 
Can you explain your talk title and Two Sigma to me? Are there any key technologies we should expose?
Answer: 
Two Sigma, especially in New York, is best known as the platinum sponsor of Papers We Love. Besides hosting several meetups, they use many cutting edge technologies like Scala, Clojure and Mesos. They are also involved with Strange Loop, Clojure/West and a bunch of conferences.
Question: 
If I say research platform, what does that mean? When I say building a research platform, what does that mean?
Answer: 
In finance (and in all data driven businesses), there are, broadly, these two stages that the problem is broken down into. One of them is acquiring data sets, be it via batch or streaming. Then, you apply machine learning and statistics and you come up with models. The other side is deploying the models to a website, trading environment, or whatever.
Our research platform is about modelling data. This is the side where the big data, big compute containers, and all that stuff that needs to scale is -- it needs to be easy to interact with. This is not the production trading system -- this is the analytic side of the business.
Question: 
What’s the motivation for your talk?
Answer: 
There are multiple goals. Two Sigma has used Mesos and applied open source technologies with Mesos to solve infrastructure problems. They also developed their own new technologies, some of which they open sourced on www.github.com/twosigma.
All of these things--open source and proprietary--have created something that is mind-blowingly cool from a technical perspective. I think this talk is going to resonate with engineers who work at large organizations, because the whole talk is about is about dealing with multitenancy and quality of service.
We’ll be talking about a problem you have with big companies. For example, on reliability: how do you keep 1000 machines running? Because if you have an auto scaling group in AWS, six machines is one thing, but when you have data centers, you have a new set of challenges.
Question: 
When you say "mind blowingly cool", will you put that in context for me?
Answer: 
Mesos has an interesting multi-resource fair sharing algorithm: what we did was extend that. We did original research on scheduling algorithms to be able to create this new preemptive fair sharing scheduling system. We used the idea of optimizing for user happiness: complaints per week was the metric that we used to track and improve our techniques. Part of this talk is about this algorithm, which is part of the software named Cook.
I’ll also talk about the other parts including Satellite. Mesos is like a deployment infrastructure that is constantly running, and like a process that is always running and getting messages about the state of the cluster and able to react. How do we use that to automate manual operations tasks? How do we integrate our manual processes with new, automatic ones -- how do we make them not conflict, and be able to work together?
Question: 
Did you go straight Mesos or did you evaluate other alternatives?
Answer: 
We investigated Yarn, OpenStack and Mesos. Kubernetes didn’t exist at the time. It’s that Mesos is super flexible. Yarn is really good for batch, but not as much for services. Kubernetes is really great for services, but not so good for batch. Mesos has so many integrations with so many systems, and at the time and, in fact, it was already battle-tested and proven.
Question: 
What is the context of the Cook discussion? Are you talking about the product itself or the patterns/lessons used in Cook?
Answer: 
I will talk about Cook: what it is, what it does, the challenges of how we built it. Cook originally was almost going to have like a market of resources internally, but we ended up with a much simpler thing that is easier to understand than economy of compute units.
I’ll talk about this great feature of Cook: self-simulation. Cook itself could record traces of what it was doing, and then replay them so that we could experiment with it and the use of simulation to improve the scheduling of the cluster.
Question: 
QCon targets advanced architects and senior development leads, what do you feel will be the actionable that type of persona will walk away from your talk with?
Answer: 
I want them to understand that Mesos is an excellent technology for cluster design and for container based architectures, and to understand the value of containers is, in many cases, a way to avoid like a large pay down of legacy technical debt and instead a mitigation strategy that can be just as effective.
I want them to be aware of the open source technologies that Two Sigma released for doing this.
Question: 
What do you feel is the most disruptive tech in IT right now?
Answer: 
I will answer that two fold. For what is currently disrupting, versus in six months or a year, what it is going to be.
What is currently I think the key is the new use of things like PF and IP tables for containers. The thing that is really crazy now is that the key networking technology is such that we are able to almost turn into a container into a virtual machine.
The next thing - and this is also, I guess, my hope - is that Mesos now is shipping with maintenance integration primitives and disk management primitives.
Currently none of the distributed databases operate themselves. All of them require someone to run backups and compactions and monitor them, and you have to manually add nodes and deal with decommissioning old nodes. I think Mesos is now in a position where, in a year from now, there going to be production quality frameworks for things like Cassandra and Postgres.
I think this is a real game changer because it essentially brings Amazon RDS out of a proprietary hosted product. It’s democratizing database infrastructure.

Speaker: David Greenberg

VP Transforming Architecture @TwoSigma

David Greenberg loves learning new things. He is an independent consultant who previously worked at Two Sigma, where he led the effort to rebuild their computing infrastructure. His desire to learn has lead him to study Russian, and he enjoys practicing cooking techniques. He's interested in high performance software and distributed systems with Mesos. He's the author of the O'Reilly book "Building Applications on Mesos" and the designer of Cook, a Mesos framework written in Clojure and Datomic which coordinates containers to optimize task scheduling.

Find David Greenberg at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June