Presentation: Membership, Dissemination and Population Protocols

Location:

Duration

Duration: 
1:40pm - 2:30pm

Day of week:

Level:

Persona:

Key Takeaways

  • Hear practical applications of academic research with a large scale distributed system.
  • Discuss membership and dissemination protocols and understand their application in practice.
  • Learn practical lessons applying membership and dissemination protocols to a large scale distributed system.

Abstract

We are building an instrumentation platform that runs across dozens of datacenters to provide operational visibility for internal systems and applications. This platform must remain up as much as possible and allow support and operations staff to understand and diagnose problems quickly. They must be able to ask questions like "what machines and applications are publishing metrics?", "what systems appear to be offline?", "what order did these errors occur in?", all without consulting every datacenter. Furthermore, they must be able to change configuration quickly, with confidence that every affected system will receive and act upon it.

To help with these problems, we are implementing several recently developed protocols for cluster membership, epidemic broadcast, and monotonic time. Respectively, these protocols allow us to know what nodes are peers, to disseminate configuration and status information, and to agree on roughly relative orders of events. Best of all, they are all synchronization-free, meaning we can achieve our goals while remaining highly available. In this talk, we'll discuss the protocols we chose, challenges to implementing them, and some preliminary results from deploying the protocols across our infrastructure.

Interview

Question: 
You recently moved from Basho to Comcast. Can you tell me about your role at Comcast?
Answer: 
I am a software engineer at Comcast, and I am starting a new team as the tech lead on that team. For the last year, I have been working on an API management proxy that we wrote in-house, in Nginx and Lua. I am going to use some of my previous experience with distributed systems at Basho in this new project. It’s going to be on the Erlang OTP platform, and we are going to be using some of the technologies that I am covering in my talk.
Question: 
Lua, Erlang OTP... Sounds like quite a polyglot environment. Tell me more about what you’re doing there?
Answer: 
It is predominantly JVM stuff inside Comcast, but I also work on a team that is underneath Jon Moore. His teams are focusing on building services that are used by the rest of the company, the products behind the products. His focus there is to leverage open source, and build our own where we can’t use open source or where it is not sufficient, to try to reduce risk, increase value, and reduce costs.
For the most part, we are building managed services that are consumed by other teams within Comcast to build the products. We’ve got a DynamoDB clone. We have the API management proxy I mentioned. We have a DNS-as-a-service that is spinning up and then my new project.
Question: 
Will you talk us through the motivation for this talk?
Answer: 
The track is Modern CS in the Real World and I want to talk about the research that we are applying in my new project. Specifically, I will be covering getting information out from a big cluster spread across multiple data centers. There are three aspects to that, that I am intending to talk about.
The first aspect is membership, what machines are part of the cluster. That’s an interesting problem: because in very large clusters if you let every node know what every node is you will get a lot of churn, especially if they come and go.
As long as there is a path from one node to every other node, you don’t have to have a full view of membership. So we are looking at some protocols that use partial views of the entire membership, and they communicate only with their peers that are in their list, and those peers know other peers that they don’t know about. That’s how you get a fully routable mesh to the cluster.
Question: 
What do you consider to be the right persona for your talk?
Answer: 
I will be talking primarily to architects, people who are managing or building large scale services that span multiple data centers, especially where they need to have high availability requirements.
If you don’t need to have high availability, there are lots of other things that you can just pull off the shelf and use, rather than trying to implement one of these protocols.
Question: 
How would you class this, as an intermediate or advanced talk?
Answer: 
It’s intermediate to advanced, leaning toward advanced. I am going to make sure that it is approachable, but if you are not experienced, it might be pretty hard to motivate this sort of exploration of research. This is partly because the problems you try to solve haven’t been as big, or because you haven’t had to run a huge system before.
It’s targeted at a lot narrower group than maybe some of the other talks that are more general.
Question: 
Tell me about how you are going to approach the talk?
Answer: 
In each of the three areas, I am going to talk briefly about the research involved, without going into too much detail. Then I will talk about implementation and some of the pitfalls we ran into.
Question: 
Is there some open source code the people can use for this protocol?
Answer: 
There are some implementations in the wild that are open source. Recently, Mesosphere released their DCOS, a high level management tool for Mesos and containers. One of the components that they released is called Lashup, and it implements the membership protocol or a variation of the membership protocol that I am looking at.
I don’t think it implements the dissemination protocol, which is the second piece.
The dissemination protocol was implemented first at Basho, and then extracted by the folks at Helium, who have an IoT platform that is using it. It is called Plumtree. We are going to do a variation on that.
Question: 
What are the main takeaways from your talk?
Answer: 
The first is the usefulness of research, applying it to distributed systems. Then it’s a discussion on the membership and dissemination protocols, and the open solutions that exist out there. I’ll add some of the lessons we learned going through research, and applying these protocols in production.

Speaker: Sean Cribbs

Software Engineer @Comcast

Sean Cribbs is a distributed systems and web architecture enthusiast, currently building innovative cloud and infrastructure software at Comcast Cable. Previously, Sean spent five years with Basho Technologies contributing to nearly every part of Riak including client libraries, CRDTs and tools. In his free time, he has ported Basho’s Webmachine HTTP server toolkit from Erlang to Ruby, created a popular parser-generator for Erlang, and has contributed to many other open-source projects, including Chef, Homebrew, and Radiant CMS.

Find Sean Cribbs at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June