Presentation: The Verification of a Distributed System

Location:

Duration

Duration: 
2:55pm - 3:45pm

Day of week:

Level:

Persona:

Key Takeaways

  • Understand various approaches to verification of a distributed system.
  • Hear real world, practical tips on ensuring a distributed systems correctness.
  • Dive into details on the use of model checkers, fault injection, and more.

Abstract

Distributed Systems are difficult to build and test for two main reasons: partial failure & asynchrony. These two realities of distributed systems must be addressed to create a correct system, and often times the resulting systems have a high degree of complexity. Because of this complexity, testing and verifying these systems is critically important. In this talk we will discuss strategies for proving a system is correct, like formal methods, and less strenuous methods of testing which can help increase our confidence that our systems are doing the right thing.

Interview

Question: 
What are you doing today at Twitter?
Answer: 
I am a distributed systems engineer at Twitter, and for the last year and a half I have been the tech lead of the observability team which provides all of the monitoring, alerting, visualization, matrix injection, distributed log aggregation, tracing software like Zipkin. Recently, I transferred to our engineering effectiveness group because I am really excited to work on distributed build systems. We work on developer tools, providing a better developer experience so we can run linters, code suggestion and code quality tools in the background in a cost effective way, so they are not going to eat all of the RAM and CPU on a machine. I am going to be applying some of my distributed systems knowledge to help make our developer tools at Twitter better.
Question: 
Your talk is “The Verification of a Distributed System.” Tell me a bit about the genesis of the talk.
Answer: 
I wrote an article that was published in “Communications of the ACM” and “ACM Queue” this past year. The idea for that article came from being very passionate about testing and producing good quality software. Testing distributed systems is really hard. How do we even ensure that the system is correct? A lot of what we hear is about formal methods and verification, but as a practitioner I can say that most people aren’t doing that.
I talk about the article and I will describe where I think it is good to do this. I will discuss some practical things that may not give you the guarantee of a correct system, but help increase our confidence greatly that the systems we are building are correct. And I’ll talk about some of the nasty bugs that we found.
Question: 
What does it mean for a distributed system to be correct?
Answer: 
Testing is used to prove that the system we are doing does what we want. There are a couple of principles that I’ll discuss in my talk such as safety and liveness. Does it do what we want and does it do what we want in a certain amount of time are two good ways to think about that. Then there are correctness guarantees. For example, if I say that this thing is durably storing data and if two of out of my five machines fail, did it actually get stored correctly? How do I know if a system does what I want even in the face of failure?
Question: 
Why is an integration test or a unit test not enough when it comes to a distributed system?
Answer: 
Those are key components and there has been an interesting research paper that came out showing that those are an important first step. Doing that is fundamental to getting distributed systems correct, but there is a lot of stuff that is hard to mimic in a single execution or on a single machine. There are asynchronous networks and partition failure, which are the two reasons distributed systems are so hard. It’s important to be able to inject faults over the network, testing when things aren’t reliable. You can mimic some of this on a single laptop but you want it tested in an environment closer to production which is way more hostile than just running code on a machine where generally everything is fine.
Question: 
What types of techniques are you going to cover that are down to earth? I mean ones that are not based mathematical verification of a distributed system?
Answer: 
There are obviously unit tests and integration tests that we should, and they are bare minimum. I will talk about model checkers which are useful for exploring the state space of failures. Then we will also talk about fault injection which is also a useful method for testing systems under failure. There is a variety of ways to do fault injection, from kill -9 on a node, which is a manual operation, to tools like Jepsen or Netflix’s Chaos Monkey and all the automated fault injection that you can run.
There is also the financial side, how much can you spend on this because, obviously, in industry we need to be realistic. We only have so much budget to ship a project and only so much time, so what is the most bang for our buck. That is what I want people to take away from my talk.
Question: 
How are you going to go through this?
Answer: 
I definitely want to go through and we will touch on formal verification briefly just so that people understand what it is. Then we will spend a lot more time on the benefits of unit tests and integration tests along with some case studies that have been done. Then we will go through world examples of things done in production and written case studies that you can dive deeper for if you want after the talk. Hopefully one of those resonates with you and you want to try it for your system. You either find a bug or the system survives the test and you become more confident about your system.
Then, at the end, I’ll discuss interesting things that are happening in the distributed systems research space. This is a really exciting area because it is a very hard problem that academics and industry folks are both looking at proving more correct systems, providing more certainty about the code that we are shipping and putting out there.
Question: 
What is the level of this talk, and what’s the persona that you are targeting?
Answer: 
It’s intermediate. The audience should have some level of code development experience. They need to have worked on a product that ended up in production, and they want to have an increased level of confidence that their system is correct. I will discuss about what has to be done through the lifecycle of a project to ensure correctness.
Question: 
How do you create repeatability during tests in a distributed system?
Answer: 
That’s super challenging in distributed systems obviously because a bunch of different things could fail. The way to do this is to inject a very specific failure case. For example, a specific node is going to fail, and we watch what happens. And we can definitely reproduce it. We do not wildly inject faults although it can be done with Netflix’s Simian Army. But I believe they record them so they can go figure out later what went wrong.
Repeatability is hard because there are some timing issues that will also come up. That’s why the state space you are exploring is not exhaustive with some of these techniques. You never know that your system is 100% correct because we haven’t exhausted the state space of failures.
Question: 
What do you want someone coming to this talk to walk away with that’s actionable, that they can then go back to their shop and implement?
Answer: 
The main takeaways is that there is hope in verifying a system. You can use formal verification but I don’t think that inspires a lot of hope. The goal is to provide different ways that one can test. I will also discuss about the time, effort and money aspect, the need to decide on the strategies to apply to get a high level of confidence that the system built is correct.
This is a talk mostly about real world practical tips for doing system correctness or increasing confidence that the system is correct.

Speaker: Caitie McCaffrey

Distributed Systems Engineer @Twitter

Caitie McCaffrey is a Backend Brat and Distributed Systems Diva at Twitter, where she is the Tech Lead of the Observability Team. Prior to that she spent the majority of her career building large scale services and systems that power the entertainment industry at 343 Industries, Microsoft Game Studios, and HBO. Caitie has a degree in Computer Science from Cornell University, and has worked on several video games including Gears of War 2, Gears of War 3, Halo 4, and Halo 5 She maintains a blog at CaitieM.com and frequently discusses technology on Twitter @Caitie

Find Caitie McCaffrey at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June