Presentation: Building Confidence in a Distributed System



2:55pm - 3:45pm

Day of week:



Key Takeaways

  • Understand the need for verification of distributed systems.
  • Learn approaches and techniques for verification with distributed systems.
  • Understand some of the different challenges and solutions for verification with stream processing systems.


How Did I Get Here? Building Confidence in a Distributed Stream Processor

When we build a distributed application, how do we have confidence that our results are correct? We can test our business logic over and over but if the engine executing it isn't trustworthy, we can't trust our results.

How can we build trust in our execution engines? We need to test them. It's hard enough to test a stream processor that runs on a single machine, it gets even more complicated when you start talking about a distributed stream processor. As Kyle Kingsbury's Jepsen series has shown, we have a long way to go creating tests that can provide confidence that our systems are trustworthy.

At Sendence, we're building a distributed streaming data analytics engine that we want to prove is trustworthy. This talk will focus on the various means we have come up with to create repeatable tests that allow us to start trusting that our system will give correct results. You’ll learn how to combine repeatable programmatic fault injection, message tracing, and auditing to create a trustworthy system. Together, we’ll move through the design process repeatedly answering the questions “What do we have to do to trust this result?” and “If we get the wrong result, how can we determine what went wrong so we can fix it?”. Hopefully you’ll leave this talk inspired to apply a similar process to your own projects.


Sean, you are the VP of engineering at Sendence. Can you tell us more about your company?
We are a newish start-up. The company started in February of last year. We are positioned in the Big Data space. At the moment, we are focusing on financial systems, and we are building a stream processing platform for that (which we are planning on open sourcing).
Would you say that this open source platform is a core part of what you are going to be speaking about?
Yes. The talk is centered around our experience in building that system. There’s a number of key features that we are working on that if they don’t work properly, we don’t have a product to sell. Knowing that they work and work under duress is very important to us. In particular, we are saying that our platform is going to have idempotent data structures which give you exactly once semantics in an at least once system. To say that you can do that at high speed is a fairly bold claim to make.
The talk is in many ways the evolving story of how we are going about feeling confident in our own claims.
How do you (Sendence) prove confidence in a distributed system?
We started from a fairly simple idea. We have a stream processor. We have some data coming in. We want to know that our system is working correctly. What does it mean for this system to be correct? It starts with black box verification. We have an external testing system that sends data into the stream processor and gets data out. That external system has an idea about what “correct” means and can validate that our results match those expectations.
Imagine a simple stream processing example, we have a function that doubles the input: for every string A we send in, we get AA out. Or if we send in the number 10, we get 20 out.
Verification in this case is pretty trivial. We can look at what went in, what came out, if we sent in 1,2,3, we can expect to have gotten 2,4,6. Anything else is an error. The core principle is the same no matter what but it is far more complex if we have business logic inside or multiple streams.
For example, one of the reference applications that we are currently testing takes two streams of data, one of them being market data, the other one being trades. It decides whether these trades should be accepted, should be allowed to go out to a market or not, based on variations within the market data. These streams will end up moving at different rates, but we still want to be able to verify whether these are correct or not.
Our strategies for verifying such a system is a core part of the talk.
Can you tell me about some of the concrete takeaways that an architect/tech lead might walk away with from your talk?
The big one is if you want to have confidence in what you are building, then you have to build testability in from the beginning. You need to build a system where you can deterministically recreate the scenarios that you are in, so you can run those tests over and over again, and get the exact same result each time, except for whatever might happen based on the changes that you have done with your code.
Being able to black box verify correctness can be fairly hard but probably doable with a single stream of data. Once you add in multiple streams, it gets a lot harder. I’m going to be talking a lot about how we are attacking that very hard problem and how we are combining those techniques with deterministic injection of faults and errors into the system.
There was a talk from one of the FoundationDB team members a couple years ago about how they went about building for correctness. We at Sendence have taken the spirit of a lot of those ideas and applied them to a new problem space. Hopefully, attendees of my talk will be able to do the same: take our practical examples of how we have been solving the verification problem and adapt it to their own problem space.
One of the big things we realized early on was that in order to verify our system, we needed to be able to trace all the messages flowing through it and we needed an audit log to be able to understand how we ended up in an incorrect state. Those are two features that while a boon to verification are also great for convincing people to use your system.
What can I do to my own systems today to make them more traceable, more auditable?
First, you need to be able to observe and control state for any decision that you make.
Let’s consider that we are doing time-based calculations, and we have to deal with the system’s clock, which is a state, it’s a dependency, and we need to record that state.
When we are building up the audit, we have two types of things: we have derivable and non-derivable states. A derivable state for us is one where we can get back to that state simply by replaying the stream of incoming data. If we are doing word counts, for example, then we can get the final result by reprocessing whatever that corpus of words was.
But if we are doing things based on the time of day, for example, if on the first minute of every hour we throw away all data, then we need to know when we made those non-derivable decisions so we can repeat them.
Or, if we are making the decision whether to allow this trade to happen or not based on the state of what is going in the market, that is non-derivable data. In order to be able to recreate everything that is going on we need to record all of that data in order to be able to fix the bugs that we are going to have, that we are going to expose when we do fault injection. We can have a complete audit log that is useable in a production type environment, for regulators, or for anybody else.
I remember a talk in London where William Hill discussed how they do something similar I believe. They record the traffic during their busiest day of the year (the Grand National, if I remember correctly). Then in their R&D environment they’ve iterated multiple times with different architectures to continually improve the architectures their building.
That’s very nice, definitely something I’d want the ability to do. I ran Storm in production for three years, and one of the things that was most frustrating was trying to debug when tuples would cross JVMs. It was really hard to trace, and somewhere along the way something was happening to it. We’d have exceptions in our log, and we have millions of messages flowing through. There are fifty, sixty thousand messages per second that keep flowing through, and we have a steady stream of errors. It’s pretty much impossible to know what message is causing what error and, more importantly, even if we can know that it is a particular message, how did it end up in a state where it is somehow causing an error, when we have never seen this error before?

Speaker: Sean T. Allen

VP Engineering @Sendence

Sean T. Allen is VP of Engineering at Sendence- a startup focused on high speed data analytics. His turn-ons include programming languages, distributed computing, Hiwatt amplifiers, and Fender Telecasters. His turn-offs include mayonnaise, stirring yogurt, and sloppy code. He is one of the authors of Storm Applied; he wanted to call it “I wanna go fast” but, you know, publishers- ¯\(ツ)/¯.

Find Sean T. Allen at


Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June