Presentation: Adaptive Availability for Quality of Service



10:35am - 11:25am

Day of week:



Key Takeaways

  • Learn concrete approaches to discovering, analyzing, and optimizing poorly performing or latent nodes in a cluster.
  • Hear about tools and techniques that can be used to capture and model behavior.
  • Understand lessons from building applications to model performance on distributed systems.


In this presentation, I'll talk about lessons learned in building a always-on distributed time-series database with aggressive quality of service guarantees. As any distributed systems engineer knows, coping with a failed machine is an easy problem compared to an under performing one. When SLAs are tight, under performing is effectively byzantine behavior. I will talk about both macro and micro techniques used in our system to cope with bad machines, bad actors and other poorly qualified badness. Most are adaptive techniques backed with both local and cluster-wide statistical analysis of observed behavior.


What’s the motivation for your talk?
At Circonus, we have some innovative approaches to managing a distributed system’s performance by leveraging the new resiliency models available on those systems. They are somewhat novel and they are pretty accessible to people who run general distributed systems (like Cassandra, Riak, etc.). I will be talking about our proprietary system but they apply to all distributed systems. I think that people will be given some interesting ideas on how they can manage their large distributed databases.
Can you describe one of the techniques you will go into?
There are some pretty traditional methods of doing active feedback between nodes for distributed system cluster performance. These are approaches you might use when you notice another node is slow, latent, or just not up to date. For example, you might take it out of rotation or change your parameters regarding node-to-node interaction. But the idea of measuring a node’s resource performance at a highly granular level (or being able to actually turn off nodes that have different performance profiles in an effort to understand if you have better performance across the whole cluster when you do) is the conclusion of the talk.
I will go over some standard techniques of getting performance characteristics off of replications systems, and I will also go on to talk about measuring per transaction latency on low level system resources. For example, measuring the latency for IOPS on every spindle on every node in your cluster and then being able to model that and elect to immobilize machines based on bad behavior.
When you say model latency on every IOPS on a spindle, are you going to be talking about a specific tool to help you model or general ideas and approaches? Assuming you’ll discuss tooling, is that tool open source and available for people to use?
I am going to talk about the general idea and the general outcome because I think it is applicable to a lot of people, but I will describe exactly how we do it. When I describe how we do it, I plan to discuss the tools that we use.
The tool that we use to collect and get all of that information is all open source, and the models that we use to detect the behavior are all open. But we built a monitoring tool. So we actually pump the data through our own product to do the actual modeling. With that said, the techniques and the missing parts that are closed source are very small. They are actually rather simple concepts that others can build on.
How would you rate this talk: Beginner, Intermediate, or Advanced?
I think that beginners might be a little overwhelmed, so intermediate and advanced will really understand the concepts and the approaches. I think advanced users will likely leave with enough information to implement something like this in their own environment.

Speaker: Theo Schlossnagle

Founder and CEO @Circonus, Editorial board of ACM's ‘Queue’

Theo founded Circonus in 2010, and continues to be its principal architect. After earning undergraduate and graduate degrees from Johns Hopkins University in computer science with a focus on graphics and randomized algorithms in distributed systems, he went on to research resource allocation techniques in distributed systems during four years of post-graduate work. A widely respected industry thought leader, Theo is the author of Scalable Internet Architectures (Sams) and a frequent speaker at worldwide IT conferences. Theo is a computer scientist in every respect. Theo is a member of the IEEE and a senior member of the ACM. He serves on the editorial board of the ACM's Queue Magazine.

Find Theo Schlossnagle at


Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June