Presentation: Unorthodox Paths to High Performance

Location:

Duration

Duration: 
11:50am - 12:40pm

Day of week:

Level:

Persona:

Key Takeaways

  • Hear how rethinking assumptions can lead to breakthroughs in building high-performance data processing systems.
  • Gain insights, tradeoffs, and an architectural overview from the a member of the team who developed a the record-breaking TritonSort sorting system.
  • Understand some of the general principles that helped the team reframe the problem to achieve faster and faster benchmarks.

Abstract

Building high-performance data processing and storage systems that can both scale up and scale out is hard. In this talk, we'll examine some lessons my team and I learned while building record-setting sorting systems at UCSD, and how re-examining your assumptions, understanding your hardware, and actively avoiding work can make building high-performance systems easier.

Interview

Question: 
What is your role today at Trifacta?
Answer: 
I came to Trifacta pretty early; I was one of the first engineers we hired. I’ve been there for about 3 years, and we’ve grown from 6 or 7 people (when I joined) to over 100 people. I have done a bunch of different stuff in that time. A lot of different components have my fingerprints all over them.
I’m now a principal engineer, so my primary role is providing technical leadership on a handful of projects. If they need to drop somebody from engineering into a customer situation to fight fires, then they might drop somebody like me. There is also the day-to-day technical work: writing code, reviewing code, all that sort of stuff. I’m also involved in a lot of the higher level technical planning too, so I wear a lot of different hats.
Question: 
For those that aren’t familiar with Trifacta, can you describe the business a little bit that Trifacta is in?
Answer: 
The goal with Trifacta is to create what we call a data transformation platform. The general idea is that if you are a data scientist or an analyst, you spend a lot of your time taking your raw data and massaging it (cleaning it up so that you can get it into a structured format for analysis). Most of the time, that’s done with a hacked together pile of Python scripts or R scripts or it’s done with Excel. You would be amazed at how many people use Excel as this giant hammer that they hit their data with until it looks structured.
What we are trying to do is bring a tool into the market that lets people who don’t really know how to write code (or do know how to write code but don’t want to maintain it) and provide them with this very intuitive user experience that assists them in cleaning data up for analysis, and that works just as well on really small data sets and really large data sets. The overarching idea is we want to take that 80-90% of the data analyst’s time that they spend doing what we call “data wrangling” and turn it into something that maybe takes 10% of their time. The idea is to let them focus on actually doing their job.
Question: 
What is the motivation for your talk?
Answer: 
When I was in graduate school, we did some work on large scale data processing and trying to do it as efficiently as possible. We were operating at the 100 terabyte scale. For a 100 terabyte dataset (at least back then), you might have seen a multi-thousand node cluster being deployed.
There are two systems we built that I’m going to talk about. One of these systems is TritonSort (which was our sorting system), and the other is Themis (our MapReduce system, which was an extension of TritonSort).
I should mention why sort? If you look at the things that a data-intensive system has to do, it has to read data off secondary storage, and it has to write data out to secondary storage. It has to do some compute on that data once it’s got the data in RAM and then usually it has to move data around from place to place so that you get this nice data locality property for the compute that you’re doing. What’s nice about sort from a research perspective is that it has to do all of those things at the same time.
There is this benchmark competition that was originally envisioned by Jim Gray (Turing Award winner and database elder god) who basically posited that we can measure the progress of these data intensive systems by seeing how fast we can do the sort operation. TritonSort was designed to see how much we could blow the doors off of that sort benchmark number.
Before we came on to the scene, the record was held by a 3000+ node cluster from Yahoo that was running Hadoop. We did the math on that cluster’s performance, and those machines weren’t being pushed very hard while that operation was going on. We asked ourselves, “how close to the edge we can take this?” We ended up doing the first sort record run with 52 machines and substantially improving the bottom-line sorting rate at the same time, so overall it was pretty successful.
When we started, we had this vague understanding about minimizing the number of times that we wrote to the disk, but past that, we had no idea what we were doing. In the intervening 3 ½ years, we did a lot of experimentation. We ripped the thing down and rebuilt it a couple of times, and we took away some general principles that in retrospect helped us a whole lot.
This is an experience talk, in a way. If I were to explain my Ph.D. thesis in a sentence, it would be “reexamine your assumptions.” It’s okay to design for a particular hardware platform. Having a system that is architected to be easy to experiment on is hugely important if you want to approach, in a rigorous way, this idea of making something go faster. Another of the big things we had to do was rethink some of the fundamental assumptions that we had about sorting and MapReduce as problems. I’m hoping that people will be able to take some of those lessons we learned and apply them to the systems that they’re building.
Question: 
Are going to dive into how you built TritonSort?
Answer: 
Yes, we are going to dive in from an architectural point of view. I’ll discuss some of the pieces of it that we got a lot of benefit out of (almost by accident) and also some of the previous iterations, what didn’t work about them, and how it informed the final architecture.
Question: 
I like the fact that you are taking two concrete examples and have a retrospective of what you learned going through it. I had this theory of constraints in my head as you were talking of reorganizing things so that you can alleviate bottlenecks.
Answer: 
It’s a question of understanding your constraints and what those constraints mean in your context. There is this temptation to immediately go and micro-optimize like crazy - like, we have to make all the data structures lock-free, we have got to do zero copies, and DMA between the network cards on different machines, and all that sort of stuff. But in reality, we found that we didn’t have to do a lot of that. The fastest thing is the thing that you don’t do.

Speaker: Alex Rasmussen

Principal Software Engineer @Trifacta

Alex Rasmussen got his Ph.D. from UC San Diego in 2013. While at UCSD, he worked on really efficient large-scale data processing and set a few world sorting records, which makes him a hit at parties. He's currently working at Trifacta, helping build the industry's leading big data washing machine.

Find Alex Rasmussen at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June