Track:

Location:

Salon E

Duration

Duration:

11:50am - 12:40pm

Day of week:

Tuesday

Level:

Advanced

Persona:

Developer

Key Takeaways

Hear how rethinking assumptions can lead to breakthroughs in building high-performance data processing systems.
Gain insights, tradeoffs, and an architectural overview from the a member of the team who developed a the record-breaking TritonSort sorting system.
Understand some of the general principles that helped the team reframe the problem to achieve faster and faster benchmarks.

Abstract

Building high-performance data processing and storage systems that can both scale up and scale out is hard. In this talk, we'll examine some lessons my team and I learned while building record-setting sorting systems at UCSD, and how re-examining your assumptions, understanding your hardware, and actively avoiding work can make building high-performance systems easier.

Interview

Question:

What is your role today at Trifacta?

Answer:

I came to Trifacta pretty early; I was one of the first engineers we hired. I’ve been there for about 3 years, and we’ve grown from 6 or 7 people (when I joined) to over 100 people. I have done a bunch of different stuff in that time. A lot of different components have my fingerprints all over them.

I’m now a principal engineer, so my primary role is providing technical leadership on a handful of projects. If they need to drop somebody from engineering into a customer situation to fight fires, then they might drop somebody like me. There is also the day-to-day technical work: writing code, reviewing code, all that sort of stuff. I’m also involved in a lot of the higher level technical planning too, so I wear a lot of different hats.

Question:

For those that aren’t familiar with Trifacta, can you describe the business a little bit that Trifacta is in?

Answer:

The goal with Trifacta is to create what we call a data transformation platform. The general idea is that if you are a data scientist or an analyst, you spend a lot of your time taking your raw data and massaging it (cleaning it up so that you can get it into a structured format for analysis). Most of the time, that’s done with a hacked together pile of Python scripts or R scripts or it’s done with Excel. You would be amazed at how many people use Excel as this giant hammer that they hit their data with until it looks structured.

What we are trying to do is bring a tool into the market that lets people who don’t really know how to write code (or do know how to write code but don’t want to maintain it) and provide them with this very intuitive user experience that assists them in cleaning data up for analysis, and that works just as well on really small data sets and really large data sets. The overarching idea is we want to take that 80-90% of the data analyst’s time that they spend doing what we call “data wrangling” and turn it into something that maybe takes 10% of their time. The idea is to let them focus on actually doing their job.

Question:

What is the motivation for your talk?

Answer:

When I was in graduate school, we did some work on large scale data processing and trying to do it as efficiently as possible. We were operating at the 100 terabyte scale. For a 100 terabyte dataset (at least back then), you might have seen a multi-thousand node cluster being deployed.

There are two systems we built that I’m going to talk about. One of these systems is TritonSort (which was our sorting system), and the other is Themis (our MapReduce system, which was an extension of TritonSort).

I should mention why sort? If you look at the things that a data-intensive system has to do, it has to read data off secondary storage, and it has to write data out to secondary storage. It has to do some compute on that data once it’s got the data in RAM and then usually it has to move data around from place to place so that you get this nice data locality property for the compute that you’re doing. What’s nice about sort from a research perspective is that it has to do all of those things at the same time.

There is this benchmark competition that was originally envisioned by Jim Gray (Turing Award winner and database elder god) who basically posited that we can measure the progress of these data intensive systems by seeing how fast we can do the sort operation. TritonSort was designed to see how much we could blow the doors off of that sort benchmark number.

Before we came on to the scene, the record was held by a 3000+ node cluster from Yahoo that was running Hadoop. We did the math on that cluster’s performance, and those machines weren’t being pushed very hard while that operation was going on. We asked ourselves, “how close to the edge we can take this?” We ended up doing the first sort record run with 52 machines and substantially improving the bottom-line sorting rate at the same time, so overall it was pretty successful.

When we started, we had this vague understanding about minimizing the number of times that we wrote to the disk, but past that, we had no idea what we were doing. In the intervening 3 ½ years, we did a lot of experimentation. We ripped the thing down and rebuilt it a couple of times, and we took away some general principles that in retrospect helped us a whole lot.

This is an experience talk, in a way. If I were to explain my Ph.D. thesis in a sentence, it would be “reexamine your assumptions.” It’s okay to design for a particular hardware platform. Having a system that is architected to be easy to experiment on is hugely important if you want to approach, in a rigorous way, this idea of making something go faster. Another of the big things we had to do was rethink some of the fundamental assumptions that we had about sorting and MapReduce as problems. I’m hoping that people will be able to take some of those lessons we learned and apply them to the systems that they’re building.

Question:

Are going to dive into how you built TritonSort?

Answer:

Yes, we are going to dive in from an architectural point of view. I’ll discuss some of the pieces of it that we got a lot of benefit out of (almost by accident) and also some of the previous iterations, what didn’t work about them, and how it informed the final architecture.

Question:

I like the fact that you are taking two concrete examples and have a retrospective of what you learned going through it. I had this theory of constraints in my head as you were talking of reorganizing things so that you can alleviate bottlenecks.

Answer:

It’s a question of understanding your constraints and what those constraints mean in your context. There is this temptation to immediately go and micro-optimize like crazy - like, we have to make all the data structures lock-free, we have got to do zero copies, and DMA between the network cards on different machines, and all that sort of stuff. But in reality, we found that we didn’t have to do a lot of that. The fastest thing is the thing that you don’t do.

Speaker: Alex Rasmussen

Principal Software Engineer @Trifacta

Alex Rasmussen got his Ph.D. from UC San Diego in 2013. While at UCSD, he worked on really efficient large-scale data processing and set a few world sorting records, which makes him a hit at parties. He's currently working at Trifacta, helping build the industry's leading big data washing machine.

Find Alex Rasmussen at

Speaker page

@alexras

Principal Software Engineer at Trifacta

Similar Talks

Learnings from a Culture First Startup

CTO @Buffer

Sunil Sadasivan

Becoming an Outlier

Software Architect @VinSolutions, Author @pluralsight

Cory House

ESPN Next Generation APIs Powering Web, Mobile, TV

Senior Director of Distribution Platforms @ESPN

Manny Pelarinos

The Human Side of Microservices

Tech Lead @Yelp

John Billings

Lessons Learned on Uber's Journey into Microservices

Software Engineer @Uber

Emily Reinhold

What They Don’t Tell You About Microservices…

CTO @Yodle

Daniel Rolnick

Algorithms for Animation

Partner & Tech Lead @CarbonFive

Courtney Hemphill

Machine Learning Fast and Slow

Lead Data Scientist @betaworks

Suman Deb Roy

Vowpal Wabbit a Machine Learning System

Leading Machine Learning Researcher, Vowpal Wabbit Contributor

John Langford

Tracks

Monday, 13 June

Architectures You've Always Wondered About

Case studies from: Google, Linkedin, Alibaba, Twitter, and more...
Stream Processing @ Scale

Technologies and techniques to handle ever increasing data streams
Culture As Differentiator

Stories of companies and team for whom engineering culture is a differentiator - in delivering faster, in attracting better talent, and in making their businesses more successful.
Practical DevOps for Cloud Architectures

Real-world lessons and practices that enable the devops nirvana of operating what you build
Incredible Power of an Open-Sourced .NET

.NET is more than you may think. From Rx to C# 7 designed in the open, learn more about the power of open source .NET
Sponsored Solutions Track 1

Tuesday, 14 June

Better than Resilient: Antifragile

Failure is a constant in production systems, learn how to wield it to your advantage to build more robust systems.
Innovations in Java and the Java Ecosystem

Cutting Edge Java Innovations for the Real World
Modern CS in the Real World

Real-world Industry adoption of modern CS ideas
Containers: From Dev to Prod

Beyond the buzz and into the how and why of running containers in production
Security War Stories

Expert-level security track led by well known and respected leaders in the field
Sponsored Solutions Track 2

Wednesday, 15 June

Microservices and Monoliths

Practical lessons on services. Asks the question when and when to NOT go with Microservices?
Modern API Architecture - Tools, Methods, Tactics

API-based application development, and the tooling and techniques to support effectively working with APIs in the small or at scale. Using internal and external APIs
Commoditized Machine Learning

Barriers to entry for applied ML are lower than ever before, jumpstart your journey
Full Stack JavaScript

Browser, server, devices - JavaScript is everywhere
Optimizing Yourself

Keeping life in balance is always a challenge. Learning lifehacks
Sponsored Solutions Track 3

See the Full Schedule

Location:

Duration

Day of week:

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Alex Rasmussen at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Unorthodox Paths to High Performance

Location:

Duration

Day of week:

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Alex Rasmussen at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World