Track:

Location:

Salon D

Duration

Duration:

5:25pm - 6:15pm

Day of week:

Tuesday

Level:

Intermediate

Persona:

DevOps Engineer

Key Takeaways

Learn how Dropbox uses disaster recovery testing on extremely large scale systems.
Understand the benefits of establishing a culture that encourages and promotes active failure testing.
Hear about the principles that Dropbox uses allowing teams to focus on the system seams for aggressive failure testing.

Abstract

Thomissa joined the Dropbox Infrastructure team 100 days ago. This presentation will share her experiences developing and rolling out new Disaster Recovery Testing techniques at Dropbox. Tammy will join Thomissa to share how her team runs DRTs and has implemented the techniques Thomissa has evangelized.

Dropbox was founded by engineers, and the ethos of technical innovation is fundamental to our culture. We’ve grown enormously since launching in 2008, surpassing 500 million signups and 500 petabytes of user data. To give you a sense of our incredible growth, we had only about 40 petabytes of user data in 2012! With such rapid growth comes increased failures and operational challenges. We look forward to sharing with you how Dropbox focuses on reliability while scaling.

Interview

Question:

What is your role today?

Answer:

THOMISSA: I am a technical project manager in reliability on the infrastructure team.

TAMMY: I am Site Reliability Engineering Manager for the databases team at Dropbox. The team has roughly around 5,000 databases. So very big scale.

Question:

What kind of unique scalability or problems are you are dealing with at Dropbox?

Answer:

TAMMY: We’ve grown enormously since launching in 2008, surpassing 500 million signups and 500 petabytes (i.e., 5 followed by 17 zeroes!) of user data. . There are many engineers working on the codebase. So hundreds of engineers- around 500 engineers and 500 million customers.

THOMISSA: Accordingly, there are constantly things that are changing, improving, failing, recovering, and hopefully, becoming less complex.

TAMMY: We definitely have a focus on sweating the details, so you can generally find out everything you need to find out about what we do at Dropbox.

Question:

What’s the motivation for your talk?

Answer:

THOMISSA: We are focused on DRTs (Disaster Recovery Tests). We intentionally break things to make them better. This is something that we are working on as a company, and we are constantly looking to improve. It’s an ongoing cycle of having things break, fixing them, testing the fixes, realizing we need to break other things to test the same thing, and then working on those things which generally breaks other things. Improvement from failure.

TAMMY: And for me, personally, I came from banking in Australia. And in banking, we did DRTs quite frequently, but not nearly as much as we do at Dropbox. So coming to Dropbox, we have a very good methodology which we can share in the talk. Thomissa is working on the methodology and it is something that anyone could use in their own organization.

Question:

What can you tell us about this framework?

Answer:

THOMISSA: We use the principle of "why, how, what" to ensure DRT value and buy-in. It determines why tests are important, how to test, and what the test will be. We identify and test internal, external, interdependent failure modes aka the known unknowns, unknown unknowns, etc.

TAMMY: We also have different measurements that we capture in terms of what is the likelihood of failure? What is the impact of the failure? Then there is a matrix that Thomissa has created, that actually helps you identify other places to test. So it guides you to think differently about your systems and what the possible failures areas could be.

Question:

What’s something a developer might not expect he/she needs to test that the methodology exposes?

Answer:

THOMISSA: One of the things that we see is that there is a lot of inter-system dependencies.

You can have a system that is incredibly robust, led by super talented team, interacting with an equally solid system and team. But even the most ideal systems, when interacting with one another, don’t always interact ideally. Things are constantly changing on either side, and new systems are coming up. The seams where systems and teams meet are where you can see some unusual failures that you need to make sure you test.

TAMMY: We look at the edge (the edges between systems rather than looking at the system just as themselves). Also, the biggest difference at Dropbox (compared to anywhere else that I have worked) is that teams are empowered to run their own DRTs on their own schedule. We are encouraged to run them as frequently as we can, measure and put forward ideas like 'this is what we are going to run and this is why'. think this approach is really important. It is a different way to look at it because the teams are empowered to run DRTs. Then the reliability team helps by giving this methodology to help navigate who is running jobs at what time so you don’t have clashes. That is really important too. If everyone runs on the same day and don’t know about each other is doing, it could be very dangerous.

Question:

So it almost seems as though there is as much culture as technology that you are talking about. Is that true?

Answer:

THOMISSA: I would say so. If we cultivate the mindset of engineering for failure we can put in a light system that can be optimized by individual teams. This enables teams to move fast and feel empowered to test their systems, push them to their limits, and explore where boundaries are.

Question:

What I am going to get by coming to your talk? What am I going to learn about establishing a methodology for doing disaster recovery tests against my architecture?

Answer:

THOMISSA: Well, from a cultural standpoint: how can anyone on a team, SRE or otherwise, think about engineering for failure and why should they? How can these team members find their weaknesses and engage with other teams to make a more robust product overall? We hope to frame a way to build tests, break them into multiple pieces, prioritize them, and execute.

Question:

What are your key takeaways for this talk?

Answer:

TAMMY: I think it’s that teams should be empowered to run their own DRTs and that it’s important to give people tools to be able to do that. This talk is an opportunity to consider a variety of questions. How do you empower everyone? How do you make everyone feel confident that they can run DRTs in their teams? How do we all start asking why we should run DRTs on a more frequent basis? How are we going to run DRTs? What is going happen when DRTs are run? What are the results going to be? What results do we even want it to see?

THOMISSA: There needs to be understanding and ownership of systems and the failures inherent to them, so that teams have the confidence to break things, fix them, and repeat the process with bigger goals.

Speaker: Thomissa Comellas

Technical Project Manager @Dropbox

Thomissa Comellas is a Technical Project Manager on the Infrastructure Reliability Team. She will cross her 100th day on the infrastructure team during QCon. Thomissa has a B.S. in Atmosphere & Energy Engineering and M.S. in Management Science & Engineering from Stanford. She loves efficiency and reliability - in technology and process.

Find Thomissa Comellas at

Speaker page

@thomissa

Infrastructure Reliability at Dropbox

Speaker: Tammy Butow

SRE Manager @Dropbox

Tammy Butow is the SRE Manager of the Dropbox Databases team. She loves Team Traditions, databases, automation, storage, Linux, Go, open source, monitoring, metal and runs @ladieswholinux. Tammy is an Australian living in San Francisco.

Find Tammy Butow at

Speaker page

@tammybutow

Site Reliability Engineering (SRE) Manager, Dropbox

Similar Talks

Learnings from a Culture First Startup

CTO @Buffer

Sunil Sadasivan

Becoming an Outlier

Software Architect @VinSolutions, Author @pluralsight

Cory House

ESPN Next Generation APIs Powering Web, Mobile, TV

Senior Director of Distribution Platforms @ESPN

Manny Pelarinos

The Human Side of Microservices

Tech Lead @Yelp

John Billings

Lessons Learned on Uber's Journey into Microservices

Software Engineer @Uber

Emily Reinhold

What They Don’t Tell You About Microservices…

CTO @Yodle

Daniel Rolnick

Algorithms for Animation

Partner & Tech Lead @CarbonFive

Courtney Hemphill

Machine Learning Fast and Slow

Lead Data Scientist @betaworks

Suman Deb Roy

Vowpal Wabbit a Machine Learning System

Leading Machine Learning Researcher, Vowpal Wabbit Contributor

John Langford

Tracks

Monday, 13 June

Architectures You've Always Wondered About

Case studies from: Google, Linkedin, Alibaba, Twitter, and more...
Stream Processing @ Scale

Technologies and techniques to handle ever increasing data streams
Culture As Differentiator

Stories of companies and team for whom engineering culture is a differentiator - in delivering faster, in attracting better talent, and in making their businesses more successful.
Practical DevOps for Cloud Architectures

Real-world lessons and practices that enable the devops nirvana of operating what you build
Incredible Power of an Open-Sourced .NET

.NET is more than you may think. From Rx to C# 7 designed in the open, learn more about the power of open source .NET
Sponsored Solutions Track 1

Tuesday, 14 June

Better than Resilient: Antifragile

Failure is a constant in production systems, learn how to wield it to your advantage to build more robust systems.
Innovations in Java and the Java Ecosystem

Cutting Edge Java Innovations for the Real World
Modern CS in the Real World

Real-world Industry adoption of modern CS ideas
Containers: From Dev to Prod

Beyond the buzz and into the how and why of running containers in production
Security War Stories

Expert-level security track led by well known and respected leaders in the field
Sponsored Solutions Track 2

Wednesday, 15 June

Microservices and Monoliths

Practical lessons on services. Asks the question when and when to NOT go with Microservices?
Modern API Architecture - Tools, Methods, Tactics

API-based application development, and the tooling and techniques to support effectively working with APIs in the small or at scale. Using internal and external APIs
Commoditized Machine Learning

Barriers to entry for applied ML are lower than ever before, jumpstart your journey
Full Stack JavaScript

Browser, server, devices - JavaScript is everywhere
Optimizing Yourself

Keeping life in balance is always a challenge. Learning lifehacks
Sponsored Solutions Track 3

See the Full Schedule

Location:

Duration

Day of week:

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Thomissa Comellas at

Find Tammy Butow at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: 0 to 100 days - Running DRTs at Dropbox

Location:

Duration

Day of week:

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Thomissa Comellas at

Find Tammy Butow at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World