Presentation: 0 to 100 days - Running DRTs at Dropbox

Location:

Duration

Duration: 
5:25pm - 6:15pm

Day of week:

Level:

Persona:

Key Takeaways

  • Learn how Dropbox uses disaster recovery testing on extremely large scale systems. 
  • Understand the benefits of establishing a culture that encourages and promotes active failure testing.
  • Hear about the principles that Dropbox uses allowing teams to focus on the system seams for aggressive failure testing.

Abstract

Thomissa joined the Dropbox Infrastructure team 100 days ago. This presentation will share her experiences developing and rolling out new Disaster Recovery Testing techniques at Dropbox. Tammy will join Thomissa to share how her team runs DRTs and has implemented the techniques Thomissa has evangelized.

Dropbox was founded by engineers, and the ethos of technical innovation is fundamental to our culture. We’ve grown enormously since launching in 2008, surpassing 500 million signups and 500 petabytes of user data. To give you a sense of our incredible growth, we had only about 40 petabytes of user data in 2012! With such rapid growth comes increased failures and operational challenges. We look forward to sharing with you how Dropbox focuses on reliability while scaling.

Interview

Question: 
What is your role today?
Answer: 
THOMISSA: I am a technical project manager in reliability on the infrastructure team.
TAMMY: I am Site Reliability Engineering Manager for the databases team at Dropbox. The team has roughly around 5,000 databases. So very big scale.
Question: 
What kind of unique scalability or problems are you are dealing with at Dropbox?
Answer: 
TAMMY: We’ve grown enormously since launching in 2008, surpassing 500 million signups and 500 petabytes (i.e., 5 followed by 17 zeroes!) of user data. . There are many engineers working on the codebase. So hundreds of engineers- around 500 engineers and 500 million customers.
THOMISSA: Accordingly, there are constantly things that are changing, improving, failing, recovering, and hopefully, becoming less complex.
TAMMY: We definitely have a focus on sweating the details, so you can generally find out everything you need to find out about what we do at Dropbox.
Question: 
What’s the motivation for your talk?
Answer: 
THOMISSA: We are focused on DRTs (Disaster Recovery Tests). We intentionally break things to make them better. This is something that we are working on as a company, and we are constantly looking to improve. It’s an ongoing cycle of having things break, fixing them, testing the fixes, realizing we need to break other things to test the same thing, and then working on those things which generally breaks other things. Improvement from failure.
TAMMY: And for me, personally, I came from banking in Australia. And in banking, we did DRTs quite frequently, but not nearly as much as we do at Dropbox. So coming to Dropbox, we have a very good methodology which we can share in the talk. Thomissa is working on the methodology and it is something that anyone could use in their own organization.
Question: 
What can you tell us about this framework?
Answer: 
THOMISSA: We use the principle of "why, how, what" to ensure DRT value and buy-in. It determines why tests are important, how to test, and what the test will be. We identify and test internal, external, interdependent failure modes aka the known unknowns, unknown unknowns, etc.
TAMMY: We also have different measurements that we capture in terms of what is the likelihood of failure? What is the impact of the failure? Then there is a matrix that Thomissa has created, that actually helps you identify other places to test. So it guides you to think differently about your systems and what the possible failures areas could be.
Question: 
What’s something a developer might not expect he/she needs to test that the methodology exposes?
Answer: 
THOMISSA: One of the things that we see is that there is a lot of inter-system dependencies.
You can have a system that is incredibly robust, led by super talented team, interacting with an equally solid system and team. But even the most ideal systems, when interacting with one another, don’t always interact ideally. Things are constantly changing on either side, and new systems are coming up. The seams where systems and teams meet are where you can see some unusual failures that you need to make sure you test.
TAMMY: We look at the edge (the edges between systems rather than looking at the system just as themselves). Also, the biggest difference at Dropbox (compared to anywhere else that I have worked) is that teams are empowered to run their own DRTs on their own schedule. We are encouraged to run them as frequently as we can, measure and put forward ideas like 'this is what we are going to run and this is why'. think this approach is really important. It is a different way to look at it because the teams are empowered to run DRTs. Then the reliability team helps by giving this methodology to help navigate who is running jobs at what time so you don’t have clashes. That is really important too. If everyone runs on the same day and don’t know about each other is doing, it could be very dangerous.
Question: 
So it almost seems as though there is as much culture as technology that you are talking about. Is that true?
Answer: 
THOMISSA: I would say so. If we cultivate the mindset of engineering for failure we can put in a light system that can be optimized by individual teams. This enables teams to move fast and feel empowered to test their systems, push them to their limits, and explore where boundaries are.
Question: 
What I am going to get by coming to your talk? What am I going to learn about establishing a methodology for doing disaster recovery tests against my architecture?
Answer: 
THOMISSA: Well, from a cultural standpoint: how can anyone on a team, SRE or otherwise, think about engineering for failure and why should they? How can these team members find their weaknesses and engage with other teams to make a more robust product overall? We hope to frame a way to build tests, break them into multiple pieces, prioritize them, and execute.
Question: 
What are your key takeaways for this talk?
Answer: 
TAMMY: I think it’s that teams should be empowered to run their own DRTs and that it’s important to give people tools to be able to do that. This talk is an opportunity to consider a variety of questions. How do you empower everyone? How do you make everyone feel confident that they can run DRTs in their teams? How do we all start asking why we should run DRTs on a more frequent basis? How are we going to run DRTs? What is going happen when DRTs are run? What are the results going to be? What results do we even want it to see?
THOMISSA: There needs to be understanding and ownership of systems and the failures inherent to them, so that teams have the confidence to break things, fix them, and repeat the process with bigger goals.

Speaker: Thomissa Comellas

Technical Project Manager @Dropbox

Thomissa Comellas is a Technical Project Manager on the Infrastructure Reliability Team. She will cross her 100th day on the infrastructure team during QCon. Thomissa has a B.S. in Atmosphere & Energy Engineering and M.S. in Management Science & Engineering from Stanford. She loves efficiency and reliability - in technology and process.

Find Thomissa Comellas at

Speaker: Tammy Butow

SRE Manager @Dropbox

Tammy Butow is the SRE Manager of the Dropbox Databases team. She loves Team Traditions, databases, automation, storage, Linux, Go, open source, monitoring, metal and runs @ladieswholinux. Tammy is an Australian living in San Francisco.

Find Tammy Butow at

Similar Talks

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June