Presentation: Too Big to Fail: Lessons from Google and


Day of week:

11:50am - 12:40pm

Failure is a fact of life, so we design our system to be fault-tolerant at all levels. In practice, however, some components almost never fail. As the product grows, these components are increasingly stressed in new and different ways; when they ultimately do fail they create outages for which we are unprepared. We thought we were designing for failure, but the design didn't include failures at this level. At Google, some of our most exciting production snafus involve large and unpredictable network-level failures; at in late 2013, just about every component fell into this category on a daily level.

Through stories of large-scale Google outages and smaller-scale outages, we’ll illustrate situations we’re often flying blind to and draw lessons from them about how to expose unknown weak points in our systems. We’ll discuss the importance of being able to model systems ahead of time and visualize solutions in real time (including during an outage). Attendees will learn a practical framework for anticipating potential large-scale outages and specific ways to increase systemic robustness, for example “practicing disaster”. Failure -- even large failure -- is a fact of life; outages don’t have to be.


Wednesday Jun 10

Thursday Jun 11

Friday Jun 12

Conference for Professional Software Developers