Keynote: The History of Fire Escapes

Location: Broadway Ballroom, 6th fl.

Duration: 9:00am - 10:10am

Day of week: Friday

What You’ll Learn

  1. Learn about resiliency, why it is important and it compares to firefighting.

  2. Find out what lessons from fire prevention can be applied to software development to increase resiliency.

  3. Hear about embedding resilience into software from the beginning.

Abstract

When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug, we usually have a contingency plan. We reduce damage, redirect traffic, page someone, drop low-priority requests, follow documented procedures. But why do many failures still come as a surprise? In this talk, we'll look at how fire safety in buildings parallels how we prevent and manage software failures. Fire partitions. Public safety campaigns. Smoke alarms. Sprinkler systems. Doors that say “This is not an exit”. And fire escapes. We'll look at the journey fire safety took in New York City, from expecting our wooden buildings to catch fire at any moment, to designing in fire resistence from the start. And we'll look at what we can learn from real world fire codes about expecting failure and designing for it.

Interview

Question: 

You recently left Google, you are now at Squarespace. What's the focus of the work that you're doing?

Answer: 

I'm working on various cross-team projects around Squarespace's infrastructure. One of my current projects is to look at how we define and measure our service level objectives: the measure of how available we expect each piece of infrastructure to be. If the expectations and measured results for each component are clear, it makes it easier for developers in other organizations to make decisions and pick the right infrastructure for their services. SLOs are the bread and butter of site reliability engineering. We are all about measuring.

Question: 

Your talk is called "The history of fire escapes." Why fire escapes?

Answer: 

Fire escapes are cool. I moved to New York 10 years ago, and they're one of the big things I kept noticing: the face of the city is different. Over the last few years, I worked a lot on disaster plans, resiliency, the contingency plans that get used when things fail. Maybe you recover from a backup, or you fail over to somewhere else, or you page someone, but it's almost always something that's not your regular process. And that made me start thinking about fire escapes again. They're contingency plans. The building is on fire; you drain load from the building, you get the people out. And I wondered, how did that happen? All these fire escapes look the same, pretty much. How did the city get to here?  So I started reading about fire escapes, and it turns out it's a really rich history. There's a lot of things that happened to get fire escapes to where they are now. But I also discovered that fire escapes are no longer part of the building code. You're not allowed to build one now if you wanted to. And I thought that was fascinating because... what are the backup plans that we're using now? These buildings are 100 years old and fire escapes made sense at that time, but not now. What are we doing now that we'll come back in a few years and say "Don't do that, that's terrible."

Question: 

So this talk is going to connect fire escapes to building resilient systems? What's the focus for this talk?

Answer: 

I'm looking at how the fire safety code in New York City evolved -- which is way more interesting than it sounds! In the beginning, when New York City wanted to reduce fire deaths, they assumed that a fire would inevitably start, then we'd have to figure out how to move people away from it and put it out. And over the years the thinking on that shifted to, "how do we stop the fire from happening in the first place? How do we stop it spreading?" We moved from wooden buildings to stone buildings. We moved from packing our factories wall to wall with flammable materials to putting in fireproof interior walls so that a fire couldn't go more than a room at a time. We looked at detection technology, finding out about the fire very quickly, putting it out with sprinklers. In your house, you've got a fire blanket or a fire extinguisher and you can put the fire out before it becomes a big problem, and you rarely get to the point where you have to call in firefighters. 2016 was the lowest number of fires deaths in New York City in a hundred years. I'm thinking about what we can learn from that.

Question: 

With chaos engineering, don't we start a fire to see if we can get people out safely?

Answer: 

I think we're now starting controlled fires, and that's cool. If you think about it, in every building we have fire drills, and everyone grumbles and they lock their screen, and they go out of the building and then they come back in again. And we're used to that. We haven't done that as much in software, and chaos engineering is how we get good at that, I think. We're getting used to things breaking, and either calmly standing up and solving the problem, or hopefully, the system failing over or otherwise recovering on its own. It's a success if we don't notice. Things break, and we just recover from that without reacting.

Question: 

Who is the core persona that you're talking to?

Answer: 

I think reliability is everyone's responsibility. We all have our part to play. A lot of companies have site reliability engineers, or production engineers or DevOps teams, but reliability is everyone's job. It's not something you can add at the end. So I want everybody to be thinking, how can we prevent fires -- well, outages -- from starting in our software systems, and if they do happen, how can we make sure the fire doesn't spread very fast or very far.

I'd like our whole industry to own this. We all know what our best practices are, but we don't always do them. We cut corners because we need to get a feature out and we say, well, it's good enough. I want us to take what "good enough" means and move it up a little. So the people I would really like to talk to are those who influence our organizations, who set the culture of what's good enough. But I want everybody to care. Software is increasingly used for life-critical systems, where outages are a very big deal. We need to take that responsibility seriously.

Question: 

How do we do that? Software developers are trying to move faster. How do we intentionally move slower?

Answer: 

Reliability is one of the best features that you can have. When you go to a site and it's there, that's worth more than any individual feature. I think about it like this: if I hire an electrician to rewire my home, I want to trust that they're not cutting corners, even if it means they'd get the job finished faster and move on to the next job. The good ones do it right, even when nobody's watching, because they know faulty wiring can be life or death. If software can be life or death, we have to do the same. Anyway, doing a clean and safe job the first time makes it easier to iterate. Nobody enjoys working on a system that's all technical debt. Having to work around poor structure and hacks decreases your velocity, and you'll also spend a ton of time reacting to outages. Writing fireproof software is worth it in the long run.

Speaker: Tanya Reilly

Principal Engineer @squarespace

Tanya Reilly is a principal engineer at Squarespace, working on infrastructure and reliability. 

Find Tanya Reilly at

Tracks