Keynote: The History of Fire Escapes
What You’ll Learn
-
Learn about resiliency, why it is important and it compares to firefighting.
-
Find out what lessons from fire prevention can be applied to software development to increase resiliency.
-
Hear about embedding resilience into software from the beginning.
Abstract
When a datacenter goes offline, a server gets overloaded, or a binary hits a crashing bug, we usually have a contingency plan. We reduce damage, redirect traffic, page someone, drop low-priority requests, follow documented procedures. But why do many failures still come as a surprise? In this talk, we'll look at how fire safety in buildings parallels how we prevent and manage software failures. Fire partitions. Public safety campaigns. Smoke alarms. Sprinkler systems. Doors that say “This is not an exit”. And fire escapes. We'll look at the journey fire safety took in New York City, from expecting our wooden buildings to catch fire at any moment, to designing in fire resistence from the start. And we'll look at what we can learn from real world fire codes about expecting failure and designing for it.
Interview
You recently left Google, you are now at Squarespace. What's the focus of the work that you're doing?
I'm working on various cross-team projects around Squarespace's infrastructure. One of my current projects is to look at how we define and measure our service level objectives: the measure of how available we expect each piece of infrastructure to be. If the expectations and measured results for each component are clear, it makes it easier for developers in other organizations to make decisions and pick the right infrastructure for their services. SLOs are the bread and butter of site reliability engineering. We are all about measuring.
Your talk is called "The history of fire escapes." Why fire escapes?
Fire escapes are cool. I moved to New York 10 years ago, and they're one of the big things I kept noticing: the face of the city is different. Over the last few years, I worked a lot on disaster plans, resiliency, the contingency plans that get used when things fail. Maybe you recover from a backup, or you fail over to somewhere else, or you page someone, but it's almost always something that's not your regular process. And that made me start thinking about fire escapes again. They're contingency plans. The building is on fire; you drain load from the building, you get the people out. And I wondered, how did that happen? All these fire escapes look the same, pretty much. How did the city get to here? So I started reading about fire escapes, and it turns out it's a really rich history. There's a lot of things that happened to get fire escapes to where they are now. But I also discovered that fire escapes are no longer part of the building code. You're not allowed to build one now if you wanted to. And I thought that was fascinating because... what are the backup plans that we're using now? These buildings are 100 years old and fire escapes made sense at that time, but not now. What are we doing now that we'll come back in a few years and say "Don't do that, that's terrible."
So this talk is going to connect fire escapes to building resilient systems? What's the focus for this talk?
I'm looking at how the fire safety code in New York City evolved -- which is way more interesting than it sounds! In the beginning, when New York City wanted to reduce fire deaths, they assumed that a fire would inevitably start, then we'd have to figure out how to move people away from it and put it out. And over the years the thinking on that shifted to, "how do we stop the fire from happening in the first place? How do we stop it spreading?" We moved from wooden buildings to stone buildings. We moved from packing our factories wall to wall with flammable materials to putting in fireproof interior walls so that a fire couldn't go more than a room at a time. We looked at detection technology, finding out about the fire very quickly, putting it out with sprinklers. In your house, you've got a fire blanket or a fire extinguisher and you can put the fire out before it becomes a big problem, and you rarely get to the point where you have to call in firefighters. 2016 was the lowest number of fires deaths in New York City in a hundred years. I'm thinking about what we can learn from that.
With chaos engineering, don't we start a fire to see if we can get people out safely?
I think we're now starting controlled fires, and that's cool. If you think about it, in every building we have fire drills, and everyone grumbles and they lock their screen, and they go out of the building and then they come back in again. And we're used to that. We haven't done that as much in software, and chaos engineering is how we get good at that, I think. We're getting used to things breaking, and either calmly standing up and solving the problem, or hopefully, the system failing over or otherwise recovering on its own. It's a success if we don't notice. Things break, and we just recover from that without reacting.
Who is the core persona that you're talking to?
I think reliability is everyone's responsibility. We all have our part to play. A lot of companies have site reliability engineers, or production engineers or DevOps teams, but reliability is everyone's job. It's not something you can add at the end. So I want everybody to be thinking, how can we prevent fires -- well, outages -- from starting in our software systems, and if they do happen, how can we make sure the fire doesn't spread very fast or very far.
I'd like our whole industry to own this. We all know what our best practices are, but we don't always do them. We cut corners because we need to get a feature out and we say, well, it's good enough. I want us to take what "good enough" means and move it up a little. So the people I would really like to talk to are those who influence our organizations, who set the culture of what's good enough. But I want everybody to care. Software is increasingly used for life-critical systems, where outages are a very big deal. We need to take that responsibility seriously.
How do we do that? Software developers are trying to move faster. How do we intentionally move slower?
Reliability is one of the best features that you can have. When you go to a site and it's there, that's worth more than any individual feature. I think about it like this: if I hire an electrician to rewire my home, I want to trust that they're not cutting corners, even if it means they'd get the job finished faster and move on to the next job. The good ones do it right, even when nobody's watching, because they know faulty wiring can be life or death. If software can be life or death, we have to do the same. Anyway, doing a clean and safe job the first time makes it easier to iterate. Nobody enjoys working on a system that's all technical debt. Having to work around poor structure and hacks decreases your velocity, and you'll also spend a ton of time reacting to outages. Writing fireproof software is worth it in the long run.
Tracks
-
Microservices: Patterns & Practices
Evolving, observing, persisting, and building modern microservices
-
Developer Experience: Level up Your Engineering Effectiveness
Improving the end to end developer experience - design, dev, test, deploy, operate/understand. Tools, techniques, and trends.
-
Modern Java Reloaded
Modern, Modular, fast, and effective Java. Pushing the boundaries of JDK 9 and beyond.
-
Modern User Interfaces: Screens and Beyond
Zero UI, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
-
Practical Machine Learning
Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, Caffe, & more
-
Ethics in Computing
Inclusive technology, Ethics and politics of technology. Considering bias. Societal relationship with tech. Also the privacy problems we have today (e.g., GDPR, right to be forgotten)
-
Architectures You've Always Wondered About
Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, Goldman Sachs
-
Modern CS in the Real World
Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probalistic programming
-
Container and Orchestration Platforms in Action
Runtime containers, libraries, and services that power microservices
-
Finding the Serverless Sweetspot
Stories about the pains and gains from migrating to Serverless.
-
Chaos, Complexity, and Resilience
Lessons building resilient systems and the war stories that drove their adoption
-
Real World Security
Practical lessons building, maintaining, and deploying secure systems
-
Blockchain Enabled
Exploring Smart contracts, oracles, sidechains, and what can/cannot be done with blockchain today.
-
21st Century Languages
Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
-
Empowered Teams
Safely running inclusive teams that are autonomous and self-correcting