Resilience Hides in Plain Sight

Think of the most out-of-nowhere and surprising incident you've experienced. I mean the ones that you know will be told over and over for years because the story is so bananas. The stuff that made it possible for you and your colleagues to handle it...is resilience. 

In this talk I'm going to describe what that "stuff" is, and then I'll talk about how it's incredibly hard to recognize it. A well-known and contrarian adage in the Resilience Engineering community is "Murphy's Law is wrong. What could go wrong almost never does, but we don't notice that — we just call it 'normal work.'" I'd like to help you understand the relationships between resilience, resilience engineering, learning from incidents, incident analysis, and other topics in the hope that you can see what a small (but fast-growing) community already sees...and cannot unsee.

What's the focus of your work these days?

My colleagues and I, are a small group, that for the last five years has been bringing methods, approaches, techniques, and really concepts from fields that study complex work and expertise to the world of software. My master's degree is in human factors and system safety; my background is in software. Prior to this job, I was CTO at a company in Brooklyn called Etsy. The fields that I'm referring to are fields like human factors, cognitive systems engineering, and resilience engineering. In my talk, I try to explain what those fields, in particular, the field of resilience engineering, have come to understand, in as accessible a way as I can.

It's a field that is growing in interest in the software world. The field is a little over twenty years old, and only recently has the world of software come to it. So where that ends up is a lot of the work that we do either focuses on or is adjacent to incident analysis and understanding how people handle complex and, in many cases, surprising and unanticipated situations.

 

What's the motivation for your talk at QCon New York 2023?

I firmly believe that the industry is at the beginning part of understanding and exploring the contribution that people make to their work. People are the only adaptive element in your organization. It is really tempting to believe that you can build in some automation that can "think" or do some of your work for you, but that's not what research in real-world environments like power plants and aviation and medicine and space travel say. Genuine adaptation, to say things that cannot be anticipated, to react to unforeseen events - people are the greatest strength in those situations. Things work because people are bridging a gap between what the software, the application, and automation are designed to do and what reality challenges it to do, and we do it so well.

How would you describe your main persona and target audience for this session?

The only criteria for attendees to get something out of the session is to have experience either currently or in the past of hands-on Practitioner hands-on production.

Is there anything specific that you'd like people to walk away with after watching your session?

A couple of things. The first is that at every moment where the design or the modification, the recalibration of software that helps them do their work includes or say fueled by that act. The support from actual work that they do. I don't mean this in a user or just like a narrow user experience way. But in a cognitive work way, and so I'm going to sort of arm people with a handful of heuristics that I would want them to have in mind when they're approaching their work. 

The other is, as I mentioned before, I believe the industry is on a precipice, a precipice that to me is incredibly similar to the 2008 to 2010 period of time where continuous deployment and delivery, the idea of DevOps. Having been there and understood that, born out of practitioners' grassroots recognition about how software could fundamentally be changed in the way it's designed and operated, I believe that same Paradigm shift is happening right now, and this talk is part of laying out the support that brings me to that conclusion.

 

What's something interesting that you've learned from the previous QCon?

What sticks out for me about QCon is certainly the interesting and insightful nuggets from the talks. However, what I remember just as much are the connections. When I've been to the conference, whether I'm speaking or attending, connections between what would otherwise on the schedule look like parallel but separate topics. It's hard for me not to see a through line of connections between and across topics, and that's something that sticks out to me at QCon. You just can't get that from just looking at the schedule.

When you're in the sessions, you go to one session. You think, "Oh, this is about ABC." And in the afternoon, "Oh, this is about apples and oranges." As you're in the "apples and oranges" session, there are things about the "ABC" session that connect in my mind, with the apples and oranges one, and I think that that's a strength that QCon has.

 


Speaker

John Allspaw

Founder and Principal @Adaptive Capacity Labs

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John’s publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to “The DevOps Handbook.”  His 2009 Velocity talk with Paul Hammond, “10+ Deploys Per Day: Dev and Ops Cooperation” helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Read more
Find John Allspaw at:

Date

Thursday Jun 15 / 01:40PM EDT ( 50 minutes )

Location

Williamsburg / Greenpoint

Topics

Resilience Engineering Incident Analysis Learning From Incidents

Share

From the same track

Session Resilience

Comparing Apples and Volkswagens: The Problem With Aggregate Incident Metrics

Thursday Jun 15 / 11:50AM EDT

This talk presents data from the Verica Open Incident Database (VOID) to conclusively demonstrate how aggregate incident metrics (MTTR, severity, # of incidents/time) aren't representative of your systems' resilience.

Speaker image - Courtney Nash

Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Session Resilience Engineering

Embrace Complexity; Tighten Your Feedback Loops

Thursday Jun 15 / 02:55PM EDT

When dealing with an environment that feels chaotic and unreliable, a common tendency is to look for ways to reduce variability and bring things back under control through procedures, hierarchy, metrics, and standardization.

Speaker image - Fred  Hebert

Fred Hebert

Staff SRE @Honeycombio

Session Resilience Engineering

5 Strategies to Resiliently Handle Uncertainty, Time Pressure & Change

Thursday Jun 15 / 04:10PM EDT

As an engineer tasked with keeping large-scale software systems running under changing priorities and time pressure, you need REsilience capabilities that are both technical and organizational to successfully navigate modern software engineering work.

Speaker image - Dr. Laura Maguire

Dr. Laura Maguire

Cognitive Systems Engineer & Researcher

Session

Two Years of Incidents at 6 Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals

Thursday Jun 15 / 10:35AM EDT

Incidents and outages are expensive, they impact engineering productivity, business goals, and your company’s reputation. In this talk I will describe how we can apply resilience throughout the incident lifecycle in order to turn incidents into opportunities.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Solutions Engineer @Jeli.io