Comparing Apples and Volkswagens: The Problem With Aggregate Incident Metrics

This talk presents data from the Verica Open Incident Database (VOID) to conclusively demonstrate how aggregate incident metrics (MTTR, severity, # of incidents/time) aren't representative of your systems' resilience. I then pair those data with observations from actual incident reports of what kind of useful information can be gleaned from incident analysis, and suggest alternate things you can measure instead in order to demonstrate learning from incidents in your organization.


Speaker

Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.

Read more
Find Courtney Nash at:

From the same track

Session Resilience Engineering

Resilience Hides in Plain Sight

Thursday Jun 15 / 01:40PM EDT

Think of the most out-of-nowhere and surprising incident you've experienced.

Speaker image - John Allspaw

John Allspaw

Founder and Principal @Adaptive Capacity Labs

Session Resilience Engineering

Embrace Complexity; Tighten Your Feedback Loops

Thursday Jun 15 / 02:55PM EDT

When dealing with an environment that feels chaotic and unreliable, a common tendency is to look for ways to reduce variability and bring things back under control through procedures, hierarchy, metrics, and standardization.

Speaker image - Fred  Hebert

Fred Hebert

Staff SRE @Honeycombio

Session Resilience Engineering

5 Strategies to Resiliently Handle Uncertainty, Time Pressure & Change

Thursday Jun 15 / 04:10PM EDT

As an engineer tasked with keeping large-scale software systems running under changing priorities and time pressure, you need REsilience capabilities that are both technical and organizational to successfully navigate modern software engineering work.

Speaker image - Dr. Laura Maguire

Dr. Laura Maguire

Cognitive Systems Engineer & Researcher

Session

Two Years of Incidents at 6 Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals

Thursday Jun 15 / 10:35AM EDT

Incidents and outages are expensive, they impact engineering productivity, business goals, and your company’s reputation. In this talk I will describe how we can apply resilience throughout the incident lifecycle in order to turn incidents into opportunities.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Solutions Engineer @Jeli.io