Embrace Complexity; Tighten Your Feedback Loops

When dealing with an environment that feels chaotic and unreliable, a common tendency is to look for ways to reduce variability and bring things back under control through procedures, hierarchy, metrics, and standardization. However, these attempts are often unsuccessful due to the inherent complexity of these systems: they can't fit in anyone's head, and  are too unruly despite all efforts.

I suggest that we relax these ideas of control, and increase focus on flexibility and adaptability. These, and other ideas coming from Resilience Engineering can help us create a toolkit to embrace surprise, and foster a richer view of systems that can extend our abilities to respond both to unforeseen challenges, but also to unexpected opportunities.

In this talk, I'll present various small approaches and patterns that slowly influence how teams deal with reliability, and highlight some of the key interactions and behaviors I keep finding work well in the organizations I've been part of. In the end you can't really cancel out the chaos, but you can embrace the complexity and deal with it a bit better.

What's the focus of your work these days?

I'm a staff member on Honeycomb's SRE team. A lot of my work is reactive, dealing with emergencies, but when there's more time, I focus on training people on our practices and fostering a good operational culture. Additionally, I work on bringing a systemic view of our organization to the organization itself, ensuring we invest in the right things and maintain the right behaviors. Of course, there's also daily operational support.

What's the motivation for your talk at QCon New York 2023?

First, I was invited to a track filled with interesting people, so I wanted to be a part of it. Second, the track focuses on resilience engineering, a topic I've been interested in for quite a few years. In my talk, I aim to provide practical insights and a different perspective beyond just learning from incidents. While learning from incidents is crucial, I believe there are many small influential things we can do in decision-making, addressing challenges, and managing goal conflicts. I want to share these practical experiences I've gained over the years.

How would you describe your main persona and target audience for this session?

While the talk can be relevant to a general audience, it would particularly benefit senior-level individuals who are involved in influence work. This includes interacting with other teams and departments and driving organizational change. If people find it challenging to navigate such situations or have struggled with it in the past, the content I present can be helpful and provide a useful perspective on making these efforts practical.

Is there anything specific that you'd like people to walk away with after watching your session?

I hope to convey the understanding that structuring an organization and implementing procedures doesn't guarantee adherence. Based on my experience, focusing on the actual emerging organization and the work people are doing, even if it's not openly reported or conforming to the intended structure, yields better results. Instead of imposing strict order, it's about daily small acts and adjusting to make people's lives easier. This approach tends to be more effective.


Speaker

Fred Hebert

Staff SRE @Honeycombio

Fred Hebert is a staff SRE at Honeycomb.io, caring for SLOs and error budgets, on-call health, alert hygiene, incident response, and operational readiness. He has previously worked as a software developer of all ranks for over a decade and ended up with a healthy dislike of computers and clumsy automation. He’s a published technical author who loves distributed systems, systems engineering and has a strong interest in resilience engineering and human factors.

Read more
Find Fred Hebert at:

Date

Thursday Jun 15 / 02:55PM EDT ( 50 minutes )

Location

Salon E

Topics

Resilience Engineering SRE Incident Response Systems

Share

From the same track

Session Resilience

Comparing Apples and Volkswagens: The Problem With Aggregate Incident Metrics

Thursday Jun 15 / 11:50AM EDT

This talk presents data from the Verica Open Incident Database (VOID) to conclusively demonstrate how aggregate incident metrics (MTTR, severity, # of incidents/time) aren't representative of your systems' resilience.

Speaker image - Courtney Nash

Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Session Resilience Engineering

Resilience Hides in Plain Sight

Thursday Jun 15 / 01:40PM EDT

Think of the most out-of-nowhere and surprising incident you've experienced.

Speaker image - John Allspaw

John Allspaw

Founder and Principal @Adaptive Capacity Labs

Session Resilience Engineering

5 Strategies to Resiliently Handle Uncertainty, Time Pressure & Change

Thursday Jun 15 / 04:10PM EDT

As an engineer tasked with keeping large-scale software systems running under changing priorities and time pressure, you need REsilience capabilities that are both technical and organizational to successfully navigate modern software engineering work.

Speaker image - Dr. Laura Maguire

Dr. Laura Maguire

Cognitive Systems Engineer & Researcher

Session

Two Years of Incidents at 6 Different Companies: How a Culture of Resilience Can Help You Accomplish Your Goals

Thursday Jun 15 / 10:35AM EDT

Incidents and outages are expensive, they impact engineering productivity, business goals, and your company’s reputation. In this talk I will describe how we can apply resilience throughout the incident lifecycle in order to turn incidents into opportunities.

Speaker image - Vanessa Huerta Granda

Vanessa Huerta Granda

Solutions Engineer @Jeli.io