Presentation: Have You Tried Turning It Off and On Again?

Track: Chaos, Complexity, and Resilience

Location: Soho Complex, 7th fl.

Duration: 4:10pm - 5:00pm

Day of week: Friday

Would you jump on this train of thought for a moment and see if you agree? Let’s say you have some number of computers. It could be three, it could be kerjillions, the number probably doesn’t matter too much for this thought experiment. Now lets say you have a number of people, probably closer to three than kerjillions, but find a number that works for you. And these people are tasked with making those computers function together in an resilient fashion in the real world. Can we agree that how the people operate the computers in production can have a significant impact on the resilience of the system? Almost obvious, no? 

But much less obvious are the deeper questions like: what are the characteristics of an operations practice that actively influence a system towards greater resiliency? Which practices (lets call them “operations theatre”) pretend to assist us in this goal but really work against us? In this talk not only will we uncover the answers, but we’ll use concrete examples from the breadth of the Site Reliability Engineering discipline to illustrate just how they work.

Speaker: David Blank-Edelman

Senior Cloud Ops Advocate @Microsoft

David has over thirty years of experience in the SRE/DevOps/systems administration field in large multiplatform environments. He is the author of the O'Reilly Otter book (Automating Systems Administration with Perl) and the editor/curator of "Seeking SRE: Conversations on Running Production Systems at Scale" (published by O'Reilly in August 2018). David is one of the co-founders of the now global set of SREcon conferences.

