Track: Architecting For Failure

Location: Broadway Ballroom North Center, 6th fl.

Day of week: Wednesday

Successfully Architecting for Failure has to include systems and people that work toward preventing failure. However, the complex and distributed systems of today often fail in ways that require a specific combination of variables in order to sneak past all the preventative barriers we’ve built and create new and exciting failure modes. The track recognizes this reality and includes talks that follow a structured idea of how to build systems and organizations that are best architected for success with the ability to quickly respond to.

Track Host: Dave Hahn

SRE in the Cloud Operations & Reliability Engineering organization @Netflix

Dave Hahn is a member of the SRE in the Cloud Operations and Reliability Engineering organization at Netflix. He has many years of experience in distributed systems, failures, and mis-attribution of complex problems to human error. Will talk for applause. Bad jokes likely.

10:35am - 11:25am

Architecting for Failure Presentation

Presentation details to follow.

Jason Hand, Senior Cloud Advocate @Microsoft

11:50am - 12:40pm

How Did Things Go Right? Learning More From Incidents

Solely learning from failure isn't a fundamental–it's a limitation.

A look into the New View of Safety, Human & Organizational Performance, and Resilience Engineering shows us that safety, great performance, and sources of resilience do not come from the absence of failure, but rather the presence of adaptive capacity.

Navigating a perfect storm in a world where availability is made up and the 9's don't matter requires expertise. This talk will describe more rewarding ways to approach incident investigation without overly focusing on failure prevention.

  • What's going on when it seems like nothing is happening?
  • When failure does occur, what's going to keep it from being worse?
  • How do teams adapt successfully when preventative techniques fail?
  • How should we prioritize the effort to develop systems that help us safely manage the consequences of failure? 

These questions cannot be answered by trying to explain causes of failure and fixing remediation items.

We will move the needle forward and increase our opportunity for learning from success with some fundamental and practical ways that get us from, "Why did things to wrong?" to "How did things go right?"

Ryan Kitchens, Site Reliability Engineering @Netflix

1:40pm - 2:30pm

Architecting for Failure Presentation

Presentation details to follow.

Janna Brummel, IT Chapter Lead Site Reliability Engineering @ingnl
Robin van Zijll, Site Reliability Engineer & Product Owner on the SRE Team @ingnl

2:55pm - 3:45pm

Graceful Degradation as a Feature

The move from monolith to microservice has allowed pieces of functionality to be deployed individually and on demand. Having functionality isolated allows the opportunity for one microservice to fail without bringing down the whole system.

However, it also increases complexity with the number of API calls being made across all of these services. Each service has unique failure models, whether its a database, cache, queue, etc. How can you be sure that one single failure doesn’t cause an outage for your end users?

Landing the launches of new products and features and providing your users with a positive experience is crucial to your success. If something is to fail, you’d prefer they didn’t know. Or if they did, it shouldn’t interrupt their experience.

In this talk, we’ll cover graceful degradation as an engineering goal which can be confidently tested with Chaos Engineering. By purposely causing failure of one service at a time in a controlled environment, you can safely observe the effect on the end user, whether that’s on a laptop browser, a mobile app, or the result of an API call.

Lorne Kligerman, Director of Product @GremlinInc

4:10pm - 5:00pm

What Breaks Our Systems: A Taxonomy of Black Swans

Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time.

By definition, you cannot predict true black swans. But black swans often fall into certain categories that we've seen before. This talk examines those categories and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, hidden system dependencies, and more.

Laura Nolan, Site Reliability Engineer


Monday, 24 June

Tuesday, 25 June

Wednesday, 26 June

  • Architecting For Failure

    More than just building software, building deployable production ready software in the face of guaranteed failure.

  • 21st Century Languages

    Lessons learned from building languages like Rust, Go-lang, Swift, Kotlin, and more.

  • Building High-Performing Teams

    What “high-performing team” means and how to build one effectively depends on context. This track will share different experiences of building high-performing teams in order to highlight how different contexts lead to different solutions but also what typically stays the same because we’re still dealing with humans trying to work together. How do different forces affect the building of high-performing teams.

  • Software Defined Infrastructure: Kubernetes, Service Meshes, & Beyond

    Deploying, scaling, managing your services is undifferentiated heavy lifting. Hear stories, learn techniques, and dive deep into software infrastructure.

  • High-Performance Computing: Lessons from FinTech & AdTech

    Killing latency and getting the most out of your hardware.