Presentation: Choose Your Own Adventure: Chaos Engineering

Track: Chaos & Resilience

Location: Majestic Complex, 6th fl

Duration: 5:25pm - 6:15pm

Day of week: Tuesday

Level: Intermediate

Persona: Developer, DevOps Engineer

What You’ll Learn

  • Learn what is Chaos Engineering and how Netflix is using it.
  • Discover how one can introduce Chaos Engineering to their organization.
  • Discuss how to present in a convincing way Chaos Engineering to the upper management.

Abstract

Chaos Engineering is described as "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production". This is immensely beneficial when executed properly, however all too often the road to cultural acceptance may not match our expectations as SREs, Chaos Engineers, and Productivity engineers.

Choose Your Own Adventure is a series of children's gamebooks where each story is written from a second-person point of view, with the reader assuming the role of the protagonist and making choices that determine the main character's actions and the plot's outcome.

This presentation will play on the book series and go over different experiences on "Chaos Adventures" including both successes and failures introducing Chaos in an organization. Chaos Engineering can lead to better development processes and procedures and better preparedness for outage.These benefits are available to any company willing to invest in more resilient and antifragile systems.

Chaos tools can positively influence the development process, and audience members will leave this talk with a game plan on how to bring Chaos practices to their organization. The "Chaos Adventure" will look a little different from everyone depending on type of organization, size of organization, and inter-team communication.

Interview

Question: 
QCon: What's the work you're focused on at Netflix today?
Answer: 

Nora: I'm on the Chaos team at Netflix. Our mission is to make sure that a system withstands turbulent conditions that happen in production on a regular basis. Making sure it's resilient enough to do that. We are using Chaos Engineering which involves injecting purposeful failure in the system, doing experiments on the system at different injection points that you can create between your services, and working with different microservice teams to do that. Our goal is to reveal failures before they become large-scale failures.

Question: 
QCon: Your title is called "Choose Your Own Adventure: Chaos Engineering." What's that mean?
Answer: 

Nora: From my experience so far I have found that there is no one solution for chaos. There is no precise process that you can follow step by step, and there are many factors to weigh in when making chaos solutions for your teams. Culture is a big factor. Getting social acceptance and cultural acceptance around chaos engineering is so important. And it's different with every organization whether it's a startup like Jet.com or a massive organization like Netflix. There are differences. Based on the kind of issues that would occur in each of those organizations there are different routes of chaos that I recommend choosing. I'll go through "choose your own adventure story" with the audience where different scenarios will come up and we'll have to pick a path to go down and see what happens based on that.

Question: 
QCon: Can you give me an example of one of these paths?
Answer: 

Nora: Sure. Say, for example, that you were having a lot of issues with Kafka. Your organization relies on Kafka on a pretty regular basis. All of a sudden, topics were getting overloaded, there were too many services writing to the same topic at the same time, or were reading from the same topic at the same time. How do you handle that? How do you control the chaos in that? One way to do that would be to arbitrarily increase reader rights on topics on a semi-regular basis on a semi-random basis, and see if your system can handle that. Many times with microservices architecture, they get so big that you don't even realize you have a ton of different services listening to the same topic. That could be one chaos introduced with Kafka. There are a few other ways that you can you can handle that too. Based on how you decide to handle that could reveal the actual problem or it could reveal different problems in the system as well.

Question: 
QCon: Who's the main audience persona you're addressing?
Answer: 

Nora: I would say that engineers, managers and PMs can all take something away from this conversation. I found in my experience that getting managers and PMs to understand what chaos engineering is and understanding the goal with it is so important for the engineer that's actually doing it. I try to tailor it to both audiences, so it's a mixture of both the business side and the the technical side.

Question: 
QCon: Will your talk give engineers the information they need to convince their managers on this?
Answer: 

Nora: Yes. And I'll speak from first hand experience.

Question: 
QCon: What do you want someone who comes to your talk to walk away with?
Answer: 

Nora: I would like for people to come away with a cultural and technical plan to introduce chaos to their organization. An introduction to a language to build a failure injection library, and some real life examples of actually bringing chaos to an organization and testing the several different functions of a distributed system, from queues to databases to regional failures and beyond.

Speaker: Nora Jones

Senior Chaos Engineer @Netflix

Nora is a Senior Chaos Engineer at Netflix. She is passionate about delivering high-quality software, improving processes, and promoting efficiency within architecture. Occasionally, she pokes holes in distributed systems to make them more resilient.

Find Nora Jones at

Similar Talks

Developer Advocate @Couchbase
Principal Software Engineer @ Vistaprint
Senior Infrastructure Engineer @Heroku
Director of Engineering @ Squarespace
Software Engineer @Jet, previous CTO
SVP Engineering, HBC Digital / Gilt & Committer Apache Karaf

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June