Presentation: Choose Your Own Adventure: Chaos Engineering
What You’ll Learn
- Learn what is Chaos Engineering and how Netflix is using it.
- Discover how one can introduce Chaos Engineering to their organization.
- Discuss how to present in a convincing way Chaos Engineering to the upper management.
Abstract
Chaos Engineering is described as "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production". This is immensely beneficial when executed properly, however all too often the road to cultural acceptance may not match our expectations as SREs, Chaos Engineers, and Productivity engineers.
Choose Your Own Adventure is a series of children's gamebooks where each story is written from a second-person point of view, with the reader assuming the role of the protagonist and making choices that determine the main character's actions and the plot's outcome.
This presentation will play on the book series and go over different experiences on "Chaos Adventures" including both successes and failures introducing Chaos in an organization. Chaos Engineering can lead to better development processes and procedures and better preparedness for outage.These benefits are available to any company willing to invest in more resilient and antifragile systems.
Chaos tools can positively influence the development process, and audience members will leave this talk with a game plan on how to bring Chaos practices to their organization. The "Chaos Adventure" will look a little different from everyone depending on type of organization, size of organization, and inter-team communication.
Interview
Nora: I'm on the Chaos team at Netflix. Our mission is to make sure that a system withstands turbulent conditions that happen in production on a regular basis. Making sure it's resilient enough to do that. We are using Chaos Engineering which involves injecting purposeful failure in the system, doing experiments on the system at different injection points that you can create between your services, and working with different microservice teams to do that. Our goal is to reveal failures before they become large-scale failures.
Nora: From my experience so far I have found that there is no one solution for chaos. There is no precise process that you can follow step by step, and there are many factors to weigh in when making chaos solutions for your teams. Culture is a big factor. Getting social acceptance and cultural acceptance around chaos engineering is so important. And it's different with every organization whether it's a startup like Jet.com or a massive organization like Netflix. There are differences. Based on the kind of issues that would occur in each of those organizations there are different routes of chaos that I recommend choosing. I'll go through "choose your own adventure story" with the audience where different scenarios will come up and we'll have to pick a path to go down and see what happens based on that.
Nora: Sure. Say, for example, that you were having a lot of issues with Kafka. Your organization relies on Kafka on a pretty regular basis. All of a sudden, topics were getting overloaded, there were too many services writing to the same topic at the same time, or were reading from the same topic at the same time. How do you handle that? How do you control the chaos in that? One way to do that would be to arbitrarily increase reader rights on topics on a semi-regular basis on a semi-random basis, and see if your system can handle that. Many times with microservices architecture, they get so big that you don't even realize you have a ton of different services listening to the same topic. That could be one chaos introduced with Kafka. There are a few other ways that you can you can handle that too. Based on how you decide to handle that could reveal the actual problem or it could reveal different problems in the system as well.
Nora: I would say that engineers, managers and PMs can all take something away from this conversation. I found in my experience that getting managers and PMs to understand what chaos engineering is and understanding the goal with it is so important for the engineer that's actually doing it. I try to tailor it to both audiences, so it's a mixture of both the business side and the the technical side.
Nora: Yes. And I'll speak from first hand experience.
Nora: I would like for people to come away with a cultural and technical plan to introduce chaos to their organization. An introduction to a language to build a failure injection library, and some real life examples of actually bringing chaos to an organization and testing the several different functions of a distributed system, from queues to databases to regional failures and beyond.
Similar Talks
Tracks
Monday, 26 June
-
Microservices: Patterns & Practices
Practical experiences and lessons with Microservices.
-
Java - Propelling the Ecosystem Forward
Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
-
High Velocity Dev Teams
Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
-
Modern Browser-Based Apps
Reactive, cross platform, progressive - webapp tech today.
-
Innovations in Fintech
Technology, tools and techniques supporting modern financial services.
Tuesday, 27 June
-
Architectures You've Always Wondered About
Case studies from the most relevant names in software.
-
Developer Experience: Level up Your Engineering Effectiveness
Trends, tools and projects that we're using to maximally empower your developers.
-
Chaos & Resilience
Failures, edge cases and how we're embracing them.
-
Stream Processing at Large
Rapidly moving data at scale.
-
Building Security Infrastructure
How our industry is being attacked and what you can do about it.
Wednesday, 28 June
-
Next Gen APIs: Designs, Protocols, and Evolution
Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
-
Immutable Infrastructures: Orchestration, Serverless, and More
What's next in infrastructure. How cloud function like lambda are making their way into production.
-
Machine Learning 2.0
Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS.
-
Optimizing Yourself
Maximizing your impact as an engineer, as a leader, and as a person.
-
Ask Me Anything (AMA)