What You’ll Learn

Learn what is Chaos Engineering and how Netflix is using it.
Discover how one can introduce Chaos Engineering to their organization.
Discuss how to present in a convincing way Chaos Engineering to the upper management.

Abstract

Chaos Engineering is described as "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production". This is immensely beneficial when executed properly, however all too often the road to cultural acceptance may not match our expectations as SREs, Chaos Engineers, and Productivity engineers.

Choose Your Own Adventure is a series of children's gamebooks where each story is written from a second-person point of view, with the reader assuming the role of the protagonist and making choices that determine the main character's actions and the plot's outcome.

This presentation will play on the book series and go over different experiences on "Chaos Adventures" including both successes and failures introducing Chaos in an organization. Chaos Engineering can lead to better development processes and procedures and better preparedness for outage.These benefits are available to any company willing to invest in more resilient and antifragile systems.

Chaos tools can positively influence the development process, and audience members will leave this talk with a game plan on how to bring Chaos practices to their organization. The "Chaos Adventure" will look a little different from everyone depending on type of organization, size of organization, and inter-team communication.

Interview

Question:

QCon: What's the work you're focused on at Netflix today?

Answer:

Nora: I'm on the Chaos team at Netflix. Our mission is to make sure that a system withstands turbulent conditions that happen in production on a regular basis. Making sure it's resilient enough to do that. We are using Chaos Engineering which involves injecting purposeful failure in the system, doing experiments on the system at different injection points that you can create between your services, and working with different microservice teams to do that. Our goal is to reveal failures before they become large-scale failures.

Question:

QCon: Your title is called "Choose Your Own Adventure: Chaos Engineering." What's that mean?

Answer:

Nora: From my experience so far I have found that there is no one solution for chaos. There is no precise process that you can follow step by step, and there are many factors to weigh in when making chaos solutions for your teams. Culture is a big factor. Getting social acceptance and cultural acceptance around chaos engineering is so important. And it's different with every organization whether it's a startup like Jet.com or a massive organization like Netflix. There are differences. Based on the kind of issues that would occur in each of those organizations there are different routes of chaos that I recommend choosing. I'll go through "choose your own adventure story" with the audience where different scenarios will come up and we'll have to pick a path to go down and see what happens based on that.

Question:

QCon: Can you give me an example of one of these paths?

Answer:

Nora: Sure. Say, for example, that you were having a lot of issues with Kafka. Your organization relies on Kafka on a pretty regular basis. All of a sudden, topics were getting overloaded, there were too many services writing to the same topic at the same time, or were reading from the same topic at the same time. How do you handle that? How do you control the chaos in that? One way to do that would be to arbitrarily increase reader rights on topics on a semi-regular basis on a semi-random basis, and see if your system can handle that. Many times with microservices architecture, they get so big that you don't even realize you have a ton of different services listening to the same topic. That could be one chaos introduced with Kafka. There are a few other ways that you can you can handle that too. Based on how you decide to handle that could reveal the actual problem or it could reveal different problems in the system as well.

Question:

QCon: Who's the main audience persona you're addressing?

Answer:

Nora: I would say that engineers, managers and PMs can all take something away from this conversation. I found in my experience that getting managers and PMs to understand what chaos engineering is and understanding the goal with it is so important for the engineer that's actually doing it. I try to tailor it to both audiences, so it's a mixture of both the business side and the the technical side.

Question:

QCon: Will your talk give engineers the information they need to convince their managers on this?

Answer:

Nora: Yes. And I'll speak from first hand experience.

Question:

QCon: What do you want someone who comes to your talk to walk away with?

Answer:

Nora: I would like for people to come away with a cultural and technical plan to introduce chaos to their organization. An introduction to a language to build a failure injection library, and some real life examples of actually bringing chaos to an organization and testing the several different functions of a distributed system, from queues to databases to regional failures and beyond.

Speaker: Nora Jones

Senior Chaos Engineer @Netflix

Nora is a Senior Chaos Engineer at Netflix. She is passionate about delivering high-quality software, improving processes, and promoting efficiency within architecture. Occasionally, she pokes holes in distributed systems to make them more resilient.

Find Nora Jones at

Speaker page

@nora_js

Senior Chaos Engineer @Netflix

Similar Talks

The Effective Remote Developer

Director of Engineering

David Copeland

Evaluating Machine Learning Models: A Case Study

Data Scientist @Opendoor

Nelson Ray

I Have A NoSQL toaster

Developer Advocate @Couchbase

Matthew Groves

Engineer Innovation Through Rapid Prototyping

Principal Software Engineer @ Vistaprint

Ramon Harrington

Managing Millions of Data Services @Heroku

Senior Infrastructure Engineer @Heroku

Gabriel Enslein

Building Microservices @Squarespace

Director of Engineering @ Squarespace

Franklin Angulo

Refactor Frontend APIs & Accounting for Tech Debt

Software Engineer @Indiegogo

Julia Nguyen

Reasoning About Complex Distributed Systems

Software Engineer @Jet, previous CTO

Erich Ess

Removing Friction In the Developer Experience

SVP Engineering, HBC Digital / Gilt & Committer Apache Karaf

Adrian Trenaman

Tracks

Monday, 26 June

Microservices: Patterns & Practices

Practical experiences and lessons with Microservices.
Java - Propelling the Ecosystem Forward

Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
High Velocity Dev Teams

Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
Modern Browser-Based Apps

Reactive, cross platform, progressive - webapp tech today.
Innovations in Fintech

Technology, tools and techniques supporting modern financial services.

Tuesday, 27 June

Architectures You've Always Wondered About

Case studies from the most relevant names in software.
Developer Experience: Level up Your Engineering Effectiveness

Trends, tools and projects that we're using to maximally empower your developers.
Chaos & Resilience

Failures, edge cases and how we're embracing them.
Stream Processing at Large

Rapidly moving data at scale.
Building Security Infrastructure

How our industry is being attacked and what you can do about it.

Wednesday, 28 June

Next Gen APIs: Designs, Protocols, and Evolution

Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
Immutable Infrastructures: Orchestration, Serverless, and More

What's next in infrastructure. How cloud function like lambda are making their way into production.
Machine Learning 2.0

Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS.
Optimizing Yourself

Maximizing your impact as an engineer, as a leader, and as a person.
Ask Me Anything (AMA)

This Year's Schedule

Track: Chaos & Resilience

Location: Majestic Complex, 6th fl

Duration: 5:25pm - 6:15pm

Day of week: Tuesday

Level: Intermediate

Persona: Developer, DevOps Engineer

What You’ll Learn

Abstract

Interview

Find Nora Jones at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Choose Your Own Adventure: Chaos Engineering

Track: Chaos & Resilience

Location: Majestic Complex, 6th fl

Duration: 5:25pm - 6:15pm

Day of week: Tuesday

Level: Intermediate

Persona: Developer, DevOps Engineer

More talks on:

What You’ll Learn

Abstract

Interview

Find Nora Jones at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World