What You’ll Learn

Gain tools and techniques helpful to reason about distributed systems.
Learn how to investigate root causes of malfunctions in distributed systems.
Discuss ideas on how to teach others in your team how to reason about distributed systems.

Abstract

One of the biggest challenges of working with distributed systems (even small ones with only 10 services) is maintaining them once they're live and performing triage of major issues and returning systems back to health as quickly as possible. This creates a key need for a good developer experience with complex systems: how to minimize the amount of time spent awake at 2am in order to achieve Return To Service. Having a good experience for developers is founded upon how the distributed system is built and developing specific problem solving strategies. For example, using technical tools (such as distributed tracing techniques) strategically to understand how a system is currently behaving and quickly identify what is misbehaving. This talk will cover the technical tools you need to gain information on a complex system and practical approaches to convert that information into an actual understanding of the system.

Interview

Question:

QCon: Tell us a bit about the talk that you're giving at QCon.

Answer:

Erich: In this talk I'm going to go through tools and techniques and I've come across and that help me work with complex distributed system. How to understand their behavior and how to reason through issues that are happening in distributed system. If something's not functioning correctly, if you're not getting the expected behavior, how to quickly determine the root cause and fix it by applying an understanding of how the system should behave. Using that understanding to look at performance metrics, tracing data and logs to be able to do simple experiments where you can test the system. The way tests behave tell us what is broken or what is not broken. This is meant to quickly narrow down and find out what subset of the system is causing issues, and then from there determine what the root cause is and bring the system back to service.

The main goal here is to help developers understand the tools I've been using and develop their own tools that let them reason about complex systems, to do better architecture and design, and better triage and support.

Question:

QCon: Can you give me an example of an experiment that you'll be talking about?

Answer:

Erich: At one of our companies, we had a system that contained 16 to 20 services interacting together to form the business logic for our customers. And there was a public facing API that had about 10 functions on it. By calling those 10 functions and looking at how each one behaves it was possible to diagnose exactly what was going wrong with the system because each function hit a difference set in a different way. If we are experiencing an outage or an error or things are not behaving correctly we can run through those 10 functions. And by looking at each one of their results we can see the ones that were not behaving correctly, and use those to triangulate exactly which parts of the system are working correctly. Within five minutes we can narrow it down to one or two services that could be misbehaving or a specific piece of infrastructure like a database that must be misbehaving. That saves us a tremendous amount of time in terms of triage: we didn't have to look at logs, we didn't have to check the dashboard. We could just run those APIs in a matter of minutes. Then we would know where the error was happening and fix the actual broken service.

Question:

QCon: Distributed systems vary a lot. How do you help people reason about their system to diagnosticate their problems?

Answer:

Erich: First this is talking to the other teams or the other developers. Usually there's a small team for each subsystem in the distributed system. Developers know what their systems role is and how it's expected to behave. Then setting up the infrastructure, such as log aggregators, to collect data that traces things as they move across different subsystems from team to team. The third tool is a social one, knowing who to talk that knows a specific subsystem really well, to know how your system works and how it interacts with other systems and what to expect from it. That will allow you to set up experiments on your system, thinking that if I'd use this on my system then it will hit these specific services and this database, and it is going to make this call to the others. That will let you know what services or databases are broken.

And you can say your system is one part of the set of systems that must be having issues. If you have enough of them overlap you can see which calls are working. We'll tell you which services must be working correctly and calls that aren't working tell you which ones may be broken. And if you overlap them you can start cutting out the ones that you verify there were some other tasks and reduced down to just a small set of things that may be broken. If they're in your system you can look at them directly, if not you can reach out to the person that knows each other really well and get them involved. They can tell you what that behavior might mean on their side, and then you can get to what the root cause.

Question:

QCon: Do you discuss metrics and how to interpret them?

Answer:

Erich: Yes. A lot of this gets down to contracts between systems. If I send you this data I expect this response. If I make this request I expect this response. SLAs come into play here. Also, metrics that tell you what performance is and then logging error messages.

Question:

QCon: Who are you talking to?

Answer:

Erich: Technical leads and architects, people who would use this as knowledge to teach junior level engineers how to think about this stuff, and people who help them design and improve the design and architecture of the systems, to provide tools and metrics that allow them to collect the data, to make it easier to reason about it, getting the data to determine what the current behavior is.

Question:

QCon: What do you want a tech leader who comes to your talk to leave with?

Answer:

Erich:I want them to leave with a set of tools that help them work with and understand their complex systems, and ideas to teach the people on their team so they can be better engineers at working with distributive systems.

Speaker: Erich Ess

Software Engineer @Jet, previous CTO

Engineer at Jet.com. Building distributed systems and microservice platforms. Previously, I've been a CTO for a small start up, engineered distributed systems, and did research into scientific visualization.

Find Erich Ess at

Speaker page

@egerhardess

Similar Talks

The Effective Remote Developer

Director of Engineering

David Copeland

Evaluating Machine Learning Models: A Case Study

Data Scientist @Opendoor

Nelson Ray

Mixing in React

Software Engineer @Agrilyst

Rushaine McBean

I Have A NoSQL toaster

Developer Advocate @Couchbase

Matthew Groves

Engineer Innovation Through Rapid Prototyping

Principal Software Engineer @ Vistaprint

Ramon Harrington

The Java Evolution of Eclipse Collections

Technology Associate @GoldmanSachs

Kristen O'Leary

Nonconformist Resilience: DB-Backed Job Queues

VP Architecture @Betterment

John Mileham

Managing Millions of Data Services @Heroku

Senior Infrastructure Engineer @Heroku

Gabriel Enslein

Take Two: Evolving Microservice Architectures

Platform Director, "SeatGeek Open" @SeatGeek

Andrew Hart

Tracks

Monday, 26 June

Microservices: Patterns & Practices

Practical experiences and lessons with Microservices.
Java - Propelling the Ecosystem Forward

Lessons from Java 8, prepping for Java 9, and looking ahead at Java 10. Innovators in Java.
High Velocity Dev Teams

Working Smarter as a team. Improving value delivery of engineers. Lean and Agile principles.
Modern Browser-Based Apps

Reactive, cross platform, progressive - webapp tech today.
Innovations in Fintech

Technology, tools and techniques supporting modern financial services.

Tuesday, 27 June

Architectures You've Always Wondered About

Case studies from the most relevant names in software.
Developer Experience: Level up Your Engineering Effectiveness

Trends, tools and projects that we're using to maximally empower your developers.
Chaos & Resilience

Failures, edge cases and how we're embracing them.
Stream Processing at Large

Rapidly moving data at scale.
Building Security Infrastructure

How our industry is being attacked and what you can do about it.

Wednesday, 28 June

Next Gen APIs: Designs, Protocols, and Evolution

Practical deep-dives into public and internal API design, tooling and techniques for evolving them, and binary and graph-based protocols.
Immutable Infrastructures: Orchestration, Serverless, and More

What's next in infrastructure. How cloud function like lambda are making their way into production.
Machine Learning 2.0

Machine Learning 2.0, Deep Learning & Deep Learning Datasets.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS.
Optimizing Yourself

Maximizing your impact as an engineer, as a leader, and as a person.
Ask Me Anything (AMA)

This Year's Schedule

Track: Developer Experience: Level up Your Engineering Effectiveness

Location: Broadway Ballroom South Center, 6th fl.

Duration: 5:25pm - 6:15pm

Day of week: Tuesday

Level: Advanced

Persona: Architect, DevOps Engineer

What You’ll Learn

Abstract

Interview

Find Erich Ess at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Reasoning About Complex Distributed Systems

Track: Developer Experience: Level up Your Engineering Effectiveness

Location: Broadway Ballroom South Center, 6th fl.

Duration: 5:25pm - 6:15pm

Day of week: Tuesday

Level: Advanced

Persona: Architect, DevOps Engineer

More talks on:

What You’ll Learn

Abstract

Interview

Find Erich Ess at

Similar Talks

Tracks

Monday, 26 June

Tuesday, 27 June

Wednesday, 28 June

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World