Presentation: Scalable Post-Mortem Analysis

Location:

Duration

Duration: 
4:10pm - 5:00pm

Day of week:

Level:

Persona:

Key Takeaways

  • Gain a better understanding of what post-mortem analysis is and how you can benefit from it.
  • Understand the capabilities and potential debugging/troubleshooting benefit of post-mortem debugging system.
  • Hear about tools and techniques targeted at devs that can be used for effective post-mortem analysis.

Abstract

Resilience for many of us comes from our ability to restart applications in the face of failure. We as debuggers and operators are often forced to go back and analyze clues left behind to tease out root-cause from assets like logs, heap dumps, or even core dumps. As our systems grow, and become more distributed, these one-off investigations become less tenable and a scalable way to analyze incidents after-the-fact is needed. In this talk, we'll explore examples of real-world incidents where scalable post-mortem analysis enabled unobtrusive investigation without affecting system resilience and look into what other opportunities post-mortem analysis methods can offer.

Interview

Question: 
What is your role and background?
Answer: 
I am the CEO and co-founder of Backtrace I/O. We are focused on fundamentally improving the way software is debugged. Our first product is a holistic post-mortem debugging platform designed to improve the way teams detect, analyze and resolve errors in their software.
In the past I was a Head of Engineering at AppNexus, and, previous to that, I was a tech lead on the ad server team at AppNexus. I’ve spent a lot of time debugging and being frustrated with the process.
Question: 
Tell me about this platform. Is it specific to a certain technology?
Answer: 
Our platform is able to detect and capture faults for any language running on Unix-like operating systems. Backtrace currently supports deep introspection and analysis for native languages like C, C++, and Go and we are actively working on supporting this functionality for other languages. When the Backtrace platform detects a fault, or is invoked, it takes a “snapshot” of the application. A snapshot includes the stack trace across all threads, local variables, heap data referenced by these variables and environmental information. This information is then analyzed for important signals to help with root-cause investigation and error detection and sent to an object-store. The object-store indexes this information, allowing you to query across multiple snapshots, plug this information into workflow systems like JIRA, Slack, etc, and pull down snapshots without needing access to the machine where the error occurred.
Question: 
What are the takeaways from your talk?
Answer: 
One key take away is understanding post-mortem analysis and how people use it in the real-world to investigate errors, even when their systems are resilient to failure. This is an incredibly powerful method for investigating failure with some compelling advantages when compared to other debugging methods. Another takeaway of this talk is relating the methods of post-mortem analysis to languages people don’t typically associate with it. There are some systems out there today that augment logs with application state, inspect heap dumps or allow you to extract interesting data about your systems and enable you to investigate errors in a post-mortem fashion.
Question: 
What is the persona you are addressing?
Answer: 
Tech leads, senior engineers, and devops. I also think this topic will be interesting to architects or anyone who thinks deeply about what debugging facilities they should add to their systems.

Speaker: Abel Mathew

Co-founder & CEO of Backtrace I/O

Abel Mathew is the co-founder and CEO of Backtrace I/O. Prior to Backtrace, Abel was a Head of Engineering at AppNexus where he led a team of developers to improve ad optimization and reduce platform-wide costs. He spent multiple years as a developer and a team lead on AppNexus’ Adserver Team where he helped design and implement their low-latency advertising platform. Before AppNexus, Abel was a kernel module and tools developer at IBM and a server room monkey at AMD.

Find Abel Mathew at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June