Presentation: Canopy: Scalable Distributed Tracing & Analysis @Facebook
Abstract
How do you understand the performance of a request that is executed in a large-scale system, potentially fanning out across thousands of machines and services? To answer this question at Facebook, we built a distributed tracing framework, Canopy, which has provided visibility into an otherwise intractable problem.
In this talk we present Canopy, Facebook’s performance and efficiency tracing infrastructure. Canopy recoards causally related events across the end-to-end execution path of requests, including from browsers, mobile applications, and backend services. Canopy processes traces in near real-time, derives user-specified features, and outputs to datasets that aggregate across billions of requests. At Facebook, Canopy is used to query and analyze performance and efficiency data in real-time.
Canopy addresses three challenges we have encountered: (1) supporting the range of execution and performance models used by different components of the Facebook stack; (2) supporting interactive ad-hoc and real-time analysis of trace data; and (3) operating at massive scale - Canopy currently records and processes over 1 billion traces per day.
We conclude by discussing lessons learned applying Canopy to a wide range of use cases at Facebook and present case studies of its use in solving various performance and efficiency challenges