Canary Analyze All The Things: How we learned to Keep Calm and Release Often

Canary Analyze All The Things: How we learned to Keep Calm and Release Often

Thursday, 5:30pm - 6:20pm

The process of releasing to production can be nerve wracking for any conscientious developer, especially when the product you're releasing is responsible for entertaining 48 million customers. Practically everyone who pushes to production spends some time after that push monitoring production, sometimes with a certain degree of trepidation.


In the last year or so, we've taken our most critical application, the system responsible for the Netflix API, and increased deployment cadence from semi-monthly deployments to daily deployments, all while lowering the effort on the part of developers, increasing availability for our customers, and building up our trust that when we deploy into production, that deployment is safe, predictable, and good for our customers -- and that when it is not, we'll know it isn't, we'll know it quickly, and we'll automatically revert changes. We've done this by investing in our real-time analytics capabilities and building an automated canary analysis system.


In this talk, we'll discuss canary analysis deployment and observability patterns we believe are generally useful, and talk about the difference between manual and automated canary analysis. Partially aspirational, partially utiliarian, our goal is to provide a useful way to think about canary analysis that will be applicable in most cloud-based engineering environments. We'll also discuss cloud-specific considerations (and opportunities) for canary analysis, as well as the next steps for Netflix's canary analysis system.

Roy.Rapoport's picture
Roy Rapoport manages the Insight Engineering group at Netflix, responsible for building Netflix's Operational Insight platforms, including cloud telemetry, alerting, and real-time analytics". He originally joined Netflix as part of its datacenter-based IT/Ops group, and prior to transferring over to Product Engineering, was managing Service Delivery for IT/Ops. He provided input into the forming of the Cloud Operations and Reliability Engineering (CORE) group at Netflix, and continues to play an advisory role to the group and its members. He also built the majority of the python infrastructure libraries to allow developers at Netflix access cloud systems. Roy has been in tech for about 20 years with positions in IT engineering and operations, software development, and software quality engineering, but his passion remains with operations and automation.