Redshift has SageMaker. BigQuery begat BigML. Spark birthed Databricks. Every data warehouse is tightly coupled to a particular ML stack. This is good for warehouse vendors – but leads to vendor lock-in, implementation complexity, and significant frictions when shuttling data to and from the ML engine.
When Eppo was trying to predict end-user behavior so that our clients could conclude their A/B experiments more quickly (the CUPED algorithm), we realized there was an opportunity to build something "so crazy it might just work" – a portable regression engine that did all of the heavy compute inside each warehouse using bog-standard ANSI SQL, and saved the hard matrix math for our client code. By leveraging each warehouse's ability to crunch columns quickly, concurrently, and *in situ*, we were able to perform complex ML estimation in far less time than it previously took just to egress that data to a dedicated ML engine.
In this talk, I will walk through the architecture of Eppo's portable, performant, privacy-preserving, multi-warehouse regression engine, and discuss the challenges with implementation as well as the quirks associated with each warehouse. My goal is to challenge prevailing industry assumptions about what's needed to build scalable ML systems and show a way forward where we start moving compute closer to the data – using standard tools that are right in front of us. Attendees can expect to leave with a solid understanding of how Eppo's system works and just enough knowledge to start building a similar system in their programming language of choice.
Principal Statistics Engineer @Eppo (Creator of Evan's Awesome A/B Tools)