Back to Basics: Scalable, Portable ML in Pure SQL

Redshift has SageMaker. BigQuery begat BigML. Spark birthed Databricks. Every data warehouse is tightly coupled to a particular ML stack. This is good for warehouse vendors – but leads to vendor lock-in, implementation complexity, and significant frictions when shuttling data to and from the ML engine.

When Eppo was trying to predict end-user behavior so that our clients could conclude their A/B experiments more quickly (the CUPED algorithm), we realized there was an opportunity to build something "so crazy it might just work" – a portable regression engine that did all of the heavy compute inside each warehouse using bog-standard ANSI SQL, and saved the hard matrix math for our client code. By leveraging each warehouse's ability to crunch columns quickly, concurrently, and *in situ*, we were able to perform complex ML estimation in far less time than it previously took just to egress that data to a dedicated ML engine.

In this talk, I will walk through the architecture of Eppo's portable, performant, privacy-preserving, multi-warehouse regression engine, and discuss the challenges with implementation as well as the quirks associated with each warehouse. My goal is to challenge prevailing industry assumptions about what's needed to build scalable ML systems and show a way forward where we start moving compute closer to the data – using standard tools that are right in front of us. Attendees can expect to leave with a solid understanding of how Eppo's system works and just enough knowledge to start building a similar system in their programming language of choice.


Speaker

Evan Miller

Principal Statistics Engineer @Eppo (Creator of Evan's Awesome A/B Tools)

With a bachelor's degree in physics and a secret past as an Operational Excellence Engineer at Amazon.com, Evan Miller has made a career attempting to thread the needle between mathematical modeling and practical engineering challenges. "Evan's Awesome A/B Tools", a set of JavaScript calculators he made in grad school, are now used throughout the A/B testing industry, helping practitioners plan their experiments with the appropriate amount of statistical power. His blog is known in the hackersphere for bringing clarity to statistical problems and a touch of humor to endless Internet debates about programming languages. For nearly a decade, Evan made his way as an independent Mac developer, making easy-to-use apps for data analysis. Most recently, he has combined his knowledge of statistics with his engineering expertise in the unique role of a Statistics Engineer, where he helps Eppo build out a world-class, warehouse-native experimentation platform for companies that appreciate the perils of trying to build one themselves.

Read more
Find Evan Miller at:

Date

Thursday Jun 15 / 02:55PM EDT ( 50 minutes )

Location

Dumbo / Navy Yard

Topics

ML in Practice MLOps Data Architecture

Share

From the same track

Session Search

Needle in a 930M Member Haystack: People Search AI @LinkedIn

Thursday Jun 15 / 11:50AM EDT

LinkedIn's search functionality is one of its oldest capabilities, allowing members to search for people they know, or to discover new connections.

Speaker image - Mathew Teoh
Mathew Teoh

Machine Learning @ LinkedIn

Session AI/ML

PostgresML: Leveraging Postgres as a Vector Database for AI

Thursday Jun 15 / 10:35AM EDT

With the growing importance of AI and machine learning in modern applications, data scientists and developers are constantly exploring new and efficient ways to store and analyze large amounts of data.

Speaker image - Montana Low
Montana Low

Machine Learning w/ PostgresML

Session AI/ML

Going Beyond the Case of Black Box AutoML

Thursday Jun 15 / 01:40PM EDT

Most AutoML tools are black-box tools. They offer no code/low code tools (UI/simple APIs) for practitioners to get started quickly. While this helps beginners, most experienced data scientists/ML practitioners often need more control.

Speaker image - Kiran Kate
Kiran Kate

Senior Technical Staff Member @IBM Research

Session

LLMs in the Real World: Structuring Text with Declarative NLP

Thursday Jun 15 / 04:10PM EDT

Building machine learning pipelines to extract structured data from unstructured text is a popular problem within an unpopular development lifecycle.

Speaker image - Adam Azzam
Adam Azzam

AI Product Lead @Prefect