Improve Feature Freshness in Large Scale ML Data Processing

In many ML use cases, model performance is highly dependent on the quality of the features they are trained and inference on. One of the important dimensions of feature quality is the freshness of the data. Therefore, it is critical to ensure that the features remain up-to-date to the problem being solved.

The presentation will cover the impact of feature freshness on model performance based on experiments both in training data and inference data. We will also discuss various strategies and techniques that can be used to improve feature freshness, including in streaming and batch feature processing. It will also discuss the challenges and tradeoffs that come with implementing these strategies in large scale machine learning systems, such as the computational cost and scalability issues.

By keeping the features fresh and relevant, organizations can achieve better results and stay ahead of the competition in today's rapidly evolving data-driven landscape.

Interview:

What's the focus of your work these days?

My current area of focus revolves around developing techniques to prepare data for machine learning inference on a large scale. At the same time, I aim to enhance reliability, improve efficiency, and minimize latency in the process.

What's the motivation for your talk at QCon New York 2023?

I would like to share our learnings while working on these projects with the industry.

How would you describe your main persona and target audience for this session?

The target audience would be experienced technologists in the industry who work on large scale data processing for machine learning. 

Is there anything specific that you'd like people to walk away with after watching your session?

There are a few key takeaways:

  • Improving data freshness is becoming more and more important in ML tasks
  • However not all your data need to be super fresh. Optimize for ROI instead of freshness alone
  • Design your system end to end, instead of focusing on localized optimization

Speaker

Zhongliang Liang

Engineering Manager @Facebook AI Infra

Zhongliang has over a decade of experience working in the domain of big data and large scale distributed systems. His most recent focus is on developing advanced data infrastructure for ML data processing at Meta, which powers the SOTA recommendation systems in the industry.

Previously, Zhngliang worked at LinkedIn, Microsoft BingAds and Vertica Systems, where he worked on building distributed online and offline systems as well as high speed analytical database. Zhongliang also serves as a member of the Steering Committee for the Machine Learning Platform Meetup, where he facilitates the sharing of the latest technology advancements in the ML platform community.

Read more
Find Zhongliang Liang at:

Date

Wednesday Jun 14 / 11:50AM EDT ( 50 minutes )

Location

Salon D

Topics

Machine Learning ML Platform Data Platform

Share

From the same track

Session MLOps

Platform and Features MLEs, a Scalable and Product-Centric Approach for High Performing Data Products

Wednesday Jun 14 / 04:10PM EDT

In this talk, we would go through the lessons learnt in the last couple of years around organising a Data Science Team and the Machine Learning Engineering efforts at Bumble Inc.

Speaker image - Massimo Belloni

Massimo Belloni

Data Science Manager @Bumble

Session AI/ML

A Bicycle for the (AI) Mind: GPT-4 + Tools

Wednesday Jun 14 / 02:55PM EDT

OpenAI recently introduced GPT-3.5 Turbo and GPT-4, the latest in its series of language models that also power ChatGPT.

Speaker image - Sherwin Wu

Sherwin Wu

Technical Staff @OpenAI

Speaker image - Atty Eleti

Atty Eleti

Software Engineer @OpenAI

Session ML Infrastructure

Introducing the Hendrix ML Platform: An Evolution of Spotify’s ML Infrastructure

Wednesday Jun 14 / 10:35AM EDT

The rapid advancement of artificial intelligence and machine learning technology has led to exponential growth in the open-source ML ecosystem.

Speaker image - Divita Vohra

Divita Vohra

Senior Product Manager @Spotify

Speaker image - Mike Seid

Mike Seid

Tech Lead for the ML Platform @Spotify

Session

Panel: Navigating the Future: LLM in Production

Wednesday Jun 14 / 05:25PM EDT

Our panel is a conversation that aim to explore the practical and operational challenges of implementing LLMs in production. Each of our panelists will share their experiences and insights within their respective organizations.

Speaker image - Sherwin Wu

Sherwin Wu

Technical Staff @OpenAI

Speaker image - Hien Luu

Hien Luu

Sr. Engineering Manager @DoorDash

Speaker image - Rishab Ramanathan

Rishab Ramanathan

Co-founder & CTO @Openlayer

Session

Unconference: MLOps

Wednesday Jun 14 / 01:40PM EDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.