Laying the Foundations for a Kappa Architecture - The Yellow Brick Road

In the ever changing landscape of big data, focus is slowly moving away from batch and towards realtime analytics. Data Science workflows are evolving to adapt to this changing landscape. Realtime analytics are limited only by the underlying architecture that enables low latency ingestion, processing and serving of high throughput data. Lambda architecture unified batch and realtime to provide low latency computations with eventual correctness. However the challenges of maintaining two different codebases(among other things) made operations hard.

The founder of Kafka talks about a streaming first kappa architecture in this 2014 article. A streaming first, single path solution that can handle realtime processing as well as reprocessing and backfills. Kappa architecture makes sense, however it is 2023 and it has a long way to go for full scale adoption. A paradigm shift is required in how we design data infrastructure. Now, instead of worrying about two codebases, we need to worry about bootstrapping from a stream, backfills from history, idempotent sinks to handle reprocessing etc.

In this presentation, I will talk about strategies to evolve your Data Infrastructure to enable Kappa architecture in your organization. An iterative roadmap to move away from Lambda, while ensuring minimum disruption to end-users. I will be using real world examples from tech companies as case studies. By the end of this presentation you will walk away with a concrete roadmap for designing a data platform built on Kappa architecture.

Interview:

What's the focus of your work these days?

I work as a Staff Engineer for Chime, which is a leading fintech company. In this role I’m helping build out a robust data platform that can serve analytics, fraud and risk, machine learning, experimentation etc. My main focus has been to provide capabilities to our users to derive real time insights from our data. Apart from that I’ve been focussing a lot on data governance and discovery as well as the usability of our data platform. Basically, figuring out how we can provide value to our end-users so they can find and use data with speed and ease without worrying about the internals of it all.

This is probably a very wordy way to say my current role touches upon all aspects of data :)

On the side, I have also been playing the role of an advisor to nonprofits and helping them scale their data infrastructure and enable a platform for ML and AI exploration. Very recently I worked with NASA’s SpaceML program to productionize a self supervised learning system that ingested petabytes of earth’s imagery and allowed scientists to search for weather phenomena such as hurricanes et al from these images.

 

What's the motivation for your talk at QCon New York 2023?

I’ve been in the data streaming world for a long time, have worked with Flink, Kafka, AWS Kinesis, Lambda architecture, Kappa architecture, all the fun stuff. One common problem I’ve seen in organizations is that they struggle to evolve their existing architecture to make use of the cutting edge streaming technologies like Flink, Kafka infinite retention, Beam model and all. A big reason for this is a mindset shift, not to mention operational challenges that come with big change.


With this talk I hope to give people a framework to think about the goals of their data platform and work backwards to slowly evolve their architecture and modernize it with minimum disruption to their current way of doing things.

How would you describe your main persona and target audience for this session?

My talk is targeted towards staff engineers, architects, product and organization leads. Speaking of persona - anybody who is working with real-time data, streaming, or online training would gain something from this talk. However, folks who work for traditional data orgs where 80 - 100% of use cases are in batch but are thinking about modernizing their architecture will find the most value out of it.

Is there anything specific that you'd like people to walk away with after watching your session?

  1. Modern streaming technologies have evolved in leaps and bounds.
  2. Anyone can enable real-time application in their organization without the level of operational overhead involved in traditional system.
  3. Even though streaming and real time is the hot topic, it does not mean batch data is going away. A typical organization should be equipped to support any kind of application on the latency and consistency spectrum.

What's something interesting that you've learned from a previous QCon?

I learned a lot of things but my favorite talk was Laura Mcguire’s talk about Exploring Costs of Coordination During Outages. I never thought about incident response in that manner and it shed so much light on what we could be focussing on.


Overall I had a lot of great discussions about machine learning platforms, kubernetes and just generally it was nice to hear about different problems spaces.


Speaker

Sherin Thomas

Staff Software Engineer @Chime

Sherin is a Software Engineer with over 12 years of experience at companies like Google, Twitter, Lyft, Netflix and Chime. She works in the field of Big Data, Streaming, ML/AI and Distributed Systems. Currently, she's building a shiny new data platform at Chime. Sherin has presented on the topic of ML and Streaming at various reputable conferences including a keynote address and has judged various awards such as SXSW Innovation awards and CES.

Recently she advised NASA's SpaceML program and helped build a platform for processing petabytes of satellite imagery for detecting weather patterns and labelling raw data for climate science related AI research. She also writes a blog where she shares her thoughts on technology, work and career.

When she's not technical stuff she enjoys painting, reading, perusing the art and fashion section of New York Times and spending time with her husband and toddler.

Read more
Find Sherin Thomas at:

Date

Tuesday Jun 13 / 10:35AM EDT ( 50 minutes )

Location

Salon E

Topics

Streaming Data Architecture Realtime Analytics

Share

From the same track

Session Serverless

The Rise of the Serverless Data Architectures

Tuesday Jun 13 / 01:40PM EDT

For a while, it looked like Serverless was just a convenient way to run stateless functions in the cloud. But in the last year we’ve seen the rapid rise in serverless data stores.

Speaker image - Gwen Shapira

Gwen Shapira

Founder @Nile, PMC Member @Kafka

Session Stream Processing

Streaming from Apache Iceberg - Building Low-Latency and Cost-Effective Data Pipelines

Tuesday Jun 13 / 11:50AM EDT

Apache Flink is a very popular stream processing engine featuring sophisticated state management, even-time semantics, exactly-once state consistency. For low latency processing, Flink jobs typically consume data from streaming sources like Apache Kafka.

Speaker image - Steven Wu

Steven Wu

Software Engineer @Apple and Apache Iceberg PMC

Session Data Architecture

Building a Large Scale Real-Time Ad Events Processing System

Tuesday Jun 13 / 02:55PM EDT

Two years ago, we embarked on building DoorDash's ad platform from the ground up. Today, our platform handles over 2 trillion events every day and our advertising business has experienced significant growth in recent years, becoming a key area of focus for the company.

Speaker image - Chao Chu

Chao Chu

Software Engineer @DoorDash

Session Architecture

Enabling Remote Query Execution Through DuckDB Extensions

Tuesday Jun 13 / 04:10PM EDT

DuckDB is a high-performance, embeddable analytical database system that has gained massive popularity in the last few years.

Speaker image - Stephanie Wang

Stephanie Wang

Founding Engineer @MotherDuck

Session

Unconference: Modern Data Architecture & Engineering

Tuesday Jun 13 / 05:25PM EDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.

Speaker image - Ben Linders

Ben Linders

Independent Consultant in Agile, Lean, Quality and Continuous Improvement