Track: Data Engineering for the Bold

Location: Majestic Complex, 6th fl

Day of week: Tuesday

Data engineering is the practice of delivering high-fidelity, custom access to data in order to serve the varied needs of a business. The rich and engaging experiences many of us expect online today (e.g. personalized news feeds, highly-relevant search engines & recommender systems, smart home assistants) are powered by modern data pipelines and architectures that form the foundation of data engineering. The tools a data engineer can deploy for his/her needs today occupy a vast landscape. The field of data engineering may have started out as “put all of your data in that RDBMS over there”, but it has evolved into a field of a multitude of specialty data solutions. It encompasses databases (RDBMS, NoSQL, NewSQL, OLAP DBs, etc…), messaging systems (Kafka, Kinesis, Pulsar), data compute frameworks (Spark, Flink, Ray, graph compute), storage systems (distributed file systems, block storage, object storage), search engines, RT OLAP engines, and graph DBs, Machine Learning Frameworks (PetaStorm, Michelangelo), etc… As the volume and speed of the data grows, we are continuing to discover new patterns and frameworks for squeezing more out of our data. What are some of the new entrants in this space and what interesting problems are being solved with them? Come to this track to find out!

Track Host: Sid Anand

Chief Data Engineer @PayPal, PMC & Committer for Apache Airflow, Co-Chair for QCon

Sid Anand currently serves as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid spends time with his wife, Shalini, and their 2 kids.

10:35am - 11:25am

Scaling DB Access for Billions of Queries Per Day @PayPal

As microservices scale and proliferate, they add increasing load on databases in terms of connections and resource usage. Open sourced in the Go programming language, VulcanMX scales thousands of PayPal’s applications with connection multiplexing, read-write split, and sharding. This talk covers various approaches taken over the years to handle a large growth in application connections and OLTP database utilization. Beyond pure connection and query scaling, VulcanMX has functionality for better manageability. Automatic SQL eviction and DBA maintenance control help to more easily operate hundreds of databases.

Petrica Voicu, Software Engineer @PayPal
Kenneth Kang, Software Engineer @PayPal

11:50am - 12:40pm

A Dive Into Streams @LinkedIn With Brooklin

Nearline (near real-time) applications power many critical services in LinkedIn, such as live search indices, notifications, ads targeting, etc. These applications require continuous and low-latency access to data, which is often spread across various data stores and messaging systems: Espresso, Oracle, MySQL, HDFS, Kafka, Azure Event Hubs, and AWS Kinesis. Building separate, specialized solutions that serve the requirements of each application and dataset combination is not sustainable, as it slows down development and makes the infrastructure unmanageable. We wanted to enable application developers to focus solely on processing events and not on building and managing pipelines that stream data. This called for a centralized, managed solution that can continuously deliver data to applications in near real-time.     

We built Brooklin to address LinkedIn’s needs for streaming data. Brooklin is a managed data streaming service that supports multiple pluggable sources and destinations, which can be data stores or messaging systems. Since 2016, Brooklin has been running in production as a critical piece of LinkedIn’s streaming infrastructure, and it supports a variety of data movement use cases, such as change data capture (CDC) and bridging data between different systems and environments. We have also leveraged Brooklin for mirroring Kafka data, replacing Kafka MirrorMaker at LinkedIn. In this talk, we will dive deeper into Brooklin’s architecture and use cases, as well as our future plans.

Celia Kung, Data Infrastructure @LinkedIn

1:40pm - 2:30pm

CockroachDB: Architecture of a Geo-Distributed SQL Database

In this talk Cockroach Labs' CTO and co-founder, Peter Mattis, will speak to the architecture of an open-source, geo-distributed, SQL database. The talk will be a whirlwind tour of CockroachDB’s internals, covering the usage of Raft for consensus, the challenges of data distribution, distributed transactions, distributed SQL execution, and distributed SQL optimizations.

Peter Mattis, CockroachDB maintainer, Co-founder & CTO @CockroachDB

2:55pm - 3:45pm

Data Engineering Presentation

Details to follow.

4:10pm - 5:00pm

Data Engineering Presentation

Details to follow.

5:25pm - 6:15pm

Data Engineering Presentation

Details to follow.

Tracks

Monday, 24 June

Tuesday, 25 June

Wednesday, 26 June

  • Architecting For Failure

    More than just building software, building deployable production ready software in the face of guaranteed failure.

  • 21st Century Languages

    Lessons learned from building languages like Rust, Go-lang, Swift, Kotlin, and more.

  • Building High-Performing Teams

    What “high-performing team” means and how to build one effectively depends on context. This track will share different experiences of building high-performing teams in order to highlight how different contexts lead to different solutions but also what typically stays the same because we’re still dealing with humans trying to work together. How do different forces affect the building of high-performing teams.

  • Software Defined Infrastructure: Kubernetes, Service Meshes, & Beyond

    Deploying, scaling, managing your services is undifferentiated heavy lifting. Hear stories, learn techniques, and dive deep into software infrastructure.

  • High-Performance Computing: Lessons from FinTech & AdTech

    Killing latency and getting the most out of your hardware.