Presentation: Apache Beam: The Case for Unifying Streaming API's

Location:

Duration

Duration: 
10:35am - 11:25am

Day of week:

Level:

Persona:

Key Takeaways

  • Hear about how the adoption of streaming platforms is driving a universal API called Apache Beam.
  • Learn the benefits of adopting Apache Beam and how it improve your company’s flexibility.
  • Gain deeper insights into how streaming products like Spark and Flink operate.

Abstract

Our needs for real-time data are growing at an unprecedented rate; it is only a matter of time before you will be faced with building a real-time streaming pipeline. Often a major key decision you would need to quickly make is which stream-processing framework should you use. What if instead you could use a unified API that allows you to express complex data processing workflows, including advanced windowing and event timing and aggregate computations? Apache Beam aims to provide this unified model along with a set of language-specific SDKs for defining and executing complex data processing, data ingestion and integration workflows. This simplifies and will truly change how we implement and think about large-scale batch and streaming data processing in the future. Today these pipelines can be run on Apache Flink, Apache Spark, and Google Cloud Dataflow.

This is only a start, come to this session to learn where the future of streaming API’s is headed and get ready to leverage Apache Beam for your next streaming project.

Interview

Question: 
Apache Beam is pretty new. But as I understand it’s in a common API that can run on top of Spark, Flink, and Google Cloud DataFlow. Is that right?
Answer: 
Apache Beam is a new open source project, a donation from Google to the community. The heart of it is starting to have a common API of how we talk about streaming.
As streaming technologies continue to grow, if you look across the landscape, just at the open source projects, there is a new one every month. The problem you have is each one of them has a different API to learn and different ways to talk about it.
Everyone talks things differently. Everyone has different APIs. Some of them get close: Flink and Spark are close when you start to leverage what is there. But everything ends up being different. People talk about it differently. People use it differently. If you switch from one to the other, you are re-writing everything.
Beam is a tremendous step in the right direction to where we need to go, long term. Maybe it's not the final answer, but it is definitely taking us in the direction we need to go to be able to take streaming to that next level, and really have developers and businesses take advantage of it.
Beam provides this common abstraction, a common way of talking about a developer, streaming, how you interact with streaming systems, and how you use them. It is going to end up being the JDBC for streaming.
Question: 
What kind of complexity is it at? Is there a higher level of complexity because you have got this abstraction?
Answer: 
From the developer’s standpoint, you end up with a consistent level of complexity and abstraction, and then you have runners. It hides some of the complexity.
Obviously, it is going to introduce some complexity because certain things are going to be new, so you have to get your head around in a way that things will be presented from Beam. But it abstracts away from the complexity of the underlying systems that you don’t have to know all the complexity of them.
Question: 
Does it cover the full capabilities of each one of these different streaming frameworks? Or not framework, but framework streaming engines?
Answer: 
No. Nor could it, right? I think that’s the harder thing: is that where the commonality is and now where do these products differentiate? It’s hard to have that consistency when you have got everyone going after machine learning in streaming, and then you have graph processing in streaming, and you have got these other things.
Storm now has windowing. I am looking forward to trying to contribute to Beam and see if we can add support for Storm, now that they can do windowing. What do you do with that? Is it a possibility to like Kafka streams. Is it possible to get that to work? You have to figure out kind of where this base level is. You will never cover everything. It will be the same thing in the traditional relational world. Everyone speaks SQL. Everyone works with JDBC/ODBC, but everyone has their extra secret sauce. It is never going to cover all the secret sauce in the product differentiation, but it will be a good abstraction over the common use cases.
Question: 
So how much of the common use cases? If you take Dataflow and extend that across for Spark and Flink, does that pretty much cover it or what’s the percentage? Where is the boundary of that you can do with Beam across all these different platforms?
Answer: 
I think the boundary is the windowing and computations that you can express. There are not all going to be complete, but I think that boundary really is the windowing and the computation.
Question: 
What’s the motivation for your talk?
Answer: 
It’s going to be about Beam, and how it lays on top of that abstraction and the direction of why we need it, and potentially where it’s going. I am going to cover this API, where it is going, regardless of whether or not Beam is the final solution, why we need to be thinking about things in this fashion, and we need to collectively move towards something -- whether that is Beam, or a different incarnation of that same type of concept and move towards it.
Question: 
What are the actionable benefits people are going to walk away with from the talk?
Answer: 
Your actionable benefits are going to be two-fold. One, you have an appreciation for being interested in using it for your next project. Secondarily, if that is not the case of Beam, you would have a deeper understanding of some of the concepts of streaming systems and the streaming system you are working with.
How do you think about these concepts that Beam covers, and start to think about that way when you are using your stream tool of choice if there isn’t a runner available for Beam? And maybe have you invigorate the community of contributing to it and keeping the conversation about these things going forward. So getting more people thinking about where we are heading.
I'd like for you to go home and be able to build a streaming applications using Beam and have it run at a minimum against Spark and Flink, if not also Google Dataflow. So to show and prove to yourself that this one abstraction will work across the available runners.

Speaker: Andrew Psaltis

IoT/Data Architect working with Streaming Systems @Hortonworks

Andrew Psaltis has a wealth of experience in software design and in the construction of large-scale data-intensive systems. Andrew is deeply entrenched in streaming systems and obsessed with delivering insight at the speed of thought. As the author of Streaming Data (http://manning.com/psaltis/) by Manning, an international speaker and trainer, he spends most of his waking hours thinking about, writing about, and building streaming systems. When he's not busy being busy, he's spending time with his lovely wife, two kids, coaching and watching as much Lacrosse as possible.

Find Andrew Psaltis at

Tracks

Monday, 13 June

Tuesday, 14 June

Wednesday, 15 June