NEXMark-Beam: Your best companion for testing and benchmarking new core stream processing libraries

We will provide an overview of FlinkNDB, which is a new type of state backend for scalable and ultra-flexible stateful data streaming that we are developing as part of our master thesis at RISE Research Institute of Sweden. We will also present how we leveraged ApacheBeam’s capabilities to speed up the development of the new Flink state backend and showcase its performance using NEXMark benchmarks and failure-recovery scenarios. Topics covered:

Office hours: Java SDK, Go, Spark, Flink

A team of Apache Beam creators and committers will be hosting office hours where they will answer any questions you may have regarding the different components of Apache Beam. 18:00-19:00 Java SDK & SQL (Kenneth Knowles) 18:00-19:00 Go (Robert Burke) 19:00-20:00 Spark (Kyle Weaver) 19:00-20:00 Flink (Ankur Goenka, Maximilian Michels) Join at the Beam Summit virtual space on Gather Town: https://sg1.run/beam-summit

Office hours: Python SDK, Ease of use, Beam model, Beam connectors

A team of Apache Beam committers & developers will be hosting office hours where they will answer any questions you may have regarding the different components of Apache Beam. 18:00-19:00 Beam ease of use (Alex Amato & Daniel Millls) 18:00-19:00 Python SDK (Pablo Estrada) 19:00-20:00 Beam model (Robert Bradshaw) 19:00-20:00 Beam connectors (Chamikara Jayalath) Join at the Beam Summit virtual space on Gather Town: https://sg1.

PyFlink on Beam: How does it actually work?

Beam’s portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. It ensures that SDKs and runners can work with each other uniformly. At the execution layer, the Fn API is provided which is for language-specific user-defined function execution. The Fn API is highly abstract and it includes several generic components such as control service, data service, state service, logging service, etc which make it not only available for Beam, but also third part projects which require multi-language support.

Replacing Your Database with Beam Pipelines

We use streaming Beam pipelines to index data into an object store (S3) and compute summaries of new data with periodic batch pipelines. These summaries are analogous to indices in a database and allow for efficient lookup along a fixed set of axes. The result is a scalable and inexpensive data storage solution for applications that serve data with limited query patterns.

Scaling Python Portable Pipelines in Linkedin

Stream processing has always been a core component of Linkedin infrastructure - with over a trillion messages processed each day. In this talk we will share how we expanded our stream processing capabilities by adopting the Beam Python Portability framework with Samza Beam Runner to bring stream processing in Python to a variety of new use cases. We will start with an overview of the Beam Samza Runner and the progress we made so far with integrating the Beam Portable pipeline.

Sharing good practices for an end-to-end Beam pipeline

Learning a whole new SDK and style to write massively parallel data pipelines is certainly a daunting task for any engineer, data scientist, or developer. We have tried to build an end-to-end example of an Apache Beam pipeline that utilizes various valuable Beam features, and tries to establish good practices for writing your Beam pipeline. If you’re interested in learning about good practices for your Beam pipeline, this talk is for you.

Simpler Python Pipelines with Schemas, SQL, and Dataframes

While the Beam Python SDK is highly scalable and has advanced streaming capabilities, its unfamiliar API has been a significant barrier for many Python users. Recently there have been several improvements in usability aimed at closing this gap: first-class support for schemas, the ability to embed SQL in a pipeline, and a pandas-compatible Dataframe API. In this talk we will introduce these new APIs and how they can be used. We will then discuss some of the details of their implementation, including the challenges involved in writing a distributed, faithful, drop-in replacement for Pandas, as well as how we are able to leverage Beam’s portability framework and cross-language transforms to execute SQL with Java.

Sessions