Simpler Python Pipelines with Schemas, SQL, and Dataframes

(Aug-26 20:20 UTC)

While the Beam Python SDK is highly scalable and has advanced streaming capabilities, its unfamiliar API has been a significant barrier for many Python users. Recently there have been several improvements in usability aimed at closing this gap: first-class support for schemas, the ability to embed SQL in a pipeline, and a pandas-compatible Dataframe API.

In this talk we will introduce these new APIs and how they can be used. We will then discuss some of the details of their implementation, including the challenges involved in writing a distributed, faithful, drop-in replacement for Pandas, as well as how we are able to leverage Beam’s portability framework and cross-language transforms to execute SQL with Java.

Brian Hulette
Software Engineer at Google and Apache Beam Committer
Robert Bradshaw
Software Engineer at Google