API Sink Connector: A Configurable Beam Pipeline for Connecting to Third-Party Data Stores

At Yelp, we use Apache Flink and Beam to build generic tools for performing common streaming operations. Feature team developers can then use these tools to build complex systems, without requiring them to learn all the details of our streaming infrastructure. One common operation we do at Yelp is to move data from Kafka to external, third-party data stores (like Salesforce or Oracle) via public APIs. This necessitates a number of complex operations, such as transforming data with business logic, batching data into bulk requests, and ordering data to resolve foreign key dependencies.

Beam and Geocoding

Beam schema options

What up with all the metadata on the metadata? This deep dive into BIP-1, Beam Improvement Proposal One, tells the story about Schema Options. Schema Options give extra context to row fields and schemas. I’ll dive in the proposal and give a break down of the potential use, but I can already give some real-world examples of how we’re using these options in our ingestion platform, based on Beam.

Beam visual pipeline development with Hop

Open source project Hop aims to make anything data orchestration related easier and recently added visual pipeline development for Apache Beam. This session goes over the architecture and the possibilities with live demos. Learn how easy it is to get those Beam pipelines up-and-running and find out how you can tune for performance in batch and streaming modes.

Building stateful streaming pipelines with Beam

Building a streaming data platform from the ground up is a daunting yet fun task. Walk developers’s through GoDaddy’s journey of building production pipelines at scale using Apache Beam as the foundation layer for writing pipelines and building abstractions on top of it to make it easier to on board new pipelines on to the data platform. Beam supports the ability to deploy the same pipeline code on multiple runners. Talk about how we extensively use the same pipeline code to run in batch/streaming mode on Flink/Spark to support both legacy on-premise cluster and running pipelines on the cloud.

Sessions

API Sink Connector: A Configurable Beam Pipeline for Connecting to Third-Party Data Stores

Beam and Geocoding

Beam schema options

Beam visual pipeline development with Hop

Building stateful streaming pipelines with Beam

Closing remarks and giveaways

Continuous retraining with TFX and Beam

Count-distinct using HLL++ algorithm