API Sink Connector: A Configurable Beam Pipeline for Connecting to Third-Party Data Stores
By

API Sink Connector: A Configurable Beam Pipeline for Connecting to Third-Party Data Stores

At Yelp, we use Apache Flink and Beam to build generic tools for performing common streaming operations. Feature team developers can then use these tools to build complex systems, without requiring them to learn all the details of our streaming infrastructure. One common operation we do at Yelp is to move data from Kafka to external, third-party data stores (like Salesforce or Oracle) via public APIs. This necessitates a number of complex operations, such as transforming data with business logic, batching data into bulk requests, and ordering data to resolve foreign key dependencies.

Read More
Beam schema options
By

Beam schema options

What up with all the metadata on the metadata? This deep dive into BIP-1, Beam Improvement Proposal One, tells the story about Schema Options. Schema Options give extra context to row fields and schemas. I’ll dive in the proposal and give a break down of the potential use, but I can already give some real-world examples of how we’re using these options in our ingestion platform, based on Beam.

Read More
Beam visual pipeline development with Hop
By

Beam visual pipeline development with Hop

Open source project Hop aims to make anything data orchestration related easier and recently added visual pipeline development for Apache Beam. This session goes over the architecture and the possibilities with live demos. Learn how easy it is to get those Beam pipelines up-and-running and find out how you can tune for performance in batch and streaming modes.

Read More
Building stateful streaming pipelines with Beam
By

Building stateful streaming pipelines with Beam

Building a streaming data platform from the ground up is a daunting yet fun task. Walk developers’s through GoDaddy’s journey of building production pipelines at scale using Apache Beam as the foundation layer for writing pipelines and building abstractions on top of it to make it easier to on board new pipelines on to the data platform. Beam supports the ability to deploy the same pipeline code on multiple runners. Talk about how we extensively use the same pipeline code to run in batch/streaming mode on Flink/Spark to support both legacy on-premise cluster and running pipelines on the cloud.

Read More
Continuous retraining with TFX and Beam
By

Continuous retraining with TFX and Beam

In this session we will walk through using Apache Beam and Tensor Flow Extended (TFX) to build a Beam pipeline that is continuously outputting timeseries data which is used by TFX for re-training of a model.

Read More