Snorkel Beambell - Real-time Weak Supervision on Apache Beam
By

Snorkel Beambell - Real-time Weak Supervision on Apache Beam

The advent of Deep Learning models has led to a massive growth of real-world machine learning. Deep Learning allows Machine Learning Practitioners to get the state-of-the-art score on benchmarks without any hand-engineered features. These Deep Learning models rely on massive hand-labeled training datasets which is a bottleneck in developing and modifying machine learning models. Most large scale Machine Learning systems today like Google’s DryBell use some form of Weak Supervision to construct lower quality, large scale training datasets that can be used to continuously retrain and deploy models in a real-world scenario.

Read More
Streaming, fast and slow
By

Streaming, fast and slow

What if you were given 2 weeks to prepare for running your first marathon, and had to be able to keep up with the fastest runner? This is the story of a small team that jumped head first into building a top tier stream processing service on a tight timeline, and how they prepared for being able to keep up with the fastest components. This talk will be of interest to those who want to build larger, more complex Beam + stream processing applications (and quickly), but will also benefit anyone who is looking to adopt a new streaming technology (or use it in a way that is new to their company), at any scale.

Read More
The nitty-gritty of moving data with Beam
By

The nitty-gritty of moving data with Beam

In this session, you won’t learn about joins or windows or timers or any other advanced features of Beam. Instead, we will focus on the real-world complexity that comes from simply moving data from one system to another safely. How do we model data as it passes from one transform to another? How do we handle errors? How do we test the system? How do we organize the code to make the pipeline configurable for different source and destination systems?

Read More
Understanding exactly-once processing and windowing in streaming pipelines
By

Understanding exactly-once processing and windowing in streaming pipelines

In this session, we will explore different windowing strategies, and the impact on dropping (or not) late data. We will use a unit test with a synthetic stream, and then we will apply different windows and will make a grouping calculation (aggregation) to check how different windows group data in different ways, and how the parameters of the windows affect how data is grouped in windows.

Read More
Using Apache Beam to process CDC Streams
By

Using Apache Beam to process CDC Streams

In this talk, we will talk about our experience building pipelines to process data coming from Change-data-capture (CDC) systems. We will review different features, including dynamically adding new tables, managing schema evolution, support for user-provided map functions, dead-letter queue designs, and running DML to ensure consistency. We will share lessons that we’ve learned, and pointers on how to try out our solution.

Read More
Using Cross-Language pipeline to run Python 3 code with Java SDK
By

Using Cross-Language pipeline to run Python 3 code with Java SDK

There are many reasons why we would need to execute Python code in Java SDK pipelines and vice versa (e.g. Machine Learning libraries, IO connectors, user’s Python code, etc) and several different ways to do that. WIth the End of Life of Python 2 started this year, it’s getting more challenging since not all old solutions still work well for Python 3. One of the potential options for this could be using a Cross-Language pipeline and Portable Runner in Apache Beam.

Read More
Using Open Source Tech to Swap Out Components of Your Data Pipeline
By

Using Open Source Tech to Swap Out Components of Your Data Pipeline

A few years ago, moving data between applications and datastores included expensive monolithic stacks from large software vendors with little flexibility. Now, with frameworks such as Apache Beam and Apache Airflow, we can schedule and run data processing jobs for both streaming and batch with the same underlying code. This presentation demonstrates the concepts of how this can glue your applications together and shows how we can run a data pipeline from Apache Kafka through Hadoop Flink to Hive and move this to Pub/Sub, Dataflow, and BigQuery by changing a few lines of Java in our Apache Beam code.

Read More