Beam College 2026 will be on April 21-23, 2026.

All times in UTC. Click on the calendar link to add to your calendar in your local time.

Tuesday, April 21, 2026 (Overview & Fundamentals)

15:00 - 15:25.
04/21/2026 3:00 PM 04/21/2026 3:25 PM UTC BeamCollege: Apache Beam: Powering the Future of Event-Driven Agents at Scale

Where did Apache Beam start, and where is it taking us?

Join us as we dive into the origins of the Beam model and examine the critical shift toward a streaming-first data landscape. We’ll break down why—in today’s expansive landscape that demands real-time context—robust streaming has become the essential nervous system for event-driven agents.

Discover how the unified Beam model enables autonomous agents to react, reason, and act in real-time, turning raw, unbounded data streams into intelligent, decentralized action.

Apache Beam: Powering the Future of Event-Driven Agents at Scale

By Reza Rokni

Where did Apache Beam start, and where is it taking us?

Join us as we dive into the origins of the Beam model and examine the critical shift toward a streaming-first data landscape. We’ll break down why—in today’s expansive landscape that demands real-time context—robust streaming has become the essential nervous system for event-driven agents.

Discover how the unified Beam model enables autonomous agents to react, reason, and act in real-time, turning raw, unbounded data streams into intelligent, decentralized action.

15:30 - 15:55.
04/21/2026 3:30 PM 04/21/2026 3:55 PM UTC BeamCollege: Authoring your first pipeline

This hands-on session guides beginners through creating their first Apache Beam pipeline from scratch. We’ll start with core Beam concepts—PCollections, PTransforms, and the Pipeline object—then walk through a practical example building a data processing pipeline step by step. You’ll learn how to read data from sources, apply transformations like Map, FlatMap, and GroupByKey, and write results to sinks. The session covers common patterns, debugging techniques, and best practices for structuring your pipeline code. We’ll also explore how these foundational concepts translate to real-world MLOps scenarios like feature engineering pipelines and batch inference workflows. Whether you’re new to Beam or looking to integrate it into your ML platform, you’ll leave with the confidence to start building production-ready pipelines on runners like Dataflow.

Key points addressed:

  • Core Apache Beam concepts (PCollections, PTransforms, Pipeline)
  • Reading from sources and writing to sinks
  • Essential transformations (Map, FlatMap, GroupByKey, Combine)
  • Pipeline structure and best practices
  • Local testing and debugging strategies
  • Real-world applications in MLOps and production deployment

Authoring your first pipeline

By Raj Katakam

This hands-on session guides beginners through creating their first Apache Beam pipeline from scratch. We’ll start with core Beam concepts—PCollections, PTransforms, and the Pipeline object—then walk through a practical example building a data processing pipeline step by step. You’ll learn how to read data from sources, apply transformations like Map, FlatMap, and GroupByKey, and write results to sinks. The session covers common patterns, debugging techniques, and best practices for structuring your pipeline code. We’ll also explore how these foundational concepts translate to real-world MLOps scenarios like feature engineering pipelines and batch inference workflows. Whether you’re new to Beam or looking to integrate it into your ML platform, you’ll leave with the confidence to start building production-ready pipelines on runners like Dataflow.

16:00 - 16:25.
04/21/2026 4:00 PM 04/21/2026 4:25 PM UTC BeamCollege: From Click to Chart: Building a Real-Time Analytics Engine

This session, “From Click to Chart,” demystifies the engineering behind real-time analytics. We will trace the lifecycle of a data point as it travels through a modern Google Cloud Platform (GCP) architecture, moving from ingestion to visualization in seconds. Using Apache Beam, we will explore how to build a unified pipeline that handles massive streams of data without the need to manage servers.

Key Points Addressed:

  • The “Why” of Streaming: Understanding the shift from traditional Batch processing to Real-Time Streaming.
  • The Architecture: A deep dive into the “Golden Path” stack: Cloud Pub/Sub (Ingest) $\rightarrow$ Dataflow (Process) $\rightarrow$ BigQuery (Store).
  • Apache Beam Fundamentals: Introduction to the unified programming model, including Windowing (how to group infinite data) and Watermarks (handling late data).

From Click to Chart: Building a Real-Time Analytics Engine

By Meghana Ganu

This session, “From Click to Chart,” demystifies the engineering behind real-time analytics. We will trace the lifecycle of a data point as it travels through a modern Google Cloud Platform (GCP) architecture, moving from ingestion to visualization in seconds. Using Apache Beam, we will explore how to build a unified pipeline that handles massive streams of data without the need to manage servers.

Key Points Addressed:

  • The “Why” of Streaming: Understanding the shift from traditional Batch processing to Real-Time Streaming.
  • The Architecture: A deep dive into the “Golden Path” stack: Cloud Pub/Sub (Ingest) $\rightarrow$ Dataflow (Process) $\rightarrow$ BigQuery (Store).
  • Apache Beam Fundamentals: Introduction to the unified programming model, including Windowing (how to group infinite data) and Watermarks (handling late data).
16:30 - 16:55.
04/21/2026 4:30 PM 04/21/2026 4:55 PM UTC BeamCollege: YAML: a new SDK to author your pipelines

In this talk, we explore a new way of authoring and running your Beam pipelines; via the YAML SDK! Learn how you can split your pipeline infrastructure from your complex processing logic.

YAML: a new SDK to author your pipelines

By Svetak Sundhar

In this talk, we explore a new way of authoring and running your Beam pipelines; via the YAML SDK! Learn how you can split your pipeline infrastructure from your complex processing logic.

17:00 - 17:15.
04/21/2026 5:00 PM 04/21/2026 5:15 PM UTC BeamCollege: The Dataflow Job Builder

Learn how you can create low-code and no-code Beam YAML jobs in the Cloud Dataflow UI.

The Dataflow Job Builder

By Ryan Madden

Learn how you can create low-code and no-code Beam YAML jobs in the Cloud Dataflow UI.

Wednesday, April 22, 2026 (New Features)

15:00 - 15:25.
04/22/2026 3:00 PM 04/22/2026 3:25 PM UTC BeamCollege: Introducing Managed IO, the New Era of Beam Connectors

Discover what makes Managed IO a major leap towards a more unified, flexible, and upgrade-friendly connector ecosystem. This talk dives into the motivation and design behind Managed IO, highlighting a few key goals:

• Provide a consistent API across all connectors
• Enable runners to seamlessly upgrade IOs, pulling in bug fixes and new features — all with zero user effort
• Allow runners to fine-tune connector behavior for their environment

See it all in action with a Dataflow demo featuring IcebergIO as a case study, alongside examples with BigQueryIO and KafkaIO!

Introducing Managed IO, the New Era of Beam Connectors

By Ahmed Abualsaud

Discover what makes Managed IO a major leap towards a more unified, flexible, and upgrade-friendly connector ecosystem. This talk dives into the motivation and design behind Managed IO, highlighting a few key goals:

• Provide a consistent API across all connectors
• Enable runners to seamlessly upgrade IOs, pulling in bug fixes and new features — all with zero user effort
• Allow runners to fine-tune connector behavior for their environment

15:30 - 15:55.
04/22/2026 3:30 PM 04/22/2026 3:55 PM UTC BeamCollege: Scaling Iceberg Ingestion with Apache Beam

This session explores the technical evolution of Apache Iceberg integration within the Apache Beam ecosystem. We dive into a suite of recent performance enhancements designed to streamline data lake ingestion at scale. Key topics include the adoption of table-defined compression for improved processing and storage efficiency, and the implementation of metadata caching to minimize lookups and prevent metadata service quota exhaustion. We also examine direct write capabilities that bypass expensive processing for large bundles, and autosharding mechanisms that optimize file sizes and ensure horizontal scalability.

Join us to learn how these establish an efficient and cost-effective performance baseline for streaming pipelines.

Scaling Iceberg Ingestion with Apache Beam

By Tom Stepp

This session explores the technical evolution of Apache Iceberg integration within the Apache Beam ecosystem. We dive into a suite of recent performance enhancements designed to streamline data lake ingestion at scale. Key topics include the adoption of table-defined compression for improved processing and storage efficiency, and the implementation of metadata caching to minimize lookups and prevent metadata service quota exhaustion. We also examine direct write capabilities that bypass expensive processing for large bundles, and autosharding mechanisms that optimize file sizes and ensure horizontal scalability.

16:00 - 16:25.
04/22/2026 4:00 PM 04/22/2026 4:25 PM UTC BeamCollege: Getting Started with Remote ML Inference in Beam Java

This session introduces the new Remote ML Inference transform in the Apache Beam Java SDK and shows how Java pipelines can run inference using external model services such as OpenAI. We’ll walk through how the transform works, how to use the OpenAI model handler with practical examples, and how to implement custom model handlers for other remote ML providers. The talk covers common usage patterns, framework extensibility, and includes a live demo so developers can quickly add remote ML inference capabilities to their Beam Java pipelines.

Getting Started with Remote ML Inference in Beam Java

By Ganesh Sivakumar

This session introduces the new Remote ML Inference transform in the Apache Beam Java SDK and shows how Java pipelines can run inference using external model services such as OpenAI. We’ll walk through how the transform works, how to use the OpenAI model handler with practical examples, and how to implement custom model handlers for other remote ML providers. The talk covers common usage patterns, framework extensibility, and includes a live demo so developers can quickly add remote ML inference capabilities to their Beam Java pipelines.

16:30 - 16:55.
04/22/2026 4:30 PM 04/22/2026 4:55 PM UTC BeamCollege: Real-Time Anomaly Detection with Apache Beam

Real-time anomaly detection is essential for identifying unexpected patterns and critical events in streaming data. This talk addresses the unique algorithmic challenges of anomaly detection in streaming environments and introduces a new feature within Apache Beam designed for this purpose. We will demonstrate how to seamlessly integrate both online and pre-trained offline anomaly detection models into Beam pipelines, empowering users to build robust, scalable, and real-time anomaly detection systems.

Real-Time Anomaly Detection with Apache Beam

By Shunping Huang

Real-time anomaly detection is essential for identifying unexpected patterns and critical events in streaming data. This talk addresses the unique algorithmic challenges of anomaly detection in streaming environments and introduces a new feature within Apache Beam designed for this purpose. We will demonstrate how to seamlessly integrate both online and pre-trained offline anomaly detection models into Beam pipelines, empowering users to build robust, scalable, and real-time anomaly detection systems.

17:00 - 17:25.
04/22/2026 5:00 PM 04/22/2026 5:25 PM UTC BeamCollege: Building Scalable Semantic Search and RAG Pipelines

This presentation introduces vector-based semantic search and Retrieval Augmented Generation (RAG), demonstrating how to build scalable pipelines for using Apache Beam. We’ll start by explaining fundamental concepts like chunking, embeddings and vector similarity. Then we’ll explore semantic search applications before extending to full RAG systems.

The presentation walks through implementing both semantic search and RAG pipelines using Apache Beam’s ML components, covering data ingestion, chunking, embedding generation, vector database integration, and similarity search. By the end, students will understand the theoretical foundations of both systems and have practical knowledge of how to implement them at scale using Apache Beam’s distributed processing capabilities.

Building Scalable Semantic Search and RAG Pipelines

By Claude Van der Merwe

This presentation introduces vector-based semantic search and Retrieval Augmented Generation (RAG), demonstrating how to build scalable pipelines for using Apache Beam. We’ll start by explaining fundamental concepts like chunking, embeddings and vector similarity. Then we’ll explore semantic search applications before extending to full RAG systems.

The presentation walks through implementing both semantic search and RAG pipelines using Apache Beam’s ML components, covering data ingestion, chunking, embedding generation, vector database integration, and similarity search. By the end, students will understand the theoretical foundations of both systems and have practical knowledge of how to implement them at scale using Apache Beam’s distributed processing capabilities.

Thursday, April 23, 2026 (Advanced Tips & Tricks)

15:00 - 15:25.
04/23/2026 3:00 PM 04/23/2026 3:25 PM UTC BeamCollege: Video Data Processing with Apache Beam

This architecture leverages a Splittable DoFn for parallelized video ingestion, distributing frame extraction across workers before applying Sliding Window logic to generate temporal 3D tensors. It utilizes Beam’s RunInference API with a KeyedModelHandler for GPU-accelerated inference, ensuring robust state management. Finally, CoGroupByKey synchronizes asynchronous feature vectors with metadata, serializing the aligned dataset into TFRecord SequenceExamples for downstream training.

Video Data Processing with Apache Beam

By Adesh Abhang

This architecture leverages a Splittable DoFn for parallelized video ingestion, distributing frame extraction across workers before applying Sliding Window logic to generate temporal 3D tensors. It utilizes Beam’s RunInference API with a KeyedModelHandler for GPU-accelerated inference, ensuring robust state management. Finally, CoGroupByKey synchronizes asynchronous feature vectors with metadata, serializing the aligned dataset into TFRecord SequenceExamples for downstream training.

15:30 - 15:55.
04/23/2026 3:30 PM 04/23/2026 3:55 PM UTC BeamCollege: Real-Time Stateful Processing of Video Data

You will learn how to build a pipeline that processes video data in real time to enable efficient analysis and event or anomaly detection.

  • Reads video streams or recorded footage from the source
  • Applies sliding-window analysis to examine activity over short intervals
  • Uses stateful processing to track patterns and detect anomalies over time

Finally, it raises alerts or stores annotated events in a structured format that can be used for monitoring dashboards or downstream analysis.

An example of a use case could be in monitoring traffic using surveillance footage to detect any unusual event like accident or disabled vehicle, etc.

Real-Time Stateful Processing of Video Data

By Darshan Kanade & Aditya Shukla

You will learn how to build a pipeline that processes video data in real time to enable efficient analysis and event or anomaly detection.

  • Reads video streams or recorded footage from the source
  • Applies sliding-window analysis to examine activity over short intervals
  • Uses stateful processing to track patterns and detect anomalies over time

Finally, it raises alerts or stores annotated events in a structured format that can be used for monitoring dashboards or downstream analysis.

16:00 - 16:25.
04/23/2026 4:00 PM 04/23/2026 4:25 PM UTC BeamCollege: Assembling the Puzzle: High-Performance Entity Building streaming Beam pipeline using a Two-Tiered State Architecture

When source systems emit only partial updates to conserve network bandwidth, Data Engineers face the complex task of reconstructing complete entities in real-time. In this session, we will deep dive into a high-performance, SCD-like streaming pipeline that dynamically reconstructs full entities from partial data before sinking them to the data warehouse.

The core of our solution is a custom two-tiered state backend architecture. By intelligently combining Apache Beam’s native, low-latency state API (Tier 1) with an external third-party data store (Tier 2), we overcome standard memory and throughput limitations. Join us to explore how this two-tiered design, alongside Beam timers, drastically reduces external database lookups, minimizes network latency, and unlocks unparalleled performance for stateful streaming pipelines.

Assembling the Puzzle: High-Performance Entity Building streaming Beam pipeline using a Two-Tiered State Architecture

By Israel Herraiz

When source systems emit only partial updates to conserve network bandwidth, Data Engineers face the complex task of reconstructing complete entities in real-time. In this session, we will deep dive into a high-performance, SCD-like streaming pipeline that dynamically reconstructs full entities from partial data before sinking them to the data warehouse.

The core of our solution is a custom two-tiered state backend architecture. By intelligently combining Apache Beam’s native, low-latency state API (Tier 1) with an external third-party data store (Tier 2), we overcome standard memory and throughput limitations. Join us to explore how this two-tiered design, alongside Beam timers, drastically reduces external database lookups, minimizes network latency, and unlocks unparalleled performance for stateful streaming pipelines.

16:30 - 16:55.
04/23/2026 4:30 PM 04/23/2026 4:55 PM UTC BeamCollege: Beyond Vectors: Building Scalable GraphRAG with Apache Beam and Cloud Spanner

Retrieval-Augmented Generation (RAG) has revolutionized how we interact with LLMs, but standard vector search often fails to capture the complex, multi-hop relationships hidden in data. “GraphRAG” solves this by grounding answers in a structured Knowledge Graph, but building these graphs from raw unstructured data at scale presents a massive data engineering challenge.

In this session, we will demonstrate how to build a production-grade GraphRAG ingestion and retrieval pipeline using Apache Beam and Google Cloud Spanner Graph.

Beyond Vectors: Building Scalable GraphRAG with Apache Beam and Cloud Spanner

By Oleksandr Saienko

Retrieval-Augmented Generation (RAG) has revolutionized how we interact with LLMs, but standard vector search often fails to capture the complex, multi-hop relationships hidden in data. “GraphRAG” solves this by grounding answers in a structured Knowledge Graph, but building these graphs from raw unstructured data at scale presents a massive data engineering challenge.

In this session, we will demonstrate how to build a production-grade GraphRAG ingestion and retrieval pipeline using Apache Beam and Google Cloud Spanner Graph.