Beam College 2026 will be on April 21-23, 2026.
All times in UTC. Click on the calendar link to add to your calendar in your local time.
Tuesday, April 21, 2026 (Overview & Fundamentals)
| 15:00 - 15:25. Where did Apache Beam start, and where is it taking us? Join us as we dive into the origins of the Beam model and examine the critical shift toward a streaming-first data landscape. We’ll break down why—in today’s expansive landscape that demands real-time context—robust streaming has become the essential nervous system for event-driven agents. Discover how the unified Beam model enables autonomous agents to react, reason, and act in real-time, turning raw, unbounded data streams into intelligent, decentralized action. | Apache Beam: Powering the Future of Event-Driven Agents at ScaleBy Reza RokniWhere did Apache Beam start, and where is it taking us? Join us as we dive into the origins of the Beam model and examine the critical shift toward a streaming-first data landscape. We’ll break down why—in today’s expansive landscape that demands real-time context—robust streaming has become the essential nervous system for event-driven agents. Discover how the unified Beam model enables autonomous agents to react, reason, and act in real-time, turning raw, unbounded data streams into intelligent, decentralized action. |
| 15:30 - 15:55. This hands-on session guides beginners through creating their first Apache Beam pipeline from scratch. We’ll start with core Beam concepts—PCollections, PTransforms, and the Pipeline object—then walk through a practical example building a data processing pipeline step by step. You’ll learn how to read data from sources, apply transformations like Map, FlatMap, and GroupByKey, and write results to sinks. The session covers common patterns, debugging techniques, and best practices for structuring your pipeline code. We’ll also explore how these foundational concepts translate to real-world MLOps scenarios like feature engineering pipelines and batch inference workflows. Whether you’re new to Beam or looking to integrate it into your ML platform, you’ll leave with the confidence to start building production-ready pipelines on runners like Dataflow. Key points addressed:
| Authoring your first pipelineBy Raj KatakamThis hands-on session guides beginners through creating their first Apache Beam pipeline from scratch. We’ll start with core Beam concepts—PCollections, PTransforms, and the Pipeline object—then walk through a practical example building a data processing pipeline step by step. You’ll learn how to read data from sources, apply transformations like Map, FlatMap, and GroupByKey, and write results to sinks. The session covers common patterns, debugging techniques, and best practices for structuring your pipeline code. We’ll also explore how these foundational concepts translate to real-world MLOps scenarios like feature engineering pipelines and batch inference workflows. Whether you’re new to Beam or looking to integrate it into your ML platform, you’ll leave with the confidence to start building production-ready pipelines on runners like Dataflow. |
| 16:00 - 16:25. This session, “From Click to Chart,” demystifies the engineering behind real-time analytics. We will trace the lifecycle of a data point as it travels through a modern Google Cloud Platform (GCP) architecture, moving from ingestion to visualization in seconds. Using Apache Beam, we will explore how to build a unified pipeline that handles massive streams of data without the need to manage servers. Key Points Addressed:
| From Click to Chart: Building a Real-Time Analytics EngineBy Meghana GanuThis session, “From Click to Chart,” demystifies the engineering behind real-time analytics. We will trace the lifecycle of a data point as it travels through a modern Google Cloud Platform (GCP) architecture, moving from ingestion to visualization in seconds. Using Apache Beam, we will explore how to build a unified pipeline that handles massive streams of data without the need to manage servers. Key Points Addressed:
|
| 16:30 - 16:55. In this talk, we explore a new way of authoring and running your Beam pipelines; via the YAML SDK! Learn how you can split your pipeline infrastructure from your complex processing logic. | YAML: a new SDK to author your pipelinesBy Svetak SundharIn this talk, we explore a new way of authoring and running your Beam pipelines; via the YAML SDK! Learn how you can split your pipeline infrastructure from your complex processing logic. |
| 17:00 - 17:15. Learn how you can create low-code and no-code Beam YAML jobs in the Cloud Dataflow UI. | The Dataflow Job BuilderBy Ryan MaddenLearn how you can create low-code and no-code Beam YAML jobs in the Cloud Dataflow UI. |
Wednesday, April 22, 2026 (New Features)
| 15:00 - 15:25. Discover what makes Managed IO a major leap towards a more unified, flexible, and upgrade-friendly connector ecosystem. This talk dives into the motivation and design behind Managed IO, highlighting a few key goals: • Provide a consistent API across all connectors See it all in action with a Dataflow demo featuring IcebergIO as a case study, alongside examples with BigQueryIO and KafkaIO! | Introducing Managed IO, the New Era of Beam ConnectorsBy Ahmed AbualsaudDiscover what makes Managed IO a major leap towards a more unified, flexible, and upgrade-friendly connector ecosystem. This talk dives into the motivation and design behind Managed IO, highlighting a few key goals: • Provide a consistent API across all connectors |
| 15:30 - 15:55. This session explores the technical evolution of Apache Iceberg integration within the Apache Beam ecosystem. We dive into a suite of recent performance enhancements designed to streamline data lake ingestion at scale. Key topics include the adoption of table-defined compression for improved processing and storage efficiency, and the implementation of metadata caching to minimize lookups and prevent metadata service quota exhaustion. We also examine direct write capabilities that bypass expensive processing for large bundles, and autosharding mechanisms that optimize file sizes and ensure horizontal scalability. Join us to learn how these establish an efficient and cost-effective performance baseline for streaming pipelines. | Scaling Iceberg Ingestion with Apache BeamBy Tom SteppThis session explores the technical evolution of Apache Iceberg integration within the Apache Beam ecosystem. We dive into a suite of recent performance enhancements designed to streamline data lake ingestion at scale. Key topics include the adoption of table-defined compression for improved processing and storage efficiency, and the implementation of metadata caching to minimize lookups and prevent metadata service quota exhaustion. We also examine direct write capabilities that bypass expensive processing for large bundles, and autosharding mechanisms that optimize file sizes and ensure horizontal scalability. |
| 16:00 - 16:25. This session introduces the new Remote ML Inference transform in the Apache Beam Java SDK and shows how Java pipelines can run inference using external model services such as OpenAI. We’ll walk through how the transform works, how to use the OpenAI model handler with practical examples, and how to implement custom model handlers for other remote ML providers. The talk covers common usage patterns, framework extensibility, and includes a live demo so developers can quickly add remote ML inference capabilities to their Beam Java pipelines. | Getting Started with Remote ML Inference in Beam JavaBy Ganesh SivakumarThis session introduces the new Remote ML Inference transform in the Apache Beam Java SDK and shows how Java pipelines can run inference using external model services such as OpenAI. We’ll walk through how the transform works, how to use the OpenAI model handler with practical examples, and how to implement custom model handlers for other remote ML providers. The talk covers common usage patterns, framework extensibility, and includes a live demo so developers can quickly add remote ML inference capabilities to their Beam Java pipelines. |
| 16:30 - 16:55. Real-time anomaly detection is essential for identifying unexpected patterns and critical events in streaming data. This talk addresses the unique algorithmic challenges of anomaly detection in streaming environments and introduces a new feature within Apache Beam designed for this purpose. We will demonstrate how to seamlessly integrate both online and pre-trained offline anomaly detection models into Beam pipelines, empowering users to build robust, scalable, and real-time anomaly detection systems. | Real-Time Anomaly Detection with Apache BeamBy Shunping HuangReal-time anomaly detection is essential for identifying unexpected patterns and critical events in streaming data. This talk addresses the unique algorithmic challenges of anomaly detection in streaming environments and introduces a new feature within Apache Beam designed for this purpose. We will demonstrate how to seamlessly integrate both online and pre-trained offline anomaly detection models into Beam pipelines, empowering users to build robust, scalable, and real-time anomaly detection systems. |
| 17:00 - 17:25. This presentation introduces vector-based semantic search and Retrieval Augmented Generation (RAG), demonstrating how to build scalable pipelines for using Apache Beam. We’ll start by explaining fundamental concepts like chunking, embeddings and vector similarity. Then we’ll explore semantic search applications before extending to full RAG systems. The presentation walks through implementing both semantic search and RAG pipelines using Apache Beam’s ML components, covering data ingestion, chunking, embedding generation, vector database integration, and similarity search. By the end, students will understand the theoretical foundations of both systems and have practical knowledge of how to implement them at scale using Apache Beam’s distributed processing capabilities. | Building Scalable Semantic Search and RAG PipelinesBy Claude Van der MerweThis presentation introduces vector-based semantic search and Retrieval Augmented Generation (RAG), demonstrating how to build scalable pipelines for using Apache Beam. We’ll start by explaining fundamental concepts like chunking, embeddings and vector similarity. Then we’ll explore semantic search applications before extending to full RAG systems. The presentation walks through implementing both semantic search and RAG pipelines using Apache Beam’s ML components, covering data ingestion, chunking, embedding generation, vector database integration, and similarity search. By the end, students will understand the theoretical foundations of both systems and have practical knowledge of how to implement them at scale using Apache Beam’s distributed processing capabilities. |
Thursday, April 23, 2026 (Advanced Tips & Tricks)
| 15:00 - 15:25. This architecture leverages a Splittable DoFn for parallelized video ingestion, distributing frame extraction across workers before applying Sliding Window logic to generate temporal 3D tensors. It utilizes Beam’s RunInference API with a KeyedModelHandler for GPU-accelerated inference, ensuring robust state management. Finally, CoGroupByKey synchronizes asynchronous feature vectors with metadata, serializing the aligned dataset into TFRecord SequenceExamples for downstream training. | Video Data Processing with Apache BeamBy Adesh AbhangThis architecture leverages a Splittable DoFn for parallelized video ingestion, distributing frame extraction across workers before applying Sliding Window logic to generate temporal 3D tensors. It utilizes Beam’s RunInference API with a KeyedModelHandler for GPU-accelerated inference, ensuring robust state management. Finally, CoGroupByKey synchronizes asynchronous feature vectors with metadata, serializing the aligned dataset into TFRecord SequenceExamples for downstream training. |
| 15:30 - 15:55. You will learn how to build a pipeline that processes video data in real time to enable efficient analysis and event or anomaly detection.
Finally, it raises alerts or stores annotated events in a structured format that can be used for monitoring dashboards or downstream analysis. An example of a use case could be in monitoring traffic using surveillance footage to detect any unusual event like accident or disabled vehicle, etc. | Real-Time Stateful Processing of Video DataBy Darshan Kanade & Aditya ShuklaYou will learn how to build a pipeline that processes video data in real time to enable efficient analysis and event or anomaly detection.
Finally, it raises alerts or stores annotated events in a structured format that can be used for monitoring dashboards or downstream analysis. |
| 16:00 - 16:25. When source systems emit only partial updates to conserve network bandwidth, Data Engineers face the complex task of reconstructing complete entities in real-time. In this session, we will deep dive into a high-performance, SCD-like streaming pipeline that dynamically reconstructs full entities from partial data before sinking them to the data warehouse. The core of our solution is a custom two-tiered state backend architecture. By intelligently combining Apache Beam’s native, low-latency state API (Tier 1) with an external third-party data store (Tier 2), we overcome standard memory and throughput limitations. Join us to explore how this two-tiered design, alongside Beam timers, drastically reduces external database lookups, minimizes network latency, and unlocks unparalleled performance for stateful streaming pipelines. | Assembling the Puzzle: High-Performance Entity Building streaming Beam pipeline using a Two-Tiered State ArchitectureBy Israel HerraizWhen source systems emit only partial updates to conserve network bandwidth, Data Engineers face the complex task of reconstructing complete entities in real-time. In this session, we will deep dive into a high-performance, SCD-like streaming pipeline that dynamically reconstructs full entities from partial data before sinking them to the data warehouse. The core of our solution is a custom two-tiered state backend architecture. By intelligently combining Apache Beam’s native, low-latency state API (Tier 1) with an external third-party data store (Tier 2), we overcome standard memory and throughput limitations. Join us to explore how this two-tiered design, alongside Beam timers, drastically reduces external database lookups, minimizes network latency, and unlocks unparalleled performance for stateful streaming pipelines. |
| 16:30 - 16:55. Retrieval-Augmented Generation (RAG) has revolutionized how we interact with LLMs, but standard vector search often fails to capture the complex, multi-hop relationships hidden in data. “GraphRAG” solves this by grounding answers in a structured Knowledge Graph, but building these graphs from raw unstructured data at scale presents a massive data engineering challenge. In this session, we will demonstrate how to build a production-grade GraphRAG ingestion and retrieval pipeline using Apache Beam and Google Cloud Spanner Graph. | Beyond Vectors: Building Scalable GraphRAG with Apache Beam and Cloud SpannerBy Oleksandr SaienkoRetrieval-Augmented Generation (RAG) has revolutionized how we interact with LLMs, but standard vector search often fails to capture the complex, multi-hop relationships hidden in data. “GraphRAG” solves this by grounding answers in a structured Knowledge Graph, but building these graphs from raw unstructured data at scale presents a massive data engineering challenge. In this session, we will demonstrate how to build a production-grade GraphRAG ingestion and retrieval pipeline using Apache Beam and Google Cloud Spanner Graph. |