DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Lakehouse: Starting With Apache Doris + S3 Tables
  • Evolution of Data Partitioning: Traditional vs. Modern Data Lakes
  • A Deep Dive Into the Differences Between Kafka and Pulsar
  • Delta, Hudi, and Iceberg: The Data Lakehouse Trifecta

Trending

  • Issue and Present Verifiable Credentials With Spring Boot and Android
  • Evolution of Cloud Services for MCP/A2A Protocols in AI Agents
  • Unlocking AI Coding Assistants Part 1: Real-World Use Cases
  • A Developer's Guide to Mastering Agentic AI: From Theory to Practice
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Integrating Apache Doris and Hudi for Data Querying and Migration

Integrating Apache Doris and Hudi for Data Querying and Migration

Learn how to integrate Apache Doris with Apache Hudi for efficient federated querying, real-time analytics, and seamless data migration in big data environments.

By 
li yy user avatar
li yy
·
Apr. 03, 25 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

In the field of big data analytics, real-time data availability, query performance, and flexibility are crucial. With the rise of the Lakehouse architecture as a new paradigm in big data, integrating Apache Doris, a high-performance real-time analytical database, with Apache Hudi, a data lake management framework focused on incremental data processing, provides powerful federated query and data migration capabilities. 

This integration enables seamless collaboration while accelerating enterprise data architecture upgrades. In this article, we will explore how to integrate Apache Doris with Hudi to achieve efficient federated querying and data migration, along with best practices and optimization techniques.

Integrating Apache Doris with Hudi

1. Introduction

Apache Doris is a real-time analytical database based on the Massively Parallel Processing (MPP) architecture, known for its high query performance and ease of use. It is widely applied in scenarios such as reporting analytics, ad-hoc queries, unified data warehousing, and accelerating federated queries in data lakes. 

Apache Hudi, on the other hand, is an open-source data management framework designed to simplify incremental data processing in data lakes while supporting transactional updates and efficient data management.

By integrating Doris with Hudi, users can leverage Doris's query capabilities to directly access Hudi data (federated query) or migrate Hudi data efficiently into Doris for more complex analysis. This article will provide an in-depth guide on achieving this integration, alongside practical insights and optimization strategies.

Key Features of Doris-Hudi Integration

Apache Doris enhances its ability to read Apache Hudi tables, deeply integrating with Hudi’s real-time data management capabilities to provide efficient, flexible, and cost-effective data querying and analytics solutions. The key features of Doris's Hudi support include:

  • Copy on Write (COW) tables: Supports snapshot queries.
  • Merge on Read (MOR) tables: Supports snapshot queries and read-optimized queries.
  • Time travel: Allows users to access historical data states.
  • Incremental read: Efficiently processes data changes.

The combination of Apache Doris’s high-performance query execution and Apache Hudi’s real-time data management features enables not only real-time analytics but also robust data auditing, historical data retrieval, and incremental processing. This integration has been validated in real-world business scenarios by multiple community users.

Use Cases

The integration between Doris and Hudi shines in various real-world applications:

1. Real-Time Data Analytics and Processing

In use cases such as financial transaction analytics, real-time ad-click stream analysis, and e-commerce user behavior analysis, data needs to be continuously updated and queried efficiently. 

Hudi ensures real-time data consistency, while Doris’s high-performance querying handles large-scale analytical workloads with ease. The combination of both enables real-time, high-throughput data analytics.

2. Data Auditing and Historical Retrieval

Industries such as finance and healthcare, where data security and accuracy are paramount, require reliable historical data retrieval and auditing. Hudi’s Time Travel functionality allows users to view past states of data, and Doris’s efficient query engine ensures fast retrieval and analysis of data at specific points in time, making it ideal for forensic analysis and compliance auditing.

3. Incremental Data Processing

Big data analytics often involves handling large datasets with frequent updates. Hudi supports incremental data reading, enabling users to process only the changed data instead of performing full-table updates. Doris’s incremental query capabilities further enhance efficiency, significantly reducing resource consumption while improving data processing performance.

4. Federated Queries Across Multiple Data Sources

Enterprise data often resides in multiple databases and diverse data sources. Doris’s Multi-Catalog feature enables seamless mapping and synchronization across different data sources, allowing enterprises to integrate and analyze heterogeneous data efficiently without excessive data movement.

Integrating Apache Doris With Hudi

Doris leverages its multi-catalog capability to integrate with external data sources such as Apache Hudi. Multi-catalog automatically maps metadata from external data sources, enabling federated querying and data ingestion. Below are the integration steps:

Environment and Data Preparation

Environment Setup

Component Version
Apache Doris 2.1.4
Apache Hudi 0.14
Apache Spark 3.4.2
Apache Hive 2.1.3
MinIO 2022-05-26T05-48-41Z


Creating Hudi Tables Using Spark

SQL
 
-- Start Spark
./login-spark.sh

-- Use default database
spark-sql> use default;

-- Create Copy-on-Write (COW) table
spark-sql> CREATE TABLE customer_cow
USING hudi
TBLPROPERTIES (
  type = 'cow',
  primaryKey = 'c_custkey',
  preCombineField = 'c_name'
)
PARTITIONED BY (c_nationkey)
AS SELECT * FROM customer;

-- Create Merge-on-Read (MOR) table
spark-sql> CREATE TABLE customer_mor
USING hudi
TBLPROPERTIES (
  type = 'mor',
  primaryKey = 'c_custkey',
  preCombineField = 'c_name'
)
PARTITIONED BY (c_nationkey)
AS SELECT * FROM customer;


Creating a Hudi Catalog in Doris

1. Execute the following SQL to create a Hudi catalog in Doris:

SQL
 
CREATE CATALOG hudi_catalog PROPERTIES (
    "type"="hms",
    "hive.metastore.uris"="thrift://hive-metastore:9083",
    "s3.access_key"="minio",
    "s3.secret_key"="minio123",
    "s3.endpoint"="http://minio:9000",
    "s3.region"="us-east-1",
    "use_path_style"="true"
);


Once configured, Doris automatically scans Hudi’s metadata, allowing users to access Hudi databases and tables through hudi_catalog.

2. Refresh the catalog to synchronize Hudi tables.

SQL
 
doris> REFRESH CATALOG hudi_catalog;


After this step, any data modifications made in Hudi using Spark will be instantly visible in Doris without requiring a manual refresh.

3. Insert the data into Hudi COW and MOR tables.

When using Spark to manipulate data in Hudi, changes are immediately visible in Doris without requiring a catalog refresh. The following Spark SQL commands insert a row into both Copy-on-Write (COW) and Merge-on-Read (MOR) tables:

SQL
 
spark-sql> insert into customer_cow values (100, "Customer#000000100", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 25); 
spark-sql> insert into customer_mor values (100, "Customer#000000100", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 25);


4. Doris can directly query the latest data inserted.

SQL
 
doris> use hudi_catalog.default; 
doris> select * from customer_cow where c_custkey = 100;
doris> select * from customer_mor where c_custkey = 100;


5. Inserting data with an existing c_custkey value updates the record:

SQL
 
spark-sql> insert into customer_cow values (32, "Customer#000000032_update", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 15); 
spark-sql> insert into customer_mor values (32, "Customer#000000032_update", "jD2xZzi", "25-430-914-2194", 3471.59, "BUILDING", "cial ideas. final, furious requests", 15);


6. After updating, Doris reflects the changes:

SQL
 
doris> select * from customer_cow where c_custkey = 32; 
doris> select * from customer_mor where c_custkey = 32;


7. Incremental read and time travel.

Incremental Read

Hudi supports incremental reads, allowing users to fetch changes within a specific time range. Doris can query the changes made after inserting c_custkey=100:

SQL
 
doris> select * from customer_cow@incr('beginTime'='20240603015018572'); spark-sql> select * from hudi_table_changes('customer_cow', 'latest_state', '20240603015018572'); 
doris> select * from customer_mor@incr('beginTime'='20240603015058442'); spark-sql> select * from hudi_table_changes('customer_mor', 'latest_state', '20240603015058442');


Time Travel

Doris enables querying historical snapshots of Hudi data. First, use Spark to retrieve commit history:

SQL
 
spark-sql> call show_commits(table => 'customer_cow', limit => 10); spark-sql> call show_commits(table => 'customer_mor', limit => 10);


Then, use Doris to query the data before a specific snapshot:

SQL
 
doris> select * from customer_cow for time as of '20240603015018572' where c_custkey = 32 or c_custkey = 100; spark-sql> select * from customer_mor timestamp as of '20240603015018572' where c_custkey = 32 or c_custkey = 100; 
doris> select * from customer_mor for time as of '20240603015058442' where c_custkey = 32 or c_custkey = 100; spark-sql> select * from customer_mor timestamp as of '20240603015058442' where c_custkey = 32 or c_custkey = 100;


8. Hudi data consists of baseline (merged Parquet files) and incremental data (resulting from INSERT, UPDATE, DELETE). Doris optimally queries baseline data using its native Parquet reader, while incremental data is accessed via JNI and the Hudi Java SDK.

Using EXPLAIN to Verify Optimization

For COW tables, all splits are baseline data, enabling optimal performance:

SQL
 
doris> explain select * from customer_cow where c_custkey = 32;
|   0:VHUDI_SCAN_NODE(68)                                        |
|      table: customer_cow                                       |
|      predicates: (c_custkey[#5] = 32)                          |
|      inputSplitNum=101, totalFileSize=45338886, scanRanges=101 |
|      partition=26/26                                           |
|      cardinality=1, numNodes=1                                 |
|      pushdown agg=NONE                                         |
|      hudiNativeReadSplits=101/101                              |


For MOR tables, a majority of splits are baseline data, while some require JNI for incremental reads:

SQL
 
doris> explain select * from customer_mor where c_custkey = 32;
|   0:VHUDI_SCAN_NODE(68)                                        |
|      table: customer_mor                                       |
|      predicates: (c_custkey[#5] = 32)                          |
|      inputSplitNum=101, totalFileSize=45340731, scanRanges=101 |
|      partition=26/26                                           |
|      cardinality=1, numNodes=1                                 |
|      pushdown agg=NONE                                         |
|      hudiNativeReadSplits=100/101                              |


Observing Baseline and Incremental Data Differences

Deleting a record via Spark and verifying Doris’s behavior:

SQL
 
spark-sql> delete from customer_cow where c_custkey = 64; 
doris> explain select * from customer_cow where c_custkey = 64; 

spark-sql> delete from customer_mor where c_custkey = 64; 
doris> explain select * from customer_mor where c_custkey = 64;


By using partition filters, Doris accesses only relevant partitions, improving query efficiency.

SQL
 
 
-- customer_xxx is partitioned by c_nationkey, we can use the partition column to prune data
doris> explain select * from customer_mor where c_custkey = 64 and c_nationkey = 15;
|   0:VHUDI_SCAN_NODE(68)                                        |
|      table: customer_mor                                       |
|      predicates: (c_custkey[#5] = 64), (c_nationkey[#12] = 15) |
|      inputSplitNum=4, totalFileSize=1798186, scanRanges=4      |
|      partition=1/26                                            |
|      cardinality=1, numNodes=1                                 |
|      pushdown agg=NONE                                         |
|      hudiNativeReadSplits=3/4                                  |


Conclusion

Integrating Apache Doris with Apache Hudi offers a powerful combination for real-time analytics, federated queries, and incremental data processing. By leveraging Doris’s high-performance query engine and Hudi’s efficient data lake management, enterprises can streamline data workflows, reduce costs, and enhance analytical capabilities. This integration is a game-changer for organizations looking to unify Lakehouse architectures with real-time querying capabilities.

As we look forward to 2025, I am excited about the milestones ahead. The remarkable progress we’ve achieved is made possible by the incredible support of the Doris community, including its dedicated users and developers. 

Big data Data lake Data migration Data processing Apache

Opinions expressed by DZone contributors are their own.

Related

  • Lakehouse: Starting With Apache Doris + S3 Tables
  • Evolution of Data Partitioning: Traditional vs. Modern Data Lakes
  • A Deep Dive Into the Differences Between Kafka and Pulsar
  • Delta, Hudi, and Iceberg: The Data Lakehouse Trifecta

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: