DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Securing the Cloud: Navigating the Frontier of Cloud Security
  • Why Continuous Monitoring of AWS Logs Is Critical To Secure Customer and Business-Specific Data
  • What Is IoT Gateway? Is It Important
  • How To Reduce the Impact of a Cloud Outage

Trending

  • A Guide to Container Runtimes
  • Breaking Bottlenecks: Applying the Theory of Constraints to Software Development
  • Building Scalable and Resilient Data Pipelines With Apache Airflow
  • Google Cloud Document AI Basics
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Processing Cloud Data With DuckDB And AWS S3

Processing Cloud Data With DuckDB And AWS S3

DuckDB's ability to read data directly from cloud storage, such as AWS S3, makes it particularly powerful for modern data architectures.

By 
Anil Kumar Moka user avatar
Anil Kumar Moka
·
Feb. 04, 25 · Tutorial
Likes (0)
Comment
Save
Tweet
Share
10.3K Views

Join the DZone community and get the full member experience.

Join For Free

DuckDb is a powerful in-memory database that has a parallel processing feature, which makes it a good choice to read/transform cloud storage data, in this case, AWS S3. I've had a lot of success using it and I will walk you through the steps in implementing it. 

I will also include some learnings and best practices for you. Using the DuckDb, httpfs extension and pyarrow, we can efficiently process Parquet files stored in S3 buckets. Let's dive in:

Before starting the installation of DuckDb, make sure you have these prerequisites: 

  • Python 3.9 or higher installed 
  • Prior knowledge of setting up Python projects and virtual environments or conda environments

Installing Dependencies

First, let's establish the necessary environment:

Shell
 
# Install required packages for cloud integration
pip install "duckdb>=0.8.0" pyarrow pandas boto3 requests


The dependencies explained:

  • duckdb>=0.8.0: The core database engine that provides SQL functionality and in-memory processing
  • pyarrow: Handles Parquet file operations efficiently with columnar storage support
  • pandas: Enables powerful data manipulation and analysis capabilities
  • boto3: AWS SDK for Python, providing interfaces to AWS services
  • requests: Manages HTTP communications for cloud interactions

Configuring Secure Cloud Access

Python
 
import duckdb
import os

# Initialize DuckDB with cloud support
conn = duckdb.connect(':memory:')
conn.execute("INSTALL httpfs;")
conn.execute("LOAD httpfs;")

# Secure AWS configuration
conn.execute("""
    SET s3_region='your-region';
    SET s3_access_key_id='your-access-key';
    SET s3_secret_access_key='your-secret-key';
""")


This initialization code does several important things:

  1. Creates a new DuckDB connection in memory using :memory:
  2. Installs and loads the HTTP filesystem extension (httpfs) which enables cloud storage access
  3. Configures AWS credentials with your specific region and access keys
  4. Sets up a secure connection to AWS services

Processing AWS S3 Parquet Files

Let's examine a comprehensive example of processing Parquet files with sensitive data masking:

Python
 
import duckdb
import pandas as pd

# Create sample data to demonstrate parquet processing
sample_data = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Bob Wilson', 'Alice Brown'],
    'email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'],
    'phone': ['123-456-7890', '234-567-8901', '345-678-9012', '456-789-0123'],
    'ssn': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012'],
    'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm Dr'],
    'salary': [75000, 85000, 65000, 95000]  # Non-sensitive data
})


This sample data creation helps us demonstrate data masking techniques. We include various types of sensitive information commonly found in real-world datasets:

  • Personal identifiers (name, SSN)
  • Contact information (email, phone, address)
  • Financial data (salary)

Now, let's look at the processing function:

Python
 
def demonstrate_parquet_processing():
    # Create a DuckDB connection
    conn = duckdb.connect(':memory:')
    
    # Save sample data as parquet
    sample_data.to_parquet('sample_data.parquet')
    
    # Define sensitive columns to mask
    sensitive_cols = ['email', 'phone', 'ssn']
    
    # Process the parquet file with masking
    query = f"""
    CREATE TABLE masked_data AS
    SELECT 
        -- Mask name: keep first letter of first and last name
        regexp_replace(name, '([A-Z])[a-z]+ ([A-Z])[a-z]+', '\1*** \2***') as name,
        
        -- Mask email: hide everything before @
        regexp_replace(email, '([a-zA-Z0-9._%+-]+)(@.*)', '****\2') as email,
        
        -- Mask phone: show only last 4 digits
        regexp_replace(phone, '[0-9]{3}-[0-9]{3}-', '***-***-') as phone,
        
        -- Mask SSN: show only last 4 digits
        regexp_replace(ssn, '[0-9]{3}-[0-9]{2}-', '***-**-') as ssn,
        
        -- Mask address: show only street type
        regexp_replace(address, '[0-9]+ [A-Za-z]+ ', '*** ') as address,
        
        -- Keep non-sensitive data as is
        salary
    FROM read_parquet('sample_data.parquet');
    """


Let's break down this processing function:

  • We create a new DuckDB connection
  • Convert our sample DataFrame to a Parquet file
  • Define which columns contain sensitive information
  • Create a SQL query that applies different masking patterns:
    • Names: Preserves initials (e.g., "John Smith" → "J*** S***")
    • Emails: Hides local part while keeping domain (e.g., "" → "****@email.com")
    • Phone numbers: Shows only the last four digits
    • SSNs: Displays only the last four digits
    • Addresses: Keeps only street type
    • Salary: Remains unmasked as non-sensitive data

The output should look like:

Plain Text
 
Original Data:
=============
         name                   email           phone          ssn        address  salary
0  John Smith    [email protected]   123-456-7890  123-45-6789   123 Main St   75000
1   Jane Doe    [email protected]   234-567-8901  234-56-7890   456 Oak Ave   85000
2  Bob Wilson         [email protected]     345-678-9012  345-67-8901   789 Pine Rd   65000
3 Alice Brown      [email protected]     456-789-0123  456-78-9012   321 Elm Dr    95000

Masked Data:
===========
      name                email           phone          ssn     address  salary
0  J*** S***    ****@email.com     ***-***-7890  ***-**-6789   *** St   75000
1  J*** D***  ****@company.com     ***-***-8901  ***-**-7890   *** Ave   85000
2  B*** W***      ****@email.net   ***-***-9012  ***-**-8901   *** Rd   65000
3  A*** B***      ****@org.com     ***-***-0123  ***-**-9012   *** Dr    95000


Now, let's explore different masking patterns with explanations in the comments of the Python code snippets:

Email Masking Variations

Python
 
# Show first letter only
"[email protected]" → "j***@email.com"

# Show domain only
"[email protected]" → "****@email.com"

# Show first and last letter
"[email protected]" → "j*********[email protected]"


Phone Number Masking

Python
 
# Last 4 digits only
"123-456-7890" → "***-***-7890"

# First 3 digits only
"123-456-7890" → "123-***-****"

# Middle digits only
"123-456-7890" → "***-456-****"


Name Masking

Python
 
# Initials only
"John Smith" → "J.S."

# First letter of each word
"John Smith" → "J*** S***"

# Fixed length masking
"John Smith" → "XXXX XXXXX"


Efficient Partitioned Data Processing

When dealing with large datasets, partitioning becomes crucial. Here's how to handle partitioned data efficiently:

Python
 
def process_partitioned_data(base_path, partition_column, sensitive_columns):
    """
    Process partitioned data efficiently
    
    Parameters:
    - base_path: Base path to partitioned data
    - partition_column: Column used for partitioning (e.g., 'date')
    - sensitive_columns: List of columns to mask
    """
    conn = duckdb.connect(':memory:')
    
    try:
        # 1. List all partitions
        query = f"""
        WITH partitions AS (
            SELECT DISTINCT {partition_column}
            FROM read_parquet('{base_path}/*/*.parquet')
        )
        SELECT * FROM partitions;
        """


This function demonstrates several important concepts:

  • Dynamic partition discovery
  • Memory-efficient processing
  • Error handling with proper cleanup
  • Masked data output generation

The partition structure typically looks like:

Partition Structure

Plain Text
 
sample_data/
├── date=2024-01-01/
│   └── data.parquet
├── date=2024-01-02/
│   └── data.parquet
└── date=2024-01-03/
    └── data.parquet


Sample Data

Plain Text
 
Original Data:
date        customer_id  email              phone           amount
2024-01-01  1           [email protected]    123-456-0001    500.00
2024-01-01  2           [email protected]    123-456-0002    750.25
...

Masked Data:
date        customer_id  email     phone        amount
2024-01-01  1           ****      ****         500.00
2024-01-01  2           ****      ****         750.25


Below are some benefits of partitioned processing:

  • Reduced memory footprint
  • Parallel processing capability
  • Improved performance
  • Scalable data handling

Performance Optimization Techniques

1. Configuring Parallel Processing

Python
 
# Optimize for performance
conn.execute("""
    SET partial_streaming=true;
    SET threads=4;
    SET memory_limit='4GB';
""")


These settings:

  • Enable partial streaming for better memory management
  • Set parallel processing threads
  • Define memory limits to prevent overflow

2. Robust Error Handling

Python
 
def robust_s3_read(s3_path, max_retries=3):
    """
    Implement reliable S3 data reading with retries.
    
    Parameters:
    - s3_path: Path to S3 data
    - max_retries: Maximum retry attempts
    """
    for attempt in range(max_retries):
        try:
            return conn.execute(f"SELECT * FROM read_parquet('{s3_path}')")
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff


This code block demonstrates how to implement retries and also throw exceptions where needed so as to take proactive measures. 

3. Storage Optimization

Python
 
# Efficient data storage with compression
conn.execute("""
    COPY (SELECT * FROM masked_data)
    TO 's3://output-bucket/masked_data.parquet'
    (FORMAT 'parquet', COMPRESSION 'ZSTD');
""")


This code block demonstrates applying storage compression type for optimizing the storage.

Best Practices and Recommendations

Security Best Practices

Security is crucial when handling data, especially in cloud environments. Following these practices helps protect sensitive information and maintain compliance:

  • IAM roles. Use AWS Identity and Access Management roles instead of direct access keys when possible
  • Key rotation. Implement regular rotation of access keys
  • Least privilege. Grant minimum necessary permissions
  • Access monitoring. Regularly review and audit access patterns

Why it's important: Security breaches can lead to data leaks, compliance violations, and financial losses. Proper security measures protect both your organization and your users' data.

Performance Optimization

Optimizing performance ensures efficient resource utilization and faster data processing:

  • Partition sizing. Choose appropriate partition sizes based on data volume and processing patterns
  • Parallel processing. Utilize multiple threads for faster processing
  • Memory management. Monitor and optimize memory usage
  • Query optimization. Structure queries for maximum efficiency

Why it's important: Efficient performance reduces processing time, saves computational resources, and improves overall system reliability.

Error Handling

Robust error handling ensures reliable data processing:

  • Retry mechanisms. Implement exponential backoff for failed operations
  • Comprehensive logging. Maintain detailed logs for debugging
  • Status monitoring. Track processing progress
  • Edge cases. Handle unexpected data scenarios

Why it's important: Proper error handling prevents data loss, ensures processing completeness, and makes troubleshooting easier.

Conclusion

Cloud data processing with DuckDB and AWS S3 offers a powerful combination of performance and security. Let me know how your DuckDb implementation goes!error handling

AWS Data processing Cloud Data (computing) security

Opinions expressed by DZone contributors are their own.

Related

  • Securing the Cloud: Navigating the Frontier of Cloud Security
  • Why Continuous Monitoring of AWS Logs Is Critical To Secure Customer and Business-Specific Data
  • What Is IoT Gateway? Is It Important
  • How To Reduce the Impact of a Cloud Outage

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: