#data-fusion #csv #sql #cli

bin+lib sql-stream

A production-grade CLI tool for executing SQL queries against CSV/JSON files using streaming architecture

1 unstable release

new 0.1.0 Feb 3, 2026

#148 in #data-fusion

MIT/Apache

37KB
272 lines

SQL Stream

A CLI tool for executing SQL queries against CSV/JSON files using a zero-copy, streaming architecture powered by Apache DataFusion and Apache Arrow.

Installation

cargo install sql-stream

From GitHub Releases

Download pre-built binaries for your platform from the releases page:

Linux (x86_64)

wget https://github.com/1cbyc/sql-stream/releases/latest/download/sql-stream-linux-x86_64
chmod +x sql-stream-linux-x86_64
sudo mv sql-stream-linux-x86_64 /usr/local/bin/sql-stream

macOS (Intel)

curl -L https://github.com/1cbyc/sql-stream/releases/latest/download/sql-stream-macos-x86_64 -o sql-stream
chmod +x sql-stream
sudo mv sql-stream /usr/local/bin/

macOS (Apple Silicon)

curl -L https://github.com/1cbyc/sql-stream/releases/latest/download/sql-stream-macos-aarch64 -o sql-stream
chmod +x sql-stream
sudo mv sql-stream /usr/local/bin/

Windows

  • Download sql-stream-windows-x86_64.exe from the releases page
  • Rename to sql-stream.exe and add to your PATH

From Source

git clone https://github.com/1cbyc/sql-stream.git
cd sql-stream
cargo build --release

The binary will be available at target/release/sql-stream.

Usage

Basic Query on CSV

sql-stream -f data.csv -q "SELECT * FROM data LIMIT 10"

Custom Table Name

sql-stream -f employees.csv -t employees -q "SELECT name, salary FROM employees WHERE age > 30"

JSON Files

sql-stream -f data.json -q "SELECT COUNT(*) as total FROM data"

Aggregations and Group By

sql-stream -f sales.csv -q "SELECT region, SUM(revenue) as total_revenue FROM data GROUP BY region ORDER BY total_revenue DESC"

Enable Verbose Logging

sql-stream -f data.csv -q "SELECT * FROM data" --verbose

Command Line Options

sql-stream -f <FILE> -q <SQL> [OPTIONS]

Options:
  -f, --file <FILE>           Path to CSV or JSON file (required)
  -q, --query <SQL>           SQL query to execute (required)
  -t, --table-name <NAME>     Table name for SQL queries (default: "data")
  -v, --verbose               Enable verbose debug logging
  -h, --help                  Print help information
  -V, --version               Print version information

Examples

Data Analysis

# Find top 5 highest salaries
sql-stream -f employees.csv -q "SELECT name, salary FROM data ORDER BY salary DESC LIMIT 5"

# Calculate average age by city
sql-stream -f employees.csv -q "SELECT city, AVG(age) as avg_age FROM data GROUP BY city"

# Filter and count
sql-stream -f sales.json -q "SELECT product, COUNT(*) as count FROM data WHERE revenue > 1000 GROUP BY product"

Complex Queries

# Multiple aggregations
sql-stream -f data.csv -q "
  SELECT 
    category,
    COUNT(*) as count,
    AVG(price) as avg_price,
    MAX(price) as max_price,
    MIN(price) as min_price
  FROM data
  GROUP BY category
  HAVING count > 10
  ORDER BY avg_price DESC
"

Architecture

SQL Stream is built with a modular architecture:

  • Engine Module (src/engine.rs): DataFusion SessionContext management and query execution
  • CLI Module (src/cli.rs): Argument parsing with clap
  • Error Module (src/error.rs): Type-safe error handling with thiserror
  • Main Binary (src/main.rs): Async runtime and orchestration

Technology Stack

  • Apache DataFusion (45.x): SQL query engine
  • Apache Arrow (53.x): In-memory columnar format
  • Tokio: Async runtime
  • Clap (v4): CLI parsing
  • Tracing: Structured logging

Performance Considerations

  • Zero-Copy Streaming: DataFusion processes data using Arrow's columnar format without unnecessary copying
  • Lazy Evaluation: Queries are optimized before execution
  • Parallel Processing: Multi-threaded execution for CPU-intensive operations
  • Memory Efficiency: Streaming results prevent loading entire datasets into memory

For performance profiling, see docs/PROFILING.md.

Development

Prerequisites

  • Rust 1.70 or later
  • Cargo

Building

# Debug build
cargo build

# Release build
cargo build --release

Testing

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_csv_simple_select

Code Quality

# Format code
cargo fmt

# Run linter
cargo clippy -- -D warnings

# Generate documentation
cargo doc --open

CI/CD

The project includes GitHub Actions workflows:

  • CI Workflow: Runs on every push and pull request

    • Format checking (cargo fmt)
    • Linting (cargo clippy)
    • Cross-platform tests (Linux, macOS, Windows)
  • Release Workflow: Automated releases on version tags

    • Builds binaries for Linux (x86_64, musl), Windows (x86_64), macOS (Intel, Apple Silicon)
    • Uploads release artifacts to GitHub
    • Publishes to crates.io

Using as a Library

SQL Stream can be used as a Rust library in your projects:

[dependencies]
sql-stream = "0.1"
use sql_stream::{QueryEngine, SqlStreamError};

#[tokio::main]
async fn main() -> Result<(), SqlStreamError> {
    let mut engine = QueryEngine::new()?;
    
    // Register a CSV file as a table
    engine.register_file("data.csv", "my_table").await?;
    
    // Execute a SQL query
    let results = engine.execute_query("SELECT * FROM my_table WHERE age > 30").await?;
    
    // Print results
    engine.print_results(results).await?;
    
    Ok(())
}

See the API documentation for more details.

Contributing

Contributions are welcome! Please ensure:

  1. All tests pass (cargo test)
  2. Code is formatted (cargo fmt)
  3. No clippy warnings (cargo clippy)
  4. Documentation is updated

License

This project is dual-licensed under MIT OR Apache-2.0.

Acknowledgments

Built with:

Dependencies

~86MB
~1.5M SLoC