3 releases (breaking)
| 0.3.0 | Jan 5, 2026 |
|---|---|
| 0.2.0 | Jan 5, 2026 |
| 0.1.0 | Jan 5, 2026 |
#504 in Science
1.5MB
28K
SLoC
RuVector Dataset Discovery Framework
Find hidden patterns and connections in massive datasets that traditional tools miss.
RuVector turns your dataβresearch papers, climate records, financial filingsβinto a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts before they become obvious.
Why RuVector?
Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you discover what you don't know you're looking for.
Real-world examples:
- π¬ Research: Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
- π Climate: Detect regime shifts in weather patterns that correlate with economic disruptions
- π° Finance: Find companies whose narratives are diverging from their peersβoften an early warning signal
Features
| Feature | What It Does | Why It Matters |
|---|---|---|
| Vector Memory | Stores data as 384-1536 dim embeddings | Similar concepts cluster together automatically |
| HNSW Index | O(log n) approximate nearest neighbor search | 10-50x faster than brute force for large datasets |
| Graph Structure | Connects related items with weighted edges | Reveals hidden relationships in your data |
| Min-Cut Analysis | Measures how "connected" your network is | Detects regime changes and fragmentation |
| Cross-Domain Detection | Finds bridges between different fields | Discovers unexpected correlations (e.g., climate β finance) |
| ONNX Embeddings | Neural semantic embeddings (MiniLM, BGE, etc.) | Production-quality text understanding |
| Causality Testing | Checks if changes in X predict changes in Y | Moves beyond correlation to actionable insights |
| Statistical Rigor | Reports p-values and effect sizes | Know which findings are real vs. noise |
What's New in v0.3.0
- HNSW Integration: O(n log n) similarity search replaces O(nΒ²) brute force
- Similarity Cache: 2-3x speedup for repeated similarity queries
- Batch ONNX Embeddings: Chunked processing with progress callbacks
- Shared Utils Module:
cosine_similarity,euclidean_distance,normalize_vector - Auto-connect by Embeddings: CoherenceEngine creates edges from vector similarity
Performance
- β‘ 10-50x faster similarity search (HNSW vs brute force)
- β‘ 8.8x faster batch vector insertion (parallel processing)
- β‘ 2.9x faster similarity computation (SIMD acceleration)
- β‘ 2-3x faster repeated queries (similarity cache)
- π Works with millions of records on standard hardware
Quick Start
Prerequisites
# Ensure you're in the ruvector workspace
cd /workspaces/ruvector
Run Your First Example
# 1. Performance benchmark - see the speed improvements
cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release
# 2. Discovery hunter - find patterns in sample data
cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release
# 3. Cross-domain analysis - detect bridges between fields
cargo run --example cross_domain_discovery -p ruvector-data-framework --release
Domain-Specific Examples
# Climate: Detect weather regime shifts
cargo run --example regime_detector -p ruvector-data-climate
# Finance: Monitor corporate filing coherence
cargo run --example coherence_watch -p ruvector-data-edgar
What You'll See
π Discovery Results:
Pattern: Climate β Finance bridge detected
Strength: 0.73 (strong connection)
P-value: 0.031 (statistically significant)
β Drought indices may predict utility sector
performance with a 3-period lag
The Discovery Thesis
RuVector's unique combination of vector memory, graph structures, and dynamic minimum cut algorithms enables discoveries that most analysis tools miss:
- Emerging patterns before they have names: Detect topic splits and merges as cut boundaries shift over time
- Non-obvious cross-domain bridges: Find small "connector" subgraphs where disciplines quietly start citing each other
- Causal leverage maps: Link funders, labs, venues, and downstream citations to spot high-impact intervention points
- Regime shifts in time series: Use coherence breaks to flag fundamental changes in system behavior
Tutorial
1. Creating the Engine
use ruvector_data_framework::optimized::{
OptimizedDiscoveryEngine, OptimizedConfig,
};
use ruvector_data_framework::ruvector_native::{
Domain, SemanticVector,
};
let config = OptimizedConfig {
similarity_threshold: 0.55, // Minimum cosine similarity
mincut_sensitivity: 0.10, // Coherence change threshold
cross_domain: true, // Enable cross-domain discovery
use_simd: true, // SIMD acceleration
significance_threshold: 0.05, // P-value threshold
causality_lookback: 12, // Temporal lookback periods
..Default::default()
};
let mut engine = OptimizedDiscoveryEngine::new(config);
2. Adding Data
use std::collections::HashMap;
use chrono::Utc;
// Single vector
let vector = SemanticVector {
id: "climate_drought_2024".to_string(),
embedding: generate_embedding(), // 128-dim vector
domain: Domain::Climate,
timestamp: Utc::now(),
metadata: HashMap::from([
("region".to_string(), "sahel".to_string()),
("severity".to_string(), "extreme".to_string()),
]),
};
let node_id = engine.add_vector(vector);
// Batch insertion (8.8x faster)
#[cfg(feature = "parallel")]
{
let vectors: Vec<SemanticVector> = load_vectors();
let node_ids = engine.add_vectors_batch(vectors);
}
3. Computing Coherence
let snapshot = engine.compute_coherence();
println!("Min-cut value: {:.3}", snapshot.mincut_value);
println!("Partition sizes: {:?}", snapshot.partition_sizes);
println!("Boundary nodes: {:?}", snapshot.boundary_nodes);
Interpretation:
| Min-cut Trend | Meaning |
|---|---|
| Rising | Network consolidating, stronger connections |
| Falling | Fragmentation, potential regime change |
| Stable | Steady state, consistent structure |
4. Pattern Detection
let patterns = engine.detect_patterns_with_significance();
for pattern in patterns.iter().filter(|p| p.is_significant) {
println!("{}", pattern.pattern.description);
println!(" P-value: {:.4}", pattern.p_value);
println!(" Effect size: {:.3}", pattern.effect_size);
}
Pattern Types:
| Type | Description | Example |
|---|---|---|
CoherenceBreak |
Min-cut dropped significantly | Network fragmentation crisis |
Consolidation |
Min-cut increased | Market convergence |
BridgeFormation |
Cross-domain connections | Climate-finance link |
Cascade |
Temporal causality | Climate β Finance lag-3 |
EmergingCluster |
New dense subgraph | Research topic emerging |
5. Cross-Domain Analysis
// Check coupling strength
let stats = engine.stats();
let coupling = stats.cross_domain_edges as f64 / stats.total_edges as f64;
println!("Cross-domain coupling: {:.1}%", coupling * 100.0);
// Domain coherence scores
for domain in [Domain::Climate, Domain::Finance, Domain::Research] {
if let Some(coh) = engine.domain_coherence(domain) {
println!("{:?}: {:.3}", domain, coh);
}
}
Performance Benchmarks
| Operation | Baseline | Optimized | Speedup |
|---|---|---|---|
| Vector Insertion | 133ms | 15ms | 8.84x |
| SIMD Cosine | 432ms | 148ms | 2.91x |
| Pattern Detection | 524ms | 655ms | - |
Datasets
1. OpenAlex (Research Intelligence)
Best for: Emerging field detection, cross-discipline bridges
- 250M+ works, 90M+ authors
- Native graph structure
- Bulk download + API access
use ruvector_data_openalex::{OpenAlexConfig, FrontierRadar};
let radar = FrontierRadar::new(OpenAlexConfig::default());
let frontiers = radar.detect_emerging_topics(papers);
2. NOAA + NASA (Climate Intelligence)
Best for: Regime shift detection, anomaly prediction
- Weather observations, satellite imagery
- Time series β graph transformation
- Economic risk modeling
use ruvector_data_climate::{ClimateConfig, RegimeDetector};
let detector = RegimeDetector::new(config);
let shifts = detector.detect_shifts();
3. SEC EDGAR (Financial Intelligence)
Best for: Corporate risk signals, peer divergence
- XBRL financial statements
- 10-K/10-Q filings
- Narrative + fundamental analysis
use ruvector_data_edgar::{EdgarConfig, CoherenceMonitor};
let monitor = CoherenceMonitor::new(config);
let alerts = monitor.analyze_filing(filing);
Directory Structure
examples/data/
βββ README.md # This file
βββ Cargo.toml # Workspace manifest
βββ framework/ # Core discovery framework
β βββ src/
β β βββ lib.rs # Framework exports
β β βββ ruvector_native.rs # Native engine with Stoer-Wagner
β β βββ optimized.rs # SIMD + parallel optimizations
β β βββ coherence.rs # Coherence signal computation
β β βββ discovery.rs # Pattern detection
β β βββ ingester.rs # Data ingestion
β βββ examples/
β βββ cross_domain_discovery.rs # Cross-domain patterns
β βββ optimized_benchmark.rs # Performance comparison
β βββ discovery_hunter.rs # Novel pattern search
βββ openalex/ # OpenAlex integration
βββ climate/ # NOAA/NASA integration
βββ edgar/ # SEC EDGAR integration
Configuration Reference
OptimizedConfig
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.65 | Minimum cosine similarity for edges |
mincut_sensitivity |
0.12 | Sensitivity to coherence changes |
cross_domain |
true | Enable cross-domain discovery |
batch_size |
256 | Parallel batch size |
use_simd |
true | Enable SIMD acceleration |
similarity_cache_size |
10000 | Max cached similarity pairs |
significance_threshold |
0.05 | P-value threshold |
causality_lookback |
10 | Temporal lookback periods |
causality_min_correlation |
0.6 | Minimum correlation for causality |
CoherenceConfig (v0.3.0)
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.5 | Min similarity for auto-connecting embeddings |
use_embeddings |
true | Auto-create edges from embedding similarity |
hnsw_k_neighbors |
50 | Neighbors to search per vector (HNSW) |
hnsw_min_records |
100 | Min records to trigger HNSW (else brute force) |
min_edge_weight |
0.01 | Minimum edge weight threshold |
approximate |
true | Use approximate min-cut for speed |
parallel |
true | Enable parallel computation |
Discovery Examples
Climate-Finance Bridge
Detected: Climate β Finance bridge
Strength: 0.73
Connections: 197
Hypothesis: Drought indices may predict
utility sector performance with lag-2
Regime Shift Detection
Min-cut trajectory:
t=0: 72.5 (baseline)
t=1: 73.3 (+1.1%)
t=2: 74.5 (+1.6%) β Consolidation
Effect size: 2.99 (large)
P-value: 0.042 (significant)
Causality Pattern
Climate β Finance causality detected
F-statistic: 4.23
Optimal lag: 3 periods
Correlation: 0.67
P-value: 0.031
Algorithms
HNSW (Hierarchical Navigable Small World)
Approximate nearest neighbor search in high-dimensional spaces.
- Complexity: O(log n) search, O(log n) insert
- Use: Fast similarity search for edge creation
- Parameters:
m=16,ef_construction=200,ef_search=50
Stoer-Wagner Min-Cut
Computes minimum cut of weighted undirected graph.
- Complexity: O(VE + VΒ² log V)
- Use: Network coherence measurement
SIMD Cosine Similarity
Processes 8 floats per iteration using AVX2.
- Speedup: 2.9x vs scalar
- Fallback: Chunked scalar (8 floats per iteration)
Granger Causality
Tests if past values of X predict Y.
- Compute cross-correlation at lags 1..k
- Find optimal lag with max |correlation|
- Calculate F-statistic
- Convert to p-value
Best Practices
- Start with low thresholds - Use
similarity_threshold: 0.45for exploration - Use batch insertion -
add_vectors_batch()is 8x faster - Monitor coherence trends - Min-cut trajectory predicts regime changes
- Filter by significance - Focus on
p_value < 0.05 - Validate causality - Temporal patterns need domain expertise
Troubleshooting
| Problem | Solution |
|---|---|
| No patterns detected | Lower mincut_sensitivity to 0.05 |
| Too many edges | Raise similarity_threshold to 0.70 |
| Slow performance | Use --features parallel --release |
| Memory issues | Reduce batch_size |
References
License
MIT OR Apache-2.0
Dependencies
~20β43MB
~568K SLoC