Readme
RuVector Dataset Discovery Framework
Find hidden patterns and connections in massive datasets that traditional tools miss.
RuVector turns your dataβresearch papers, climate records, financial filingsβinto a connected graph, then uses cutting-edge algorithms to spot emerging trends, cross-domain relationships, and regime shifts before they become obvious.
Why RuVector?
Most data analysis tools excel at answering questions you already know to ask. RuVector is different: it helps you discover what you don't know you're looking for .
Real-world examples:
π¬ Research : Spot a new field forming 6-12 months before it gets a name, by detecting when papers start citing across traditional boundaries
π Climate : Detect regime shifts in weather patterns that correlate with economic disruptions
π° Finance : Find companies whose narratives are diverging from their peersβoften an early warning signal
Features
Feature
What It Does
Why It Matters
Vector Memory
Stores data as 384-1536 dim embeddings
Similar concepts cluster together automatically
HNSW Index
O(log n) approximate nearest neighbor search
10-50x faster than brute force for large datasets
Graph Structure
Connects related items with weighted edges
Reveals hidden relationships in your data
Min-Cut Analysis
Measures how "connected" your network is
Detects regime changes and fragmentation
Cross-Domain Detection
Finds bridges between different fields
Discovers unexpected correlations (e.g., climate β finance)
ONNX Embeddings
Neural semantic embeddings (MiniLM, BGE, etc.)
Production-quality text understanding
Causality Testing
Checks if changes in X predict changes in Y
Moves beyond correlation to actionable insights
Statistical Rigor
Reports p-values and effect sizes
Know which findings are real vs. noise
What's New in v0.3.0
HNSW Integration : O(n log n) similarity search replaces O(nΒ²) brute force
Similarity Cache : 2-3x speedup for repeated similarity queries
Batch ONNX Embeddings : Chunked processing with progress callbacks
Shared Utils Module : cosine_similarity , euclidean_distance , normalize_vector
Auto-connect by Embeddings : CoherenceEngine creates edges from vector similarity
β‘ 10-50x faster similarity search (HNSW vs brute force)
β‘ 8.8x faster batch vector insertion (parallel processing)
β‘ 2.9x faster similarity computation (SIMD acceleration)
β‘ 2-3x faster repeated queries (similarity cache)
π Works with millions of records on standard hardware
Quick Start
Prerequisites
# Ensure you're in the ruvector workspace
cd /workspaces/ruvector
Run Your First Example
# 1. Performance benchmark - see the speed improvements
cargo run --example optimized_benchmark -p ruvector-data-framework --features parallel --release
# 2. Discovery hunter - find patterns in sample data
cargo run --example discovery_hunter -p ruvector-data-framework --features parallel --release
# 3. Cross-domain analysis - detect bridges between fields
cargo run --example cross_domain_discovery -p ruvector-data-framework --release
Domain-Specific Examples
# Climate: Detect weather regime shifts
cargo run --example regime_detector -p ruvector-data-climate
# Finance: Monitor corporate filing coherence
cargo run --example coherence_watch -p ruvector-data-edgar
What You'll See
π Discovery Results:
Pattern: Climate β Finance bridge detected
Strength: 0. 73 ( strong connection)
P- value: 0. 031 ( statistically significant)
β Drought indices may predict utility sector
performance with a 3 - period lag
The Discovery Thesis
RuVector's unique combination of vector memory , graph structures , and dynamic minimum cut algorithms enables discoveries that most analysis tools miss:
Emerging patterns before they have names : Detect topic splits and merges as cut boundaries shift over time
Non-obvious cross-domain bridges : Find small "connector" subgraphs where disciplines quietly start citing each other
Causal leverage maps : Link funders, labs, venues, and downstream citations to spot high-impact intervention points
Regime shifts in time series : Use coherence breaks to flag fundamental changes in system behavior
Tutorial
1. Creating the Engine
use ruvector_data_framework:: optimized:: {
OptimizedDiscoveryEngine, OptimizedConfig,
} ;
use ruvector_data_framework:: ruvector_native:: {
Domain, SemanticVector,
} ;
let config = OptimizedConfig {
similarity_threshold: 0. 55 , // Minimum cosine similarity
mincut_sensitivity: 0. 10 , // Coherence change threshold
cross_domain: true , // Enable cross-domain discovery
use_simd: true , // SIMD acceleration
significance_threshold: 0. 05 , // P-value threshold
causality_lookback: 12 , // Temporal lookback periods
.. Default :: default( )
} ;
let mut engine = OptimizedDiscoveryEngine:: new( config) ;
2. Adding Data
use std:: collections:: HashMap;
use chrono:: Utc;
// Single vector
let vector = SemanticVector {
id: " climate_drought_2024" . to_string ( ) ,
embedding: generate_embedding ( ) , // 128-dim vector
domain: Domain:: Climate,
timestamp: Utc:: now( ) ,
metadata: HashMap:: from( [
( " region" . to_string ( ) , " sahel" . to_string ( ) ) ,
( " severity" . to_string ( ) , " extreme" . to_string ( ) ) ,
] ) ,
} ;
let node_id = engine. add_vector ( vector) ;
// Batch insertion (8.8x faster)
# [ cfg ( feature = " parallel" ) ]
{
let vectors: Vec < SemanticVector> = load_vectors ( ) ;
let node_ids = engine. add_vectors_batch ( vectors) ;
}
3. Computing Coherence
let snapshot = engine. compute_coherence ( ) ;
println! ( " Min-cut value: {:.3} " , snapshot. mincut_value) ;
println! ( " Partition sizes: {:?} " , snapshot. partition_sizes) ;
println! ( " Boundary nodes: {:?} " , snapshot. boundary_nodes) ;
Interpretation:
Min-cut Trend
Meaning
Rising
Network consolidating, stronger connections
Falling
Fragmentation, potential regime change
Stable
Steady state, consistent structure
4. Pattern Detection
let patterns = engine. detect_patterns_with_significance ( ) ;
for pattern in patterns. iter ( ) . filter ( | p | p. is_significant ) {
println! ( " {} " , pattern. pattern. description) ;
println! ( " P-value: {:.4} " , pattern. p_value) ;
println! ( " Effect size: {:.3} " , pattern. effect_size) ;
}
Pattern Types:
Type
Description
Example
CoherenceBreak
Min-cut dropped significantly
Network fragmentation crisis
Consolidation
Min-cut increased
Market convergence
BridgeFormation
Cross-domain connections
Climate-finance link
Cascade
Temporal causality
Climate β Finance lag-3
EmergingCluster
New dense subgraph
Research topic emerging
5. Cross-Domain Analysis
// Check coupling strength
let stats = engine. stats ( ) ;
let coupling = stats. cross_domain_edges as f64 / stats. total_edges as f64 ;
println! ( " Cross-domain coupling: {:.1} %" , coupling * 100. 0 ) ;
// Domain coherence scores
for domain in [ Domain:: Climate, Domain:: Finance, Domain:: Research] {
if let Some ( coh) = engine. domain_coherence ( domain) {
println! ( " {:?} : {:.3} " , domain, coh) ;
}
}
Operation
Baseline
Optimized
Speedup
Vector Insertion
133ms
15ms
8.84x
SIMD Cosine
432ms
148ms
2.91x
Pattern Detection
524ms
655ms
-
Datasets
1. OpenAlex (Research Intelligence)
Best for : Emerging field detection, cross-discipline bridges
250M+ works, 90M+ authors
Native graph structure
Bulk download + API access
use ruvector_data_openalex:: { OpenAlexConfig, FrontierRadar} ;
let radar = FrontierRadar:: new( OpenAlexConfig:: default( ) ) ;
let frontiers = radar. detect_emerging_topics ( papers) ;
2. NOAA + NASA (Climate Intelligence)
Best for : Regime shift detection, anomaly prediction
Weather observations, satellite imagery
Time series β graph transformation
Economic risk modeling
use ruvector_data_climate:: { ClimateConfig, RegimeDetector} ;
let detector = RegimeDetector:: new( config) ;
let shifts = detector. detect_shifts ( ) ;
3. SEC EDGAR (Financial Intelligence)
Best for : Corporate risk signals, peer divergence
XBRL financial statements
10-K/10-Q filings
Narrative + fundamental analysis
use ruvector_data_edgar:: { EdgarConfig, CoherenceMonitor} ;
let monitor = CoherenceMonitor:: new( config) ;
let alerts = monitor. analyze_filing ( filing) ;
Directory Structure
examples/ data/
βββ README . md # This file
βββ Cargo. toml # Workspace manifest
βββ framework/ # Core discovery framework
β βββ src/
β β βββ lib. rs # Framework exports
β β βββ ruvector_native. rs # Native engine with Stoer- Wagner
β β βββ optimized. rs # SIMD + parallel optimizations
β β βββ coherence. rs # Coherence signal computation
β β βββ discovery. rs # Pattern detection
β β βββ ingester. rs # Data ingestion
β βββ examples/
β βββ cross_domain_discovery. rs # Cross- domain patterns
β βββ optimized_benchmark. rs # Performance comparison
β βββ discovery_hunter. rs # Novel pattern search
βββ openalex/ # OpenAlex integration
βββ climate/ # NOAA / NASA integration
βββ edgar/ # SEC EDGAR integration
Configuration Reference
OptimizedConfig
Parameter
Default
Description
similarity_threshold
0.65
Minimum cosine similarity for edges
mincut_sensitivity
0.12
Sensitivity to coherence changes
cross_domain
true
Enable cross-domain discovery
batch_size
256
Parallel batch size
use_simd
true
Enable SIMD acceleration
similarity_cache_size
10000
Max cached similarity pairs
significance_threshold
0.05
P-value threshold
causality_lookback
10
Temporal lookback periods
causality_min_correlation
0.6
Minimum correlation for causality
CoherenceConfig (v0.3.0)
Parameter
Default
Description
similarity_threshold
0.5
Min similarity for auto-connecting embeddings
use_embeddings
true
Auto-create edges from embedding similarity
hnsw_k_neighbors
50
Neighbors to search per vector (HNSW)
hnsw_min_records
100
Min records to trigger HNSW (else brute force)
min_edge_weight
0.01
Minimum edge weight threshold
approximate
true
Use approximate min-cut for speed
parallel
true
Enable parallel computation
Discovery Examples
Climate-Finance Bridge
Detected: Climate β Finance bridge
Strength: 0. 73
Connections: 197
Hypothesis: Drought indices may predict
utility sector performance with lag- 2
Regime Shift Detection
Min- cut trajectory:
t= 0 : 72. 5 ( baseline)
t= 1 : 73. 3 ( + 1. 1 % )
t= 2 : 74. 5 ( + 1. 6 % ) β Consolidation
Effect size: 2. 99 ( large)
P- value: 0. 042 ( significant)
Causality Pattern
Climate β Finance causality detected
F- statistic: 4. 23
Optimal lag: 3 periods
Correlation: 0. 67
P- value: 0. 031
Algorithms
HNSW (Hierarchical Navigable Small World)
Approximate nearest neighbor search in high-dimensional spaces.
Complexity : O(log n) search, O(log n) insert
Use : Fast similarity search for edge creation
Parameters : m= 16 , ef_construction= 200 , ef_search= 50
Stoer-Wagner Min-Cut
Computes minimum cut of weighted undirected graph.
Complexity : O(VE + VΒ² log V)
Use : Network coherence measurement
SIMD Cosine Similarity
Processes 8 floats per iteration using AVX2.
Speedup : 2.9x vs scalar
Fallback : Chunked scalar (8 floats per iteration)
Granger Causality
Tests if past values of X predict Y.
Compute cross-correlation at lags 1..k
Find optimal lag with max |correlation|
Calculate F-statistic
Convert to p-value
Best Practices
Start with low thresholds - Use similarity_threshold: 0. 45 for exploration
Use batch insertion - add_vectors_batch ( ) is 8x faster
Monitor coherence trends - Min-cut trajectory predicts regime changes
Filter by significance - Focus on p_value < 0. 05
Validate causality - Temporal patterns need domain expertise
Troubleshooting
Problem
Solution
No patterns detected
Lower mincut_sensitivity to 0.05
Too many edges
Raise similarity_threshold to 0.70
Slow performance
Use --features parallel --release
Memory issues
Reduce batch_size
References
License
MIT OR Apache-2.0