Scalable, Out-of-Core Data
Structures for Data Science
Krishna Sridhar
Data Scientist, Dato Inc.
krishna_srd
• Background
- Machine Learning (ML) Research.
- Ph.D Numerical Optimization @Wisconsin
• Now
- Build ML tools for data-scientists & developers @Dato.
- Help deploy ML algorithms.
@krishna_srd, @DatoInc
About Me
Collaborators
45+$and$growing$fast!
Scalable Machine Learning
recommenders, other task-oriented ML,
boosted decision trees, deep learning, pattern
mining, many others, etc
GraphLab Create
SGraphSFrameLocal
HDFS
S3
Compressed)In,Core)or)
Out,of,core)scalable)datastructures
C++11
Dato Architecture
pip install graphlab-create
Dato (Open Source) Architecture
SGraphSFrame
Compressed)In,Core)or)
Out,of,core)scalable)datastructures
https://github.com/dato-code/sframe
Single Machine? Scalable??
Yes!
What can you do with a
single machine?
Build a Collaborative Filtering Model on 20 Billion
User-Item Ratings
Do PageRank on a 128 Billion edge graph.
How?
Data Structures!
User Com.
Title Body
User Disc.
SFrame SGraph TimeSeries
SFrame Python API
Make a little SFrame of 1 column and 5 values:
>> sf = gl.SFrame({‘x’:[1,2,3,4,5]})
Normalizes the column x:
>> sf[‘x’] = sf[‘x’] / sf[‘x’].sum()
Uses a python lambda to create a new column:
>> sf[‘x-squared’] = sf[‘x’].apply(lambda x: x*x if x > 0 else 0)
Create a new column using a vectorized operator:
>> sf[‘x-cubed’] = sf[‘x-squared’] * sf[‘x’]
Create a new SFrame taking only 2 of the columns:
>> sf2 = sf[[‘x’,’x-squared’]]
SFrame Design Principles
Graceful Degradation as 1st principle
- Always works
- High performance when in-memory, scales to disk.
Rich Datatypes
- Strong schema types: int, double, string, image.
- Weak schema types: list, dictionary (arbitrary JSON!)
Columnar Architecture
- Easy feature engineering + Vectorized feature operation
- Immutable columns + Lazy Evaluation
- Statistics + Sketching + Visualization
nrating
sf[‘nrating’]-=-sf2[‘rating’]
What is the SFrame?
sf#=#gl.SFrame(‘netflix_tr.frame’)
user movie rating
netflix_tr.frame
sf
user
item
rating
sf2$=$gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
rating
nrating
sf[‘nrating’]-=-sf2[‘rating’]
What is the SFrame?
sf#=#gl.SFrame(‘netflix_tr.frame’)
user movie rating
netflix_tr.frame
sf
user
item
rating
sf2$=$gl.SFrame(‘netflix_norm.frame’)
user movie rating
netflix_norm.frame
sf2
user
item
rating
diff
anonymous
diff$=$sf[‘rating’]$0 sf2[‘rating’]
What is the SFrame?
Filtering
sf[sf[‘rating’]->=-3]
Joins
Sf.join(user_table,-on=‘user_id’)
Random/Array3indexing
row10-=-sf[10]
Table_with_every_other_row =-sf[::2]
Rather3Fast3Parallelized3UDFs3(Interproc SHM)
sf[‘rating’].apply(lambda-x:-x*x)
Not a SQL
Frontend
SArray Column Types
Boring Scalar Types
- int64, double, string
Interesting Scalar Types
- Datetime, image
Mathematician Type
- array(‘d’)
Industrial Data Scientist Type
- list, dict
SFrame Architecture
Physical)Storage)Layer
Compressed)Column)Store
(with)some)interesting)properties)
Lazy)Query)Optimization)/)
Execution
C++)CoroutineExec)Pipeline
Python)API
Heavily)Pandas)Inspired)
(+)immutable)data)considerations)
File)System)Abstraction Local HDFS S3
Cache
Compression!
Type aware compression
methods. Very aggressive
numeric compression.
Netflix Dataset,
99M rows, 3 columns, ints
1.4GB raw
289MB gzip compressed
160MBPhysical)Storage)Layer
Lazy)Query)Optimization)/)
Execution
Python)API
File)System)Abstraction
Query Evaluation
Physical)Storage)Layer
Lazy)Query)Optimization)/)
Execution
Python)API
File)System)Abstraction
p['X4']'='p['X3']'+'p['X2']
g='p[p['X1']'<'10]
Cross Platform?
Python Bindings
- Our oldest binding
- Via Cython + Interprocessing communication to a C++ binary
R Bindings
- Via RCpp
- In Beta. Soon to be released.
C++ Bindings
- Used for internal development of
Julia Bindings
- “Hackathon” mock project mature
SGraph: Common Crawl
1x r3.8xlarge ! using 1x SSD.
PageRank:)9 min%per%iteration.
Connected)Components:))~%1%hr.
There)isn’t)any)general)purpose)library)out)there)capable)of)this.
3.5 billion Nodes and 128 billion Edges
Time Series!
Applications
- Log data mining.
- Sensor data mining.
- Churn Prediction.
- Transactional data processing.
- Financial data.
Log Data Mining
Log Data Mining
Data Structures!
User Com.
Title Body
User Disc.
SFrame SGraph TimeSeries
Demo!
Thanks!
https://github.com/dato-code/sframe
pip install sframe
pip install graphlab-create

Scalable data structures for data science