Large Data with Scikit-learn

Alexis Perrier - @alexip
Data & Software - @BerkleeOnline - Day
Data Science contributor - @ODSC - Night

Plan

What is large data?

out-of-core, streaming, online, batch?

Algorithms
Implementation
Examples

Cannot fit in the main memory (RAM)

2-10Gb makes the computer swap too much. Slow!

From in-memory to on-disk

Being able to work with data that fits on disk on a single machine with no cluster

### Many great alternatives

### Terminology

Out-of-core / External memory

Data does not fit in main memory => access data stored in slow data store (disk, ...), slower (10x, 100x)

Offline: all the data is available from the start
Online: Serialized. Same results as offline. Act on new data right away
Streaming Serialized. limited number of passes over the data, can postpone processing. "old" data is of less importance.
Minibatch, Serialized in blocks
Incremental = online + minibatch

https://stats.stackexchange.com/questions/22439/incremental-or-online-or-single-pass-or-data-stream-clustering-refers-to-the-sam

clf = Some Model or Transformation

Train on training data X_train, y_train

clf.fit(X_train,y_train)

Predict on test data: X_test => y_test

y_test = clf.predict(X_test)

Assess model performance on test: y_truth, vs y_test

clf.score(y_truth, y_test, metric = ...)

Predict on new data: $ \hat{y} = clf.predict(X_{new}) $

### Scikit-learn: out-of-core

Split the training data in blocks: mini-batch (sequential, random, ...)
Load each block in sequence
Train the model adaptively on each block
Convergence!

### Scikit-learn minibatch learning

Batch size n
split the data in N blocks
.partial_fit instead of .fit

clf.partial_fit(X_train(i),y(i), all_classes)

All possible classes given to partial_fit on first call
i = 1..N number of blocks

### Scikit: Implementation Implementation with generators Generator code

better example

II Algorithms

	Regression	Classification
Stochastic Gradient (1)	X	X
Naive Bayes (2)		X

(1) SG + PassiveAggressive + Perceptron (clf only)

(2) NB: MultinomialNB, BernoulliNB

	Clustering	Decomposition
MiniBatchKMeans	X	X
DictionaryLearning, PCA		X

<script type="text/template">
<!-- .slide: data-background="#FFFFFF" -->

Stochastic Gradient - Wiener filtering

Gradient Descent solves

$$ w^* = (X^T X)^{-1} X^T Y $$

with

$$ w_{n+1} = w_n + \mu \sum_i[(w_n^{T} x_i − y_i)x_i] $$

Deterministic!

<script type="text/template">
<!-- .slide: data-background="#FFFFFF" -->

SGD / LMS:

Approximate the gradient one sample at a time $ x_i, y_i $

$$ w_{n+1} = w_n + \mu (w_n^{T} x_i − y_i)x_i $$

Slow convergence
Poor approximation of Gradient => oscillations

Mini-batch SGD / Block-LMS

N sample at a time to estimate the gradient

$$ w_{n+1} = w_n + \frac{\mu}{N} \sum_i^{N}[(w_n^{T} x_i − y_i)x_i] $$

Less gradient noise, better convergence than SGD, still very optimizable

#### Other loss functions

GD: loss = MSE
Passive Aggressive: Hinge loss
Perceptron: loss = +/- 1

###### Mini Batch K-Means

D. Sculley @Google, 2010, Web Scale K-Means Clustering

Each mini-batch
- Each sample $ x $ in the batch
  - Find closes center $ C_{x} $ (min dist)
- Each sample $ x $ in the batch
  - Update $ C_{x} $ with

$$ C_x = (1 - \frac{\mu}{|C_x|}) C_x + \mu x $$

Same convergence level

Classification: Mini-batches need to be balanced for classes
Perceptron, PA and SGD put different emphasis on samples over time.
Batch size influences?

### Batch size The bigger the better? until you run out of memory?

time efficiency of training (small batch) VS noisiness of gradient estimate (large batch)

Adding some noise (small batch) can be interesting to get out of local minimas

In practice

small to moderate mini-batch sizes (10-500)
decaying learning rate

https://www.quora.com/Intuitively-how-does-mini-batch-size-affect-the-performance-of-gradient-descent

III Examples

Not same partition as K-Means, but much faster

Same convergence level

Batch sizes: 100, 500, 2000

Accuracy results are similar

Batch sizes: 100, 500, 2000

Training times decreases for models

Stays basically the same for vectorization

feature extraction, vectorization: TfIdf / Hashing Trick

10k training 4k test

10k pages, 500 batch size, 100 iterations

NB steadily converges,

SGD, PA, Perceptron: unstable

Naive Bayes, Batch size: 10, 100 and 500

### Part II Data Munging (next time)

Loading csv files
Parsing, segmenting,
word counts, TfIdF, hashing vectorizer

=> time consuming

### Dask * Continuum IO (same people who brought you Conda) * Easy install pip or conda (no JVM like spark) * Blocked algorithm * Copies the numpy interface * Very little code changes * [Matthew Rocklin - PyData 2015](https://www.youtube.com/watch?v=ieW3G7ZzRZ0)

# Thank you

@alexip

[email protected]

https://alexisperrier.com

##### Links

Large Data with Scikit-learn - Boston Meetup

Large Data with Scikit-learn

Plan

II Algorithms

III Examples