March 16, 2016 · 7 min read

Large Data with Scikit-learn - Boston Meetup

MeetupLarge Data

By Alexis Perrier

Large Data with Scikit-learn

Plan

  1. What is large data?

out-of-core, streaming, online, batch?

  1. Algorithms

  2. Implementation

  3. Examples

Cannot fit in the main memory (RAM)

2-10Gb makes the computer swap too much. Slow!

From in-memory to on-disk

  • Being able to work with data that fits on disk on a single machine with no cluster
### Many great alternatives
### Terminology
  • Out-of-core / External memory

Data does not fit in main memory => access data stored in slow data store (disk, ...), slower (10x, 100x)

  • Offline: all the data is available from the start

  • Online: Serialized. Same results as offline. Act on new data right away

  • Streaming Serialized. limited number of passes over the data, can postpone processing. "old" data is of less importance.

  • Minibatch, Serialized in blocks

  • Incremental = online + minibatch

https://stats.stackexchange.com/questions/22439/incremental-or-online-or-single-pass-or-data-stream-clustering-refers-to-the-sam

clf = Some Model or Transformation

  • Train on training data X_train, y_train

clf.fit(X_train,y_train)

  • Predict on test data: X_test => y_test

y_test = clf.predict(X_test)

  • Assess model performance on test: y_truth, vs y_test

clf.score(y_truth, y_test, metric = ...)

  • Predict on new data: \( \hat{y} = clf.predict(X_{new}) \)
### Scikit-learn: out-of-core
  • Split the training data in blocks: mini-batch (sequential, random, ...)

  • Load each block in sequence

  • Train the model adaptively on each block

  • Convergence!

### Scikit-learn minibatch learning
  • Batch size n
  • split the data in N blocks
  • .partial_fit instead of .fit

clf.partial_fit(X_train(i),y(i), all_classes)

  • All possible classes given to partial_fit on first call
  • i = 1..N number of blocks
### Scikit: Implementation Implementation with generators Generator code

better example

II Algorithms

Regression Classification
Stochastic Gradient (1) X X
Naive Bayes (2) X

(1) SG + PassiveAggressive + Perceptron (clf only)

(2) NB: MultinomialNB, BernoulliNB

Clustering Decomposition
MiniBatchKMeans X X
DictionaryLearning, PCA X
<script type="text/template">
<!-- .slide: data-background="#FFFFFF" -->

Stochastic Gradient - Wiener filtering

Gradient Descent solves

$$ w^* = (X^T X)^{-1} X^T Y $$
with
$$ w_{n+1} = w_n + \mu \sum_i[(w_n^{T} x_i − y_i)x_i] $$

Deterministic!

<script type="text/template">
<!-- .slide: data-background="#FFFFFF" -->

SGD / LMS:

Approximate the gradient one sample at a time \( x_i, y_i \)

$$ w_{n+1} = w_n + \mu (w_n^{T} x_i − y_i)x_i $$

  • Slow convergence
  • Poor approximation of Gradient => oscillations

Mini-batch SGD / Block-LMS

N sample at a time to estimate the gradient

$$ w_{n+1} = w_n + \frac{\mu}{N} \sum_i^{N}[(w_n^{T} x_i − y_i)x_i] $$

Less gradient noise, better convergence than SGD, still very optimizable

#### Other loss functions
  • GD: loss = MSE
  • Passive Aggressive: Hinge loss
  • Perceptron: loss = +/- 1
###### Mini Batch K-Means

D. Sculley @Google, 2010, Web Scale K-Means Clustering

  • Each mini-batch
    • Each sample \( x \) in the batch
      • Find closes center \( C_{x} \) (min dist)
    • Each sample \( x \) in the batch
      • Update \( C_{x} \) with
$$ C_x = (1 - \frac{\mu}{|C_x|}) C_x + \mu x $$

Same convergence level

  • Classification: Mini-batches need to be balanced for classes

  • Perceptron, PA and SGD put different emphasis on samples over time.

  • Batch size influences?

### Batch size The bigger the better? until you run out of memory?
  • time efficiency of training (small batch) VS noisiness of gradient estimate (large batch)

Adding some noise (small batch) can be interesting to get out of local minimas

In practice

  • small to moderate mini-batch sizes (10-500)
  • decaying learning rate

https://www.quora.com/Intuitively-how-does-mini-batch-size-affect-the-performance-of-gradient-descent

III Examples

Not same partition as K-Means, but much faster

Same convergence level

Batch sizes: 100, 500, 2000

Accuracy results are similar

Batch sizes: 100, 500, 2000

Training times decreases for models

Stays basically the same for vectorization

  • feature extraction, vectorization: TfIdf / Hashing Trick

  • 10k training 4k test

  • 10k pages, 500 batch size, 100 iterations

    NB steadily converges,

    SGD, PA, Perceptron: unstable

    Naive Bayes, Batch size: 10, 100 and 500

    ### Part II Data Munging (next time)
    • Loading csv files
    • Parsing, segmenting,
    • word counts, TfIdF, hashing vectorizer

    => time consuming

    ### Dask * Continuum IO (same people who brought you Conda) * Easy install pip or conda (no JVM like spark) * Blocked algorithm * Copies the numpy interface * Very little code changes * [Matthew Rocklin - PyData 2015](https://www.youtube.com/watch?v=ieW3G7ZzRZ0)
    # Thank you

    @alexip

    [email protected]

    https://alexisperrier.com

    ##### Links