Gradient stochastique

gradient stochastique

Gradient stochastique

contenu

methode du gradient pour trouver le minimum d’une fonction derivable
demo sur fonction polynimial dont on connait la derivée
STOCHASTIC GRADIENT: general idea is to aproximate an unknown function through iterations of unbiased estimates of the function’s gradient. Knowing that the expectation of the gradient estimates equal the gradient.
“stochastic” comes from the fact that the gradient based on a single training sample is a “stochastic approximation” of the “true” cost gradient
comment estimer le gradient / la derivee
FONCTION DE COUT , FORME GÉNÉRALE: E(w, b) = L( , f ( )) +αR(w)
example of loss function : Quadratic error : need to normalize
breve histoire du SGD
implementation generalisée et tres optimisée
online learning
les parametres du SGD de scikit learn
influence du learning rate
strategies du learning rate : adaptatif, etc
MINI-BATCH (aka out-of-core) vs batch vs stochastic GRADIENT DESCENT
regularisation : L1, L2
comment regulariser le SGD
realy stopping
scikit SGDRegressor
overview of optimizers

resources

colab: https://colab.research.google.com/drive/1EDjgLYCtcudxcSfhSNbBjopig1GGeHHk?usp=sharing
https://www.ruder.io/optimizing-gradient-descent/
https://scikit-learn.org/stable/modules/sgd.html#tips-on-practical-use
http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf
https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/

Descente de Gradient Stochastique

1. Méthode du Gradient

Objectif : Trouver le minimum d’une fonction dérivable $f(x)$

Principe : Itération selon la direction opposée au gradient

\[x_{n+1} = x_n - \eta \nabla f(x_n)\]

où $\eta$ est le taux d’apprentissage (learning rate)

2. Démonstration : Fonction Polynomiale

Exemple : Minimiser $f(x) = x^2 - 4x + 5$

Dérivée : $f’(x) = 2x - 4$

Algorithme :

Initialisation : $x_0 = 0$
Itération : $x_{n+1} = x_n - \eta \cdot (2x_n - 4)$
Avec $\eta = 0.1$ : convergence vers $x^* = 2$ (minimum)

Résultat : $f(2) = 1$ est le minimum global

3. Gradient Stochastique : Idée Générale

Problème : Fonction inconnue ou coûteuse à calculer

Solution : Approximer le gradient par des estimations non biaisées

Principe clé : $\mathbb{E}[\nabla_{\text{estimé}}] = \nabla_{\text{vrai}}$

L’espérance des estimations du gradient égale le gradient réel

4. Pourquoi “Stochastique” ?

Le gradient calculé sur un seul échantillon d’entraînement est une approximation stochastique du gradient du coût total

Gradient complet (batch) : $\nabla E = \frac{1}{n}\sum_{i=1}^n \nabla L_i$

Gradient stochastique (un échantillon) : $\nabla \tilde{E} = \nabla L_k \quad \text{(échantillon aléatoire } k\text{)}$

5. Estimation du Gradient

Méthodes d’estimation :

Différences finies : $\frac{\partial f}{\partial x} \approx \frac{f(x + h) - f(x)}{h}$
Sous-échantillonnage : Utiliser un sous-ensemble aléatoire des données
Gradient exact sur échantillon : Calculer le gradient analytique sur un point

6. Fonction de Coût : Forme Générale

\[E(w, b) = L(y, f(x; w, b)) + \alpha R(w)\]

Composantes :

$L$ : Fonction de perte (loss)
$f$ : Modèle avec paramètres $w$ (poids) et $b$ (biais)
$R(w)$ : Terme de régularisation
$\alpha$ : Coefficient de régularisation

7. Exemple : Erreur Quadratique

Fonction de perte : $L = \frac{1}{2}(y - \hat{y})^2$

Gradient : $\frac{\partial L}{\partial w} = -(y - \hat{y}) \cdot \frac{\partial \hat{y}}{\partial w}$

⚠️ Important : Normaliser les features pour améliorer la convergence

\[x_{\text{norm}} = \frac{x - \mu}{\sigma}\]

8. Brève Histoire du SGD

1951 : Robbins & Monro - Fondements théoriques
1960s : Widrow & Hoff - Adaline (apprentissage adaptatif)
1980s : Réseaux de neurones + backpropagation
2010s : Deep Learning - SGD devient incontournable
Aujourd’hui : Variants optimisés (Adam, RMSprop, etc.)

9. Implémentation Moderne

Frameworks optimisés :

TensorFlow / PyTorch : calcul GPU
Scikit-learn : interface simple
JAX : différentiation automatique

Optimisations :

Vectorisation (NumPy, BLAS)
Parallélisation sur GPU/TPU
Calcul distribué

10. Apprentissage en Ligne (Online Learning)

Principe : Mise à jour après chaque observation

Avantages :

Adaptation aux données changeantes
Mémoire constante
Traitement de flux continu

Application :

Systèmes de recommandation
Détection d’anomalies
Trading algorithmique

11. SGDRegressor : Paramètres Clés

from sklearn.linear_model import SGDRegressor

model = SGDRegressor(
    loss='squared_error',      # Fonction de perte
    penalty='l2',              # Régularisation
    alpha=0.0001,              # Force régularisation
    learning_rate='invscaling', # Stratégie LR
    eta0=0.01,                 # LR initial
    max_iter=1000,             # Nb itérations
    tol=1e-3,                  # Critère arrêt
    early_stopping=True        # Arrêt anticipé
)

12. Influence du Taux d’Apprentissage

\[w_{t+1} = w_t - \eta \nabla L(w_t)\]

$\eta$ trop grand :

Oscillations
Divergence

$\eta$ trop petit :

Convergence lente
Minima locaux

$\eta$ optimal : Convergence rapide et stable

13. Stratégies de Learning Rate

1. Constant : $\eta_t = \eta_0$

2. Décroissance temporelle : $\eta_t = \frac{\eta_0}{1 + t \cdot d}$

3. Inverse scaling : $\eta_t = \frac{\eta_0}{t^\alpha}$

4. Adaptatif : Ajustement selon performance validation

14. Learning Rate Adaptatif

Méthodes avancées :

AdaGrad : Adapte $\eta$ par paramètre $\eta_t^{(i)} = \frac{\eta_0}{\sqrt{\sum_{\tau=1}^t (g_\tau^{(i)})^2 + \epsilon}}$

RMSprop : Moyenne mobile des gradients

Adam : Combine momentum + adaptation

Moyenne du gradient (moment d’ordre 1)
Moyenne du gradient carré (moment d’ordre 2)

15. Batch vs Mini-Batch vs Stochastique

Type	Taille échantillon	Caractéristiques
Batch GD	Tous ($n$)	Stable, lent, grande mémoire
Stochastic GD	1	Rapide, bruité, converge
Mini-Batch GD	$m$ (ex: 32-256)	Optimal : vectorisation + stabilité

Formule mini-batch : $w_{t+1} = w_t - \eta \frac{1}{m}\sum_{i=1}^m \nabla L_i(w_t)$

16. Mini-Batch : Out-of-Core Learning

Problème : Dataset trop large pour la RAM

Solution : Charger les données par batch

for epoch in range(n_epochs):
    for batch in data_loader:
        gradient = compute_gradient(batch)
        weights -= learning_rate * gradient

Applications : Big Data, streaming

17. Régularisation L1 (Lasso)

\[R(w) = \|w\|_1 = \sum_{i=1}^d |w_i|\]

Effet : Sélection de variables (sparsité)

Gradient sous-différentiel : $\frac{\partial R}{\partial w_i} = \text{sign}(w_i)$

Usage : Features redondantes, interprétabilité

18. Régularisation L2 (Ridge)

\[R(w) = \|w\|_2^2 = \sum_{i=1}^d w_i^2\]

Effet : Réduction de tous les poids

Gradient : $\frac{\partial R}{\partial w_i} = 2w_i$

Mise à jour : $w_{t+1} = (1 - 2\eta\alpha)w_t - \eta \nabla L(w_t)$

19. Régularisation Elastic Net

Combinaison L1 + L2 : $R(w) = \rho \|w\|_1 + \frac{1-\rho}{2}\|w\|_2^2$

Avantages :

Sélection de variables (L1)
Stabilité (L2)
$\rho \in [0,1]$ contrôle le mix

SGDRegressor(penalty='elasticnet',
             l1_ratio=0.5)

20. Comment Régulariser le SGD

Paramètres Scikit-learn :

SGDRegressor(
    penalty='l2',        # 'l1', 'l2', 'elasticnet'
    alpha=0.0001,        # Force régularisation
    l1_ratio=0.15        # Si elasticnet
)

Impact :

Prévient le sur-apprentissage
Améliore la généralisation
Stabilise l’entraînement

21. Early Stopping (Arrêt Anticipé)

Principe : Arrêter l’entraînement avant la fin si pas d’amélioration

Algorithme :

Diviser : train / validation
À chaque époque : évaluer sur validation
Si pas d’amélioration pendant $p$ époques : STOP

Implémentation :

SGDRegressor(
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=5
)

22. SGDRegressor : Exemple Complet

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Normalisation
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Division
X_train, X_val, y_train, y_val = train_test_split(
    X_scaled, y, test_size=0.2
)

# Modèle
sgd = SGDRegressor(
    loss='squared_error',
    penalty='l2',
    alpha=0.001,
    learning_rate='adaptive',
    eta0=0.01,
    early_stopping=True
)

sgd.fit(X_train, y_train)

23. Vue d’Ensemble des Optimiseurs

SGD classique : $w_t = w_{t-1} - \eta \nabla L$

SGD + Momentum : $v_t = \beta v_{t-1} + \nabla L$ $w_t = w_{t-1} - \eta v_t$

Nesterov : Momentum anticipé

AdaGrad : LR adaptatif par paramètre

RMSprop : Moyenne mobile du gradient carré

24. Optimiseurs Avancés

Adam (Adaptive Moment Estimation) :

Combine momentum + adaptation
Très populaire en Deep Learning

AdamW : Adam avec weight decay découplé

Nadam : Nesterov + Adam

RAdam : Warmup automatique

Choisir : Adam par défaut, SGD+momentum si besoin de généralisation

25. Comparaison des Optimiseurs

Optimiseur	Vitesse	Stabilité	Généralisation	Usage
SGD	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	Classique
SGD+Momentum	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	Recommandé
Adam	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐	Deep Learning
RMSprop	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	RNN

26. Momentum : Intuition

Problème SGD : Oscillations dans les ravins

Solution : Accumuler la “vitesse”

$v_t = \beta v_{t-1} + (1-\beta) \nabla L_t$ $w_t = w_{t-1} - \eta v_t$

Effet :

Accélération dans directions constantes
Amortissement des oscillations
$\beta \in [0.9, 0.99]$ typiquement

27. Diagnostic de Convergence

Courbes d’apprentissage :

Loss training vs validation
Détection sur/sous-apprentissage

Signes de problèmes :

Loss qui explose → $\eta$ trop grand
Plateau précoce → $\eta$ trop petit ou minimum local
Gap train/val → sur-apprentissage

28. Hyperparamètres : Réglage

Stratégies :

Grid Search : Test exhaustif
Random Search : Échantillonnage aléatoire
Bayesian Optimization : Recherche intelligente

Paramètres critiques :

Learning rate $\eta$
Coefficient régularisation $\alpha$
Taille mini-batch $m$

29. Astuces Pratiques

Toujours normaliser : StandardScaler ou MinMaxScaler

Mélanger les données : shuffle=True

Monitoring : Verbose pour suivre la convergence

Warmup : Commencer avec petit LR

Gradient clipping : Éviter explosions (Deep Learning)

Checkpointing : Sauvegarder meilleur modèle

30. Limitations du SGD

Choix du Learning Rate : Sensible, difficile à optimiser

Minima locaux : Peut rester coincé (fonctions non-convexes)

Saddle points : Ralentit près des points-selle

Scaling : Sensible à l’échelle des features

Solution : Optimiseurs adaptatifs (Adam, etc.)

31. Quand Utiliser SGD ?

SGD convient si :

Grands datasets (millions d’exemples)
Apprentissage en ligne nécessaire
Mémoire limitée (out-of-core)
Modèles linéaires ou convexes

Alternatives :

Petits datasets → Batch methods (BFGS, L-BFGS)
Deep Learning → Adam, AdamW
Arbres → Gradient Boosting

32. Extensions du SGD

Variance Reduction :

SVRG (Stochastic Variance Reduced Gradient)
SAGA
Réduisent le bruit du gradient stochastique

Second-ordre :

Newton stochastique
Utilise information de courbure (Hessienne)

Distribué :

Parallélisation sur plusieurs machines
Hogwild!, AllReduce

33. SGD en Deep Learning

Particularités :

Milliers/millions de paramètres
Fonctions fortement non-convexes
Backpropagation pour calcul gradient

Best Practices :

Mini-batch 32-256
Learning rate schedule
Batch normalization
Dropout comme régularisation supplémentaire

34. Références et Ressources

Papers fondateurs :

Robbins & Monro (1951) : Méthode stochastique
Bottou (2010) : Large-Scale Machine Learning

Documentation :

Scikit-learn SGD User Guide
PyTorch Optimizers
Deep Learning Book (Goodfellow et al.)

Cours :

Stanford CS229 (Machine Learning)
Fast.ai (Deep Learning)

35. Conclusion

SGD = Outil fondamental du Machine Learning

Points clés : ✓ Approximation stochastique du gradient ✓ Efficace sur grandes données ✓ Nombreux variants optimisés ✓ Toujours normaliser et régulariser ✓ Choisir l’optimiseur selon le problème

Recommandation : Commencer avec SGD+momentum ou Adam