Machine learning


Multiple types of automated learnings


Machine learning


Linear Regression

Legendre (1805): First publication of the method of least squares, which is the basis for linear regression.

Gauss (1795–1809): Independently developed and formalized the method, contributing significantly to its statistical foundation. chat.mistral.ai


context: tabular data. - predictor variables - target variable : the thing you want to predict

Simple model : linear regression

sales = a * radio + b * TV + c * newspaper + noise

you try to find a, b, c so that you can predict sales

Linear Regression Rocks!

Some hypothesis needed

Method of least squares

The goal is to minimize the error between what the model predicts and the actual values

In the case with 2 predictors (x_0 and x_1) and 1 target (y), the model is:

y_pred = a* x_0 + b * x_1 + c

you want to minimize the prediction error.

\[\text{prediction error: } = \sqrt{ \sum_{i=1}^{n} (y_i - (a + b x_i))^2 }\]

Heron’s method

Let’s travel back to 50 AD in ancient greece and meet Ἥρων ὁ Ἀλεξανδρεύς aka Heron of Alexandria

Heron

He invented the first steam engine (Aeolipile) and first vending machine!

And also something called Heron’s method to calculate the square root of a number. Also called for unknown reasons Babylonian method.

Simple algorithm to calculate the square root of 2.

Start from x_0 for instance x_0 = 1

Then repeat

- x_{n+1} = 1/2 (x_n + 2 / x_n)

until x_{n+1}^2 is close enough to 2.

in python:

x = -1
precision = 0.0001

for i in range(100):
    x = 0.5 * (x + 2 / x)
    error = abs(2 - x**2)
    print(i, 'x : ', round(x,8), 'error : ', round(error, 8))
    if error < precision:
        break

print('\nx : ', round(x,10), 'x^2 : ', round(x**2,10))

Extremely efficient!

It has all the ingredient of the Method of least squares.

Back to Machine Learning

The context: you have a dataset with features (columns) and a target variable.

for instance, the titanic dataset, the iris dataset, the house prices dataset, etc.

see UCI machine learning repository

On this page of the most popular datasets

Notice how we can filter by

Regression vs Classification vs Clustering

On one side we have supervised learning (classification and regression) and on the other side unsupervised learning (clustering).

Linear regression vs Logistic regression

the output of a linear regression is a real number (as in decimal number) without any limit

if we forced this number into a range of [0,1], then we could interpret it as a probability.

Introducing the logit / sigmoid function

Evaluation Metrics

confusion matrix

which becomes

see https://en.wikipedia.org/wiki/Confusion_matrix

Iterative Work process

Most important families of models

Auto ML

Gradient descent

start with a set of arbitrary coefficients (all zeroes or ones for instance)

In a way very similar to Heron’s method.

The goal of gradient descent is to find the local minima of a function.

and


Wikipedia:

Gradient descent is generally attributed to Augustin-Louis Cauchy, who first suggested it in 1847.Jacques Hadamard independently proposed a similar method in 1907. Its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944

Each iteration uses all the data.

The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s

Each iteration uses a tiny subset of the data (batch mode) or even just one sample.

All current genAI models are trained with some variant of the SGD!


Tree based models

Ensembling / Bagging vs Boosting


Overfitting and regularization

Splitting the data

You only have a set of data. You want to

  1. train the model => it learns the patterns in the data
  2. evaluate the performance of the mode with some choosen metric

But you only have a set of data.

So you split the dataset into train and test subsets

cross validation: In fact you do that multiple times with different splits and tune the model so that it performs well on average over all the splits.

This way the training data is not very different than the evaluation data. think outliers, missing values, under representation of a class etc

In the end you want your model to perform well on new unseen data.

Overfitting

Overfitting is the enemy of the data scientist.

More complex models, are better at learning the training data.

The model is so good at learning training data that it will perform great on the training data but poorly on the test data. And therefore on the real world data that it has nott seen yet.

Can’t generalize.

Detect overfit by comparing the error on the training data and the error on the test data.

if it is too high on the training data and too low on the test data, it is overfitting.

Regularization

Techniques to avoid overfitting.

Adding constraints on the model

Randomly drop some features at each iteration.

Tree pruning

This is a decision tree trained on the Iris dataset. It is overfitting.

The same tree but limited to a depth of 2

No longer overfits, but poorer performance.

So train many trees (pruned or not pruned), each on a random subset of the data. then average

You have random forests.

Neural Networks

Perceptron

Binary classification

1957, Rosenblatt

Activation function is what makes the model non linear.


Algorithm Perceptron

Initialize weights (including a bias weight) to 0 or small random values Set a learning rate (a small number, e.g., 0.01)

For each training example:
    Calculate the weighted sum:
        sum = (input1 * weight1) + (input2 * weight2) + ... + bias_weight

    Apply a step function to decide:
        If sum >= 0, output = 1
        Else, output = 0

    If the output is wrong:
        For each weight:
            weight = weight + (learning_rate * (true_output - predicted_output) * input)
        bias_weight = bias_weight + (learning_rate * (true_output - predicted_output))

Repeat for multiple passes over the training data until errors are minimized

and in python

import numpy as np

# Training data: each row is [input1, input2, true_output]
training_data = np.array([
    [0, 0, 0],
    [0, 1, 0],
    [1, 0, 0],
    [1, 1, 1],
])

# Initialize weights and bias
weights = np.array([0.0, 0.0])
bias = 0.0
learning_rate = 0.1

# Training loop
for _ in range(10):
    for data in training_data:
        inputs, true_output = data[:2], data[2]
        # Calculate weighted sum
        weighted_sum = np.dot(inputs, weights) + bias
        # Step function (1 if >= 0, else 0)
        predicted_output = 1 if weighted_sum >= 0 else 0
        # Update weights and bias
        error = true_output - predicted_output
        weights += learning_rate * error * inputs
        bias += learning_rate * error

# Test the perceptron
for data in training_data:
    inputs, _ = data[:2], data[2]
    weighted_sum = np.dot(inputs, weights) + bias
    print(f"Input: {inputs} -> Output: {1 if weighted_sum >= 0 else 0}")

Multi layer perceptron

Single-layer perceptrons are only capable of learning linearly separable patterns

Can learn AND but not XOR

a feedforward neural network with two or more layers (also called a multilayer perceptron) had greater processing power than perceptrons with one layer (also called a single-layer perceptron). (wikipedia)

see Notebook: XOR vs AND

Backpropagation

1980s: Development of backpropagation

Regularisation for Neural Networks

On top of L2/L1 regularisation, dropout consists in randomly setting some weights to zero during training.

visualize

playground.tensorflow.org

More complex networks

CNN : convolutional neural networks

Replace the linear regression with convolutions.

Excellent for

Each convolution layer extracts features from the previous layer. It zooms out and abstracts the patterns

Typical CNN

RNNs : Recurrent Neural Networks

Great for time series, NLP and any type of sequential data

Example of a more complex NN : Inception

https://en.wikipedia.org/wiki/Inception_(deep_learning_architecture)

Inception[1] is a family of convolutional neural network (CNN) for computer vision, introduced by researchers at Google in 2014 as GoogLeNet

Transformers

Architecture of GPT-2

GPT2