Machine learning

Multiple types of automated learnings

machine learning vs deep learning vs reinforcement learning vs online learning

Machine learning

Linear Regression

Legendre (1805): First publication of the method of least squares, which is the basis for linear regression.

Gauss (1795–1809): Independently developed and formalized the method, contributing significantly to its statistical foundation. chat.mistral.ai

context: tabular data. - predictor variables - target variable : the thing you want to predict

Simple model : linear regression

sales = a * radio + b * TV + c * newspaper + noise

you try to find a, b, c so that you can predict sales

find the line closest to the points.
minimize the distance between the line and the points.
distance = difference between the predicted value (the points on the line) and the actual value (the real data).

Linear Regression Rocks!

simple
explainable
been around for litteraly ages
you cna understand the dynamics within the data.
very much in use today
implemented everywhere: R, python statsmodel, excel, …

Some hypothesis needed

Method of least squares

The goal is to minimize the error between what the model predicts and the actual values

In the case with 2 predictors (x_0 and x_1) and 1 target (y), the model is:

y_pred = a* x_0 + b * x_1 + c

you want to minimize the prediction error.

\[\text{prediction error: } = \sqrt{ \sum_{i=1}^{n} (y_i - (a + b x_i))^2 }\]

Heron’s method

Let’s travel back to 50 AD in ancient greece and meet Ἥρων ὁ Ἀλεξανδρεύς aka Heron of Alexandria

Heron

He invented the first steam engine (Aeolipile) and first vending machine!

And also something called Heron’s method to calculate the square root of a number. Also called for unknown reasons Babylonian method.

Simple algorithm to calculate the square root of 2.

Start from x_0 for instance x_0 = 1

Then repeat

- x_{n+1} = 1/2 (x_n + 2 / x_n)

until x_{n+1}^2 is close enough to 2.

in python:

x = -1
precision = 0.0001

for i in range(100):
    x = 0.5 * (x + 2 / x)
    error = abs(2 - x**2)
    print(i, 'x : ', round(x,8), 'error : ', round(error, 8))
    if error < precision:
        break

print('\nx : ', round(x,10), 'x^2 : ', round(x**2,10))

Extremely efficient!

It has all the ingredient of the Method of least squares.

the prediction : (x_n)^2
the error : abs(2 - x_n^2)
the learning rate : 0.5 that dictates by how much we wan to modify the estimated value.

Back to Machine Learning

The context: you have a dataset with features (columns) and a target variable.

for instance, the titanic dataset, the iris dataset, the house prices dataset, etc.

see UCI machine learning repository

On this page of the most popular datasets

Notice how we can filter by

data types
tasks
feature types

Regression vs Classification vs Clustering

On one side we have supervised learning (classification and regression) and on the other side unsupervised learning (clustering).

Unsupervised: no target, you use all columns to find patterns
- group samples by similarity (user segmentation, topics in corpus, etc)
Supervised: on column is the target you want to predict

Linear regression vs Logistic regression

the output of a linear regression is a real number (as in decimal number) without any limit

if we forced this number into a range of [0,1], then we could interpret it as a probability.

10% = 0.1
99% = 0.99

Introducing the logit / sigmoid function

takes any number and transforms it into a number between 0 and 1

Evaluation Metrics

regression: mean squared error (see Heron’s algo), MAE, RMSE, R^2
classification: accuracy, precision, recall, f1 score etc etc etc

confusion matrix

which becomes

see https://en.wikipedia.org/wiki/Confusion_matrix

Iterative Work process

get the data
clean the data
select features based on domain knowldged
select features automatically
split the data into train and test
select a model
train the model
evaluate the model
back to square 1

Most important families of models

linear models : linear regression, logistic regression, SGD, …
tree based : ensembling, boosting
neural networks
- perceptron : 1957, simple algorithm. MLP. can handle non linear
- backpropagation : 1986 (Yan LeCun),
- Winter of AI
- breakthrough : image classification competition. datasets are available
LLMs
- embeddings : 2013 word2vec
- transfomers architecture: 2017 : context embedding

Auto ML

Google AutoML, AWS Sagemaker etc

Gradient descent

context: linear model : set of coefficients. think linear regression with many coefficients

start with a set of arbitrary coefficients (all zeroes or ones for instance)

calculate the error and its derivative (the slope)
update the coefficients a little (tune with a parameter called the learning rate)
until the error is very small

In a way very similar to Heron’s method.

The goal of gradient descent is to find the local minima of a function.

and

From Gradient descent

Wikipedia:

Gradient descent is generally attributed to Augustin-Louis Cauchy, who first suggested it in 1847.Jacques Hadamard independently proposed a similar method in 1907. Its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944

Each iteration uses all the data.

to Stochastic gradient descent

The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s

Each iteration uses a tiny subset of the data (batch mode) or even just one sample.

All current genAI models are trained with some variant of the SGD!

Tree based models

decision trees
random forests
gradient boosting (ensembling + SGD)

Ensembling / Bagging vs Boosting

train many weak learners or laerners that tend to overfit
combine them to get a better model
- averaging the output : bagging
- sequentially narrowing on the errors : boosting

Overfitting and regularization

Splitting the data

You only have a set of data. You want to

train the model => it learns the patterns in the data
evaluate the performance of the mode with some choosen metric

But you only have a set of data.

So you split the dataset into train and test subsets

cross validation: In fact you do that multiple times with different splits and tune the model so that it performs well on average over all the splits.

This way the training data is not very different than the evaluation data. think outliers, missing values, under representation of a class etc

In the end you want your model to perform well on new unseen data.

Overfitting

Overfitting is the enemy of the data scientist.

More complex models, are better at learning the training data.

The model is so good at learning training data that it will perform great on the training data but poorly on the test data. And therefore on the real world data that it has nott seen yet.

Can’t generalize.

Detect overfit by comparing the error on the training data and the error on the test data.

if it is too high on the training data and too low on the test data, it is overfitting.

Regularization

Techniques to avoid overfitting.

adding data

Adding constraints on the model

trees
- pruning trees: limiting depth
- minimum split, or samples per leafs
Gradient based models
- L1/L2 regularization
- early stopping

Randomly drop some features at each iteration.

dropout

Tree pruning

This is a decision tree trained on the Iris dataset. It is overfitting.

The same tree but limited to a depth of 2

No longer overfits, but poorer performance.

So train many trees (pruned or not pruned), each on a random subset of the data. then average

You have random forests.

Neural Networks

Perceptron

Binary classification

1957, Rosenblatt

Activation function is what makes the model non linear.

Algorithm Perceptron

Initialize weights (including a bias weight) to 0 or small random values Set a learning rate (a small number, e.g., 0.01)

For each training example:
    Calculate the weighted sum:
        sum = (input1 * weight1) + (input2 * weight2) + ... + bias_weight

    Apply a step function to decide:
        If sum >= 0, output = 1
        Else, output = 0

    If the output is wrong:
        For each weight:
            weight = weight + (learning_rate * (true_output - predicted_output) * input)
        bias_weight = bias_weight + (learning_rate * (true_output - predicted_output))

Repeat for multiple passes over the training data until errors are minimized

and in python

import numpy as np

# Training data: each row is [input1, input2, true_output]
training_data = np.array([
    [0, 0, 0],
    [0, 1, 0],
    [1, 0, 0],
    [1, 1, 1],
])

# Initialize weights and bias
weights = np.array([0.0, 0.0])
bias = 0.0
learning_rate = 0.1

# Training loop
for _ in range(10):
    for data in training_data:
        inputs, true_output = data[:2], data[2]
        # Calculate weighted sum
        weighted_sum = np.dot(inputs, weights) + bias
        # Step function (1 if >= 0, else 0)
        predicted_output = 1 if weighted_sum >= 0 else 0
        # Update weights and bias
        error = true_output - predicted_output
        weights += learning_rate * error * inputs
        bias += learning_rate * error

# Test the perceptron
for data in training_data:
    inputs, _ = data[:2], data[2]
    weighted_sum = np.dot(inputs, weights) + bias
    print(f"Input: {inputs} -> Output: {1 if weighted_sum >= 0 else 0}")

Multi layer perceptron

Single-layer perceptrons are only capable of learning linearly separable patterns

Can learn AND but not XOR

a feedforward neural network with two or more layers (also called a multilayer perceptron) had greater processing power than perceptrons with one layer (also called a single-layer perceptron). (wikipedia)