
Leo Breiman Statistical Modeling: the two cultures
Here is the abstract
_ There are two cultures in the use of statistical modeling to reach conclusions from data._
One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
The statistical community has been committed to the almost exclusive use of data models. This commit- ment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current prob- lems.
Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets.
If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools
correlation is not causation => spurious correlation website
statistical significance - p-values

Legendre (1805): First publication of the method of least squares, which is the basis for linear regression.
Gauss (1795–1809): Independently developed and formalized the method, contributing significantly to its statistical foundation. chat.mistral.ai
context: tabular data. - predictor variables - target variable : the thing you want to predict
Simple model : linear regression
sales = a * radio + b * TV + c * newspaper + noise
you try to find a, b, c so that you can predict sales


The goal is to minimize the error between what the model predicts and the actual values
In the case with 2 predictors (x_0 and x_1) and 1 target (y), the model is:
y_pred = a* x_0 + b * x_1 + c
you want to minimize the prediction error.
\[\text{prediction error: } = \sqrt{ \sum_{i=1}^{n} (y_i - (a + b x_i))^2 }\]Let’s travel back to 50 AD in ancient greece and meet Ἥρων ὁ Ἀλεξανδρεύς aka Heron of Alexandria

He invented the first steam engine (Aeolipile) and first vending machine!
And also something called Heron’s method to calculate the square root of a number. Also called for unknown reasons Babylonian method.
Simple algorithm to calculate the square root of 2.
Start from x_0 for instance x_0 = 1
Then repeat
- x_{n+1} = 1/2 (x_n + 2 / x_n)
until x_{n+1}^2 is close enough to 2.
in python:
x = -1
precision = 0.0001
for i in range(100):
x = 0.5 * (x + 2 / x)
error = abs(2 - x**2)
print(i, 'x : ', round(x,8), 'error : ', round(error, 8))
if error < precision:
break
print('\nx : ', round(x,10), 'x^2 : ', round(x**2,10))
Extremely efficient!
It has all the ingredient of the Method of least squares.
The context: you have a dataset with features (columns) and a target variable.
for instance, the titanic dataset, the iris dataset, the house prices dataset, etc.
On this page of the most popular datasets
Notice how we can filter by
On one side we have supervised learning (classification and regression) and on the other side unsupervised learning (clustering).


the output of a linear regression is a real number (as in decimal number) without any limit
if we forced this number into a range of [0,1], then we could interpret it as a probability.
Introducing the logit / sigmoid function


start with a set of arbitrary coefficients (all zeroes or ones for instance)
In a way very similar to Heron’s method.
The goal of gradient descent is to find the local minima of a function.

and

Wikipedia:
Gradient descent is generally attributed to Augustin-Louis Cauchy, who first suggested it in 1847.Jacques Hadamard independently proposed a similar method in 1907. Its convergence properties for non-linear optimization problems were first studied by Haskell Curry in 1944
Each iteration uses all the data.
The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s
Each iteration uses a tiny subset of the data (batch mode) or even just one sample.
All current genAI models are trained with some variant of the SGD!



You only have a set of data. You want to
But you only have a set of data.
So you split the dataset into train and test subsets
cross validation: In fact you do that multiple times with different splits and tune the model so that it performs well on average over all the splits.

This way the training data is not very different than the evaluation data. think outliers, missing values, under representation of a class etc
In the end you want your model to perform well on new unseen data.
Overfitting is the enemy of the data scientist.
More complex models, are better at learning the training data.

The model is so good at learning training data that it will perform great on the training data but poorly on the test data. And therefore on the real world data that it has nott seen yet.
Can’t generalize.
Detect overfit by comparing the error on the training data and the error on the test data.
if it is too high on the training data and too low on the test data, it is overfitting.

Techniques to avoid overfitting.
Adding constraints on the model
Randomly drop some features at each iteration.
This is a decision tree trained on the Iris dataset. It is overfitting.

The same tree but limited to a depth of 2

No longer overfits, but poorer performance.
So train many trees (pruned or not pruned), each on a random subset of the data. then average
You have random forests.
Activation function is what makes the model non linear.

Initialize weights (including a bias weight) to 0 or small random values Set a learning rate (a small number, e.g., 0.01)
For each training example:
Calculate the weighted sum:
sum = (input1 * weight1) + (input2 * weight2) + ... + bias_weight
Apply a step function to decide:
If sum >= 0, output = 1
Else, output = 0
If the output is wrong:
For each weight:
weight = weight + (learning_rate * (true_output - predicted_output) * input)
bias_weight = bias_weight + (learning_rate * (true_output - predicted_output))
Repeat for multiple passes over the training data until errors are minimized
and in python
import numpy as np
# Training data: each row is [input1, input2, true_output]
training_data = np.array([
[0, 0, 0],
[0, 1, 0],
[1, 0, 0],
[1, 1, 1],
])
# Initialize weights and bias
weights = np.array([0.0, 0.0])
bias = 0.0
learning_rate = 0.1
# Training loop
for _ in range(10):
for data in training_data:
inputs, true_output = data[:2], data[2]
# Calculate weighted sum
weighted_sum = np.dot(inputs, weights) + bias
# Step function (1 if >= 0, else 0)
predicted_output = 1 if weighted_sum >= 0 else 0
# Update weights and bias
error = true_output - predicted_output
weights += learning_rate * error * inputs
bias += learning_rate * error
# Test the perceptron
for data in training_data:
inputs, _ = data[:2], data[2]
weighted_sum = np.dot(inputs, weights) + bias
print(f"Input: {inputs} -> Output: {1 if weighted_sum >= 0 else 0}")
Single-layer perceptrons are only capable of learning linearly separable patterns
Can learn AND but not XOR

a feedforward neural network with two or more layers (also called a multilayer perceptron) had greater processing power than perceptrons with one layer (also called a single-layer perceptron). (wikipedia)

see Notebook: XOR vs AND
1980s: Development of backpropagation
On top of L2/L1 regularisation, dropout consists in randomly setting some weights to zero during training.


Replace the linear regression with convolutions.
Excellent for

Each convolution layer extracts features from the previous layer. It zooms out and abstracts the patterns

Typical CNN

Great for time series, NLP and any type of sequential data


https://en.wikipedia.org/wiki/Inception_(deep_learning_architecture)
Inception[1] is a family of convolutional neural network (CNN) for computer vision, introduced by researchers at Google in 2014 as GoogLeNet
Architecture of GPT-2
