Du ML aux LLMs

Images et ConvNets

transfer learning

Transformers avec Huggingface AutoModelForCausalLM: generates text in a unidirectional, left-to-right manner, predicting the next word in a sequence based on the preceding context. Compared to other transformer model types, the key difference in AutoModelForCausalLM is its unidirectional nature. This means it processes the text in a one-way, left-to-right fashion. (bidirectional language models, like BERT,)

https://huggingface.co/transformers/v4.8.2/model_doc/auto.html?highlight=automodelforcausallm#automodelforcausallm This is a generic model class that will be instantiated as one of the model classes of the library (with a causal language modeling head) when created with the from_pretrained() class method or the from_config() class method.

The core idea behind Automodelforcausallm is to use an autoregressive modeling approach to infer causal relationships from observational data. Autoregressive models are statistical models that predict the future value of a variable based on its past values. The model essentially captures the conditional probability distributions — how the variables depend on and influence each other.

pip install transformers

AutoModelForCausalLM: This module allows us to load a pre-trained causal language model. Causal language models can generate text based on a given prompt or context. AutoTokenizer: This module allows us to load a pre-trained tokenizer. Tokenizers break down input text into individual tokens, which are the basic units that the model understands.

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_text = "I want to learn AI"
input_ids = tokenizer(input_text, return_tensors='pt').input_ids
generated_ids = model.generate(input_ids, max_length=30)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

Transformers

de DistilGPT2 a Llama ou Gemma

DistilGPT2

DistilGPT2 is an English-language model pre-trained with the supervision of the 124 million parameter version of GPT-2. DistilGPT2, which has 82 million parameters, was developed using knowledge distillation and was designed to be a faster, lighter version of GPT-2.

https://huggingface.co/distilbert/distilgpt2

GPT2 model card : https://github.com/openai/gpt-2/blob/master/model_card.md

LOL: Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.

Trained on https://skylion007.github.io/OpenWebTextCorpus/ (Reddit) : This left 38GB of text data (40GB using SI units) from 8,013,769 documents.

DistilGPT2 : architecture classique, 82M de parametres (leger)

On load the model avec

model = AutoModelForCausalLM.from_pretrained(
        "distilbert/distilgpt2",
        torch_dtype="auto",
        device_map="cpu",
    )
print(model)

Ce qui donne la structure suivante

GPT2LMHeadModel(
    (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
            (0-5): 6 x GPT2Block(
                (ln_1): LayerNorm((768,), eps=1e-05,
                            ➥elementwise_affine=True)
                (attn): GPT2Attention(
                    (c_attn): Conv1D(nf=2304, nx=768)
                    (c_proj): Conv1D(nf=768, nx=768)
                    (attn_dropout): Dropout(p=0.1, inplace=False)
                    (resid_dropout): Dropout(p=0.1, inplace=False)
                )
                (ln_2): LayerNorm((768,), eps=1e-05,
                             ➥elementwise_affine=True)
                (mlp): GPT2MLP(
                    (c_fc): Conv1D(nf=3072, nx=768)
                    (c_proj): Conv1D(nf=768, nx=3072)
                    (act): NewGELUActivation()
                    (dropout): Dropout(p=0.1, inplace=False)
                )
            )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

donc 2 layers d’embedding + drop out puis 6 couches de GPT2Block puis la normalization et le head

2 couches d’embeddings

Embedding creation process in the input block. The wte layer converts each text token into a Token Embedding based on its meaning. The wpe layer generates a Positional Embedding based on the token’s position in the sequence. The combination of both produces a vector that contains both semantic and positional information.

WTE: Word Token Embedding

Vocab : 50257 tokens output : vecteur de 768

wpe (Positional Embedding) : encoding the position of each token within the sequence. For each possible position (from 0 to 1023), this layer generates a unique 768-dimension vector. The limit of 1024 positions marks the maximum input length that the model accepts. This vector, known as the positional embedding, is then added directly to the token embedding from the previous step. This vector addition is possible, precisely, because both vectors share the same dimensionality.

on ajoute les 2 vecteurs qui ont la meme dimension

puis du dropout (pourquoi jsute apres embeddings ? est ce qu’on a deja des nodes )

In the context of embeddings, dropout can be applied to the embedding vectors[2][3]. The original GPT-2 paper and some related literature suggest that “embedding dropout” typically refers to zeroing out entire word vectors, rather than individual elements within the vectors[2]. However, the Hugging Face implementation of GPT-2 applies a standard element-wise dropout to the embedding outputs[2].

les couches d’embeddings sont mis a jour pendant le training

on part d’un vocabulaire fixe (lookup) et des vecteurs initailisés, qui sont maj pendant l’entraineemnt

Weight Matrix: The embedding layer is essentially a weight matrix, often denoted as E.

The number of rows in E equals the vocabulary size (V).
The number of columns in E equals the embedding dimension (D). This D is a hyperparameter you choose (e.g., 128, 256, 512).

the Hugging Face implementation of GPT-2 applies a standard element-wise dropout to the embedding outputs[2].

Enfin on passe le tout aux blocks GPT2

.png)

GPT2Block Transformer block contains four components that appear in this order:

the ln_1 layer (normalization before the attention mechanism),
the attn module (the attention mechanism),
the ln_2 layer (normalization before MLP processing),
and the mlp module (transformation neural network).

The attention mechanism, is responsible for contextualization: it gives each word the ability to “look at” and connect with the other words in the sequence to better understand its meaning. For example, in the sentence “Fresh water flows from the mountain spring daily”, the word “spring” needs to look at “water,” “flows” and “mountain” to understand that it doesn’t refer to the season. Attention calculates these connections automatically, determining which words are relevant to understand each position.

MLP: Multi layer perceptron

The MLP (Multi-Layer Perceptron) block acts as the knowledge processor: it takes the information that the attention layer has contextualized and transforms it by applying the knowledge the model has learned during training. Attention says, the word “spring” in this context is related to “mountain” and “water”, and the MLP uses its knowledge to understand that this implies concepts like “nature,” “freshness,” and “geography,” and other similar terms. It’s a traditional neural network that maps input patterns to richer representations.

the model must convert its internal representations into a final prediction. This process is done in two steps: first, the ln_f (final layer norm) layer stabilizes the output vectors.

Next, the lm_head (language modeling head) projects each vector to the full vocabulary space (50,257 dimensions).

The result of this projection isn’t a single token, but a vector of logits (scores) that represents a probability distribution over all possible tokens. For example, for the phrase “Paris is the capital of…”, these logits might assign a 30% probability to “France” and 25% to “Italy”. While 30% might seem low for a correct answer, it represents a strong prediction when distributed across a vocabulary of over 50,000 possible tokens. The remaining probability is spread thinly across thousands of other, far less likely options. It’s from this distribution that the next token is finally selected.

Attention module

Google’s classic paper: “Attention is all you need” (https://arxiv.org/abs/1706.03762).

So far the embedding for a given input token has : identtity and position bbut no context

l’avocat mange des fruits vs je mange un avocat

For example, the part of the vector that represents the meaning of the word “spring” (its token embedding) is identical in “Fresh water flows from the mountain spring daily” and in “The garden comes alive again during the spring”.

This contextualization happens inside the (attn) module, GPT2Attention, through an architecture called Multi-Head Attention (MHA) each head works in parallel Each attention head is, in essence, a complete and independent attention mechanism that has specialized in looking for a specific type of relationship. The specialization of the attention heads arises during the model’s training. As the model adjusts its weights to minimize the prediction error, each head learns to focus on the types of patterns that it finds most useful. For example, one head might become an expert in detecting syntactic relationships (subject-verb), while another might specialize in semantic relationships (synonyms) or long-range dependencies.

2 3 config = model.config print(f”Attention heads: {config.n_head}”) print(f”Head dimensions: {config.n_embd // config.n_head}”)

The model divides the input embedding dimensionality (768) among these 12 heads, so each one focuses on a smaller, more specialized 64-dimension representation (768 / 12 = 64). The c_attn layer is designed to very efficiently generate the necessary projections for all these heads simultaneously.

Head process

The input X is linearly transformed into three different representations: the Query (Q), the Key (K), and the Value (V). These transformations are learned linear projections.

Q = X * W_Q

K = X * W_K

V = X * W_V

Where W_Q, W_K, and W_V are the weight matrices for the query, key, and value transformations, respectively. Each head has its own unique set of these weight matrices.

The differences between W_Q, W_K, and W_V arise from:

Independent Initialization: They start with different random values.

Different Gradient Updates: They receive different gradients during backpropagation because they are used in different ways to compute attention scores and weighted values. This is the most important factor.

Functional Roles: They serve different purposes (querying, keying, and valuing), which guide their learning process.

C’est un sujet en soi meme et j’ai pas tout compris

Back to MLP It’s important to remember that this sequence runs in order: Attention → MLP → Attention → MLP, and so on through the six blocks In other words, the Attention layer contextualizes and the MLP layer converts contextualization into knowledge.

(mlp): GPT2MLP(
     (c_fc): Conv1D(nf=3072, nx=768)
     (c_proj): Conv1D(nf=768, nx=3072)
     (act): NewGELUActivation()
     (dropout): Dropout(p=0.1, inplace=False)
)

En fait l’ordre d’execution est un peu different

(mlp): GPT2MLP(
     (c_fc): Conv1D(nf=3072, nx=768)
     (act): NewGELUActivation()
     (c_proj): Conv1D(nf=768, nx=3072)
     (dropout): Dropout(p=0.1, inplace=False)
)

expansion (c_fc) → activation (act) → contraction (c_proj) → regularization (dropout).

Gelu https://docs.pytorch.org/docs/stable/generated/torch.nn.GELU.html

GELU (Gaussian Error Linear Unit) is a smooth, non-linear activation function that helps the model learn complex patterns by selectively passing or attenuating signals from each neuron.

=> breaks the linearity

Dimensions of a model

The depth refers to the number of Transformer blocks that we stack on top of each other. In DistilGPT2, the depth is six Transformer blocks. Modifying the depth, as we saw in the previous chapter with depth pruning, involves removing entire blocks. Generally, more depth allows for a more complex sequential refinement of information, but at the cost of higher latency. The width refers to the size of the internal layers, specifically the intermediate dimension of the MLP block. In our DistilGPT2, the width is 3072 neurons. This dimension determines the model’s capacity to process knowledge at each step. Reducing the expansion capacity with Width Pruning directly impacts the number of parameters and memory consumption.

We can distinguish two main approaches:

Wide models: Architectures like Llama-3.2-1B and DistilGPT2 are notably wide. They use a large input embedding and a larger MLP expansion than in the other models, in this case x4, which gives them a large processing capacity in each of their layers.
Deep and narrow models: In contrast, models like Qwen3-0.6B and Gemma-3-270M, with a smaller embedding and an x3 expansion, are deeper and narrower. They prioritize a higher number of sequential transformations.

Head

Les LLM

comment utiliser hugginface

les librairies dataset et … de hugginface
tokenizer

LLM locaux

Ollama

Tailored LLM

pruning
knowledge distillation

https://huggingface.co/spaces/exbert-project/exbert