=> you can record me
Project reviews
You
What caught your attention?
Google dismentlement case not happening because of genAI which changes the monopoly status of google
Some numbers

Major distinction between open and closed source models
A distinction relevant for all software including AI models
Open source:

Closed source:

Since the code is public: Transparency, security, innovation, cost-effectiveness
Different levels of openness:
Some models are fully open (DeepSeek), partially open (LLama, Mistral 7B), or closed (OpenAI o1, Claude Sonnet, Gemini)
If you have the weights of a model you can fine tune it on your own data. Light version of training a whole model
Lack of Transparency & Reproducibility: architecture, weights, hyperparameters, and training data hidden
The HUGE Impact of LLMs on society demands transparency and accountability
| Model / Family | Date | Company | Key Highlights |
|---|---|---|---|
| GPT-OSS-120B / 20B | Aug 2025 | OpenAI | Open-weight MoE models; long context; consumer-friendly |
| Llama 4 (Scout, Maverick) | April 5, 2025 | Meta | Mixture-of-experts; multimodal; long context; multilingual |
| DeepSeek-R1 | Jan 2025 | DeepSeek | Strong reasoning & math performance |
| Gemma 3 | March 12, 2025 | Google DeepMind | Multimodal, multilingual, long-context |
| Qwen 2.5 (VL-32B, Omni-7B) | March 2025 | Alibaba (Qwen Team) | Vision–language and multimodal capabilities |
| Qwen 3 family | April 28, 2025 | Alibaba (Qwen Team) | Dense & sparse variants up to 235B |
| Mistral Small 3.1 | March 2025 | Mistral AI | Efficient small-scale open model |
| Magistral Small | June 10, 2025 | Mistral AI | Reasoning-focused, chain-of-thought tuned |
| GLM-4.5 | July 2025 | Zhipu AI | Agent-oriented model |
| BitNet b1.58 2B4T | April 2025 | Microsoft Research | 1-bit quantized, ultra-efficient |
| AM-Thinking-v1 | May 2025 | Academind / Qwen community | Qwen-based advanced reasoning model |

Open source models are catching up with closed source models
Q: What does this mean ?
Performance vs Cost - updated august 2025

| Term | means Explanation |
|---|---|
| Multilingual | The model understands and can respond in many different languages, not just English. |
| Multimodal | The model can understand inputs like text and images (sometimes also audio/video) rather than just one type. |
| Long context | The model can remember and work with very long passages of text (think chapters or entire books) without forgetting what was at the start. 1M tokens |
| MoE Mixture-of-Experts |
An architecture where different experts models handle different parts of the task—only a few experts activate per input |
| Efficient / Consumer-friendly | Designed to run on regular devices (like a powerful laptop or single GPU) without needing massive data center infrastructure. |
| Language | Example Sentence | Writing System | Approx. Tokens (LLM tokenizers like GPT-4’s) | Notes |
|---|---|---|---|---|
| English | I am going to the computer store tomorrow. | Alphabetic | ~9–10 | Clear word boundaries, but “computer” may split into subwords depending on tokenizer. |
| French | Je vais demain au magasin d’ordinateur. | Alphabetic | ~9–11 | Similar to English; compounds like d’ordinateur may add extra tokens. |
| Chinese | 我明天去电脑商店。 | Logographic | 7 | Each character is usually one token; very dense information packing. |
| Japanese | 私は明日パソコンの店に行きます。 | Mixed (Kanji + Kana) | ~12–15 | Needs morphological analysis; kanji are tokens, kana sometimes merge into subwords. |
| Korean | 나는 내일 컴퓨터 가게에 간다. | Alphabetic (syllabic blocks) | ~10–12 | Spaces exist, but subword splits can happen (esp. with loanwords like 컴퓨터). |
The router model picks the proper expert models given the input and combines their outputs.
=> Better quality (specialization) and lower cost (sparse compute).

Model trains end-to-end: router + all experts learn together.
Router learns which experts fit each token/type of text.
Experts get practice on what they see most → specialization emerges
Load-balancing & capacity limits keep work shared so all experts improve.
Questions:
=> No
Train on a big mixed dataset (web, books, code, Q&A, dialogue, languages).
We choose sampling weights by goals → start with an initial mix, then test & adjust.
Guardrails: min/max quotas, quality filters, de-duplication; optional curriculum (shift weights over time).
MoE-specific nudge: up-weight data that wakes underused experts.
=> Net effect: the router + data mix shape who does what, efficiently.
Obtain and analyze an existing CSV dataset for the project.

Hugging Face datasets interface with 295,389 datasets and various dataset examples
RSS = “Really Simple Syndication.” A standard way for websites to publish updates.
How it works: Sites expose a feed (an XML file). Your reader/aggregator checks it and shows new posts in one place.
Why it’s nice: Chronological, no algorithms, no ads injected by platforms, and privacy-friendly (you pull info; nobody tracks your clicks by default).
What you can follow: Blogs, news sites, podcasts, newsletters, job boards, YouTube channels, forum threads.
Finding feeds: Look for the RSS icon, “/feed” or “/rss” on the site, or a “Subscribe”/“Follow via RSS” link.
ex: https://oilprice.com
Super efficient video downloader
Can also be used for transcripts / subtitles to build a corpus
https://github.com/yt-dlp/yt-dlp
https://www.pythoncentral.io/yt-dlp-download-youtube-videos/
yt-dlp --write-subs https://www.youtube.com/watch?v=example
Example, pick a media, get all videos, download subtitles or get videos, extract images, analyse images etc
examine coverage of a topic by a media. commentary, images, etc
There are multiple API standards and languages
And APIs can return all sorts of content: html, text, xml, pdfs,
APIs often return raw data formatted as JSON
JSON (JavaScript Object Notation) is a lightweight data format.
[
{
"name": "Châtelet",
"lines": ["1", "4", "7", "11", "14"],
},
{
"name": "Bastille",
"lines": ["1", "5", "8"],
},
{
"name": "Charles de Gaulle–Étoile",
"lines": ["1", "2", "6"],
}
]
[
{
"name": "Charles de Gaulle–Étoile",
"lines": ["1", "2", "6"],
"exits": [
{
"name": "Sortie 1 — Arc de Triomphe",
"address": {
"street": "Place Charles de Gaulle",
"postal_code": "75008",
"city": "Paris",
"country": "France"
}
},
{
"name": "Sortie 2 — Champs-Élysées",
"address": {
"street": "Avenue des Champs-Élysées",
"postal_code": "75008",
"city": "Paris",
"country": "France"
}
}
]
},
{
"name": "Bastille",
"lines": ["1", "5", "8"],
"exits": [
{
"name": "Sortie 1 — Place de la Bastille",
"address": {
"street": "Place de la Bastille",
"postal_code": "75011",
"city": "Paris",
"country": "France"
}
}
]
}
]
[
{
"name": "Billlie",
"origin": "South Korea",
"genre": "K-pop",
"debut_year": 2021,
"members": [
"Moon Sua",
"Suhyeon",
"Haruna",
"Sheon",
"Tsuki",
"Siyoon"
],
"notable_songs": ["Ring X Ring", "GingaMingaYo", "EUNOIA"]
},
{
"name": "Oasis",
"origin": "United Kingdom",
"genre": "Britpop / Rock",
"debut_year": 1991,
"members": [
"Liam Gallagher",
"Noel Gallagher",
"Paul Arthurs",
"Paul McGuigan",
"Tony McCarroll"
],
"notable_songs": ["Wonderwall", "Don't Look Back in Anger", "Champagne Supernova"]
}
]
JSON rules
"key": valueKeys must be in double quotes " ".
"Hello"42, 3.14true or falsenull{ "key": "value" }[ 1, 2, 3 ]Common Mistakes
(// ... or /* ... */) → not allowed in JSON.NaN, Infinity, or functions → invalid.t or f[]{}Missing data
spyder, vscode, windsurfsublime, atom, Xcodecsv stands for : comma separated values
in 🇫🇷🇫🇷🇫🇷 France, we use ; as a separator because numbers use , as a decimal separator 🤪
123.456,78123,456.78csv files with a tab separator are sometimes called tsv
A csv file is a spreadsheet.

THE library to handle data.

Loads a csv file into a dataframe
A dataframe is exactly like a spreadsheet. columns and rows
import pandas as pd
df = pd.read_cv('titanic.csv')
print(df)
df.head()
df.tail()
df.columns
df.dtypes
df.shape
df.describe()
df.isnull()
df['column_name']
df[['column_name1', 'column_name2']]
df.loc[:, 'column_name1':'column_name2']
df.iloc[:, 0:2]
df.loc[]
df.iloc[]
mask = df.age > 18
df[mask]
column_names = [list of column names you want to see]
df.loc[mask, column_names]
In google colab
The titanic dataset is available here
Load imdb_top_1000.csv either from local or from the url into a pandas dataframe called df
Read the Data dictionnary
Then => Practice sheet
Let’s review your projects
fill in the google spreadsheet for projects
Ethan Mollick : https://www.oneusefulthing.org