Project reviews
You
What caught your attention?
Major distinction between open and closed source models
A distinction relevant for all software including AI models
Open source:
Closed source:
Since the code is public: Transparency, security, innovation, cost-effectiveness
Different levels of openness:
Some models are fully open (DeepSeek), partially open (LLama, Mistral 7B), or closed (OpenAI o1, Claude Sonnet, Gemini)
If you have the weights of a model you can fine tune it on your own data. Light version of training a whole model
Lack of Transparency & Reproducibility: architecture, weights, hyperparameters, and training data hidden
The HUGE Impact of LLMs on society demands transparency and accountability
Model / Family | Date | Company | Key Highlights |
---|---|---|---|
GPT-OSS-120B / 20B | Aug 2025 | OpenAI | Open-weight MoE models; long context; consumer-friendly |
Llama 4 (Scout, Maverick) | April 5, 2025 | Meta | Mixture-of-experts; multimodal; long context; multilingual |
DeepSeek-R1 | Jan 2025 | DeepSeek | Strong reasoning & math performance |
Gemma 3 | March 12, 2025 | Google DeepMind | Multimodal, multilingual, long-context |
Qwen 2.5 (VL-32B, Omni-7B) | March 2025 | Alibaba (Qwen Team) | Vision–language and multimodal capabilities |
Qwen 3 family | April 28, 2025 | Alibaba (Qwen Team) | Dense & sparse variants up to 235B |
Mistral Small 3.1 | March 2025 | Mistral AI | Efficient small-scale open model |
Magistral Small | June 10, 2025 | Mistral AI | Reasoning-focused, chain-of-thought tuned |
GLM-4.5 | July 2025 | Zhipu AI | Agent-oriented model |
BitNet b1.58 2B4T | April 2025 | Microsoft Research | 1-bit quantized, ultra-efficient |
AM-Thinking-v1 | May 2025 | Academind / Qwen community | Qwen-based advanced reasoning model |
Open source models are catching up with closed source models
Q: What does this mean ?
Term | means Explanation |
---|---|
Multilingual | The model understands and can respond in many different languages, not just English. |
Multimodal | The model can understand inputs like text and images (sometimes also audio/video) rather than just one type. |
Long context | The model can remember and work with very long passages of text (think chapters or entire books) without forgetting what was at the start. 1M tokens |
MoE Mixture-of-Experts |
An architecture where different experts models handle different parts of the task—only a few experts activate per input |
Efficient / Consumer-friendly | Designed to run on regular devices (like a powerful laptop or single GPU) without needing massive data center infrastructure. |
Language | Example Sentence | Writing System | Approx. Tokens (LLM tokenizers like GPT-4’s) | Notes |
---|---|---|---|---|
English | I am going to the computer store tomorrow. | Alphabetic | ~9–10 | Clear word boundaries, but “computer” may split into subwords depending on tokenizer. |
French | Je vais demain au magasin d’ordinateur. | Alphabetic | ~9–11 | Similar to English; compounds like d’ordinateur may add extra tokens. |
Chinese | 我明天去电脑商店。 | Logographic | 7 | Each character is usually one token; very dense information packing. |
Japanese | 私は明日パソコンの店に行きます。 | Mixed (Kanji + Kana) | ~12–15 | Needs morphological analysis; kanji are tokens, kana sometimes merge into subwords. |
Korean | 나는 내일 컴퓨터 가게에 간다. | Alphabetic (syllabic blocks) | ~10–12 | Spaces exist, but subword splits can happen (esp. with loanwords like 컴퓨터). |
The router model picks the proper expert models given the input and combines their outputs.
=> Better quality (specialization) and lower cost (sparse compute).
Model trains end-to-end: router + all experts learn together.
Router learns which experts fit each token/type of text.
Experts get practice on what they see most → specialization emerges
Load-balancing & capacity limits keep work shared so all experts improve.
Questions:
=> No
Train on a big mixed dataset (web, books, code, Q&A, dialogue, languages).
We choose sampling weights by goals → start with an initial mix, then test & adjust.
Guardrails: min/max quotas, quality filters, de-duplication; optional curriculum (shift weights over time).
MoE-specific nudge: up-weight data that wakes underused experts.
=> Net effect: the router + data mix shape who does what, efficiently.
Obtain and analyze an existing CSV dataset for the project.
Hugging Face datasets interface with 295,389 datasets and various dataset examples
RSS = “Really Simple Syndication.” A standard way for websites to publish updates.
How it works: Sites expose a feed (an XML file). Your reader/aggregator checks it and shows new posts in one place.
Why it’s nice: Chronological, no algorithms, no ads injected by platforms, and privacy-friendly (you pull info; nobody tracks your clicks by default).
What you can follow: Blogs, news sites, podcasts, newsletters, job boards, YouTube channels, forum threads.
Finding feeds: Look for the RSS icon, “/feed” or “/rss” on the site, or a “Subscribe”/“Follow via RSS” link.
ex: https://oilprice.com
https://www.google.com/alerts](https://www.google.com/alerts)
Super efficient video downloader
Can also be used for transcripts / subtitles to build a corpus
https://github.com/yt-dlp/yt-dlp
https://www.pythoncentral.io/yt-dlp-download-youtube-videos/
yt-dlp --write-subs https://www.youtube.com/watch?v=example
Example, pick a media, get all videos, download subtitles or get videos, extract images, analyse images etc
examine coverage of a topic by a media. commentary, images, etc
There are multiple API standards and languages
And APIs can return all sorts of content: html, text, xml, pdfs,
APIs often return raw data formatted as JSON
JSON (JavaScript Object Notation) is a lightweight data format.
[
{
"name": "Châtelet",
"lines": ["1", "4", "7", "11", "14"],
},
{
"name": "Bastille",
"lines": ["1", "5", "8"],
},
{
"name": "Charles de Gaulle–Étoile",
"lines": ["1", "2", "6"],
}
]
[
{
"name": "Charles de Gaulle–Étoile",
"lines": ["1", "2", "6"],
"exits": [
{
"name": "Sortie 1 — Arc de Triomphe",
"address": {
"street": "Place Charles de Gaulle",
"postal_code": "75008",
"city": "Paris",
"country": "France"
}
},
{
"name": "Sortie 2 — Champs-Élysées",
"address": {
"street": "Avenue des Champs-Élysées",
"postal_code": "75008",
"city": "Paris",
"country": "France"
}
}
]
},
{
"name": "Bastille",
"lines": ["1", "5", "8"],
"exits": [
{
"name": "Sortie 1 — Place de la Bastille",
"address": {
"street": "Place de la Bastille",
"postal_code": "75011",
"city": "Paris",
"country": "France"
}
}
]
}
]
[
{
"name": "Billlie",
"origin": "South Korea",
"genre": "K-pop",
"debut_year": 2021,
"members": [
"Moon Sua",
"Suhyeon",
"Haruna",
"Sheon",
"Tsuki",
"Siyoon"
],
"notable_songs": ["Ring X Ring", "GingaMingaYo", "EUNOIA"]
},
{
"name": "Oasis",
"origin": "United Kingdom",
"genre": "Britpop / Rock",
"debut_year": 1991,
"members": [
"Liam Gallagher",
"Noel Gallagher",
"Paul Arthurs",
"Paul McGuigan",
"Tony McCarroll"
],
"notable_songs": ["Wonderwall", "Don't Look Back in Anger", "Champagne Supernova"]
}
]
JSON rules
"key": value
Keys must be in double quotes " "
.
"Hello"
42, 3.14
true
or false
null
{ "key": "value" }
[ 1, 2, 3 ]
Common Mistakes
(// ... or /* ... */)
→ not allowed in JSON.NaN
, Infinity
, or functions → invalid.t
or f
[]
{}
Missing data
vscode
, windsurf
sublime
, atom
, Xcode
csv stands for : comma separated values
in 🇫🇷🇫🇷🇫🇷 France, we use ;
as a separator because numbers use ,
as a decimal separator 🤪
123.456,78
123,456.78
csv
files with a tab separator are sometimes called tsv
A csv file is a spreadsheet.
THE library to handle data.
Loads a csv file into a dataframe
A dataframe is exactly like a spreadsheet. columns and rows
import pandas as pd
df = pd.read_cv('titanic.csv')
print(df)
df.head()
df.tail()
df.columns
df.dtypes
df.shape
df.describe()
df.isnull().sum()
df.isnull().mean()
df.isnull().sum() / df.shape[0]
df.isnull().sum() / df.shape[0] * 100
df.isnull().sum() / df.shape[0] * 100
df['column_name']
df[['column_name1', 'column_name2']]
df.loc[:, 'column_name1':'column_name2']
df.iloc[:, 0:2]
df.loc[]
df.iloc[]
mask = df.age > 18
df[mask]
column_names = [list of column names you want to see]
df.loc[mask, column_names]
In google colab
df
data dictionnary is available at : https://www.kaggle.com/datasets/mayankray/imdb-top-1000-movies-dataset
Let’s review your projects
fill in the google spreadsheet for projects
https://docs.google.com/spreadsheets/d/1PcCnVB6wnJVXIreziA89dtqpGRGylk03sqUnTY-HYj4/edit?usp=sharing
Ethan Mollick : https://www.oneusefulthing.org