Investigating with AI

Web, JSON and pandas

# [skatAI.com/inwai](/courses/inwai/)

=> you can record me


What we saw last time

Questions ? Feedback ?


Class representative elections

class representative elections

https://forms.gle/jncnyGVYhvfSCLe29


Today


At the end of this class

You


In the News

What caught your attention?

---

News Sept 8th 2025

Some numbers


Open source vs closed source

---

the hugging face release dashboard

2024 AI Timeline - Hugginface

2024 AI Timeline

Major distinction between open and closed source models


Open source vs closed source

A distinction relevant for all software including AI models

Open source:

  • The code is public : Linux
  • Linux, OpenOffice, Firefox, Chromium, Python, major databases,
  • Can be copied and modified by anyone
Open source

Closed source:

  • The code is not accessible.
  • Windows, Word, Chrome, Edge, Oracle
  • needs a license to use, black box
Closed source

Open source

Since the code is public: Transparency, security, innovation, cost-effectiveness


Linux is open source


Open source LLMs

Different levels of openness:

Some models are fully open (DeepSeek), partially open (LLama, Mistral 7B), or closed (OpenAI o1, Claude Sonnet, Gemini)

If you have the weights of a model you can fine tune it on your own data. Light version of training a whole model


Closed source models: issues

Lack of Transparency & Reproducibility: architecture, weights, hyperparameters, and training data hidden

The HUGE **Impact of LLMs** on society demands transparency and accountability

Recent open source

Model / FamilyDateCompanyKey Highlights
GPT-OSS-120B / 20BAug 2025OpenAIOpen-weight MoE models; long context; consumer-friendly
Llama 4 (Scout, Maverick)April 5, 2025MetaMixture-of-experts; multimodal; long context; multilingual
DeepSeek-R1Jan 2025DeepSeekStrong reasoning & math performance
Gemma 3March 12, 2025Google DeepMindMultimodal, multilingual, long-context
Qwen 2.5 (VL-32B, Omni-7B)March 2025Alibaba (Qwen Team)Vision–language and multimodal capabilities
Qwen 3 familyApril 28, 2025Alibaba (Qwen Team)Dense & sparse variants up to 235B
Mistral Small 3.1March 2025Mistral AIEfficient small-scale open model
Magistral SmallJune 10, 2025Mistral AIReasoning-focused, chain-of-thought tuned
GLM-4.5July 2025Zhipu AIAgent-oriented model
BitNet b1.58 2B4TApril 2025Microsoft Research1-bit quantized, ultra-efficient
AM-Thinking-v1May 2025Academind / Qwen communityQwen-based advanced reasoning model

Closed-source vs. Open-weight models (MMLU, 5-shot) performance comparison over time from 2022-04 to 2024-04

Open source models are catching up with closed source models

Q: What does this mean ?


Performance vs Cost - updated august 2025

Performance vs Cost
Termmeans Explanation
MultilingualThe model understands and can respond in many different languages, not just English.
MultimodalThe model can understand inputs like text and images (sometimes also audio/video) rather than just one type.
Long contextThe model can remember and work with very long passages of text (think chapters or entire books) without forgetting what was at the start. 1M tokens
MoE
Mixture-of-Experts
An architecture where different experts models handle different parts of the task—only a few experts activate per input
Efficient / Consumer-friendlyDesigned to run on regular devices (like a powerful laptop or single GPU) without needing massive data center infrastructure.

Context window


language variations

LanguageExample SentenceWriting SystemApprox. Tokens (LLM tokenizers like GPT-4’s)Notes
EnglishI am going to the computer store tomorrow.Alphabetic~9–10Clear word boundaries, but "computer" may split into subwords depending on tokenizer.
FrenchJe vais demain au magasin d’ordinateur.Alphabetic~9–11Similar to English; compounds like d’ordinateur may add extra tokens.
Chinese我明天去电脑商店。Logographic7Each character is usually one token; very dense information packing.
Japanese私は明日パソコンの店に行きます。Mixed (Kanji + Kana)~12–15Needs morphological analysis; kanji are tokens, kana sometimes merge into subwords.
Korean나는 내일 컴퓨터 가게에 간다.Alphabetic (syllabic blocks)~10–12Spaces exist, but subword splits can happen (esp. with loanwords like 컴퓨터).

MoE Mixture of Experts

The router model picks the proper expert models given the input and combines their outputs.

=> Better quality (specialization) and lower cost (sparse compute).

moe_001

A Visual Guide to Mixture of Experts


How experts form (training)

Questions:

=> No


Data & sampling weights

=> Net effect: the router + data mix shape who does what, efficiently.


Datasets & datasources

---

Datasets & datasources


Kaggle datasets

Obtain and analyze an existing CSV dataset for the project.

https://www.kaggle.com/datasets


huggingface-datasets

Hugging Face datasets interface with 295,389 datasets and various dataset examples

https://huggingface.co/datasets

---

Google datasets search engine

https://datasetsearch.research.google.com/


Data sources

Media


rss feeds

RSS on wikipedia

RSS = “Really Simple Syndication.” A standard way for websites to publish updates.

How it works: Sites expose a feed (an XML file). Your reader/aggregator checks it and shows new posts in one place.

Why it’s nice: Chronological, no algorithms, no ads injected by platforms, and privacy-friendly (you pull info; nobody tracks your clicks by default).

What you can follow: Blogs, news sites, podcasts, newsletters, job boards, YouTube channels, forum threads.

Finding feeds: Look for the RSS icon, “/feed” or “/rss” on the site, or a “Subscribe”/“Follow via RSS” link.

ex: https://oilprice.com


Google alerts in RSS feeds

https://www.google.com/alerts

google-alerts-rss

yt-dlp

Super efficient video downloader

Can also be used for transcripts / subtitles to build a corpus

https://github.com/yt-dlp/yt-dlp

https://www.pythoncentral.io/yt-dlp-download-youtube-videos/

yt-dlp --write-subs https://www.youtube.com/watch?v=example

Example, pick a media, get all videos, download subtitles or get videos, extract images, analyse images etc

examine coverage of a topic by a media. commentary, images, etc


EU parliament debates

  • show the page (url) to the model as an example
  • ask to extract all interventions
  • get meta data on speakers
  • extract arguments

JSON

---

Beyond html: JSON

There are multiple API standards and languages

And APIs can return all sorts of content: html, text, xml, pdfs,

APIs often return raw data formatted as JSON

JSON (JavaScript Object Notation) is a lightweight data format.


JSON

how machine exchange data


```json [ { "name": "Châtelet", "lines": ["1", "4", "7", "11", "14"], }, { "name": "Bastille", "lines": ["1", "5", "8"], }, { "name": "Charles de Gaulle–Étoile", "lines": ["1", "2", "6"], } ] ```

```json [ { "name": "Charles de Gaulle–Étoile", "lines": ["1", "2", "6"], "exits": [ { "name": "Sortie 1 — Arc de Triomphe", "address": { "street": "Place Charles de Gaulle", "postal_code": "75008", "city": "Paris", "country": "France" } }, { "name": "Sortie 2 — Champs-Élysées", "address": { "street": "Avenue des Champs-Élysées", "postal_code": "75008", "city": "Paris", "country": "France" } } ] }, { "name": "Bastille", "lines": ["1", "5", "8"], "exits": [ { "name": "Sortie 1 — Place de la Bastille", "address": { "street": "Place de la Bastille", "postal_code": "75011", "city": "Paris", "country": "France" } } ] } ]
</div>

---

<div class="small" markdown="1">
```json
[
  {
    "name": "Billlie",
    "origin": "South Korea",
    "genre": "K-pop",
    "debut_year": 2021,
    "members": [
      "Moon Sua",
      "Suhyeon",
      "Haruna",
      "Sheon",
      "Tsuki",
      "Siyoon"
    ],
    "notable_songs": ["Ring X Ring", "GingaMingaYo", "EUNOIA"]
  },
  {
    "name": "Oasis",
    "origin": "United Kingdom",
    "genre": "Britpop / Rock",
    "debut_year": 1991,
    "members": [
      "Liam Gallagher",
      "Noel Gallagher",
      "Paul Arthurs",
      "Paul McGuigan",
      "Tony McCarroll"
    ],
    "notable_songs": ["Wonderwall", "Don't Look Back in Anger", "Champagne Supernova"]
  }
]

JSON rules

  • Always in key–value pairs as "key": value

  • Keys must be in double quotes " ".

  • Allowed value types

    • String (in double quotes) → "Hello"
    • Number (no quotes) → 42, 3.14
    • Boolean → true or false
    • Null → null
    • Object → { "key": "value" }
    • Array → [ 1, 2, 3 ]
**Common Mistakes**
  • Using single quotes for keys/strings.
  • Adding a trailing comma at the end of arrays/objects.
  • Having comments (// ... or /* ... */) → not allowed in JSON.
  • Using undefined values like NaN, Infinity, or functions → invalid.

Common data types

Missing data


How do you edit a JSON file ?


csv

csv stands for : comma separated values

in 🇫🇷🇫🇷🇫🇷 France, we use ; as a separator because numbers use , as a decimal separator 🤪

csv files with a tab separator are sometimes called tsv

A csv file is a spreadsheet.


Do not use Excel to edit your csv file!

Do not use Word to edit your json file!

Excel is evil

A Space Odyssey - 1968 - Stanley Kubrick


pandas

THE library to handle data.

pandas

Loads a csv file into a dataframe

A dataframe is exactly like a spreadsheet. columns and rows

import pandas as pd
df = pd.read_cv('titanic.csv')

basic manipulations

print(df)
df.head()
df.tail()
df.columns
df.dtypes
df.shape
df.describe()

Missing values = NULL

df.isnull()

Selecting columns

df['column_name']
df[['column_name1', 'column_name2']]
df.loc[:, 'column_name1':'column_name2']
df.iloc[:, 0:2]

Selecting rows

df.loc[]
df.iloc[]

Masking

  1. create a logical mask, a condition
mask = df.age > 18
  1. only get the rows for that condition
df[mask]
  1. combine masking and selection
column_names = [list of column names you want to see]
df.loc[mask, column_names]

Pandas Practice

---

demo

In google colab

The titanic dataset is available here


Your turn

Load imdb_top_1000.csv either from local or from the url into a pandas dataframe called df

Read the Data dictionnary

Then => Practice sheet


Projects

---

Projects

Let's review your projects

fill in the google spreadsheet for projects


Next week

new data source:

Ethan Mollick : https://www.oneusefulthing.org


Exit ticket

exit ticket
[https://forms.gle/9eE9PUR6mFC2szq47](https://forms.gle/9eE9PUR6mFC2szq47)
1 / 0