LLM : large language models, aka agents, AIAgents,
Super Powerful LLMs: Revolutionizing data analysis, interpretation, and automation.
Impact on Data Science: Enhancing efficiency, accuracy, and scalability of workflows.
Code Generation with LLM: Quickly develop data analysis pipelines
Use of LLMs as an Autonomous Data Analysis Tool: extract patterns, trends, and insights from raw data. Analysis of large datasets or documents
Accessible Methods: Simplifying complex data processes with user-friendly tools.
from Emergence to Transcendence
Language models are trained to mimic human behavioral data.
This mimicry makes it tempting to anthropomorphize a system—to think of it like a person. However, not only is the model not a person, it is not even trained to mimic a person.
Instead, the model has been trained to mimic a group of people with individual capacities, predilections, and biases. [...] but we also see the enormous advantage of training on data from a
diverse set of people: often, it is possible to outperform any individual member of that
group.
The capacity of a generalist model to exceed individual ability is evident in a chatbot that can converse with equal competence about cryptography, international law, and the work of Dostoevsky. Our goal is to describe the circumstances in which a model, trained to mimic multiple people, is capable of transcending its sources by outperforming each individual.
LLM benchmarks are tests that measure how well large language models perform on different tasks, like answering questions, solving problems, or writing text. They give a way to compare models side by side.
Challenges:
Benchmarks don’t always reflect real-world use, can become outdated quickly, and models often “train to the test,” meaning high scores don’t always equal better usefulness.
Traditional Benchmarks:
MMLU Massive Multitask Language Understanding: 16,000 multiple-choice questions
HellaSwag: Can a Machine Really Finish Your Sentence?
GSM8K Math Word Problems: Grade School Math 8K Q&A
Problem: Models quickly saturate these tests
The New Frontier:
GPQA Diamond: 198 MCQ in biology, chemistry, and physics, from “hard undergraduate” to “post-graduate level”.
LiveCodeBench: contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time
Humanity's Last Exam: questions from nearly 1,000 subject expert contributors affiliated with over 500 institutions across 50 countries – comprised mostly of professors, researchers, and graduate degree holders.
These represent humanity's cognitive boundaries
GPQA Diamond: Science at PhD Level
The hardest science questions humans can answer
Graduate-Level Google-Proof Q&A
Physics, Chemistry, Biology questions at PhD level
Designed to be Google-proof - can't be solved by search
Human PhD holders: ~65% accuracy
Why it matters: Tests deep scientific reasoning, not memorization
Current AI leaders:
Grok 4: 87.5%
GPT‑5: 87.3%
Gemini 2.5 Pro: 86.4%
Grok 3 [Beta]: 84.6%
OpenAI o3: 83.3%
LiveCodeBench
Real programming challenges from competitive coding platforms
Problems released after model training
Tests actual problem-solving, not memorization
Updated continuously with fresh challenges
**Human vs AI Performance (2025)**
- Top human programmers: ~85-95%
- Best AI models: ~45-60%
- **Gap is closing rapidly**
The question: What happens when AI exceeds human performance on every cognitive benchmark?
two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.
Artificial Analysis Intellligence Index
Artificial Analysis Intelligence Index combines performance across seven evaluations: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME 2025, and IFBench.
Score vs release date
Cost
"When GPT-4 came out it was around 50/Mtokens,nowitcostsaround0.14 / Mtokens to use GPT-5 nano. GPT-5 nano is a much more capable model than the original GPT-4."
"Google has reported that energy efficiency per prompt has improved by 33x in the last year alone.
The marginal energy used by a standard prompt from a modern LLM in 2025 is relatively established at this point, from both independent tests and official announcements.
It is roughly 0.0003 kWh, the same energy use as 8-10 seconds of streaming Netflix or the equivalent of a Google search in 2008.
Image creation seems to use a similar amount of energy as a text prompt.
How much water these models use per prompt is less clear but ranges from a few drops to a fifth of a shot glass (.25mL to 5mL+)
**midjourney**: _a panda is typing on a laptop ; a penguin is looking over the shoulder realistic, natural colors, low sun, the pandas wears a knitted hat and a warm jacket background : polar station, ice, snow_
Demo
In this demo, I will
load the penguins data into a pandas dataframe
do some basic exploration
ask for visualizations and analysis
Your turn
---
# Google colab practice
Simple exercise on google colab with a similar dataset : the Titanic!
The titanic dataset is a classic in machine learning.
List of 200 passengers
some features (age, sex, name, ticket price, etc.)