Any questions?
We have used Gemini in google colab and web platforms (chatgpt.com, claude.ai etc )
We’re going to look at platforms for inference: groq, openrouter
The goal is to compare models on a simple task
Use a 3rd LLM to judge the outputs
You
What caught your attention this week?
Inference is the process of running a trained LLM to generate predictions or responses from input prompts - essentially the “thinking” phase where the model processes your request and produces output.
• What it is: Using a pre-trained model to generate text, not training it • Process: Input (prompt) → Model processing → Output (tokens/text) • Computational need: Requires significant GPU/TPU resources for each request
Groq: • Specialized inference hardware (LPU - Language Processing Unit) • Optimized for extremely fast token generation (up to 500+ tokens/second) • Focuses purely on inference speed, not training • Achieves low latency through custom silicon designed for LLMs
OpenRouter: • Inference routing service that connects to multiple LLM providers • Aggregates different inference endpoints (OpenAI, Anthropic, Meta, etc.) • Handles load balancing and failover between providers • You pay for inference compute across various models through one API
Bottom line: Inference is “running the model to get answers” - Groq makes it blazingly fast with custom chips, while OpenRouter gives you access to many different models’ inference endpoints through a single interface.
Open an account on groq
then go to
https://console.groq.com/
and create an API key
create a new google colab
add the API key to the colab (left side secret )
load the dataset into a pandas dataframe
classic or modern NLP is mostly about
We can use an LLM to extract much richer information from text
Consider a corpus of verbatim debates, or online posts on a given topic, or articles or scientific papers.
Category | Use Case | Description |
---|---|---|
Content Analysis & Extraction | Argument/claim extraction | Identify main arguments, supporting evidence, counterarguments |
Entity recognition | Extract people, organizations, locations, events, concepts | |
Relationship mapping | Identify connections between entities, ideas, or actors | |
Quote attribution | Match statements to speakers/authors | |
Topic modeling | Identify main themes and sub-topics | |
Key points summarization | Extract main takeaways per document |
Category | Use Case | Description |
---|---|---|
Semantic & Linguistic Annotation | Sentiment analysis | Positive/negative/neutral stance on topics |
Emotion detection | Identify emotional tone (anger, joy, fear, etc.) | |
Stance detection | Position on specific issues (pro/con/neutral) | |
Rhetorical device identification | Metaphors, analogies, logical fallacies | |
Writing style analysis | Formal/informal, technical level, complexity | |
Intent classification | Inform, persuade, criticize, question | |
Structural Organization | Temporal extraction | Timeline of events, chronological ordering |
Hierarchical categorization | Taxonomies, nested topic structures | |
Section segmentation | Break into logical parts (intro, methods, conclusion) | |
Thread reconstruction | Link replies, responses, follow-ups | |
Citation network analysis | Who references whom, influence mapping |
Category | Use Case | Description |
---|---|---|
Quality & Meta-Analysis | Fact-checking flags | Mark claims needing verification |
Bias detection | Political, cultural, or demographic biases | |
Quality scoring | Rate argument strength, evidence quality, coherence | |
Contradiction detection | Find conflicting statements within/across documents | |
Gap analysis | Identify missing topics or underexplored areas | |
Discourse type classification | Narrative, expository, argumentative | |
Knowledge Synthesis | Comparative analysis | How different sources treat same topic |
Consensus identification | Points of agreement across documents | |
Evolution tracking | How arguments/positions change over time | |
Cross-reference generation | Link related content across corpus | |
Glossary creation | Extract and define domain-specific terms | |
Question generation | Create relevant questions the corpus answers |
Let’s pick arguments
We have this dataset of verbatim debates of the European parliament.
I will connect to a model via groq
and use it to extract arguments from the debates.
and enrich the original datasets
I work on a samll subsample of the whole dataset (45M)
Using an API is limited in two ways
It’s important to not ovewhelm the endpoint. Always add some pause in your loops
import time
time.sleep(1)
LLMs APIs have limits wrt to the amount of data you can send and retrieve.
Specially for free APIs!
Your alloted amount of free tokens is severaly limited as you will see
always work on the smallest subset of data that’s relevant.
start with one or two samples to verify the code works
once the code works, slowly scale up, do not process the whole dataset right away
you never know if some weird row will break your code
DRY: Don’t Repeat Yourself KISS: Keep It Simple, Stupid YAGNI: You Ain’t Gonna Need It SRP: Single Responsibility Principle PIE: Program Intently and Expressively
A very very important aspect of API calls is indempotency.
An operation that produces the same result no matter how many times you perform it.
Example: The Elevator Button 🛗
Result: Always the same - one elevator arrives
data example
not idempotent
l
with ll
you can ask your coding agent to respect these principles
and make the functions idempotent by not chaging the input but by outputing a new object / varraible
Now let’s see what we can do with open router
multiple LLM providers: 494 models
Let’s open a new account on openrouter and create an API key
https://openrouter.ai/
https://openrouter.ai/settings/keys
https://openrouter.ai/rankings
https://openrouter.ai/models
filter by
there are multiple (57) free models that you can use
Let’s go back to our small dataset
write the inference code for openrouter
with params: models (so we can switch models easily )
ask Gemini to create a function that sends a prompt to the OpenRouter API and return a JSON
write a prompt, specify that the output must be JSON
write a function add the returned JSON to the original dataframe.
pick 2 models!
ok now let’s evaluate the 2 models
Most of the times it’s super difficult to say that on model is better than another one.
On some samples or some prompts one model will perform better than another one.
and sometimes vice versa.
we’re going to aks a rd model to evaluate the output of the 2 models
let’s try to use a model from groq as a judge
flow:
enrich the initial small dataset with the 2 open router
set the judge prompt with
Then scale up!
Ask gemini for a visualization of the results
Continue to investigate on the difference in quality output
maybe you had something in mind, some expectation of how you wanted the data to be enriched. n what you wanted to see. and that did not come through
how can you improve the initial data enrichment prompt ?
The iterative cycle of experimenting with models/approaches, analyzing failures, and refining until you find what works.
Why It’s Essential:
The Truth: If you’re not failing, you’re not trying hard enough. The magic happens in iteration 47, not iteration 1! 🎯