Fast Inference via APIs

---

What we saw last time

Any questions?

Today

We have used Gemini in google colab and web platforms (chatgpt.com, claude.ai etc )

We're going to look at platforms for inference: groq, openrouter

The goal is to compare models on a simple task

Use a 3rd LLM to judge the outputs


At the end of this class

You


In the News

What caught your attention this week?

---

Inference

---

Inference

Inference in LLMs

Inference is the process of running a trained LLM to generate predictions or responses from input prompts - essentially the "thinking" phase where the model processes your request and produces output.

Key Points:

What it is: Using a pre-trained model to generate text, not training it • Process: Input (prompt) → Model processing → Output (tokens/text) • Computational need: Requires significant GPU/TPU resources for each request


In Context of Groq & OpenRouter:

Groq: • Specialized inference hardware (LPU - Language Processing Unit) • Optimized for extremely fast token generation (up to 500+ tokens/second) • Focuses purely on inference speed, not training • Achieves low latency through custom silicon designed for LLMs

OpenRouter: • Inference routing service that connects to multiple LLM providers • Aggregates different inference endpoints (OpenAI, Anthropic, Meta, etc.) • Handles load balancing and failover between providers • You pay for inference compute across various models through one API


Inference

Bottom line: Inference is "running the model to get answers" - Groq makes it blazingly fast with custom chips, while OpenRouter gives you access to many different models' inference endpoints through a single interface.


dataset

see /Users/alexis/work/ncc1701/eu-scrape/data/debates_small.json for EU dataset

Groq

Open an account on groq

then go to

https://console.groq.com/

and create an API key

create a new google colab

add the API key to the colab (left side secret )

load the dataset into a pandas dataframe


Enriching data with LLMs

classic or modern NLP is mostly about

We can use an LLM to extract much richer information from text

Consider a corpus of verbatim debates, or online posts on a given topic, or articles or scientific papers.

CategoryUse CaseDescription
Content Analysis & ExtractionArgument/claim extractionIdentify main arguments, supporting evidence, counterarguments
Entity recognitionExtract people, organizations, locations, events, concepts
Relationship mappingIdentify connections between entities, ideas, or actors
Quote attributionMatch statements to speakers/authors
Topic modelingIdentify main themes and sub-topics
Key points summarizationExtract main takeaways per document

CategoryUse CaseDescription
Semantic & Linguistic AnnotationSentiment analysisPositive/negative/neutral stance on topics
Emotion detectionIdentify emotional tone (anger, joy, fear, etc.)
Stance detectionPosition on specific issues (pro/con/neutral)
Rhetorical device identificationMetaphors, analogies, logical fallacies
Writing style analysisFormal/informal, technical level, complexity
Intent classificationInform, persuade, criticize, question
Structural OrganizationTemporal extractionTimeline of events, chronological ordering
Hierarchical categorizationTaxonomies, nested topic structures
Section segmentationBreak into logical parts (intro, methods, conclusion)
Thread reconstructionLink replies, responses, follow-ups
Citation network analysisWho references whom, influence mapping

CategoryUse CaseDescription
Quality & Meta-AnalysisFact-checking flagsMark claims needing verification
Bias detectionPolitical, cultural, or demographic biases
Quality scoringRate argument strength, evidence quality, coherence
Contradiction detectionFind conflicting statements within/across documents
Gap analysisIdentify missing topics or underexplored areas
Discourse type classificationNarrative, expository, argumentative
Knowledge SynthesisComparative analysisHow different sources treat same topic
Consensus identificationPoints of agreement across documents
Evolution trackingHow arguments/positions change over time
Cross-reference generationLink related content across corpus
Glossary creationExtract and define domain-specific terms
Question generationCreate relevant questions the corpus answers

Demo

Let's pick arguments

We have this dataset of verbatim debates of the European parliament.

I will connect to a model via groq

and use it to extract arguments from the debates.

and enrich the original datasets

I work on a samll subsample of the whole dataset (45M)


API caps

Using an API is limited in two ways

It's important to not ovewhelm the endpoint. Always add some pause in your loops

import time
time.sleep(1)

LLMs APIs have limits wrt to the amount of data you can send and retrieve.

Specially for free APIs!

Your alloted amount of free tokens is severaly limited as you will see


Method

always work on the smallest subset of data that's relevant.

start with one or two samples to verify the code works

once the code works, slowly scale up, do not process the whole dataset right away

you never know if some weird row will break your code


Some code best practices

DRY: Don't Repeat Yourself KISS: Keep It Simple, Stupid YAGNI: You Ain't Gonna Need It SRP: Single Responsibility Principle PIE: Program Intently and Expressively


Indempotency

A very very important aspect of API calls is indempotency.

An operation that produces the same result no matter how many times you perform it.

Example: The Elevator Button 🛗

Result: Always the same - one elevator arrives

data example

not idempotent


Code quality

you can ask your coding agent to respect these principles

and make the functions idempotent by not chaging the input but by outputing a new object / varraible


Let's open a new collab


Open Router

Now let's see what we can do with open router

multiple LLM providers: 494 models

Let's open a new account on openrouter and create an API key

https://openrouter.ai/

https://openrouter.ai/settings/keys


Open Router leaderboard

https://openrouter.ai/rankings


Open Router models

https://openrouter.ai/models

filter by

there are multiple (57) free models that you can use


Compare model performance

Let's go back to our small dataset

write the inference code for openrouter

with params: models (so we can switch models easily )

ask Gemini to create a function that sends a prompt to the OpenRouter API and return a JSON

write a prompt, specify that the output must be JSON

write a function add the returned JSON to the original dataframe.

pick 2 models!


Evaluate the 2 models

ok now let's evaluate the 2 models

Most of the times it's super difficult to say that on model is better than another one.

On some samples or some prompts one model will perform better than another one.

and sometimes vice versa.

we're going to aks a rd model to evaluate the output of the 2 models


LLM as a judge

let's try to use a model from groq as a judge

flow:

enrich the initial small dataset with the 2 open router

set the judge prompt with

Then scale up!


Extra : data viz

Ask gemini for a visualization of the results

Continue to investigate on the difference in quality output

maybe you had something in mind, some expectation of how you wanted the data to be enriched. n what you wanted to see. and that did not come through

how can you improve the initial data enrichment prompt ?


The try fail loop

The iterative cycle of experimenting with models/approaches, analyzing failures, and refining until you find what works.

Why It's Essential:

The Truth: If you're not failing, you're not trying hard enough. The magic happens in iteration 47, not iteration 1! 🎯

1 / 0