Fast Inference via APIs

What we saw last time

  • from LLMs to Augmented LLMs to Agents

Any questions?

Today

We have used Gemini in google colab and web platforms (chatgpt.com, claude.ai etc )

We’re going to look at platforms for inference: groq, openrouter

The goal is to compare models on a simple task

Use a 3rd LLM to judge the outputs

At the end of this class

You

  • use an LLM API to process data
  • mix and match LLMs for different tasks

In the News

What caught your attention this week?

Inference

Inference

Inference in LLMs

Inference is the process of running a trained LLM to generate predictions or responses from input prompts - essentially the “thinking” phase where the model processes your request and produces output.

Key Points:

What it is: Using a pre-trained model to generate text, not training it • Process: Input (prompt) → Model processing → Output (tokens/text) • Computational need: Requires significant GPU/TPU resources for each request

In Context of Groq & OpenRouter:

Groq: • Specialized inference hardware (LPU - Language Processing Unit) • Optimized for extremely fast token generation (up to 500+ tokens/second) • Focuses purely on inference speed, not training • Achieves low latency through custom silicon designed for LLMs

OpenRouter: • Inference routing service that connects to multiple LLM providers • Aggregates different inference endpoints (OpenAI, Anthropic, Meta, etc.) • Handles load balancing and failover between providers • You pay for inference compute across various models through one API

Inference

Bottom line: Inference is “running the model to get answers” - Groq makes it blazingly fast with custom chips, while OpenRouter gives you access to many different models’ inference endpoints through a single interface.

dataset

  • use the European parliament dataset
  • or your project dataset

see /Users/alexis/work/ncc1701/eu-scrape/data/debates_small.json for EU dataset

Groq

Open an account on groq

then go to

https://console.groq.com/

and create an API key

create a new google colab

add the API key to the colab (left side secret )

load the dataset into a pandas dataframe

Enriching data with LLMs

classic or modern NLP is mostly about

  • classification: topic, detection, sentiment etc
  • extractions of existing element : NER, POS
  • unsupervised analysis: topic modeling

We can use an LLM to extract much richer information from text

Consider a corpus of verbatim debates, or online posts on a given topic, or articles or scientific papers.

Category Use Case Description
Content Analysis & Extraction Argument/claim extraction Identify main arguments, supporting evidence, counterarguments
  Entity recognition Extract people, organizations, locations, events, concepts
  Relationship mapping Identify connections between entities, ideas, or actors
  Quote attribution Match statements to speakers/authors
  Topic modeling Identify main themes and sub-topics
  Key points summarization Extract main takeaways per document
Category Use Case Description
Semantic & Linguistic Annotation Sentiment analysis Positive/negative/neutral stance on topics
  Emotion detection Identify emotional tone (anger, joy, fear, etc.)
  Stance detection Position on specific issues (pro/con/neutral)
  Rhetorical device identification Metaphors, analogies, logical fallacies
  Writing style analysis Formal/informal, technical level, complexity
  Intent classification Inform, persuade, criticize, question
Structural Organization Temporal extraction Timeline of events, chronological ordering
  Hierarchical categorization Taxonomies, nested topic structures
  Section segmentation Break into logical parts (intro, methods, conclusion)
  Thread reconstruction Link replies, responses, follow-ups
  Citation network analysis Who references whom, influence mapping
Category Use Case Description
Quality & Meta-Analysis Fact-checking flags Mark claims needing verification
  Bias detection Political, cultural, or demographic biases
  Quality scoring Rate argument strength, evidence quality, coherence
  Contradiction detection Find conflicting statements within/across documents
  Gap analysis Identify missing topics or underexplored areas
  Discourse type classification Narrative, expository, argumentative
Knowledge Synthesis Comparative analysis How different sources treat same topic
  Consensus identification Points of agreement across documents
  Evolution tracking How arguments/positions change over time
  Cross-reference generation Link related content across corpus
  Glossary creation Extract and define domain-specific terms
  Question generation Create relevant questions the corpus answers

Demo

Let’s pick arguments

We have this dataset of verbatim debates of the European parliament.

I will connect to a model via groq

and use it to extract arguments from the debates.

and enrich the original datasets

I work on a samll subsample of the whole dataset (45M)

API caps

Using an API is limited in two ways

  • frequency

It’s important to not ovewhelm the endpoint. Always add some pause in your loops

import time
time.sleep(1)
  • volume

LLMs APIs have limits wrt to the amount of data you can send and retrieve.

Specially for free APIs!

Your alloted amount of free tokens is severaly limited as you will see

Method

always work on the smallest subset of data that’s relevant.

start with one or two samples to verify the code works

once the code works, slowly scale up, do not process the whole dataset right away

you never know if some weird row will break your code

Some code best practices

DRY: Don’t Repeat Yourself KISS: Keep It Simple, Stupid YAGNI: You Ain’t Gonna Need It SRP: Single Responsibility Principle PIE: Program Intently and Expressively

Indempotency

A very very important aspect of API calls is indempotency.

An operation that produces the same result no matter how many times you perform it.

Example: The Elevator Button 🛗

  • 1st press: Elevator called ✅
  • 2nd press: Still called (doesn’t call 2 elevators)
  • 50th press: Still just one elevator coming

Result: Always the same - one elevator arrives

data example

  • remove all stopwords
  • capitalize words

not idempotent

  • replace l with ll
  • add a period before each capitalized letter

Code quality

you can ask your coding agent to respect these principles

and make the functions idempotent by not chaging the input but by outputing a new object / varraible

Let’s open a new collab

  • save the API
  • paste the template code in the first cell. so that Gemini knows about it
  • load the dataset into a pandas dataframe
  • ask Gemini to create a function that sends a prompt to the Groq API and return a JSON
  • write a prompt, specify that the output must be JSON
  • write a function add the returned JSON to the original dataframe.

Open Router

Now let’s see what we can do with open router

multiple LLM providers: 494 models

Let’s open a new account on openrouter and create an API key

https://openrouter.ai/

https://openrouter.ai/settings/keys

Open Router leaderboard

https://openrouter.ai/rankings

Open Router models

https://openrouter.ai/models

filter by

  • input
  • output
  • pricing

there are multiple (57) free models that you can use

Compare model performance

Let’s go back to our small dataset

write the inference code for openrouter

with params: models (so we can switch models easily )

ask Gemini to create a function that sends a prompt to the OpenRouter API and return a JSON

write a prompt, specify that the output must be JSON

write a function add the returned JSON to the original dataframe.

pick 2 models!

Evaluate the 2 models

ok now let’s evaluate the 2 models

Most of the times it’s super difficult to say that on model is better than another one.

On some samples or some prompts one model will perform better than another one.

and sometimes vice versa.

we’re going to aks a rd model to evaluate the output of the 2 models

LLM as a judge

let’s try to use a model from groq as a judge

flow:

enrich the initial small dataset with the 2 open router

set the judge prompt with

  • the original instructions (extract arguments)
  • ask for a score
  • ask for a justification

Then scale up!

Extra : data viz

Ask gemini for a visualization of the results

Continue to investigate on the difference in quality output

maybe you had something in mind, some expectation of how you wanted the data to be enriched. n what you wanted to see. and that did not come through

how can you improve the initial data enrichment prompt ?

The try fail loop

The iterative cycle of experimenting with models/approaches, analyzing failures, and refining until you find what works.

Why It’s Essential:

  • Learning through failure - Each fail teaches what doesn’t work
  • Progressive refinement - Incrementally improve understanding
  • Hidden insights - Failures often reveal data issues, wrong assumptions
  • Reality check - “My fancy model / prompt combo is worse than the baseline”

The Truth: If you’re not failing, you’re not trying hard enough. The magic happens in iteration 47, not iteration 1! 🎯

1 / 28
Use ← → arrow keys or Space to navigate