NLP

Natural Language Processing

Classic style

What we saw last time

  • APIs
  • The web as a gigantic API
  • the REST protocol
  • inspecting a website (hack 101)
  • building datasets from APIs

  • Guided practice on the wikipedia API

Any questions?

Today

  • classic NLP
    • tokens
    • NER: Named Entity Recognition
    • POS: Part of Speech
    • topic modeling
    • text classification : sentiment, hate speech, categorisaton, spam etc
  • some libraries : Spacy.io, NLTK

  • Practice on the NYT API to build a dataset and apply classic NLP techniques

At the end of this class

You

  • understand what NLP is
  • can query the NYT API
  • can apply classic NLP techniques to a dataset using Spacy.io

In the News

What caught your attention this week?

YouTube just dropped over 30 new creator tools at its Made on YouTube event, including AI-powered editing and clipping features, the addition of Veo 3 Fast in shorts, auto-dubbing, and more.

https://www.heygen.com/

The Rundown: Researchers at Stanford and the Arc Institute just created the first AI-generated, entirely new viruses from scratch that successfully infect and kill bacteria, marking a breakthrough in computational biology.

The details:

Scientists trained an AI model called Evo on 2M viruses, then asked it to design brand new ones — with 16 of 302 attempts proving functional in lab tests.

The AI viruses contained 392 mutations never seen in nature, including successful combos that scientists had previously tried and failed to engineer.

When bacteria developed resistance to natural viruses, AI-designed versions broke through defenses in days where the traditional viruses failed.

One synthetic version incorporated a component from a distantly related virus, something researchers had attempted unsuccessfully to design for years.

AI’s huge competitive coding win…

the neuron daily ai’s huge competitive coding win

OpenAI’s AI achieved a perfect score at the world’s top programming competition, beating all human teams.

OpenAI’s reasoning models achieved a perfect 12/12 score at the ICPC World Finals, the most prestigious programming competition in the world… and outperforming every human team.

To put this in perspective, the best human team solved 11 out of 12 problems. And OpenAI competed under the same 5-hour time limit as human teams. They used an ensemble of general-purpose models, including GPT-5, with no special training for competitive programming. In fact, 11 out of 12 problems were solved on the first try.

Google’s Gemini 2.5 Deep Think also won gold, solving 10 out of 12 problems. This means two different AI systems both outperformed every human team on the planet. And Google’s performance was equally jaw-dropping:

Gemini solved 8 problems in just 45 minutes and cracked one problem that stumped every single human team.

NLP

from Linguistic to NLP

Ferdinand de Saussure 1916 cours de linguistique générale

Linguistics is the scientific study of human language structure and theory

NLP (Natural Language Processing) is the computational field focused on building systems that can understand and generate human language.

Other classic refs

1957 Benveniste : Problemes de linguistique générale

1957 Benveniste : Problemes de linguistique générale

1957 Chomsky : syntactic structures

see wikipedia Chomsky

The basis of Chomsky’s linguistic theory lies in biolinguistics, the linguistic school that holds that the principles underpinning the structure of language are biologically preset in the human mind and hence genetically inherited. He argues that all humans share the same underlying linguistic structure, irrespective of sociocultural differences.

1957 Chomsky : syntactic structures

Speech and Language Processing

speech and language processing

Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin

Latest release: August 24 2025!!

Timeline

nlp history

NLP Timeline

NLP timeline : the early days

Era / Year Key NLP Developments Performance on Tasks
1950s–1960s First MT experiments, e.g., Georgetown‑IBM (1954); generative grammar theories (Chomsky, 1957); ELIZA (1964) Machine translation primitive; NER not yet formalized; dialogue systems purely rule-based
1970s–1980s SHRDLU (1970); PARRY (1972); rise of expert systems and handcrafted rules Language understanding limited to constrained domains
Late 1980s–1990s Adoption of statistical models; NER from news (MUC-7 ~1998); statistical MT replaces rule-based NER F1 ~93%, nearing human (~97); statistical MT limited scope and quality
2000s Widespread statistical MT (e.g., Google Translate from 2006) MT quality improving but far from human level; NER robust in limited domains

NLP timeline

Era / Year Key NLP Developments Performance on Tasks
2010–mid-2010s Introduction of word embeddings (Word2Vec 2013, GloVe); RNNs & seq2seq models; early neural MT Embeddings enable semantic similarity; neural MT achieves noticeable improvement
Late 2010s Transformer architecture (2016); BERT (2018); adoption in search engines by 2020 NER & QA reach human or super-human on benchmarks; MT approaches near-human fluency
2020s (LLM era) Emergence of GPT-3, ChatGPT, GPT-4, etc. (LLMs dominate NLP paradigm) Across-the-board excellence: near-human or exceeding performance in translation, NER, summarization, reasoning

Language is … complicated

Will you marry me? : a marriage proposal.

Will, You, Mary, Me : a card game proposal.

Will, you marry me : a time traveller spoiling the future.

Will you, Mary me : a cavewoman named Mary, trying to make Will, who has amnesia, remember who he is.

  • Let’s eat , grandpa.
  • Let’s eat grandpa.

Language is … complicated

  • Variable length
  • Wide variety of complexity across languages
    • German: Donaudampfschiffahrtsgesellschaftskapitän (5 “words”)
    • Chinese: 50,000 different characters (2-3k to read a newspaper)
    • Slavic: Different word forms depending on gender, case, tense
  • Encoding: unicode vs Ascii
  • Unstructured data
  • code switching
  • idioms, Generational lingo, slang

classic NLP problems

  • Text mining
  • NER: Named Entity Recognition: LOC, PER,
  • POS: Part of speech Tagging : nouns, adjectives, verbs, …
  • Classification: sentiment analysis, spam, hate speech, …
  • Topic identification, topic modeling
  • WSD: word sense disambiguation : bank, fly
  • STT: speech to text, text to speech

and more difficult tasks such as

  • Automated Translation, summarization, question answering

Classic NLP

  • deterministic (not probabilistic - LLMs)
  • based on the decomposition of text into identifiable elements: words, grammar roles, entities, etc
  • applied to sentences, noun phrases, words
  • includes pre processing methods of the raw text to facilitate processing
    • stop words: and, the, of, etc
    • stemming: universities, universal, universe -> univer (meaning is ofetn lost)
    • lemmatization: run, running, ran -> run (more efficient than stemming)
    • subword tokenization: “unhappiness” → [“un”, “happy”, “ness”]

Requires models, rules that are language specific. Russian or French need different lemmatizers than English.

What’s the unit of text ?

We could work with

  • words, syllables, tokens, letters & punctuation
  • Bigrams, n-grams: New York, cul-de-sac, pain au chocolat
  • noun phrases: group of words that function as a noun: the big brown dog with spots
  • sentences, paragraphes, tweets, articles, books, comments
  • Corpus: a whole set of text

Need to deal with

  • The vocabulary is large / infinite and fast changing
  • Typos, multiple spellings
  • word forms: plural, declension (home, house), conjugation, …,etc

Subword tokenization

What’s the most efficient unit of text ?

  • Using words is problematic: very large vocabulary, multiple forms: plural, declension (home, house), conjugation, …,etc
  • Using letters is too short

  • Let’s split words into multiple tokens: subword tokenization
    • unhappiness -> un + happ + in + ess / un + hap + pin + ness
    • running -> runn + ing
    • universities -> uni + vers + iti + es

Benefits of subword tokenization

  • no words are OOV: out of vocabulary (gpt 3.5 did not know about COVID)
  • captures semantic and morphological meaning better
  • Vocabulary Handling : we get infinite vocabulary from finite tokens

That’s also why LLms are very robust wrt to typos and misspellings.

Bag of words for binary classification

you want to build a model that can predict if an email is spam or not spam, or if a review is positive or negative (sentiment analysis)

To train the model you need to transform the corpus into numbers.

The main classic NLP method for that is tf-idf (term frequency-inverse document frequency)

For each sentence, we count the frequency of each word and we normalize it by the number of documents in the corpus that contain the word.

This gives us a matrix, that we can use to train a model.

but this appraoch has multiple problems

  • OOV: Out of Vocabulary words are not taken into account
  • large vocabulary => huge matrix
  • full of zeros

etc

  • works for easy tasks (spam, sentiment) but fails for more complex tasks

nlp workflow diagram

tf (term frequency) example

Let’s take an example. Consider the 3 following sentences from the well known Surfin’ Bird song and count the number of times each word appears in each sentence.

  about bird heard is the word you
About the bird, the bird, bird bird bird 1 5 0 0 2 0 0
You heard about the bird 1 1 1 0 1 0 1
The bird is the word 0 1 0 1 2 1 0

NER: named entities recognition

All types of predefined entities: location, groups, persons, companies, money, etc

NER uses

  • Pattern matching: Looks for capital letters, titles (Mr., Dr.), known entity lists
  • Context clues: Words like “works at” suggest an organization follows
  • Statistical models: Trained on labeled data to recognize entity patterns

NER models and rules are language specific

ner simpsons diagram

POS: part of speech tagging

Identify the grammatical function of each word : ADJ, NOUN, VERBs, etc

POS uses:

  • Word endings: “-ing” often = verb, “-ly” often = adverb
  • Position rules: Determiners (“the”) come before nouns
  • Context: Same word can be noun or verb (“run” vs “a run”)

POS models and rules are also language specific

pos shakespeare diagram

Demo classic NLP : NER and POS

https://cloud.google.com/natural-language

Input some text: (text from FT on AI impact on traffic due to AI summaries in google search)

“Like everyone, we have definitely felt the impact of AI Overviews. There is only one direction of travel; not only are AIs getting better, but they’re getting better in an exponential fashion,” said Sean Cornwell, chief executive of Immediate Media, which owns the Radio Times and Good Food brands in the UK.

NER

Classification

  • sentiment scoring
  • categories
  • Moderation

POS : part of speech and dependency tagging

Unfortunately this features is no longer available in the NLP google demo.

Tokens and LLMs

  • Context Window = measured in tokens, not words : 1M token window, 200k tokens window, …
  • Pricing Model = cost per token (input + output)
  • Non-English Text = more tokens needed
  • Token Limits = why responses cut off
  • Character-Level Tasks = difficult (LLMs see tokens, not letters)
  • Efficiency Varies = by language, domain, complexity

token based pricing

https://openai.com/api/pricing/

openai pricing

NER and POS with spacy.io

There are few important NLP python libraries : Spacy.io and NLTK

Spacy.io supports 75 languages,

  • entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more

=> follow Ines Montani (website) she’s super cool

Spacy code

To use Spacy.io we need to

  • install and import the library
  • then download a model associated to the language of the corpus.
    • Each language offers multiple models with varying sizes
    • Each model is trained to handle POS and NER and lemmatization
  • Once the model is available we instanciate the spacy object doc on the text we want to analyze
  • then we can easily extract NER like location or persons with these simple lines
  • Similary we can identify all the ADJ and NOUNS in a text
# pip install -U spacy
# python -m spacy download en_core_web_sm
import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")

# instanciate the spacy object
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Intermission

Forbidden planet 1956

Robby the robot

Practice

Build a dataset using the NYT API

NYT API

Offers free access: developer.nytimes.com/apis

follow instructions on developer.nytimes.com/get-started

  • open an account on the NYT developer website
  • create an app
  • get an API key

nyt api key

An API key is SECRET

Public API keys cost lives and money, …,
ok, …., mostly money

DO not publish your API key publicly

nyt api key secret

see ref

Secret keys in google colab

colab secret key

left menu

colab secret key

add key

Load key in colab

from google.colab import userdata
userdata.get('NYT_API_KEY')

NYT API - Practice

https://colab.research.google.com/drive/1PoFhONvZZxcpIG-_XMoN9wKTT1U7KS7X#scrollTo=aJVkXUfIFjqX

Goal :

  • choose a topic, a set of articles
  • build a dataset of articles
  • extract entities, nouns, verbs, adjectives using spacy.io
  • and also save the dataset on your laptop

Next time

  • Modern NLP
  • Embeddings
  • RAG - context window

new data source: Andriy Burkov

aiweekly.substack.com/

1 / 44
Use ← → arrow keys or Space to navigate