NLP

Natural Language Processing

Modern style

---

What we saw last time

Any questions?

Today


At the end of this class

You


In the News

What caught your attention this week?

---

Modern NLP - Embeddings

--- # Embeddings

the problem : how to find similar texts ?

In a large corpus how do you find sentences, paragraphs that adresss similar topics ?

Or put differently how to calculate a distance between 2 texts ?


distance

How to calculate a distance between 2 texts so that we can say that

d(_window_, _door_) < d(_window_, _coffee_)

Before 2013: counting words and their relative frequencies The method, crude, called tf-idf, worked well for spam detection, requires lots of text pre processing (stopwords, lemmatization, ...). OOV, can't scale to large vocabularies, typo sensitive, etc etc


Word2vec 2013

Thomas Mikolov @Google Efficient Estimation of Word Representations in Vector Space

Semantic distance of similar words

Queen - Woman = King - Man
France + capital = Paris
Germany + capital = Berlin

but still no context disambiguation: bank (river) = bank (finance), a play != to play, etc

word2vec

word2vec 2013

Each word is a vector of large dimension

word2vec

Vocab

- Embedding - Vector - List of numbers - numerical representation of text, audio, images, video etc

Multi modal embeddings


Cosine similarity

Once you have embeddings (vectors, series of numbers) of 2 texts (sentences, words, tokens ,...), their distance is given by the cosine similarity of their embeddings

word2vec cosine similarity word2vec cosine similarity

Example Workflow:

  1. Load a pre-trained embedding model.
  2. Pass your sentences through the model.
  3. Get a N-dimensional vector (embedding) for each sentence.
  4. Use these embeddings for your task.
Embedding workflow

Specialized embeddings

Different companies optimize for different priorities—ease of use, multilingual support, domain expertise, or cost efficiency!

Leading Embedding Model Companies

OpenAI – Versatile, general-purpose embeddings

Cohere – Enterprise-focused, multilingual embeddings

Voyage AI – Domain-specific, customizable embeddings

Jina AI – Multimodal and open-source embeddings

Google (Vertex AI) – Integrated, scalable embeddings


resources


Core Architecture Comparison

Word2VecModern Methods (BERT/GPT)
Static embeddingsContextual embeddings
One vector per wordDynamic vectors per context
Same vector for "bank"Different vectors per meaning
~300 dimensions768-4096+ dimensions
Position-agnosticPosition encodings
Local context windowFull sequence attention

Capabilities & Scale

Word2VecModern Methods (BERT/GPT)
Word similarity onlyMultiple downstream tasks
Fixed vocabularySubword tokenization
No transfer learningPre-train + fine-tune

RAG: Retrieval Augmented Generation

The LLM Context Limitation

Large Language Models have a context window—a limit on how much text they can process at once (typically 32K-200K tokens, or ~25-150 pages).

The Challenge:

How RAG Solves This:

Before RAG: You ask: "What's our return policy for electronics?" LLM thinks: "I don't have access to this company's specific policies... I'll make an educated guess based on common practices" ❌

With RAG:

  1. Your question is converted to an embedding
  2. RAG searches through ALL your documents and retrieves only the most relevant 3-5 pages (your return policy docs)
  3. Those specific pages are inserted into the LLM's prompt as context
  4. LLM answers based on YOUR actual policy ✓

RAG acts as a smart search system that finds the needle in the haystack, then gives ONLY that needle to the LLM as context. This way, the LLM always has the right information to answer accurately—without needing to fit your entire knowledge base in its context window.

Bottom Line: RAG = Retrieval (find the right docs) + Augmented (add them to the prompt) + Generation (LLM creates answer from that context)


RAG : retrieval augmented generation

RAG diagram

Main Challenges in RAG Systems

Chunking Issues

Retrieval Quality

Version Control & Freshness

Context Limitations

Hallucinations Still Occur

Cost & Latency


Intermission

---

Classic and modern NLP practice

---

On the wikipedia API


Colab

https://colab.research.google.com/drive/1FT5hdlnj23c85CEYvaIc6nh4xDWu7VSW#scrollTo=cYFFIRy-6Ign


Practice

Load the dataset from the previous class into a pandas dataframe


Next time

new data source: Simon Willison

simonw.substack.com/


Exit ticket

exit ticket
[https://forms.gle/9eE9PUR6mFC2szq47](https://forms.gle/9eE9PUR6mFC2szq47)
1 / 0