Any questions?
Spacy.io for NER and POS and lemmatization
Demo with spacy.io
You
What caught your attention this week?
the problem : how to find similar texts ?
In a large corpus how do you find sentences, paragraphs that adresss similar topics ?
Or put differently how to calculate a distance between 2 texts ?
How to calculate a distance between 2 texts so that we can say that
banana
is closer to apple
than it is to plane
;dog barks
is closer to the cat meows
than it is to the plane takes off
, …d(window, door) < d(window, coffee)
Before 2013: counting words and their relative frequencies The method, crude, called tf-idf, worked well for spam detection, requires lots of text pre processing (stopwords, lemmatization, …). OOV, can’t scale to large vocabularies, typo sensitive, etc etc
Thomas Mikolov @Google Efficient Estimation of Word Representations in Vector Space
Semantic distance of similar words
Queen - Woman = King - Man
France + capital = Paris
Germany + capital = Berlin
but still no context disambiguation: bank (river) = bank (finance), a play != to play, etc
Each word is a vector of large dimension
Once you have embeddings (vectors, series of numbers) of 2 texts (sentences, words, tokens ,…), their distance is given by the cosine similarity of their embeddings
Different companies optimize for different priorities—ease of use, multilingual support, domain expertise, or cost efficiency!
OpenAI – Versatile, general-purpose embeddings
Cohere – Enterprise-focused, multilingual embeddings
Voyage AI – Domain-specific, customizable embeddings
Jina AI – Multimodal and open-source embeddings
Google (Vertex AI) – Integrated, scalable embeddings
Word2Vec | Modern Methods (BERT/GPT) |
---|---|
Static embeddings | Contextual embeddings |
One vector per word | Dynamic vectors per context |
Same vector for “bank” | Different vectors per meaning |
~300 dimensions | 768-4096+ dimensions |
Position-agnostic | Position encodings |
Local context window | Full sequence attention |
Word2Vec | Modern Methods (BERT/GPT) |
---|---|
Word similarity only | Multiple downstream tasks |
Fixed vocabulary | Subword tokenization |
No transfer learning | Pre-train + fine-tune |
The LLM Context Limitation
Large Language Models have a context window—a limit on how much text they can process at once (typically 32K-200K tokens, or ~25-150 pages).
The Challenge:
How RAG Solves This:
Before RAG: You ask: “What’s our return policy for electronics?” LLM thinks: “I don’t have access to this company’s specific policies… I’ll make an educated guess based on common practices” ❌
With RAG:
RAG acts as a smart search system that finds the needle in the haystack, then gives ONLY that needle to the LLM as context. This way, the LLM always has the right information to answer accurately—without needing to fit your entire knowledge base in its context window.
Bottom Line: RAG = Retrieval (find the right docs) + Augmented (add them to the prompt) + Generation (LLM creates answer from that context)
Chunking Issues
Retrieval Quality
Version Control & Freshness
Context Limitations
Hallucinations Still Occur
Cost & Latency
https://colab.research.google.com/drive/1FT5hdlnj23c85CEYvaIc6nh4xDWu7VSW#scrollTo=cYFFIRy-6Ign
Load the dataset from the previous class into a pandas dataframe
extract all adjectives or nouns in a new column
build a new dataframe with the sentences (keep some reference to the orginal datafarme)
using huggingface transformers library, build embeddings for each sentence