December 1, 2023 · 4 min read
By Alexis Perrier
This short post explains how to explore a large proprietary corpus using a RAG strategy and emphasizes the importance of the chunking step of the pipeline.
(no, not the Scott Joplin Ragtime, sorry jazz lovers)
RAG stands for Retrieval Augmented Generation. It is a technique used to question and explore large collections of documents. RAG is a simple way to leverage LLMs on a proprietary corpus without resorting to more expensive and complex tuning strategies.
The initial preparation step consists in splitting the documents from your corpus into smaller parts called chunks. Chunking breaks down large text into smaller segments to optimize content relevance.
Chunking is followed by the embedding phase: computing a vector representation aka embedding of each chunk . Each text chunk becomes a vector.
There are multiple ways to chunk a document and compute embeddings.
These embeddings are then stored in a vector database such as weaviate. A vector database has 2 main roles: 1) storing the text and related vectors 2) enabling a super fast matching between 2 embeddings.
In short:
Now we can start querying your corpus.
As the name indicates, the RAG pipeline then consists in 2 steps: retrieval and generation.
Given your query or question.
The resulting chunks are used as the information that can help answer your question using an LLM and a properly structured prompt.
The prompt has the following structure: role, context and query:
Acting as
{insert specific role: teacher, analyst, nerd, author, ...}
Use the following information:
{insert resulting text chunks}
to answer this question:
{insert your initial question}
This is the overall RAG strategy. Whether it works or not will depend on multiple factors.
Chunking can be done at the sentence level, paragraph or be fixed size (see langChain TextSplitters for instance). Some overlap between chunks is usually considered to liaise consecutive chunks. Although the other steps (embedding, LLM) are worthy of attention, finding a good chunking strategy is where the challenge lies in order to get relevant answers.
A good chunking strategy must capture both meaning and context. Short chunks will preserve meaning but lack context while long chunks will tend to smooth out nuances of each sentence.
"Embedding a sentence focuses on its specific meaning, while embedding a paragraph or document considers overall context and relationships between sentences, potentially resulting in a more comprehensive vector representation but with the caveat of potential noise or dilution in larger input sizes." Thx, chatGPT!
So when defining a good chunking strategy, keep in mind:
The idea is to align the chunking strategy with the user queries to establish a closer correlation between the embedded query and the embedded chunks.
Chunking is not a challenge that can be solved with an AI model. A simple python script will do. It's a simple task that boils down to splitting large text in smaller parts. But it is the foundation that will make your RAG system generate quality answers.