What is Retrieval-Augmented Generation (RAG)

Introduction

Retrieval-Augmented Generation (RAG) is a powerful framework that enhances Large Language Models (LLMs) by incorporating external knowledge retrieval. Instead of relying solely on a model’s parametric memory, RAG retrieves relevant documents and conditions the model’s generation on them.

This post provides a mathematical formulation of RAG and explores the top-k retrieval algorithms used in modern implementations.

Mathematical Formulation of RAG

Given an input query ( q ), RAG retrieves a set of top-k documents ( D = {d_1, d_2, …, d_k} ) and generates an output ( y ) based on both the query and retrieved documents.

Step 1: Document Retrieval

The retriever finds the most relevant documents by computing similarity scores between ( q ) and each document ( d_i ):

\[ S(q, d_i) = \frac{\mathbf{q} \cdot \mathbf{d}_i}{\|\mathbf{q}\| \|\mathbf{d}_i\|} \]

where: - ( ) and ( _i ) are vector embeddings of the query and documents. - ( S(q, d_i) ) is the similarity score (cosine similarity, dot product, or Euclidean distance). - The top-k documents are selected based on the highest similarity scores.

The probability of selecting a document ( d_i ) is:

\[ P(d_i | q) = \frac{\exp(S(q, d_i))}{\sum_{j} \exp(S(q, d_j))} \]

Step 2: Generation Model

The retrieved documents ( D ) are concatenated with the query and passed to the LLM:

\[ P(y | q, D) = \prod_{t=1}^{T} P(y_t | y_{<t}, q, D; \theta) \]

where: - ( y_t ) is the ( t )-th token of the output. - ( ) are the LLM’s parameters.

The final probability of generating ( y ) is:

\[ P(y | q) = \sum_{d \in D} P(y | q, d) P(d | q) \]

This formulation marginalizes over retrieved documents, integrating retrieval and generation.

Top-k Retrieval Algorithms

Finding the most relevant top-k documents efficiently is crucial for real-time RAG applications. Several Approximate Nearest Neighbor (ANN) search algorithms are used:

1. Hierarchical Navigable Small World (HNSW)

Graph-based approach where embeddings form a multi-layered network.
Search process: A greedy walk is performed, refining search results locally.
Time complexity: ( O(N) ).
Used in FAISS, Milvus, Weaviate.

2. Inverted File Index (IVF) + Product Quantization (PQ)

Uses k-means clustering to group similar vectors.
Searches only in the nearest cluster, reducing computational cost.
Time complexity: ( O() ).
Used in FAISS.

3. Locality-Sensitive Hashing (LSH)

Uses random projections to hash similar vectors into the same bucket.
Very fast but can suffer from reduced accuracy.
Time complexity: ( O(1) ).
Used in Spotify Annoy, ScaNN.

Choosing the Right Algorithm

Algorithm	Best For	Scalability	Accuracy	Query Time
Brute Force	Small datasets	Poor	High	Slow ( O(N) )
HNSW	Large, dynamic data	Excellent	High	Fast ( O(N) )
IVF + PQ	Compressed storage	Good	Medium	Moderate ( O() )
LSH	Fast lookups	Good	Low-Medium	Very fast ( O(1) )

For real-time RAG, HNSW is preferred for its speed and accuracy, while IVF+PQ is useful for large-scale, compressed storage retrieval.

Conclusion

Retrieval-Augmented Generation (RAG) blends retrieval-based search with LLM-based generation, enabling models to access external knowledge dynamically. Efficient retrieval via ANN search algorithms like HNSW and IVF+PQ is essential for scaling RAG-based systems.

By understanding these underlying principles, we can optimize memory efficiency, retrieval accuracy, and inference speed in RAG pipelines. 🚀

References

Facebook AI. “FAISS: A Library for Efficient Similarity Search.”
Google Research. “ScaNN: Efficient Vector Search at Scale.”
Malkov et al. (2018). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.”