What is Retrieval-Augmented Generation (RAG)

A structured explanation of Retrieval-Augmented Generation (RAG), its formulation, and the algorithms behind top-k retrieval.
Machine Learning
NLP
RAG
Author

Ravi Kalia

Published

February 17, 2025

Retrieval-Augmented Generation

Introduction

Retrieval-Augmented Generation (RAG) is a powerful framework that enhances Large Language Models (LLMs) by incorporating external knowledge retrieval. Instead of relying solely on a model’s parametric memory, RAG retrieves relevant documents and conditions the model’s generation on them.

This post provides a mathematical formulation of RAG and explores the top-k retrieval algorithms used in modern implementations.


Mathematical Formulation of RAG

Given an input query ( q ), RAG retrieves a set of top-k documents ( D = {d_1, d_2, …, d_k} ) and generates an output ( y ) based on both the query and retrieved documents.

Step 1: Document Retrieval

The retriever finds the most relevant documents by computing similarity scores between ( q ) and each document ( d_i ):

\[ S(q, d_i) = \frac{\mathbf{q} \cdot \mathbf{d}_i}{\|\mathbf{q}\| \|\mathbf{d}_i\|} \]

where: - ( ) and ( _i ) are vector embeddings of the query and documents. - ( S(q, d_i) ) is the similarity score (cosine similarity, dot product, or Euclidean distance). - The top-k documents are selected based on the highest similarity scores.

The probability of selecting a document ( d_i ) is:

\[ P(d_i | q) = \frac{\exp(S(q, d_i))}{\sum_{j} \exp(S(q, d_j))} \]


Step 2: Generation Model

The retrieved documents ( D ) are concatenated with the query and passed to the LLM:

\[ P(y | q, D) = \prod_{t=1}^{T} P(y_t | y_{<t}, q, D; \theta) \]

where: - ( y_t ) is the ( t )-th token of the output. - ( ) are the LLM’s parameters.

The final probability of generating ( y ) is:

\[ P(y | q) = \sum_{d \in D} P(y | q, d) P(d | q) \]

This formulation marginalizes over retrieved documents, integrating retrieval and generation.


Top-k Retrieval Algorithms

Finding the most relevant top-k documents efficiently is crucial for real-time RAG applications. Several Approximate Nearest Neighbor (ANN) search algorithms are used:

1. Hierarchical Navigable Small World (HNSW)

  • Graph-based approach where embeddings form a multi-layered network.
  • Search process: A greedy walk is performed, refining search results locally.
  • Time complexity: ( O(N) ).
  • Used in FAISS, Milvus, Weaviate.

2. Inverted File Index (IVF) + Product Quantization (PQ)

  • Uses k-means clustering to group similar vectors.
  • Searches only in the nearest cluster, reducing computational cost.
  • Time complexity: ( O() ).
  • Used in FAISS.

3. Locality-Sensitive Hashing (LSH)

  • Uses random projections to hash similar vectors into the same bucket.
  • Very fast but can suffer from reduced accuracy.
  • Time complexity: ( O(1) ).
  • Used in Spotify Annoy, ScaNN.

Choosing the Right Algorithm

Algorithm Best For Scalability Accuracy Query Time
Brute Force Small datasets Poor High Slow ( O(N) )
HNSW Large, dynamic data Excellent High Fast ( O(N) )
IVF + PQ Compressed storage Good Medium Moderate ( O() )
LSH Fast lookups Good Low-Medium Very fast ( O(1) )

For real-time RAG, HNSW is preferred for its speed and accuracy, while IVF+PQ is useful for large-scale, compressed storage retrieval.


Conclusion

Retrieval-Augmented Generation (RAG) blends retrieval-based search with LLM-based generation, enabling models to access external knowledge dynamically. Efficient retrieval via ANN search algorithms like HNSW and IVF+PQ is essential for scaling RAG-based systems.

By understanding these underlying principles, we can optimize memory efficiency, retrieval accuracy, and inference speed in RAG pipelines. 🚀


References

  • Facebook AI. “FAISS: A Library for Efficient Similarity Search.”
  • Google Research. “ScaNN: Efficient Vector Search at Scale.”
  • Malkov et al. (2018). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.”