Introduction

I wanted to clarify my thinking on what RAG is how it relates to other concepts for large language models (LLMs). I wrote older blogpost, but since then my understanding has iterated, and I wanted to write about different aspects.

Retrieval-Augmented Generation (RAG) is a method that enhances Large Language Models (LLMs) by retrieving relevant knowledge from external sources. This approach improves factual accuracy and reduces hallucinations. In this post, we’ll explore RAG step by step, using object-oriented concepts like LLM(), Embedding(), and KNearestNeighborSearch().

Components of RAG

1. LLM() Object

The core large language model that generates text. Example models include GPT-4, LLaMA, and Claude.

2. Embedding() Object

This converts text into dense numerical vectors that capture semantic meaning.

3. KNearestNeighborSearch() Object

This searches for the most relevant stored vectors in a vector database.

4. Vector Databases

A specialized database that stores embeddings and allows for efficient similarity searches. Examples include Pinecone, Weaviate, and FAISS.

5. Data Source

A collection of documents or knowledge that the LLM can retrieve information from. This could be a database, a set of documents, or an API.

Step 1: Indexing (Building the Vector Database)

Before performing lookups, we need to store relevant documents in an embedding-based search system.

1.1 Convert Documents to Embeddings

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
documents = [
    "RAG enhances LLMs with retrieval.",
    "Vector databases store embeddings for fast search.",
    "Embeddings are dense vector representations of text."
]
document_vectors = embedding_model.encode(documents)

1.2 Store in a Vector Database

import faiss

vector_dim = document_vectors.shape[1]
index = faiss.IndexFlatL2(vector_dim)
index.add(document_vectors)

Step 2: Querying (Retrieval and Generation)

When a user asks a question, RAG retrieves relevant documents and passes them to the LLM.

2.1 Convert Query to an Embedding

query = "How does RAG improve LLMs?"
query_vector = embedding_model.encode([query])

2.2 Retrieve Similar Documents

k = 2  # Retrieve top 2 results
distances, indices = index.search(query_vector, k)
retrieved_docs = [documents[i] for i in indices[0]]

2.3 Generate a Response with Context

from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

llm = ChatOpenAI(model_name="gpt-4")
context = "\n".join(retrieved_docs)

prompt = PromptTemplate.from_template(
    """
    Use the following retrieved documents to answer the user's question:
    
    {context}
    
    Question: {query}
    """
)

response = llm.predict(prompt.format(context=context, query=query))
print(response)

Key Takeaways

RAG enhances LLMs by grounding responses in relevant data.
Vector databases enable efficient retrieval of relevant knowledge.
This pipeline is modular and can be improved with different embeddings, databases, or LLMs.

Drop me an email if you’d like to see more advanced variations! 🚀