Introduction
I wanted to clarify my thinking on what RAG is how it relates to other concepts for large language models (LLMs). I wrote older blogpost, but since then my understanding has iterated, and I wanted to write about different aspects.
Retrieval-Augmented Generation (RAG) is a method that enhances Large Language Models (LLMs) by retrieving relevant knowledge from external sources. This approach improves factual accuracy and reduces hallucinations. In this post, we’ll explore RAG step by step, using object-oriented concepts like LLM()
, Embedding()
, and KNearestNeighborSearch()
.
Components of RAG
1. LLM() Object
The core large language model that generates text. Example models include GPT-4, LLaMA, and Claude.
2. Embedding() Object
This converts text into dense numerical vectors that capture semantic meaning.
3. KNearestNeighborSearch() Object
This searches for the most relevant stored vectors in a vector database.
4. Vector Databases
A specialized database that stores embeddings and allows for efficient similarity searches. Examples include Pinecone, Weaviate, and FAISS.
5. Data Source
A collection of documents or knowledge that the LLM can retrieve information from. This could be a database, a set of documents, or an API.
Step 1: Indexing (Building the Vector Database)
Before performing lookups, we need to store relevant documents in an embedding-based search system.
1.1 Convert Documents to Embeddings
from sentence_transformers import SentenceTransformer
= SentenceTransformer("all-MiniLM-L6-v2")
embedding_model = [
documents "RAG enhances LLMs with retrieval.",
"Vector databases store embeddings for fast search.",
"Embeddings are dense vector representations of text."
]= embedding_model.encode(documents) document_vectors
1.2 Store in a Vector Database
import faiss
= document_vectors.shape[1]
vector_dim = faiss.IndexFlatL2(vector_dim)
index index.add(document_vectors)
Step 2: Querying (Retrieval and Generation)
When a user asks a question, RAG retrieves relevant documents and passes them to the LLM.
2.1 Convert Query to an Embedding
= "How does RAG improve LLMs?"
query = embedding_model.encode([query]) query_vector
2.2 Retrieve Similar Documents
= 2 # Retrieve top 2 results
k = index.search(query_vector, k)
distances, indices = [documents[i] for i in indices[0]] retrieved_docs
2.3 Generate a Response with Context
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
= ChatOpenAI(model_name="gpt-4")
llm = "\n".join(retrieved_docs)
context
= PromptTemplate.from_template(
prompt """
Use the following retrieved documents to answer the user's question:
{context}
Question: {query}
"""
)
= llm.predict(prompt.format(context=context, query=query))
response print(response)
Key Takeaways
- RAG enhances LLMs by grounding responses in relevant data.
- Vector databases enable efficient retrieval of relevant knowledge.
- This pipeline is modular and can be improved with different embeddings, databases, or LLMs.
Drop me an email if you’d like to see more advanced variations! 🚀