Classifying PDF Documents with RAG and LLMs

In this post, we’ll walk through building a lightweight document classifier for PDFs using LLMs and Retrieval-Augmented Generation (RAG) techniques. The goal is to assign one of three ordinal labels — bad, neutral, good — to documents, based on their contents.

We’ll use FastAPI to serve two endpoints: - /train: Train a classifier on labeled PDFs - /predict: Predict label for a new PDF

Step 1: Extract and Chunk PDF Text from Training Data

Code

from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def extract_text(path: str) -> str:
    reader = PdfReader(path)
    return "\n".join(p.extract_text() or '' for p in reader.pages)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

Step 2: Train an Ordinal Classifier

We use sentence-transformers for embeddings and sklearn for classification:

Code

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import pickle

embedder = SentenceTransformer("all-MiniLM-L6-v2")
label_map = {'bad': 0, 'neutral': 1, 'good': 2}
reverse_label_map = {v: k for k, v in label_map.items()}
MODEL_PATH = "classifier.pkl"

def train_model(paths: list[str], labels: list[str]):
    X, y = [], []
    for path, label in zip(paths, labels):
        text = extract_text(path)
        chunks = text_splitter.split_text(text)
        embeddings = embedder.encode(chunks)
        X.extend(embeddings)
        y.extend([label_map[label]] * len(embeddings))
    clf = LogisticRegression(max_iter=1000).fit(X, y)
    pickle.dump(clf, open(MODEL_PATH, "wb"))

Step 3: LLM Chunk Classification (Optional)

You can also prompt an LLM for classification:

Code

import openai

def classify_chunk_with_llm(chunk: str) -> str:
    prompt = f"Classify this text as 'bad', 'neutral', or 'good':\n---\n{chunk}\n---\nLabel:"
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['choices'][0]['message']['content'].strip().lower()

Step 4: Inference from a New PDF

Code

import numpy as np

def predict_class(path: str) -> str:
    text = extract_text(path)
    chunks = text_splitter.split_text(text)
    clf = pickle.load(open(MODEL_PATH, "rb"))
    embeddings = embedder.encode(chunks)
    preds = clf.predict(embeddings)
    mean_pred = int(round(np.mean(preds)))
    return reverse_label_map[mean_pred]

Step 5: FastAPI App

Code

from fastapi import FastAPI, UploadFile, File
from tempfile import NamedTemporaryFile
import shutil
from typing import List

app = FastAPI()

@app.post("/train")
async def train(pdfs: List[UploadFile] = File(...), labels: List[str] = File(...)):
    paths = []
    for pdf in pdfs:
        with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            shutil.copyfileobj(pdf.file, tmp)
            paths.append(tmp.name)
    train_model(paths, labels)
    return {"status": "model trained"}

@app.post("/predict")
async def predict(pdf: UploadFile = File(...)):
    with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        shutil.copyfileobj(pdf.file, tmp)
        predicted = predict_class(tmp.name)
    return {"predicted_class": predicted}

Install and Run

pip install fastapi uvicorn PyPDF2 sentence-transformers scikit-learn langchain openai
uvicorn app:app --reload

Summary

This project blends RAG chunking with LLMs and classical ML. You can further improve performance with: - Fine-tuning - Ordinal regression (e.g. mord) - Confidence-based routing (LLM or model)

--- title: "Classifying PDF Documents with RAG and LLMs" author: "Ravi Kalia" date: "2025-04-08" format: html: toc: true code-fold: true code-tools: true theme: cosmo --- In this post, we'll walk through building a lightweight document classifier for PDFs using LLMs and Retrieval-Augmented Generation (RAG) techniques. The goal is to assign one of three ordinal labels — `bad`, `neutral`, `good` — to documents, based on their contents. We'll use FastAPI to serve two endpoints: - `/train`: Train a classifier on labeled PDFs - `/predict`: Predict label for a new PDF ## Step 1: Extract and Chunk PDF Text from Training Data ```{python} from PyPDF2 import PdfReader from langchain.text_splitter import RecursiveCharacterTextSplitter def extract_text(path: str) -> str: reader = PdfReader(path) return "\n".join(p.extract_text() or '' for p in reader.pages) text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) ``` ## Step 2: Train an Ordinal Classifier We use `sentence-transformers` for embeddings and `sklearn` for classification: ```{python} from sentence_transformers import SentenceTransformer from sklearn.linear_model import LogisticRegression import pickle embedder = SentenceTransformer("all-MiniLM-L6-v2") label_map = {'bad': 0, 'neutral': 1, 'good': 2} reverse_label_map = {v: k for k, v in label_map.items()} MODEL_PATH = "classifier.pkl" def train_model(paths: list[str], labels: list[str]): X, y = [], [] for path, label in zip(paths, labels): text = extract_text(path) chunks = text_splitter.split_text(text) embeddings = embedder.encode(chunks) X.extend(embeddings) y.extend([label_map[label]] * len(embeddings)) clf = LogisticRegression(max_iter=1000).fit(X, y) pickle.dump(clf, open(MODEL_PATH, "wb")) ``` ## Step 3: LLM Chunk Classification (Optional) You can also prompt an LLM for classification: ```{python} import openai def classify_chunk_with_llm(chunk: str) -> str: prompt = f"Classify this text as 'bad', 'neutral', or 'good':\n---\n{chunk}\n---\nLabel:" response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) return response['choices'][0]['message']['content'].strip().lower() ``` ## Step 4: Inference from a New PDF ```{python} import numpy as np def predict_class(path: str) -> str: text = extract_text(path) chunks = text_splitter.split_text(text) clf = pickle.load(open(MODEL_PATH, "rb")) embeddings = embedder.encode(chunks) preds = clf.predict(embeddings) mean_pred = int(round(np.mean(preds))) return reverse_label_map[mean_pred] ``` ## Step 5: FastAPI App ```{python} from fastapi import FastAPI, UploadFile, File from tempfile import NamedTemporaryFile import shutil from typing import List app = FastAPI() @app.post("/train") async def train(pdfs: List[UploadFile] = File(...), labels: List[str] = File(...)): paths = [] for pdf in pdfs: with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp: shutil.copyfileobj(pdf.file, tmp) paths.append(tmp.name) train_model(paths, labels) return {"status": "model trained"} @app.post("/predict") async def predict(pdf: UploadFile = File(...)): with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp: shutil.copyfileobj(pdf.file, tmp) predicted = predict_class(tmp.name) return {"predicted_class": predicted} ``` ## Install and Run ```bash pip install fastapi uvicorn PyPDF2 sentence-transformers scikit-learn langchain openai uvicorn app:app --reload ``` ## Summary This project blends RAG chunking with LLMs and classical ML. You can further improve performance with: - Fine-tuning - Ordinal regression (e.g. `mord`) - Confidence-based routing (LLM or model)