Classifying PDF Documents with RAG and LLMs

Author

Ravi Kalia

Published

April 8, 2025

In this post, we’ll walk through building a lightweight document classifier for PDFs using LLMs and Retrieval-Augmented Generation (RAG) techniques. The goal is to assign one of three ordinal labels — bad, neutral, good — to documents, based on their contents.

We’ll use FastAPI to serve two endpoints: - /train: Train a classifier on labeled PDFs - /predict: Predict label for a new PDF

Step 1: Extract and Chunk PDF Text from Training Data

Code
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def extract_text(path: str) -> str:
    reader = PdfReader(path)
    return "\n".join(p.extract_text() or '' for p in reader.pages)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

Step 2: Train an Ordinal Classifier

We use sentence-transformers for embeddings and sklearn for classification:

Code
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
import pickle

embedder = SentenceTransformer("all-MiniLM-L6-v2")
label_map = {'bad': 0, 'neutral': 1, 'good': 2}
reverse_label_map = {v: k for k, v in label_map.items()}
MODEL_PATH = "classifier.pkl"

def train_model(paths: list[str], labels: list[str]):
    X, y = [], []
    for path, label in zip(paths, labels):
        text = extract_text(path)
        chunks = text_splitter.split_text(text)
        embeddings = embedder.encode(chunks)
        X.extend(embeddings)
        y.extend([label_map[label]] * len(embeddings))
    clf = LogisticRegression(max_iter=1000).fit(X, y)
    pickle.dump(clf, open(MODEL_PATH, "wb"))

Step 3: LLM Chunk Classification (Optional)

You can also prompt an LLM for classification:

Code
import openai

def classify_chunk_with_llm(chunk: str) -> str:
    prompt = f"Classify this text as 'bad', 'neutral', or 'good':\n---\n{chunk}\n---\nLabel:"
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['choices'][0]['message']['content'].strip().lower()

Step 4: Inference from a New PDF

Code
import numpy as np

def predict_class(path: str) -> str:
    text = extract_text(path)
    chunks = text_splitter.split_text(text)
    clf = pickle.load(open(MODEL_PATH, "rb"))
    embeddings = embedder.encode(chunks)
    preds = clf.predict(embeddings)
    mean_pred = int(round(np.mean(preds)))
    return reverse_label_map[mean_pred]

Step 5: FastAPI App

Code
from fastapi import FastAPI, UploadFile, File
from tempfile import NamedTemporaryFile
import shutil
from typing import List

app = FastAPI()

@app.post("/train")
async def train(pdfs: List[UploadFile] = File(...), labels: List[str] = File(...)):
    paths = []
    for pdf in pdfs:
        with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
            shutil.copyfileobj(pdf.file, tmp)
            paths.append(tmp.name)
    train_model(paths, labels)
    return {"status": "model trained"}

@app.post("/predict")
async def predict(pdf: UploadFile = File(...)):
    with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
        shutil.copyfileobj(pdf.file, tmp)
        predicted = predict_class(tmp.name)
    return {"predicted_class": predicted}

Install and Run

pip install fastapi uvicorn PyPDF2 sentence-transformers scikit-learn langchain openai
uvicorn app:app --reload

Summary

This project blends RAG chunking with LLMs and classical ML. You can further improve performance with: - Fine-tuning - Ordinal regression (e.g. mord) - Confidence-based routing (LLM or model)