In this post, we’ll walk through building a lightweight document classifier for PDFs using LLMs and Retrieval-Augmented Generation (RAG) techniques. The goal is to assign one of three ordinal labels — bad, neutral, good — to documents, based on their contents.
We’ll use FastAPI to serve two endpoints: - /train: Train a classifier on labeled PDFs - /predict: Predict label for a new PDF
Step 1: Extract and Chunk PDF Text from Training Data
Code
from PyPDF2 import PdfReaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterdef extract_text(path: str) ->str: reader = PdfReader(path)return"\n".join(p.extract_text() or''for p in reader.pages)text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
Step 2: Train an Ordinal Classifier
We use sentence-transformers for embeddings and sklearn for classification:
Code
from sentence_transformers import SentenceTransformerfrom sklearn.linear_model import LogisticRegressionimport pickleembedder = SentenceTransformer("all-MiniLM-L6-v2")label_map = {'bad': 0, 'neutral': 1, 'good': 2}reverse_label_map = {v: k for k, v in label_map.items()}MODEL_PATH ="classifier.pkl"def train_model(paths: list[str], labels: list[str]): X, y = [], []for path, label inzip(paths, labels): text = extract_text(path) chunks = text_splitter.split_text(text) embeddings = embedder.encode(chunks) X.extend(embeddings) y.extend([label_map[label]] *len(embeddings)) clf = LogisticRegression(max_iter=1000).fit(X, y) pickle.dump(clf, open(MODEL_PATH, "wb"))
Step 3: LLM Chunk Classification (Optional)
You can also prompt an LLM for classification:
Code
import openaidef classify_chunk_with_llm(chunk: str) ->str: prompt =f"Classify this text as 'bad', 'neutral', or 'good':\n---\n{chunk}\n---\nLabel:" response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] )return response['choices'][0]['message']['content'].strip().lower()
This project blends RAG chunking with LLMs and classical ML. You can further improve performance with: - Fine-tuning - Ordinal regression (e.g. mord) - Confidence-based routing (LLM or model)
Source Code
---title: "Classifying PDF Documents with RAG and LLMs"author: "Ravi Kalia"date: "2025-04-08"format: html: toc: true code-fold: true code-tools: true theme: cosmo---In this post, we'll walk through building a lightweight document classifier for PDFs using LLMs and Retrieval-Augmented Generation (RAG) techniques. The goal is to assign one of three ordinal labels — `bad`, `neutral`, `good` — to documents, based on their contents.We'll use FastAPI to serve two endpoints:- `/train`: Train a classifier on labeled PDFs- `/predict`: Predict label for a new PDF## Step 1: Extract and Chunk PDF Text from Training Data```{python}from PyPDF2 import PdfReaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterdef extract_text(path: str) ->str: reader = PdfReader(path)return"\n".join(p.extract_text() or''for p in reader.pages)text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)```## Step 2: Train an Ordinal ClassifierWe use `sentence-transformers` for embeddings and `sklearn` for classification:```{python}from sentence_transformers import SentenceTransformerfrom sklearn.linear_model import LogisticRegressionimport pickleembedder = SentenceTransformer("all-MiniLM-L6-v2")label_map = {'bad': 0, 'neutral': 1, 'good': 2}reverse_label_map = {v: k for k, v in label_map.items()}MODEL_PATH ="classifier.pkl"def train_model(paths: list[str], labels: list[str]): X, y = [], []for path, label inzip(paths, labels): text = extract_text(path) chunks = text_splitter.split_text(text) embeddings = embedder.encode(chunks) X.extend(embeddings) y.extend([label_map[label]] *len(embeddings)) clf = LogisticRegression(max_iter=1000).fit(X, y) pickle.dump(clf, open(MODEL_PATH, "wb"))```## Step 3: LLM Chunk Classification (Optional)You can also prompt an LLM for classification:```{python}import openaidef classify_chunk_with_llm(chunk: str) ->str: prompt =f"Classify this text as 'bad', 'neutral', or 'good':\n---\n{chunk}\n---\nLabel:" response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] )return response['choices'][0]['message']['content'].strip().lower()```## Step 4: Inference from a New PDF```{python}import numpy as npdef predict_class(path: str) ->str: text = extract_text(path) chunks = text_splitter.split_text(text) clf = pickle.load(open(MODEL_PATH, "rb")) embeddings = embedder.encode(chunks) preds = clf.predict(embeddings) mean_pred =int(round(np.mean(preds)))return reverse_label_map[mean_pred]```## Step 5: FastAPI App```{python}from fastapi import FastAPI, UploadFile, Filefrom tempfile import NamedTemporaryFileimport shutilfrom typing import Listapp = FastAPI()@app.post("/train")asyncdef train(pdfs: List[UploadFile] = File(...), labels: List[str] = File(...)): paths = []for pdf in pdfs:with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp: shutil.copyfileobj(pdf.file, tmp) paths.append(tmp.name) train_model(paths, labels)return {"status": "model trained"}@app.post("/predict")asyncdef predict(pdf: UploadFile = File(...)):with NamedTemporaryFile(delete=False, suffix=".pdf") as tmp: shutil.copyfileobj(pdf.file, tmp) predicted = predict_class(tmp.name)return {"predicted_class": predicted}```## Install and Run```bashpip install fastapi uvicorn PyPDF2 sentence-transformers scikit-learn langchain openaiuvicorn app:app --reload```## SummaryThis project blends RAG chunking with LLMs and classical ML. You can further improve performance with:- Fine-tuning- Ordinal regression (e.g. `mord`)- Confidence-based routing (LLM or model)