Accelerating Transformers with Hugging Face Optimum
machine learning
huggingface
optimization
Author
Ravi Kalia
Published
April 6, 2025
Hugging Face’s optimum library makes it easy to accelerate, quantize, and deploy transformer models on CPUs, GPUs, and inference accelerators. Here’s how to get started.
What is optimum?
Hugging Face optimum is a toolkit for optimizing transformers models using backends like ONNX Runtime, OpenVINO, and TensorRT. You can use it for:
Faster inference via ONNX and hardware acceleration
Smaller models using INT8 or FP16 quantization
Training with optimization-aware tools
Easy deployment to CPUs, GPUs, and custom silicon
Installation
Install with ONNX Runtime support:
pip install optimum[onnxruntime] onnx
If you want to quantize with Intel Neural Compressor:
pip install neural-compressor
Export a Model to ONNX
Code
from transformers import AutoTokenizer, AutoModelForSequenceClassificationfrom optimum.onnxruntime import ORTModelForSequenceClassificationfrom optimum.exporters.onnx import main_export# Export model using Optimum's CLI functionmain_export( model_name_or_path="bert-base-uncased", output="onnx/bert", task="text-classification")
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Load Exported Model
Code
model = ORTModelForSequenceClassification.from_pretrained("onnx/bert")tokenizer = AutoTokenizer.from_pretrained("onnx/bert")
Inference with ONNX Runtime
Code
from transformers import pipelinefrom optimum.onnxruntime import ORTModelForSequenceClassificationfrom transformers import AutoTokenizermodel = ORTModelForSequenceClassification.from_pretrained("onnx/bert")tokenizer = AutoTokenizer.from_pretrained("onnx/bert")pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)pipe("This is amazing!")
optimum makes it practical to get real-world performance gains without leaving the Hugging Face ecosystem. Whether you’re optimizing for latency, size, or deployment compatibility, it’s a powerful addition to your ML toolbox.