Parameter-Efficient Fine-Tuning with Hugging Face’s peft Library
Author
Ravi Kalia
Published
March 16, 2025
The peft library from Hugging Face enables efficient fine-tuning of large language models by updating only a small subset of parameters. This is essential for working with billion-parameter models on limited compute.
Motivation
Full fine-tuning of large models is expensive in terms of memory and compute. peft provides efficient alternatives by freezing the majority of model parameters and introducing a small number of trainable components.
Overview of Methods
The library implements several parameter-efficient fine-tuning (PEFT) techniques:
Method
Description
LoRA
Injects trainable low-rank matrices into attention and feedforward layers.
Prefix Tuning
Prepends learnable vectors to the hidden states inside each transformer layer.
Prompt Tuning
Adds learned embeddings to the input embedding sequence.
IA³
Scales keys, values, and FFN outputs with learned vectors.
All methods aim to reduce the number of trainable parameters while maintaining or improving downstream task performance.
Architecture and Usage
peft wraps Hugging Face transformers models. A common usage pattern:
trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364
/Users/ravikalia/Code/local-playground/.venv/lib/python3.11/site-packages/peft/tuners/lora/layer.py:1768: UserWarning:
fan_in_fan_out is set to False but the target module is `Conv1D`. Setting fan_in_fan_out to True.
This only trains the LoRA adapters while freezing the rest of the model.
Comparison with Soft Prompting
Soft prompting, also known as prompt tuning, optimizes a set of continuous embeddings that are prepended to the input sequence. These embeddings are learned while the base model remains frozen.
In peft, soft prompts are implemented via PromptTuningConfig:
from peft import PromptTuningConfigconfig = PromptTuningConfig( task_type=TaskType.CAUSAL_LM, num_virtual_tokens=10, tokenizer_name_or_path="gpt2")
The learned prompt embeddings are concatenated with the input embeddings before being passed into the frozen model.
LoRA and prompt tuning can be viewed as operating at different points in the model: • Prompt tuning: Adds trainable embeddings to the input. • LoRA: Adds low-rank parameter updates inside the transformer blocks, modifying key, value, and projection matrices during the forward pass.
Comparison Table
Feature Soft Prompting LoRA Trainable Prompt embeddings Low-rank adapters Model layers Input embeddings Attention and FFN layers Memory usage Extremely low Still efficient, but larger than prompts Implementation Prepends embeddings Modifies linear layers during forward
Object Interactions
peft wraps a pretrained model and modifies its behavior through: • Configuration objects like LoraConfig or PromptTuningConfig • Wrappers inserted by get_peft_model() that modify the model’s forward pass • New trainable parameters (lora_A, lora_B, or prompt_embeddings) added to the model
In LoRA, each modified layer behaves like:
W_eff = W + ΔWΔW = lora_B @ lora_A # trainable
Only lora_A and lora_B are optimized during training.
Installation
To install the library:
pip install peft
Final Notes
The peft library abstracts away much of the complexity of integrating PEFT methods into standard Hugging Face workflows. Whether using prompt tuning for simple adapters or LoRA for deeper changes, the interface remains consistent and efficient.
Let me know if you want to add plots or show how to visualize the number of trainable parameters.