Transformers Library for NLP

The Hugging Face transformers library is a powerful toolkit for using pretrained models across NLP tasks like classification, summarization, translation, and more.

This post shows how to:

Run inference using pipeline()
Fine-tune a model using the Trainer API

Installation

pip install transformers datasets evaluate

Inference Examples

#1. Sentiment Analysis

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier("I love this library!"))
# [{'label': 'POSITIVE', 'score': 0.999...}]

#2. Text Generation
generator = pipeline("text-generation", model="gpt2")
print(generator("The future of AI is", max_length=30, num_return_sequences=1))

#3. Summarization
summarizer = pipeline("summarization")
text = "Hugging Face’s Transformers library lets you use powerful pre-trained models for a wide range of NLP tasks with minimal setup."
print(summarizer(text))

#4. Translation
translator = pipeline("translation_en_to_fr")
print(translator("I like pizza.", max_length=40))

#5. Named Entity Recognition (NER)
ner = pipeline("ner", grouped_entities=True)
print(ner("Hugging Face is based in New York City."))

#6. Zero-Shot Classification

classifier = pipeline("zero-shot-classification")
print(classifier("I want to book a flight.", candidate_labels=["travel", "finance", "education"]))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0

[{'label': 'POSITIVE', 'score': 0.9998852014541626}]

Device set to use mps:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.

[{'generated_text': 'The future of AI is a messy mess, and in a place where there is so much good we would just give it up completely, it will all'}]

Device set to use mps:0
Your max_length is set to 142, but your input_length is only 30. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=15)
No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.

[{'summary_text': ' Hugging Face’s Transformers library lets you use powerful pre-trained models for a wide range of NLP tasks with minimal setup . The Transformers library is a library of pre-training models that can be used for a range of tasks with a simple set of tools .'}]

Device set to use mps:0
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.

[{'translation_text': "J'aime la pizza."}]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
/Users/ravikalia/Code/local-playground/.venv/lib/python3.11/site-packages/transformers/pipelines/token_classification.py:170: UserWarning:

`grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.

[{'entity_group': 'ORG', 'score': 0.8907569, 'word': 'Hugging Face', 'start': 0, 'end': 12}, {'entity_group': 'LOC', 'score': 0.9991805, 'word': 'New York City', 'start': 25, 'end': 38}]

Device set to use mps:0

{'sequence': 'I want to book a flight.', 'labels': ['travel', 'finance', 'education'], 'scores': [0.9946956634521484, 0.003904918907210231, 0.0013994225300848484]}

Fine-Tuning a Transformer for Binary Classification

Let’s fine-tune distilbert-base-uncased on the IMDb dataset using the Trainer API.

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
import evaluate

# Load a small IMDb split
dataset = load_dataset("imdb", split="train[:1%]").train_test_split(test_size=0.2)

# Tokenization
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


def tokenize(example):
    return tokenizer(example["text"], truncation=True, padding=True)


tokenized = dataset.map(tokenize, batched=True)

# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

# Evaluation metric
accuracy = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=1)
    return accuracy.compute(predictions=preds, references=labels)


# Training setup
args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    logging_steps=10,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/Users/ravikalia/Code/local-playground/.venv/lib/python3.11/site-packages/transformers/training_args.py:1611: FutureWarning:

`evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead

[150/150 00:53, Epoch 3/3]

Epoch	Training Loss	Validation Loss	Accuracy
1	0.001200	0.000826	1.000000
2	0.000400	0.000358	1.000000
3	0.000300	0.000291	1.000000

TrainOutput(global_step=150, training_loss=0.020744994301348924, metrics={'train_runtime': 54.6163, 'train_samples_per_second': 10.986, 'train_steps_per_second': 2.746, 'total_flos': 79480439193600.0, 'train_loss': 0.020744994301348924, 'epoch': 3.0})

Summary

pipeline() is the easiest way to use transformers for inference.
For training, Trainer makes fine-tuning on datasets like IMDb accessible with minimal code.
All models are backed by Hugging Face’s hub: https://huggingface.co/models

Let me know if you’d like a version for text generation or token classification!