Hugging Face Tokenizers Library

nlp
huggingface
tokenization
Author

Ravi Kalia

Published

March 20, 2025

The tokenizers library from Hugging Face is a fast, flexible, and production-grade tool for building custom tokenization pipelines. It’s written in Rust and exposed to Python for speed and usability.

What is Tokenization?

Tokenization is the process of converting text into tokens. These tokens can be: - Words: ["I", "love", "NLP"] - Subwords: ["I", "lov", "e", "NL", "P"] - Characters: ["I", "l", "o", "v", "e"]

Transformer models usually use subword tokenization like Byte-Pair Encoding (BPE), WordPiece, or Unigram for efficiency and generalization.

Why Use tokenizers?

  • Very fast (Rust-powered)
  • Modular and customizable
  • Built-in training support
  • Compatible with 🤗 Transformers
  • Tracks offsets and original text spans

Core Components

Component Description
Normalizer Lowercasing, NFKC, stripping accents
PreTokenizer Splits text into words/subwords
Model Learns the subword vocabulary (BPE, WordPiece, etc.)
PostProcessor Adds [CLS], [SEP], etc.
Decoder Reconstructs original text

Example 1: Load Pretrained Tokenizer

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

output = tokenizer.encode("Hugging Face is creating a tool.")
print("Tokens:", output.tokens)
print("IDs:", output.ids)
print("Offsets:", output.offsets)
Tokens: ['[CLS]', 'hugging', 'face', 'is', 'creating', 'a', 'tool', '.', '[SEP]']
IDs: [101, 17662, 2227, 2003, 4526, 1037, 6994, 1012, 102]
Offsets: [(0, 0), (0, 7), (8, 12), (13, 15), (16, 24), (25, 26), (27, 31), (31, 32), (0, 0)]

Example 2: Build with Normalizer and PreTokenizer

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.normalizers import Lowercase, NFD, StripAccents, Sequence
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

Example 3: Train Your Own Tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[SEP]"])
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(["data.txt"], trainer)
print(tokenizer.encode("Some unseen text").tokens)



['[UNK]', 'o', 'm', 'e', 'u', 'n', 'se', 'en', 'text']

Example 4: Add Special Tokens

from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)]
)

encoded = tokenizer.encode("Hello world")
print(encoded.tokens)  # ['[CLS]', 'hello', 'world', '[SEP]']
['[CLS]', '[UNK]', 'e', 'l', 'l', 'o', 'wor', 'l', 'd', '[SEP]']

Example 5: Integration with Transformers

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="my-tokenizer.json")

Use this with pipeline, Trainer, or model training directly.

Conclusion

The tokenizers library is a powerful tool for building custom tokenization pipelines. It’s fast, flexible, and integrates seamlessly with the Hugging Face ecosystem. Whether you’re working on a small project or a large-scale application, tokenizers can help you efficiently preprocess your text data.