Hugging Face Tokenizers Library

The tokenizers library from Hugging Face is a fast, flexible, and production-grade tool for building custom tokenization pipelines. It’s written in Rust and exposed to Python for speed and usability.

What is Tokenization?

Tokenization is the process of converting text into tokens. These tokens can be: - Words: ["I", "love", "NLP"] - Subwords: ["I", "lov", "e", "NL", "P"] - Characters: ["I", "l", "o", "v", "e"]

Transformer models usually use subword tokenization like Byte-Pair Encoding (BPE), WordPiece, or Unigram for efficiency and generalization.

Why Use `tokenizers`?

Very fast (Rust-powered)
Modular and customizable
Built-in training support
Compatible with 🤗 Transformers
Tracks offsets and original text spans

Core Components

Component	Description
`Normalizer`	Lowercasing, NFKC, stripping accents
`PreTokenizer`	Splits text into words/subwords
`Model`	Learns the subword vocabulary (BPE, WordPiece, etc.)
`PostProcessor`	Adds `[CLS]`, `[SEP]`, etc.
`Decoder`	Reconstructs original text

Example 1: Load Pretrained Tokenizer

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

output = tokenizer.encode("Hugging Face is creating a tool.")
print("Tokens:", output.tokens)
print("IDs:", output.ids)
print("Offsets:", output.offsets)

Tokens: ['[CLS]', 'hugging', 'face', 'is', 'creating', 'a', 'tool', '.', '[SEP]']
IDs: [101, 17662, 2227, 2003, 4526, 1037, 6994, 1012, 102]
Offsets: [(0, 0), (0, 7), (8, 12), (13, 15), (16, 24), (25, 26), (27, 31), (31, 32), (0, 0)]

Example 2: Build with Normalizer and PreTokenizer

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.normalizers import Lowercase, NFD, StripAccents, Sequence
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

Example 3: Train Your Own Tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[SEP]"])
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train(["data.txt"], trainer)
print(tokenizer.encode("Some unseen text").tokens)




['[UNK]', 'o', 'm', 'e', 'u', 'n', 'se', 'en', 'text']

Example 4: Add Special Tokens

from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)]
)

encoded = tokenizer.encode("Hello world")
print(encoded.tokens)  # ['[CLS]', 'hello', 'world', '[SEP]']

['[CLS]', '[UNK]', 'e', 'l', 'l', 'o', 'wor', 'l', 'd', '[SEP]']

Example 5: Integration with Transformers

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="my-tokenizer.json")

Use this with pipeline, Trainer, or model training directly.

Conclusion

The tokenizers library is a powerful tool for building custom tokenization pipelines. It’s fast, flexible, and integrates seamlessly with the Hugging Face ecosystem. Whether you’re working on a small project or a large-scale application, tokenizers can help you efficiently preprocess your text data.