Hugging Face Datasets

A practical guide to loading, creating, and using datasets with the Hugging Face datasets library.
Machine Learning
NLP
PyTorch
Author

Ravi Kalia

Published

April 4, 2025

Introduction

The Hugging Face datasets library is a powerful tool for downloading, processing, and managing large-scale datasets for machine learning, particularly in NLP but also for vision and tabular data. It provides seamless integration with PyTorch and TensorFlow.

Key Features

  • Easy Access to Large Datasets: Preloaded datasets such as IMDB, CIFAR-10, and SQuAD.
  • Streaming for Large Datasets: Memory-efficient loading of massive datasets.
  • Dataset Preprocessing and Transformations: Apply tokenization, filtering, and mapping functions.
  • Multiple File Format Support: Works with CSV, JSON, and Parquet.
  • Dataset Splitting: Simple train-test splitting.
  • Efficient Storage: Uses Apache Arrow for fast data processing.
  • Caching: Avoids redundant downloads and computations.

Installation

pip install datasets

Loading and Exploring a Dataset

Code
from datasets import load_dataset

# Load the IMDB sentiment analysis dataset
dataset = load_dataset("imdb")
print(dataset)
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

Creating a Dataset from Scratch

Code
from datasets import Dataset

data = {
    "text": ["I love this!", "This is bad!", "Absolutely amazing!", "Not good at all!"],
    "label": [1, 0, 1, 0],
}

dataset = Dataset.from_dict(data)
print(dataset)
Dataset({
    features: ['text', 'label'],
    num_rows: 4
})

Loading a Dataset from a CSV File

Code
from datasets import load_dataset

dataset = load_dataset("csv", data_files="data.csv")
print(dataset)
DatasetDict({
    train: Dataset({
        features: ['Name', 'Age', 'Country'],
        num_rows: 5
    })
})

Splitting a Dataset

Code
train_test_split = dataset["train"].train_test_split(test_size=0.1)
print(train_test_split)
DatasetDict({
    train: Dataset({
        features: ['Name', 'Age', 'Country'],
        num_rows: 4
    })
    test: Dataset({
        features: ['Name', 'Age', 'Country'],
        num_rows: 1
    })
})

Tokenization and Collation with Transformers

Code
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Dummy data
raw_data = [{"text": "hello", "label": 0}, {"text": "world", "label": 1}]


def collate_fn(batch):
    return tokenizer(
        [ex["text"] for ex in batch], padding=True, truncation=True, return_tensors="pt"
    )


loader = DataLoader(raw_data, batch_size=2, collate_fn=collate_fn)

for batch in loader:
    print(batch)
    break
{'input_ids': tensor([[ 101, 7592,  102],
        [ 101, 2088,  102]]), 'attention_mask': tensor([[1, 1, 1],
        [1, 1, 1]])}

Streaming Large Datasets

Code
streaming_dataset = load_dataset("imdb", split="train", streaming=True)

for sample in streaming_dataset:
    print(sample)
    break  # Stop after one example
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.', 'label': 0}

Conclusion

The datasets library simplifies handling large-scale datasets with powerful built-in functionalities for efficient data loading, transformation, and integration into ML pipelines. It is particularly useful for NLP, vision, and tabular machine learning workflows.


Let me know if you need more examples or modifications! 🚀