Evaluating NLP Models with the Hugging Face evaluate Library

NLP
Hugging Face
Evaluation
Author

Ravi Kalia

Published

March 12, 2025

The Hugging Face evaluate library provides a simple and flexible interface for computing metrics on machine learning predictions. It’s especially well-suited for NLP tasks like classification, summarization, and translation, where standard metrics are critical for reliable benchmarking.

Installation

pip install evaluate

To explore available metrics

Code
import evaluate
evaluate.list_evaluation_modules()
['codeparrot/apps_metric',
 'lvwerra/test',
 'angelina-wang/directional_bias_amplification',
 'cpllab/syntaxgym',
 'lvwerra/bary_score',
 'hack/test_metric',
 'yzha/ctc_eval',
 'mfumanelli/geometric_mean',
 'daiyizheng/valid',
 'erntkn/dice_coefficient',
 'mgfrantz/roc_auc_macro',
 'Vlasta/pr_auc',
 'gorkaartola/metric_for_tp_fp_samples',
 'idsedykh/metric',
 'idsedykh/codebleu2',
 'idsedykh/codebleu',
 'idsedykh/megaglue',
 'Vertaix/vendiscore',
 'GMFTBY/dailydialogevaluate',
 'GMFTBY/dailydialog_evaluate',
 'jzm-mailchimp/joshs_second_test_metric',
 'ola13/precision_at_k',
 'yulong-me/yl_metric',
 'abidlabs/mean_iou',
 'abidlabs/mean_iou2',
 'KevinSpaghetti/accuracyk',
 'NimaBoscarino/weat',
 'ronaldahmed/nwentfaithfulness',
 'Viona/infolm',
 'kyokote/my_metric2',
 'kashif/mape',
 'Ochiroo/rouge_mn',
 'leslyarun/fbeta_score',
 'anz2/iliauniiccocrevaluation',
 'zbeloki/m2',
 'xu1998hz/sescore',
 'dvitel/codebleu',
 'NCSOFT/harim_plus',
 'JP-SystemsX/nDCG',
 'sportlosos/sescore',
 'Drunper/metrica_tesi',
 'jpxkqx/peak_signal_to_noise_ratio',
 'jpxkqx/signal_to_reconstruction_error',
 'hpi-dhc/FairEval',
 'lvwerra/accuracy_score',
 'ybelkada/cocoevaluate',
 'harshhpareek/bertscore',
 'posicube/mean_reciprocal_rank',
 'bstrai/classification_report',
 'omidf/squad_precision_recall',
 'Josh98/nl2bash_m',
 'BucketHeadP65/confusion_matrix',
 'BucketHeadP65/roc_curve',
 'yonting/average_precision_score',
 'transZ/test_parascore',
 'transZ/sbert_cosine',
 'hynky/sklearn_proxy',
 'xu1998hz/sescore_english_mt',
 'xu1998hz/sescore_german_mt',
 'xu1998hz/sescore_english_coco',
 'xu1998hz/sescore_english_webnlg',
 'unnati/kendall_tau_distance',
 'Viona/fuzzy_reordering',
 'Viona/kendall_tau',
 'lhy/hamming_loss',
 'lhy/ranking_loss',
 'Muennighoff/code_eval_octopack',
 'yuyijiong/quad_match_score',
 'AlhitawiMohammed22/CER_Hu-Evaluation-Metrics',
 'Yeshwant123/mcc',
 'phonemetransformers/segmentation_scores',
 'sma2023/wil',
 'chanelcolgate/average_precision',
 'ckb/unigram',
 'Felipehonorato/eer',
 'manueldeprada/beer',
 'shunzh/apps_metric',
 'He-Xingwei/sari_metric',
 'langdonholmes/cohen_weighted_kappa',
 'fschlatt/ner_eval',
 'hyperml/balanced_accuracy',
 'brian920128/doc_retrieve_metrics',
 'guydav/restrictedpython_code_eval',
 'k4black/codebleu',
 'Natooz/ece',
 'ingyu/klue_mrc',
 'Vipitis/shadermatch',
 'gabeorlanski/bc_eval',
 'jjkim0807/code_eval',
 'repllabs/mean_reciprocal_rank',
 'repllabs/mean_average_precision',
 'mtc/fragments',
 'DarrenChensformer/eval_keyphrase',
 'kedudzic/charmatch',
 'Vallp/ter',
 'DarrenChensformer/relation_extraction',
 'Ikala-allen/relation_extraction',
 'danieldux/hierarchical_softmax_loss',
 'nlpln/tst',
 'bdsaglam/jer',
 'davebulaval/meaningbert',
 'fnvls/bleu1234',
 'fnvls/bleu_1234',
 'nevikw39/specificity',
 'yqsong/execution_accuracy',
 'shalakasatheesh/squad_v2',
 'arthurvqin/pr_auc',
 'd-matrix/dmx_perplexity',
 'akki2825/accents_unplugged_eval',
 'juliakaczor/accents_unplugged_eval',
 'Vickyage/accents_unplugged_eval',
 'Qui-nn/accents_unplugged_eval',
 'TelEl/accents_unplugged_eval',
 'livvie/accents_unplugged_eval',
 'DaliaCaRo/accents_unplugged_eval',
 'alvinasvk/accents_unplugged_eval',
 'LottieW/accents_unplugged_eval',
 'LuckiestOne/valid_efficiency_score',
 'Fritz02/execution_accuracy',
 'huanghuayu/multiclass_brier_score',
 'jialinsong/apps_metric',
 'DoctorSlimm/bangalore_score',
 'agkphysics/ccc',
 'DoctorSlimm/kaushiks_criteria',
 'CZLC/rouge_raw',
 'bascobasculino/mot-metrics',
 'SEA-AI/mot-metrics',
 'SEA-AI/det-metrics',
 'saicharan2804/my_metric',
 'red1bluelost/evaluate_genericify_cpp',
 'maksymdolgikh/seqeval_with_fbeta',
 'Bekhouche/NED',
 'danieldux/isco_hierarchical_accuracy',
 'ginic/phone_errors',
 'berkatil/map',
 'DarrenChensformer/action_generation',
 'buelfhood/fbeta_score',
 'danasone/ru_errant',
 'helena-balabin/youden_index',
 'SEA-AI/panoptic-quality',
 'SEA-AI/box-metrics',
 'MathewShen/bleu',
 'berkatil/mrr',
 'BridgeAI-Lab/SemF1',
 'SEA-AI/horizon-metrics',
 'maysonma/lingo_judge_metric',
 'dannashao/span_metric',
 'Aye10032/loss_metric',
 'ag2435/my_metric',
 'kilian-group/arxiv_score',
 'bomjin/code_eval_octopack',
 'svenwey/logmetric',
 'bowdbeg/matching_series',
 'BridgeAI-Lab/Sem-nCG',
 'bowdbeg/patch_series',
 'venkatasg/gleu',
 'kbmlcoding/apps_metric',
 'jijihuny/ecqa',
 'prajwall/mse',
 'd-matrix/dmxMetric',
 'dotkaio/competition_math',
 'bowdbeg/docred',
 'Remeris/rouge_ru',
 'jarod0411/aucpr',
 'Ruchin/jaccard_similarity',
 'phucdev/blanc_score',
 'NathanMad/bertscore-with-torch_dtype',
 'cointegrated/blaser_2_0_qe',
 'ahnyeonchan/Alignment-and-Uniformity',
 'Baleegh/Fluency_Score',
 'mdocekal/multi_label_precision_recall_accuracy_fscore',
 'phucdev/vihsd',
 'argmaxinc/detailed-wer',
 'SEA-AI/user-friendly-metrics',
 'hage2000/code_eval_stdio',
 'hage2000/my_metric',
 'Natooz/levenshtein',
 'Khaliq88/execution_accuracy',
 'pico-lm/perplexity',
 'mtzig/cross_entropy_loss',
 'kiracurrie22/precision',
 'openpecha/bleurt',
 'SEA-AI/ref-metrics',
 'Natooz/mse',
 'buelfhood/fbeta_score_2',
 'murinj/hter',
 'nobody4/waf_metric',
 'mdocekal/precision_recall_fscore_accuracy',
 'Glazkov/mars',
 'ncoop57/levenshtein_distance',
 'kaleidophon/almost_stochastic_order',
 'NeuraFusionAI/Arabic-Evaluation',
 'lvwerra/element_count',
 'prb977/cooccurrence_count',
 'NimaBoscarino/pseudo_perplexity',
 'ybelkada/toxicity',
 'ronaldahmed/ccl_win',
 'christopher/tokens_per_byte',
 'lsy641/distinct',
 'grepLeigh/perplexity',
 'Charles95/element_count',
 'Charles95/accuracy',
 'Lucky28/honest']

Quick Example

Code
from evaluate import load

accuracy = load("accuracy")
accuracy.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 0])
# {'accuracy': 0.75}
{'accuracy': 0.75}

NLP Tasks and Typical Metrics

Task Metric Examples
Text classification accuracy, f1, recall, precision
Sequence labeling seqeval
Translation bleu, sacrebleu, chrf
Summarization rouge, bertscore, bleurt
Text generation rouge, bertscore, meteor, bleurt
Speech recognition wer, cer

These scores need some dependencies:

pip install nltk rouge_score bert_score

Text Generation Metrics

ROUGE

Measures token overlap with a reference: * rouge-1: unigram overlap * rouge-2: bigram overlap

Code
rouge = evaluate.load("rouge")
rouge.compute(predictions=["The cat sat."], references=["A cat was sitting."])
{'rouge1': 0.28571428571428575,
 'rouge2': 0.0,
 'rougeL': 0.28571428571428575,
 'rougeLsum': 0.28571428571428575}

BLEU

Precision-based n-gram overlap metric, often used in translation:

Code
bleu = evaluate.load("bleu")
bleu.compute(
    predictions=["The cat is on the mat."], references=[["The cat sits on the mat."]]
)
{'bleu': 0.488923022434901,
 'precisions': [0.8571428571428571, 0.6666666666666666, 0.4, 0.25],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 7,
 'reference_length': 7}

BERTScore

Uses a pretrained transformer model to measure semantic similarity in embedding space. There’s a small bug here, I don’t have time to fix it - however similar code should work.

Code
# bertscore = evaluate.load("bertscore")
# bertscore.compute(predictions=["The cat sat."],
#                   references=["A cat was sitting."],
#                   lang="en")

With Hugging Face Pipelines

Code
from transformers import pipeline
summarizer = pipeline("summarization")

summary = summarizer("This is a long text to summarize.")[0]["summary_text"]

rouge.compute(predictions=[summary], references=["Reference summary here."])
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Your max_length is set to 142, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)
{'rouge1': 0.03571428571428571,
 'rouge2': 0.0,
 'rougeL': 0.03571428571428571,
 'rougeLsum': 0.03571428571428571}

Custom Metrics

Define your own scoring function:

Code
import evaluate


# Define the custom metric function
def my_metric_fn(predictions, references):
    score = sum(p == r for p, r in zip(predictions, references)) / len(references)
    return {"custom_accuracy": score}


# Use the evaluate's simple interface
result = my_metric_fn(predictions=[1, 0, 1], references=[1, 1, 1])
print(result)
{'custom_accuracy': 0.6666666666666666}

The evaluate library is model-agnostic and pairs well with the rest of the Hugging Face ecosystem. It’s lightweight enough for quick experiments but supports rich comparisons for production or publication.