Evaluating NLP Models with the Hugging Face evaluate Library

The Hugging Face evaluate library provides a simple and flexible interface for computing metrics on machine learning predictions. It’s especially well-suited for NLP tasks like classification, summarization, and translation, where standard metrics are critical for reliable benchmarking.

Installation

pip install evaluate

To explore available metrics

Code

import evaluate
evaluate.list_evaluation_modules()

['codeparrot/apps_metric',
 'lvwerra/test',
 'angelina-wang/directional_bias_amplification',
 'cpllab/syntaxgym',
 'lvwerra/bary_score',
 'hack/test_metric',
 'yzha/ctc_eval',
 'mfumanelli/geometric_mean',
 'daiyizheng/valid',
 'erntkn/dice_coefficient',
 'mgfrantz/roc_auc_macro',
 'Vlasta/pr_auc',
 'gorkaartola/metric_for_tp_fp_samples',
 'idsedykh/metric',
 'idsedykh/codebleu2',
 'idsedykh/codebleu',
 'idsedykh/megaglue',
 'Vertaix/vendiscore',
 'GMFTBY/dailydialogevaluate',
 'GMFTBY/dailydialog_evaluate',
 'jzm-mailchimp/joshs_second_test_metric',
 'ola13/precision_at_k',
 'yulong-me/yl_metric',
 'abidlabs/mean_iou',
 'abidlabs/mean_iou2',
 'KevinSpaghetti/accuracyk',
 'NimaBoscarino/weat',
 'ronaldahmed/nwentfaithfulness',
 'Viona/infolm',
 'kyokote/my_metric2',
 'kashif/mape',
 'Ochiroo/rouge_mn',
 'leslyarun/fbeta_score',
 'anz2/iliauniiccocrevaluation',
 'zbeloki/m2',
 'xu1998hz/sescore',
 'dvitel/codebleu',
 'NCSOFT/harim_plus',
 'JP-SystemsX/nDCG',
 'sportlosos/sescore',
 'Drunper/metrica_tesi',
 'jpxkqx/peak_signal_to_noise_ratio',
 'jpxkqx/signal_to_reconstruction_error',
 'hpi-dhc/FairEval',
 'lvwerra/accuracy_score',
 'ybelkada/cocoevaluate',
 'harshhpareek/bertscore',
 'posicube/mean_reciprocal_rank',
 'bstrai/classification_report',
 'omidf/squad_precision_recall',
 'Josh98/nl2bash_m',
 'BucketHeadP65/confusion_matrix',
 'BucketHeadP65/roc_curve',
 'yonting/average_precision_score',
 'transZ/test_parascore',
 'transZ/sbert_cosine',
 'hynky/sklearn_proxy',
 'xu1998hz/sescore_english_mt',
 'xu1998hz/sescore_german_mt',
 'xu1998hz/sescore_english_coco',
 'xu1998hz/sescore_english_webnlg',
 'unnati/kendall_tau_distance',
 'Viona/fuzzy_reordering',
 'Viona/kendall_tau',
 'lhy/hamming_loss',
 'lhy/ranking_loss',
 'Muennighoff/code_eval_octopack',
 'yuyijiong/quad_match_score',
 'AlhitawiMohammed22/CER_Hu-Evaluation-Metrics',
 'Yeshwant123/mcc',
 'phonemetransformers/segmentation_scores',
 'sma2023/wil',
 'chanelcolgate/average_precision',
 'ckb/unigram',
 'Felipehonorato/eer',
 'manueldeprada/beer',
 'shunzh/apps_metric',
 'He-Xingwei/sari_metric',
 'langdonholmes/cohen_weighted_kappa',
 'fschlatt/ner_eval',
 'hyperml/balanced_accuracy',
 'brian920128/doc_retrieve_metrics',
 'guydav/restrictedpython_code_eval',
 'k4black/codebleu',
 'Natooz/ece',
 'ingyu/klue_mrc',
 'Vipitis/shadermatch',
 'gabeorlanski/bc_eval',
 'jjkim0807/code_eval',
 'repllabs/mean_reciprocal_rank',
 'repllabs/mean_average_precision',
 'mtc/fragments',
 'DarrenChensformer/eval_keyphrase',
 'kedudzic/charmatch',
 'Vallp/ter',
 'DarrenChensformer/relation_extraction',
 'Ikala-allen/relation_extraction',
 'danieldux/hierarchical_softmax_loss',
 'nlpln/tst',
 'bdsaglam/jer',
 'davebulaval/meaningbert',
 'fnvls/bleu1234',
 'fnvls/bleu_1234',
 'nevikw39/specificity',
 'yqsong/execution_accuracy',
 'shalakasatheesh/squad_v2',
 'arthurvqin/pr_auc',
 'd-matrix/dmx_perplexity',
 'akki2825/accents_unplugged_eval',
 'juliakaczor/accents_unplugged_eval',
 'Vickyage/accents_unplugged_eval',
 'Qui-nn/accents_unplugged_eval',
 'TelEl/accents_unplugged_eval',
 'livvie/accents_unplugged_eval',
 'DaliaCaRo/accents_unplugged_eval',
 'alvinasvk/accents_unplugged_eval',
 'LottieW/accents_unplugged_eval',
 'LuckiestOne/valid_efficiency_score',
 'Fritz02/execution_accuracy',
 'huanghuayu/multiclass_brier_score',
 'jialinsong/apps_metric',
 'DoctorSlimm/bangalore_score',
 'agkphysics/ccc',
 'DoctorSlimm/kaushiks_criteria',
 'CZLC/rouge_raw',
 'bascobasculino/mot-metrics',
 'SEA-AI/mot-metrics',
 'SEA-AI/det-metrics',
 'saicharan2804/my_metric',
 'red1bluelost/evaluate_genericify_cpp',
 'maksymdolgikh/seqeval_with_fbeta',
 'Bekhouche/NED',
 'danieldux/isco_hierarchical_accuracy',
 'ginic/phone_errors',
 'berkatil/map',
 'DarrenChensformer/action_generation',
 'buelfhood/fbeta_score',
 'danasone/ru_errant',
 'helena-balabin/youden_index',
 'SEA-AI/panoptic-quality',
 'SEA-AI/box-metrics',
 'MathewShen/bleu',
 'berkatil/mrr',
 'BridgeAI-Lab/SemF1',
 'SEA-AI/horizon-metrics',
 'maysonma/lingo_judge_metric',
 'dannashao/span_metric',
 'Aye10032/loss_metric',
 'ag2435/my_metric',
 'kilian-group/arxiv_score',
 'bomjin/code_eval_octopack',
 'svenwey/logmetric',
 'bowdbeg/matching_series',
 'BridgeAI-Lab/Sem-nCG',
 'bowdbeg/patch_series',
 'venkatasg/gleu',
 'kbmlcoding/apps_metric',
 'jijihuny/ecqa',
 'prajwall/mse',
 'd-matrix/dmxMetric',
 'dotkaio/competition_math',
 'bowdbeg/docred',
 'Remeris/rouge_ru',
 'jarod0411/aucpr',
 'Ruchin/jaccard_similarity',
 'phucdev/blanc_score',
 'NathanMad/bertscore-with-torch_dtype',
 'cointegrated/blaser_2_0_qe',
 'ahnyeonchan/Alignment-and-Uniformity',
 'Baleegh/Fluency_Score',
 'mdocekal/multi_label_precision_recall_accuracy_fscore',
 'phucdev/vihsd',
 'argmaxinc/detailed-wer',
 'SEA-AI/user-friendly-metrics',
 'hage2000/code_eval_stdio',
 'hage2000/my_metric',
 'Natooz/levenshtein',
 'Khaliq88/execution_accuracy',
 'pico-lm/perplexity',
 'mtzig/cross_entropy_loss',
 'kiracurrie22/precision',
 'openpecha/bleurt',
 'SEA-AI/ref-metrics',
 'Natooz/mse',
 'buelfhood/fbeta_score_2',
 'murinj/hter',
 'nobody4/waf_metric',
 'mdocekal/precision_recall_fscore_accuracy',
 'Glazkov/mars',
 'ncoop57/levenshtein_distance',
 'kaleidophon/almost_stochastic_order',
 'NeuraFusionAI/Arabic-Evaluation',
 'lvwerra/element_count',
 'prb977/cooccurrence_count',
 'NimaBoscarino/pseudo_perplexity',
 'ybelkada/toxicity',
 'ronaldahmed/ccl_win',
 'christopher/tokens_per_byte',
 'lsy641/distinct',
 'grepLeigh/perplexity',
 'Charles95/element_count',
 'Charles95/accuracy',
 'Lucky28/honest']

Quick Example

Code

from evaluate import load

accuracy = load("accuracy")
accuracy.compute(predictions=[0, 1, 1, 0], references=[0, 1, 0, 0])
# {'accuracy': 0.75}

{'accuracy': 0.75}

NLP Tasks and Typical Metrics

Task	Metric Examples
Text classification	accuracy, f1, recall, precision
Sequence labeling	seqeval
Translation	bleu, sacrebleu, chrf
Summarization	rouge, bertscore, bleurt
Text generation	rouge, bertscore, meteor, bleurt
Speech recognition	wer, cer

These scores need some dependencies:

pip install nltk rouge_score bert_score

Text Generation Metrics

ROUGE

Measures token overlap with a reference: * rouge-1: unigram overlap * rouge-2: bigram overlap

Code

rouge = evaluate.load("rouge")
rouge.compute(predictions=["The cat sat."], references=["A cat was sitting."])

{'rouge1': 0.28571428571428575,
 'rouge2': 0.0,
 'rougeL': 0.28571428571428575,
 'rougeLsum': 0.28571428571428575}

BLEU

Precision-based n-gram overlap metric, often used in translation:

Code

bleu = evaluate.load("bleu")
bleu.compute(
    predictions=["The cat is on the mat."], references=[["The cat sits on the mat."]]
)

{'bleu': 0.488923022434901,
 'precisions': [0.8571428571428571, 0.6666666666666666, 0.4, 0.25],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 7,
 'reference_length': 7}

BERTScore

Uses a pretrained transformer model to measure semantic similarity in embedding space. There’s a small bug here, I don’t have time to fix it - however similar code should work.

Code

# bertscore = evaluate.load("bertscore")
# bertscore.compute(predictions=["The cat sat."],
#                   references=["A cat was sitting."],
#                   lang="en")

With Hugging Face Pipelines

Code

from transformers import pipeline
summarizer = pipeline("summarization")

summary = summarizer("This is a long text to summarize.")[0]["summary_text"]

rouge.compute(predictions=[summary], references=["Reference summary here."])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0
Your max_length is set to 142, but your input_length is only 10. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)

{'rouge1': 0.03571428571428571,
 'rouge2': 0.0,
 'rougeL': 0.03571428571428571,
 'rougeLsum': 0.03571428571428571}

Custom Metrics

Define your own scoring function:

Code

import evaluate


# Define the custom metric function
def my_metric_fn(predictions, references):
    score = sum(p == r for p, r in zip(predictions, references)) / len(references)
    return {"custom_accuracy": score}


# Use the evaluate's simple interface
result = my_metric_fn(predictions=[1, 0, 1], references=[1, 1, 1])
print(result)

{'custom_accuracy': 0.6666666666666666}

The evaluate library is model-agnostic and pairs well with the rest of the Hugging Face ecosystem. It’s lightweight enough for quick experiments but supports rich comparisons for production or publication.