Model Evaluation Metrics

Understanding the strengths, weaknesses, and appropriate use cases of various evaluation metrics in machine learning
Author

Ravi Kalia

Published

April 9, 2025

Introduction

Evaluation metrics are the cornerstone of machine learning model assessment. They help us quantify how well our models perform and make informed decisions about model selection and improvement. However, choosing the right metric is not always straightforward - different metrics have different strengths, weaknesses, and are appropriate for different scenarios.

In this post, we’ll explore various evaluation metrics across different types of machine learning tasks, discuss their mathematical foundations, and understand when to use each one.

Regression Metrics

Mean Squared Error (MSE)

The Mean Squared Error is one of the most commonly used metrics for regression problems. It calculates the average of the squared differences between predicted and actual values.

Code
import numpy as np
from sklearn.metrics import mean_squared_error

# Example calculation
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.2f}")
MSE: 0.38

Pros: - Penalizes larger errors more heavily (due to squaring) - Differentiable, making it suitable for optimization - Mathematically convenient for analysis

Cons: - Sensitive to outliers - Not in the same units as the target variable - Can be misleading when dealing with data of different scales

Root Mean Squared Error (RMSE)

RMSE is simply the square root of MSE, bringing the metric back to the original units of the target variable.

Code
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}")
RMSE: 0.61

Pros: - In the same units as the target variable - Easier to interpret than MSE - Still penalizes larger errors

Cons: - Still sensitive to outliers - Not as mathematically convenient as MSE

Mean Absolute Error (MAE)

MAE calculates the average absolute difference between predicted and actual values.

Code
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.2f}")
MAE: 0.50

Pros: - More robust to outliers than MSE/RMSE - Easy to interpret - In the same units as the target variable

Cons: - Less sensitive to large errors - Not differentiable at zero - May not penalize large errors enough in some applications

R-squared (R²)

R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables.

Code
from sklearn.metrics import r2_score

r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.2f}")
R²: 0.95

Pros: - Provides a relative measure of model performance - Scale-independent - Intuitive interpretation (percentage of variance explained)

Cons: - Can be misleading with non-linear relationships - Can be artificially high with many features - Doesn’t indicate whether the model is appropriate

Classification Metrics

Accuracy

Accuracy measures the proportion of correct predictions.

Code
from sklearn.metrics import accuracy_score

y_true = np.array([0, 1, 0, 1])
y_pred = np.array([0, 1, 0, 0])
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")
Accuracy: 0.75

Pros: - Simple to understand and interpret - Works well with balanced classes

Cons: - Misleading with imbalanced classes - Doesn’t distinguish between types of errors - Not suitable for multi-class problems with varying class importance

Precision and Recall

Precision measures the proportion of positive predictions that are actually positive, while recall measures the proportion of actual positives that are correctly predicted.

Code
from sklearn.metrics import precision_score, recall_score

precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
Precision: 1.00
Recall: 0.50

Pros: - More informative than accuracy for imbalanced data - Can be tuned based on business requirements - Useful for different types of errors (false positives vs false negatives)

Cons: - Need to choose between precision and recall - May not capture the full picture alone - Can be sensitive to threshold selection

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both.

Code
from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print(f"F1 Score: {f1:.2f}")
F1 Score: 0.67

Pros: - Balances precision and recall - Useful for imbalanced datasets - Single metric for comparison

Cons: - May not be appropriate when precision and recall have different importance - Can be misleading with very imbalanced classes - Doesn’t consider true negatives

ROC-AUC

The Receiver Operating Characteristic Area Under Curve measures the ability of the model to distinguish between classes.

Code
from sklearn.metrics import roc_auc_score

y_true = np.array([0, 1, 0, 1])
y_scores = np.array([0.1, 0.9, 0.2, 0.8])
roc_auc = roc_auc_score(y_true, y_scores)
print(f"ROC-AUC: {roc_auc:.2f}")
ROC-AUC: 1.00

Pros: - Threshold-independent - Works well with imbalanced data - Provides a single number for model comparison

Cons: - May not reflect actual business requirements - Can be misleading with very imbalanced classes - Doesn’t consider predicted probabilities directly

Specialized Metrics

Log Loss (Cross-Entropy Loss)

Log loss measures the performance of a classification model where the prediction is a probability between 0 and 1.

Code
from sklearn.metrics import log_loss

y_true = np.array([0, 1, 0, 1])
y_pred_proba = np.array([0.1, 0.9, 0.2, 0.8])
logloss = log_loss(y_true, y_pred_proba)
print(f"Log Loss: {logloss:.2f}")
Log Loss: 0.16

Pros: - Penalizes confident wrong predictions heavily - Works well with probabilistic predictions - Suitable for multi-class problems

Cons: - Sensitive to predicted probabilities - Can be difficult to interpret - May be too harsh in some applications

Mean Absolute Percentage Error (MAPE)

MAPE measures the average percentage difference between predicted and actual values.

Code
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

mape = mean_absolute_percentage_error(y_true, y_pred)
print(f"MAPE: {mape:.2f}%")
MAPE: nan%

Pros: - Easy to interpret as percentage error - Scale-independent - Useful for business communication

Cons: - Undefined when actual values are zero - Can be misleading with small actual values - Asymmetric (penalizes over-prediction differently than under-prediction)

Considerations for Different Scenarios

Class Imbalance

When dealing with imbalanced classes, certain metrics become more important:

  • Precision and Recall become crucial
  • ROC-AUC can be misleading
  • F1 score or balanced accuracy might be better choices
  • Consider using class weights or sampling techniques

Changing Variance

For data with non-constant variance:

  • MSE/RMSE might be less appropriate
  • Consider weighted versions of metrics
  • Look into robust regression metrics
  • Consider transforming the target variable

Multi-class Problems

For multi-class classification:

  • Micro/macro averaging of metrics
  • Confusion matrix becomes more important
  • Consider class-specific metrics
  • Weighted metrics based on class importance

Conclusion

Choosing the right evaluation metric is crucial for model assessment and should be based on:

  1. The nature of the problem (regression vs classification)
  2. The distribution of the data (balanced vs imbalanced)
  3. Business requirements (cost of different types of errors)
  4. The scale and variance of the target variable

Remember that no single metric tells the whole story - it’s often valuable to look at multiple metrics and understand their trade-offs. Always consider the context and requirements of your specific problem when selecting evaluation metrics.