Understanding Feature Importance After Clustering

Machine Learning
Clustering
Feature Importance
Author

Ravi Kalia

Published

February 20, 2025

Feature Importance

Introduction

Once clustering is complete, understanding which features differentiate each cluster is crucial for interpretation. This blog post explores multiple ways to attribute feature importance to clusters and presents a worked example using k-means clustering and decision tree-based feature importance.

Methods for Identifying Important Features

Several approaches can be used to determine which features define each cluster:

  1. Feature Distributions per Cluster
  • Compute mean, median, or standard deviation per cluster.
  • Compare distributions using boxplots or violin plots.
  1. Feature Importance via Decision Trees
  • Train a decision tree classifier on cluster labels.
  • Extract feature importances from the trained model.
  1. Permutation Importance
  • Shuffle each feature and measure its impact on cluster assignments using a trained classifier.
  1. SHAP (SHapley Additive exPlanations)
  • Train a model to predict cluster labels and use SHAP values for feature importance.
  1. PCA or t-SNE Contributions
  • If PCA or t-SNE was used, examine which features contribute most to the components.

Worked Example: Finding Discriminative Features After Clustering

We will apply k-means clustering to the famous Iris dataset and use a decision tree to find the most discriminative features for each cluster.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import seaborn as sns

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# Perform k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
X['cluster'] = kmeans.fit_predict(X)

# Train a Decision Tree classifier to predict clusters
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X.drop(columns=['cluster']), X['cluster'])

# Extract feature importance
feature_importance = pd.Series(clf.feature_importances_, index=X.columns[:-1]).sort_values(ascending=False)

# Plot feature importance
plt.figure(figsize=(8, 4))
sns.barplot(x=feature_importance.values, y=feature_importance.index, palette='viridis')
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance in Cluster Differentiation")
plt.show()
/var/folders/p9/vwq0gfs15vb07tg6xw1r14180000gn/T/ipykernel_15294/4064724445.py:27: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

Interpretation of Results

  • The decision tree classifier identifies features that best distinguish the clusters.
  • The highest-ranked feature is the most important for clustering.
  • This method can be extended to real-world datasets to interpret clusters in customer segmentation, gene expression analysis, and other domains.

Conclusion

By combining clustering with feature importance techniques, we gain insights into the underlying structure of the data. This process helps in explaining why certain groups form and allows for better decision-making in applications such as marketing, biology, and finance.