Handling data in machine learning can feel as important as wielding a lightsaber in the Star Wars universe. In this guide, we’ll explore key concepts around data splitting and discuss potential risks of early preprocessing. Think of it as a journey toward mastering the art of data preparation, where each decision shapes your path.
Loading and Preparing the Dataset
Let’s begin our adventure with the Star Wars dataset:
Code
import matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_split# Load the Star Wars dataseturl ="https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv"df = pd.read_csv(url, encoding="latin1")# Clean column namesdf.columns = df.columns.str.replace("Which of the following Star Wars films have you seen? Please select all that apply.", "seen_")df.columns = df.columns.str.replace("Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.", "rank_")df.columns = df.columns.str.replace("Do you consider yourself to be a fan of the Star Wars film franchise?", "is_fan")# Select a subset of columns for our analysiscolumns_to_use = ['seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6','rank_1', 'rank_2', 'rank_3', 'rank_4', 'rank_5', 'rank_6','is_fan', 'Gender', 'Age', 'Household Income', 'Education']df = df[columns_to_use]print(f"Dataset shape: {df.shape}")print(f"Features: {df.columns.tolist()}")
RespondentID Have you seen any of the 6 films in the Star Wars franchise? \
0 NaN Response
1 3.292880e+09 Yes
2 3.292880e+09 No
3 3.292765e+09 Yes
4 3.292763e+09 Yes
Do you consider yourself to be a fan of the Star Wars film franchise? \
0 Response
1 Yes
2 NaN
3 No
4 Yes
Which of the following Star Wars films have you seen? Please select all that apply. \
0 Star Wars: Episode I The Phantom Menace
1 Star Wars: Episode I The Phantom Menace
2 NaN
3 Star Wars: Episode I The Phantom Menace
4 Star Wars: Episode I The Phantom Menace
Unnamed: 4 \
0 Star Wars: Episode II Attack of the Clones
1 Star Wars: Episode II Attack of the Clones
2 NaN
3 Star Wars: Episode II Attack of the Clones
4 Star Wars: Episode II Attack of the Clones
Unnamed: 5 \
0 Star Wars: Episode III Revenge of the Sith
1 Star Wars: Episode III Revenge of the Sith
2 NaN
3 Star Wars: Episode III Revenge of the Sith
4 Star Wars: Episode III Revenge of the Sith
Unnamed: 6 \
0 Star Wars: Episode IV A New Hope
1 Star Wars: Episode IV A New Hope
2 NaN
3 NaN
4 Star Wars: Episode IV A New Hope
Unnamed: 7 \
0 Star Wars: Episode V The Empire Strikes Back
1 Star Wars: Episode V The Empire Strikes Back
2 NaN
3 NaN
4 Star Wars: Episode V The Empire Strikes Back
Unnamed: 8 \
0 Star Wars: Episode VI Return of the Jedi
1 Star Wars: Episode VI Return of the Jedi
2 NaN
3 NaN
4 Star Wars: Episode VI Return of the Jedi
Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. \
0 Star Wars: Episode I The Phantom Menace
1 3
2 NaN
3 1
4 5
... Unnamed: 28 Which character shot first? \
0 ... Yoda Response
1 ... Very favorably I don't understand this question
2 ... NaN NaN
3 ... Unfamiliar (N/A) I don't understand this question
4 ... Very favorably I don't understand this question
Are you familiar with the Expanded Universe? \
0 Response
1 Yes
2 NaN
3 No
4 No
Do you consider yourself to be a fan of the Expanded Universe?æ \
0 Response
1 No
2 NaN
3 NaN
4 NaN
Do you consider yourself to be a fan of the Star Trek franchise? Gender \
0 Response Response
1 No Male
2 Yes Male
3 No Male
4 Yes Male
Age Household Income Education \
0 Response Response Response
1 18-29 NaN High school degree
2 18-29 $0 - $24,999 Bachelor degree
3 18-29 $0 - $24,999 High school degree
4 18-29 $100,000 - $149,999 Some college or Associate degree
Location (Census Region)
0 Response
1 South Atlantic
2 West South Central
3 West North Central
4 West North Central
[5 rows x 38 columns]
The Importance of Proper Data Splitting
Splitting your data is crucial for developing robust machine learning models. The main purposes of these splits are:
Training set: Used to train the model
Validation set: Used for hyperparameter tuning and model selection
Test set: Used to evaluate the final model’s performance on unseen data
Proper data splitting helps prevent overfitting and provides a realistic estimate of how your model will perform on new, unseen data.
The Dangers of Premature Preprocessing
While data preprocessing is essential, performing certain steps before splitting your data can lead to data leakage and biased models. Let’s explore some common preprocessing steps and their associated risks:
1. Handling Outliers
Removing or modifying outliers based on the entire dataset before splitting can lead to data leakage.
Removing outliers from the entire dataset before splitting would result in a test set that no longer represents the true data distribution. This can lead to overly optimistic performance estimates and poor generalization to new data.
2. Bucketing Variables
Creating categorical variables from continuous ones (bucketing) based on the entire dataset can also cause data leakage:
Bucketing variables using information from the entire dataset can introduce bias, as the bin boundaries are influenced by the test set.
3. Handling Missing Data
Imputing missing values using information from the entire dataset can lead to data leakage:
Code
def show_missingness_effect(df, column):# Introduce some missing values df_missing = df.copy() df_missing.loc[df_missing.sample(frac=0.2).index, column] = np.nanprint(f"Missing values in {column}: {df_missing[column].isnull().sum()}")# Incorrect way: Impute missing values based on the entire dataset df_imputed = df_missing.copy() df_imputed[column].fillna(df_imputed[column].mean(), inplace=True) plt.figure(figsize=(12, 6)) plt.subplot(121) df[column].hist(bins=30) plt.title(f"Original {column} Distribution") plt.subplot(122) df_imputed[column].hist(bins=30) plt.title(f"{column} Distribution After Imputation") plt.tight_layout() plt.show()show_missingness_effect(df, 'Age')
Imputing missing values using statistics from the entire dataset allows information from the test set to influence the training data, potentially leading to overfitting and unreliable performance estimates.
4. Dimensionality Reduction
Applying dimensionality reduction techniques like PCA to the entire dataset before splitting can also cause data leakage:
Code
from sklearn.decomposition import PCAdef show_pca_effect(df, n_components=2): numeric_cols = df.select_dtypes(include=[np.number]).columns# Incorrect way: Apply PCA to the entire dataset pca = PCA(n_components=n_components) pca_result = pca.fit_transform(df[numeric_cols]) plt.figure(figsize=(10, 8)) plt.scatter(pca_result[:, 0], pca_result[:, 1], alpha=0.5) plt.title("PCA Result (Incorrectly Applied to Entire Dataset)") plt.xlabel("First Principal Component") plt.ylabel("Second Principal Component") plt.show()print("Explained variance ratio:", pca.explained_variance_ratio_)show_pca_effect(df)
Applying PCA or other dimensionality reduction techniques to the entire dataset allows information from the test set to influence the feature space of the training data.
Impact on Model Performance
When preprocessing steps are applied to the entire dataset before splitting, several problems can arise:
Overfitting: The model may implicitly learn patterns from the test set, leading to overly optimistic performance estimates.
Poor generalization: The model may not perform well on truly unseen data because it has been trained on a dataset that doesn’t represent the real-world data distribution.
Biased feature importance: The importance of features may be distorted due to information leakage from the test set.
Unreliable model selection: When comparing different models, the selection process may be biased towards models that overfit the leaked information.
The Exception: Minimal Target Variable Processing
While most preprocessing should be done after splitting the data, some minimal processing of the target variable can be acceptable and even beneficial when done carefully:
1. Calculating the Empirical Distribution of the Target
Understanding the distribution of your target variable can inform your sampling strategy and help identify potential issues.
When your target variable is imbalanced, stratified sampling becomes crucial. This ensures that your train, validation, and test sets maintain the same proportion of classes as the original dataset.
Code
def check_imbalance(df, target_column, threshold=0.8): distribution = df[target_column].value_counts(normalize=True)if distribution.max() > threshold:print(f"Warning: The target variable is imbalanced. The majority class represents {distribution.max():.2%} of the data.")returnTruereturnFalseis_imbalanced = check_imbalance(df, 'is_fan')# If imbalanced, use stratified samplingif is_imbalanced: X = df.drop('is_fan', axis=1) y = df['is_fan'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)else: X_train, X_test, y_train, y_test = train_test_split(df.drop('is_fan', axis=1), df['is_fan'], test_size=0.2, random_state=42)print(f"Training set class distribution:\n{y_train.value_counts(normalize=True)}")print(f"\nTest set class distribution:\n{y_test.value_counts(normalize=True)}")
Best Practices for Data Preprocessing
To avoid these issues and ensure robust models, follow these best practices:
Split your data first: Always split your data into train, validation, and test sets before any preprocessing.
Preprocess within cross-validation: Apply preprocessing steps only to the training data within each fold of cross-validation.
Use pipelines: Scikit-learn’s Pipeline class can help ensure that preprocessing steps are only applied to the training data.
Preserve test set integrity: Never use information from the test set for preprocessing or model development.
Here’s an example of how to properly split the data and apply preprocessing:
Code
from sklearn.ensemble import RandomForestClassifierfrom sklearn.impute import SimpleImputerfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler# First, split the dataX = df.drop('is_fan', axis=1)y = df['is_fan']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Create a preprocessing and modeling pipelinepipeline = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42))])# Fit the pipeline on the training datapipeline.fit(X_train, y_train)# Make predictions on the test sety_pred = pipeline.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test, y_pred)print(f"Model accuracy: {accuracy:.2f}")
Conclusion
Proper data splitting and careful preprocessing are crucial for developing robust and reliable machine learning models. By understanding the risks associated with premature preprocessing and following best practices, you can avoid data leakage, obtain realistic performance estimates, and build models that generalize well to new data.
Remember, the goal is not just to have a model that performs well on your test set, but one that will perform well on truly unseen data in real-world applications. By maintaining the integrity of your data splitting process and applying preprocessing steps correctly, you’ll be well on your way to mastering the art of machine learning.