Made with ❤️ and GitHub Copilot

Minimal Data Inspection Before Splitting The Dataset

Handling data in machine learning can feel as important as wielding a lightsaber in the Star Wars universe. In this guide, we’ll explore key concepts around data splitting and discuss potential risks of early preprocessing. Think of it as a journey toward mastering the art of data preparation, where each decision shapes your path.

Loading and Preparing the Dataset

Let’s begin our adventure with the Star Wars dataset:

Code

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the Star Wars dataset
url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv"
df = pd.read_csv(url, encoding="latin1")

# Clean column names
df.columns = df.columns.str.replace("Which of the following Star Wars films have you seen? Please select all that apply.", "seen_")
df.columns = df.columns.str.replace("Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.", "rank_")
df.columns = df.columns.str.replace("Do you consider yourself to be a fan of the Star Wars film franchise?", "is_fan")

# Select a subset of columns for our analysis
columns_to_use = [
    'seen_1', 'seen_2', 'seen_3', 'seen_4', 'seen_5', 'seen_6',
    'rank_1', 'rank_2', 'rank_3', 'rank_4', 'rank_5', 'rank_6',
    'is_fan', 'Gender', 'Age', 'Household Income', 'Education'
]

df = df[columns_to_use]

print(f"Dataset shape: {df.shape}")
print(f"Features: {df.columns.tolist()}")

   RespondentID Have you seen any of the 6 films in the Star Wars franchise?  \
0           NaN                                           Response             
1  3.292880e+09                                                Yes             
2  3.292880e+09                                                 No             
3  3.292765e+09                                                Yes             
4  3.292763e+09                                                Yes             

  Do you consider yourself to be a fan of the Star Wars film franchise?  \
0                                           Response                      
1                                                Yes                      
2                                                NaN                      
3                                                 No                      
4                                                Yes                      

  Which of the following Star Wars films have you seen? Please select all that apply.  \
0           Star Wars: Episode I  The Phantom Menace                                    
1           Star Wars: Episode I  The Phantom Menace                                    
2                                                NaN                                    
3           Star Wars: Episode I  The Phantom Menace                                    
4           Star Wars: Episode I  The Phantom Menace                                    

                                    Unnamed: 4  \
0  Star Wars: Episode II  Attack of the Clones   
1  Star Wars: Episode II  Attack of the Clones   
2                                          NaN   
3  Star Wars: Episode II  Attack of the Clones   
4  Star Wars: Episode II  Attack of the Clones   

                                    Unnamed: 5  \
0  Star Wars: Episode III  Revenge of the Sith   
1  Star Wars: Episode III  Revenge of the Sith   
2                                          NaN   
3  Star Wars: Episode III  Revenge of the Sith   
4  Star Wars: Episode III  Revenge of the Sith   

                          Unnamed: 6  \
0  Star Wars: Episode IV  A New Hope   
1  Star Wars: Episode IV  A New Hope   
2                                NaN   
3                                NaN   
4  Star Wars: Episode IV  A New Hope   

                                     Unnamed: 7  \
0  Star Wars: Episode V The Empire Strikes Back   
1  Star Wars: Episode V The Empire Strikes Back   
2                                           NaN   
3                                           NaN   
4  Star Wars: Episode V The Empire Strikes Back   

                                 Unnamed: 8  \
0  Star Wars: Episode VI Return of the Jedi   
1  Star Wars: Episode VI Return of the Jedi   
2                                       NaN   
3                                       NaN   
4  Star Wars: Episode VI Return of the Jedi   

  Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.  \
0           Star Wars: Episode I  The Phantom Menace                                                                                              
1                                                  3                                                                                              
2                                                NaN                                                                                              
3                                                  1                                                                                              
4                                                  5                                                                                              

   ...       Unnamed: 28       Which character shot first?  \
0  ...              Yoda                          Response   
1  ...    Very favorably  I don't understand this question   
2  ...               NaN                               NaN   
3  ...  Unfamiliar (N/A)  I don't understand this question   
4  ...    Very favorably  I don't understand this question   

  Are you familiar with the Expanded Universe?  \
0                                     Response   
1                                          Yes   
2                                          NaN   
3                                           No   
4                                           No   

  Do you consider yourself to be a fan of the Expanded Universe?æ  \
0                                           Response                 
1                                                 No                 
2                                                NaN                 
3                                                NaN                 
4                                                NaN                 

  Do you consider yourself to be a fan of the Star Trek franchise?    Gender  \
0                                           Response                Response   
1                                                 No                    Male   
2                                                Yes                    Male   
3                                                 No                    Male   
4                                                Yes                    Male   

        Age     Household Income                         Education  \
0  Response             Response                          Response   
1     18-29                  NaN                High school degree   
2     18-29         $0 - $24,999                   Bachelor degree   
3     18-29         $0 - $24,999                High school degree   
4     18-29  $100,000 - $149,999  Some college or Associate degree   

  Location (Census Region)  
0                 Response  
1           South Atlantic  
2       West South Central  
3       West North Central  
4       West North Central  

[5 rows x 38 columns]

The Importance of Proper Data Splitting

Splitting your data is crucial for developing robust machine learning models. The main purposes of these splits are:

Training set: Used to train the model
Validation set: Used for hyperparameter tuning and model selection
Test set: Used to evaluate the final model’s performance on unseen data

Proper data splitting helps prevent overfitting and provides a realistic estimate of how your model will perform on new, unseen data.

The Dangers of Premature Preprocessing

While data preprocessing is essential, performing certain steps before splitting your data can lead to data leakage and biased models. Let’s explore some common preprocessing steps and their associated risks:

1. Handling Outliers

Removing or modifying outliers based on the entire dataset before splitting can lead to data leakage.

Code

def show_outlier_effect(df, column):
    plt.figure(figsize=(12, 6))
    plt.subplot(121)
    df[column].hist(bins=30)
    plt.title(f"Original {column} Distribution")
    
    # Incorrect way: Remove outliers before splitting
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    df_filtered = df[(df[column] >= Q1 - 1.5*IQR) & (df[column] <= Q3 + 1.5*IQR)]
    
    plt.subplot(122)
    df_filtered[column].hist(bins=30)
    plt.title(f"{column} Distribution After Removing Outliers")
    plt.tight_layout()
    plt.show()

show_outlier_effect(df, 'Age')

Removing outliers from the entire dataset before splitting would result in a test set that no longer represents the true data distribution. This can lead to overly optimistic performance estimates and poor generalization to new data.

2. Bucketing Variables

Creating categorical variables from continuous ones (bucketing) based on the entire dataset can also cause data leakage:

Code

def show_bucketing_effect(df, column, n_bins=5):
    plt.figure(figsize=(12, 6))
    plt.subplot(121)
    df[column].hist(bins=30)
    plt.title(f"Original {column} Distribution")
    
    # Incorrect way: Create bins based on the entire dataset
    df['bucketed'] = pd.qcut(df[column], q=n_bins)
    
    plt.subplot(122)
    df['bucketed'].value_counts().sort_index().plot(kind='bar')
    plt.title(f"Bucketed {column} Distribution")
    plt.tight_layout()
    plt.show()

show_bucketing_effect(df, 'Age')

Bucketing variables using information from the entire dataset can introduce bias, as the bin boundaries are influenced by the test set.

3. Handling Missing Data

Imputing missing values using information from the entire dataset can lead to data leakage:

Code

def show_missingness_effect(df, column):
    # Introduce some missing values
    df_missing = df.copy()
    df_missing.loc[df_missing.sample(frac=0.2).index, column] = np.nan
    
    print(f"Missing values in {column}: {df_missing[column].isnull().sum()}")
    
    # Incorrect way: Impute missing values based on the entire dataset
    df_imputed = df_missing.copy()
    df_imputed[column].fillna(df_imputed[column].mean(), inplace=True)
    
    plt.figure(figsize=(12, 6))
    plt.subplot(121)
    df[column].hist(bins=30)
    plt.title(f"Original {column} Distribution")
    
    plt.subplot(122)
    df_imputed[column].hist(bins=30)
    plt.title(f"{column} Distribution After Imputation")
    plt.tight_layout()
    plt.show()

show_missingness_effect(df, 'Age')

Imputing missing values using statistics from the entire dataset allows information from the test set to influence the training data, potentially leading to overfitting and unreliable performance estimates.

4. Dimensionality Reduction

Applying dimensionality reduction techniques like PCA to the entire dataset before splitting can also cause data leakage:

Code

from sklearn.decomposition import PCA


def show_pca_effect(df, n_components=2):
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    
    # Incorrect way: Apply PCA to the entire dataset
    pca = PCA(n_components=n_components)
    pca_result = pca.fit_transform(df[numeric_cols])
    
    plt.figure(figsize=(10, 8))
    plt.scatter(pca_result[:, 0], pca_result[:, 1], alpha=0.5)
    plt.title("PCA Result (Incorrectly Applied to Entire Dataset)")
    plt.xlabel("First Principal Component")
    plt.ylabel("Second Principal Component")
    plt.show()
    
    print("Explained variance ratio:", pca.explained_variance_ratio_)

show_pca_effect(df)

Applying PCA or other dimensionality reduction techniques to the entire dataset allows information from the test set to influence the feature space of the training data.

Impact on Model Performance

When preprocessing steps are applied to the entire dataset before splitting, several problems can arise:

Overfitting: The model may implicitly learn patterns from the test set, leading to overly optimistic performance estimates.
Poor generalization: The model may not perform well on truly unseen data because it has been trained on a dataset that doesn’t represent the real-world data distribution.
Biased feature importance: The importance of features may be distorted due to information leakage from the test set.
Unreliable model selection: When comparing different models, the selection process may be biased towards models that overfit the leaked information.

The Exception: Minimal Target Variable Processing

While most preprocessing should be done after splitting the data, some minimal processing of the target variable can be acceptable and even beneficial when done carefully:

1. Calculating the Empirical Distribution of the Target

Understanding the distribution of your target variable can inform your sampling strategy and help identify potential issues.

Code

def plot_target_distribution(df, target_column):
    plt.figure(figsize=(10, 6))
    df[target_column].value_counts(normalize=True).plot(kind='bar')
    plt.title(f"Distribution of {target_column}")
    plt.ylabel("Proportion")
    plt.xlabel("Class")
    plt.show()

plot_target_distribution(df, 'is_fan')

2. Handling Target Variable Outliers

Addressing outliers in your target variable before splitting can sometimes be beneficial. Strategies include:

Grouping levels: If certain classes have very few samples, consider grouping them.
Winsorization: Cap extreme values at a specified percentile.
Removing outliers: In some cases, removing extreme outliers might be appropriate.

Code

def handle_target_outliers(df, target_column, strategy='winsorize'):
    if strategy == 'winsorize':
        low = df[target_column].quantile(0.01)
        high = df[target_column].quantile(0.99)
        df[target_column] = df[target_column].clip(low, high)
    elif strategy == 'remove':
        Q1 = df[target_column].quantile(0.25)
        Q3 = df[target_column].quantile(0.75)
        IQR = Q3 - Q1
        df = df[(df[target_column] >= Q1 - 1.5*IQR) & (df[target_column] <= Q3 + 1.5*IQR)]
    
    return df

# Example usage (if 'is_fan' were a continuous variable):
# df = handle_target_outliers(df, 'is_fan', strategy='winsorize')

3. Stratified Sampling for Data Splits

When your target variable is imbalanced, stratified sampling becomes crucial. This ensures that your train, validation, and test sets maintain the same proportion of classes as the original dataset.

Code

def check_imbalance(df, target_column, threshold=0.8):
    distribution = df[target_column].value_counts(normalize=True)
    if distribution.max() > threshold:
        print(f"Warning: The target variable is imbalanced. The majority class represents {distribution.max():.2%} of the data.")
        return True
    return False

is_imbalanced = check_imbalance(df, 'is_fan')

# If imbalanced, use stratified sampling
if is_imbalanced:
    X = df.drop('is_fan', axis=1)
    y = df['is_fan']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
else:
    X_train, X_test, y_train, y_test = train_test_split(df.drop('is_fan', axis=1), df['is_fan'], test_size=0.2, random_state=42)

print(f"Training set class distribution:\n{y_train.value_counts(normalize=True)}")
print(f"\nTest set class distribution:\n{y_test.value_counts(normalize=True)}")

Best Practices for Data Preprocessing

To avoid these issues and ensure robust models, follow these best practices:

Split your data first: Always split your data into train, validation, and test sets before any preprocessing.
Preprocess within cross-validation: Apply preprocessing steps only to the training data within each fold of cross-validation.
Use pipelines: Scikit-learn’s Pipeline class can help ensure that preprocessing steps are only applied to the training data.
Preserve test set integrity: Never use information from the test set for preprocessing or model development.

Here’s an example of how to properly split the data and apply preprocessing:

Code

from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# First, split the data
X = df.drop('is_fan', axis=1)
y = df['is_fan']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a preprocessing and modeling pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

Conclusion

Proper data splitting and careful preprocessing are crucial for developing robust and reliable machine learning models. By understanding the risks associated with premature preprocessing and following best practices, you can avoid data leakage, obtain realistic performance estimates, and build models that generalize well to new data.

Remember, the goal is not just to have a model that performs well on your test set, but one that will perform well on truly unseen data in real-world applications. By maintaining the integrity of your data splitting process and applying preprocessing steps correctly, you’ll be well on your way to mastering the art of machine learning.