2  Data Preparation

“Data preparation is 80% of the work in machine learning.” - Every ML practitioner

2.1 Introduction

Before building models, we need clean, well-structured data. This chapter covers essential data preparation techniques using Pandas and scikit-learn.

2.2 Loading Data with Pandas

Pandas is Python’s primary library for data manipulation and analysis. It provides powerful data structures like DataFrames (think of them as Excel spreadsheets in code) that make it easy to read, explore, and transform data from various sources. Before we can build any machine learning model, we need to load our data into a format that Python can work with.

2.2.1 Reading Different Formats

import pandas as pd
import numpy as np

# CSV files (most common)
# df = pd.read_csv('data.csv')

# Excel files
# df = pd.read_excel('data.xlsx')

# JSON files
# df = pd.read_json('data.json')

# For this example, let's create sample data
df = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 28, 45, 50, np.nan],
    'salary': [50000, 60000, 55000, np.nan, 52000, 80000, 90000, 65000],
    'department': ['Sales', 'Engineering', 'Sales', 'Engineering', 'HR', 'Engineering', np.nan, 'Sales'],
    'years_experience': [2, 5, 3, 7, 4, 15, 20, 6]
})

print(df)

2.3 Exploring Datasets

Once data is loaded, the next crucial step is understanding what you’re working with. Data exploration (also called Exploratory Data Analysis or EDA) helps you discover patterns, spot anomalies, identify missing values, and understand the relationships between variables. This detective work is essential before training any model, as it informs your data cleaning and feature engineering decisions.

2.3.1 Basic Exploration Commands

# First few rows
print(df.head())
# Dataset info
print(df.info())
# Statistical summary
print(df.describe())
# Check for missing values
print(df.isnull().sum())

2.4 Handling Missing Values

Missing data is one of the most common challenges in real-world datasets. Values can be missing for many reasons: sensors malfunction, users skip survey questions, data wasn’t collected, or records were lost. How you handle missing values can significantly impact your model’s performance. The two main approaches are removing the incomplete data or filling in (imputing) the missing values with reasonable estimates.

2.4.1 Strategy 1: Remove Missing Data

# Drop rows with any missing values
df_dropped = df.dropna()
print(f"Original shape: {df.shape}")
print(f"After dropping: {df_dropped.shape}")

Use when: Missing data is minimal (<5%) and random.

2.4.2 Strategy 2: Imputation (Filling Missing Values)

from sklearn.impute import SimpleImputer

# Numeric columns - fill with mean
imputer_mean = SimpleImputer(strategy='mean')
df_numeric = df[['age', 'salary', 'years_experience']].copy()
df_imputed = pd.DataFrame(
    imputer_mean.fit_transform(df_numeric),
    columns=df_numeric.columns
)

print("Before imputation:")
print(df_numeric)
print("\nAfter imputation (mean):")
print(df_imputed)
# Categorical columns - fill with most frequent
imputer_freq = SimpleImputer(strategy='most_frequent')
df_categorical = df[['department']].copy()
df_cat_imputed = pd.DataFrame(
    imputer_freq.fit_transform(df_categorical),
    columns=df_categorical.columns
)

print("Before imputation:")
print(df_categorical)
print("\nAfter imputation (most frequent):")
print(df_cat_imputed)

Other strategies: - strategy='median' - Robust to outliers - strategy='constant' - Fill with specific value

2.5 Train/Test Split

Critical concept: Never test on data you trained on!

from sklearn.model_selection import train_test_split

# Create sample dataset
X = df[['age', 'years_experience']].dropna()
y = df.loc[X.index, 'salary']

# Split: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Key parameters: - test_size: Proportion for testing (0.2 = 20%) - random_state: Ensures reproducibility - stratify: Maintains class distribution (for classification)

2.6 Complete Preparation Pipeline Example

# 1. Load data
df = pd.DataFrame({
    'feature1': [1, 2, np.nan, 4, 5],
    'feature2': [10, np.nan, 30, 40, 50],
    'target': [0, 1, 0, 1, 0]
})

# 2. Separate features and target
X = df[['feature1', 'feature2']]
y = df['target']

# 3. Handle missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# 4. Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_imputed, y, test_size=0.2, random_state=42
)

print("Data preparation complete!")
print(f"Training shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")

2.7 Best Practices

  1. Explore first: Always examine data before cleaning
  2. Document decisions: Note why you chose specific imputation strategies
  3. Preserve test integrity: Never peek at test data during preparation
  4. Use pipelines: Automate repetitive steps (Chapter 9)
  5. Handle outliers: Check for extreme values that might skew results

2.8 Summary

  • Use Pandas for data loading and exploration
  • Handle missing values via dropping or imputation
  • Always split data into train/test sets
  • Never train and test on the same data
  • Data preparation is iterative - revisit as needed

Next: We’ll build our first complete model using clean data!