8  Model Selection & Hyperparameter Tuning

Finding the best model and its optimal parameters is crucial for performance.

8.1 The Workflow

  1. Choose candidate algorithms (Random Forest, SVM, etc.)
  2. Define hyperparameter search space
  3. Use cross-validation to evaluate each combination
  4. Select best model + parameters
  5. Evaluate on test set (once!)

8.2 Sample Dataset

For this chapter, we’ll use a synthetic classification dataset to demonstrate hyperparameter tuning techniques. This controlled dataset allows us to focus on the tuning process itself without getting distracted by data cleaning or feature engineering. The principles you learn here apply to any real-world dataset.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create classification dataset
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    random_state=42
)

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Class distribution: {pd.Series(y_train).value_counts().to_dict()}")

8.5 Comparing Multiple Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Define models and their parameter grids
models = {
    'Logistic Regression': {
        'model': LogisticRegression(random_state=42, max_iter=1000),
        'params': {
            'C': [0.01, 0.1, 1, 10],
            'penalty': ['l1', 'l2'],
            'solver': ['liblinear']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100, 200],
            'max_depth': [5, 10, 15],
            'min_samples_split': [2, 5]
        }
    },
    'SVM': {
        'model': SVC(random_state=42),
        'params': {
            'C': [0.1, 1, 10],
            'kernel': ['rbf', 'linear'],
            'gamma': ['scale', 'auto']
        }
    },
    'Gradient Boosting': {
        'model': GradientBoostingClassifier(random_state=42),
        'params': {
            'n_estimators': [50, 100],
            'learning_rate': [0.01, 0.1, 0.5],
            'max_depth': [3, 5, 7]
        }
    }
}

# Search for best model
best_models = {}

for name, model_dict in models.items():
    print(f"\nTuning {name}...")

    grid = GridSearchCV(
        estimator=model_dict['model'],
        param_grid=model_dict['params'],
        cv=3,  # Faster for comparison
        scoring='accuracy',
        n_jobs=-1
    )

    grid.fit(X_train, y_train)
    best_models[name] = {
        'estimator': grid.best_estimator_,
        'cv_score': grid.best_score_,
        'params': grid.best_params_
    }

    print(f"  Best CV score: {grid.best_score_:.3f}")

8.5.1 Compare Best Models

# Evaluate all on test set
comparison = []

for name, model_info in best_models.items():
    test_score = model_info['estimator'].score(X_test, y_test)
    comparison.append({
        'Model': name,
        'CV Score': model_info['cv_score'],
        'Test Score': test_score,
        'Best Params': str(model_info['params'])
    })

comparison_df = pd.DataFrame(comparison).sort_values('Test Score', ascending=False)
print("\nModel Comparison:")
print(comparison_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(comparison_df))
width = 0.35

ax.bar(x - width/2, comparison_df['CV Score'], width, label='CV Score', alpha=0.8)
ax.bar(x + width/2, comparison_df['Test Score'], width, label='Test Score', alpha=0.8)

ax.set_xlabel('Model')
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison: CV vs Test Score')
ax.set_xticks(x)
ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
ax.legend()
ax.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

8.6 Advanced: Nested Cross-Validation

Problem: Using same data for tuning and evaluation can be optimistic.

Solution: Outer CV for evaluation, inner CV for tuning.

from sklearn.model_selection import cross_val_score

# Nested CV: Outer loop evaluates, inner loop tunes
def nested_cv(model, param_grid, X, y, outer_cv=5, inner_cv=3):
    """Perform nested cross-validation"""

    outer_scores = []

    for i in range(outer_cv):
        # Create outer fold
        X_train_outer, X_test_outer, y_train_outer, y_test_outer = train_test_split(
            X, y, test_size=0.2, random_state=i, stratify=y
        )

        # Inner CV: tune on training data
        grid = GridSearchCV(
            estimator=model,
            param_grid=param_grid,
            cv=inner_cv,
            scoring='accuracy'
        )
        grid.fit(X_train_outer, y_train_outer)

        # Evaluate best model on outer test fold
        score = grid.best_estimator_.score(X_test_outer, y_test_outer)
        outer_scores.append(score)

    return np.array(outer_scores)

# Run nested CV
param_grid_rf = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, 15]
}

nested_scores = nested_cv(
    RandomForestClassifier(random_state=42),
    param_grid_rf,
    X, y,
    outer_cv=5,
    inner_cv=3
)

print(f"Nested CV scores: {nested_scores}")
print(f"Mean: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")

8.7 Common Hyperparameters by Algorithm

8.7.1 Random Forest

  • n_estimators: Number of trees (50-500)
  • max_depth: Tree depth (3-20 or None)
  • min_samples_split: Min samples to split (2-20)
  • min_samples_leaf: Min samples in leaf (1-10)
  • max_features: Features per split (‘sqrt’, ‘log2’, or float)

8.7.2 Logistic Regression

  • C: Inverse regularization strength (0.001-100)
  • penalty: ‘l1’, ‘l2’, or ‘elasticnet’
  • solver: ‘liblinear’, ‘lbfgs’, ‘saga’

8.7.3 SVM

  • C: Regularization (0.1-100)
  • kernel: ‘linear’, ‘rbf’, ‘poly’
  • gamma: Kernel coefficient (‘scale’, ‘auto’, or float)

8.7.4 Gradient Boosting

  • n_estimators: Number of boosting stages (50-500)
  • learning_rate: Shrinks contribution (0.01-0.5)
  • max_depth: Tree depth (3-10)
  • subsample: Fraction of samples (0.5-1.0)

8.8 Tips for Efficient Tuning

  1. Start coarse, then refine
    • First: Wide range, few values
    • Then: Narrow range around best value
  2. Use RandomizedSearchCV for initial exploration
    • Faster than GridSearchCV
    • Good for large search spaces
  3. Monitor for overfitting
    • Compare train, CV, and test scores
    • Large gaps indicate overfitting
  4. Use appropriate CV folds
    • 5-fold is standard
    • 10-fold for small datasets
    • 3-fold for large datasets (faster)
  5. Parallelize with n_jobs=-1
    • Uses all CPU cores
    • Significant speedup

8.9 Scoring Metrics for Different Problems

# For classification
scoring_options = [
    'accuracy',           # Overall correctness
    'precision',          # Positive predictive value
    'recall',            # Sensitivity
    'f1',                # Harmonic mean of precision/recall
    'roc_auc',           # Area under ROC curve
]

# For regression
scoring_options_regression = [
    'neg_mean_absolute_error',        # MAE (negative because higher is better)
    'neg_mean_squared_error',         # MSE
    'neg_root_mean_squared_error',    # RMSE
    'r2',                             # R² score
]

print("Classification scoring options:", scoring_options)
print("\nRegression scoring options:", scoring_options_regression)

8.10 Summary

  • Grid Search: Exhaustive, guaranteed to find best in grid
  • Randomized Search: Faster, good for large spaces
  • Nested CV: Unbiased performance estimate
  • Compare multiple algorithms before deep tuning
  • Start simple: Baseline → tune → ensemble
  • Watch for overfitting: CV score >> test score is bad
  • Use appropriate scoring metric for your problem

Next: Real-world considerations and production tips!