3  Your First Model - Decision Trees

3.1 Why Start with Decision Trees?

Decision trees are perfect for beginners because:

  • Intuitive: Works like a flowchart of yes/no questions
  • Visual: Can be easily visualized and interpreted
  • Flexible: Handles both regression and classification
  • No preprocessing needed: Works with raw features (no scaling required)

3.2 Decision Tree Intuition

Imagine predicting if someone will buy a product:

Is age > 30?
├─ Yes → Is income > 50k?
│  ├─ Yes → BUY (90% confidence)
│  └─ No → DON'T BUY (70% confidence)
└─ No → Is student?
   ├─ Yes → BUY (60% confidence)
   └─ No → DON'T BUY (80% confidence)

The algorithm automatically learns these decision rules from data!

3.3 Regression Example: Predicting House Prices

Dataset: We’ll use the California Housing dataset, which contains real data from the 1990 California census. This dataset includes median house values for California districts along with features like median income, house age, and location.

Why this dataset? It’s a real-world problem that’s easy to understand—predicting house prices based on neighborhood characteristics. The patterns you discover here mirror actual factors that influence housing markets.

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

# Load real California housing data
housing = fetch_california_housing(as_frame=True)
data = housing.frame

# Select most interpretable features for our first model
# MedInc: median income in block
# HouseAge: median house age in block
# AveRooms: average number of rooms
# AveOccup: average number of household members
# MedHouseVal: median house value (target, in $100,000s)

# Keep it simple: use 3 most important features
data_subset = data[['MedInc', 'HouseAge', 'AveRooms', 'MedHouseVal']].copy()

# Convert target to actual dollars for easier interpretation
data_subset['MedHouseVal'] = data_subset['MedHouseVal'] * 100000

# Remove extreme outliers to make visualization clearer
data_subset = data_subset[data_subset['AveRooms'] < 10]

print("First 5 houses in our dataset:")
print(data_subset.head())
print(f"\nDataset shape: {data_subset.shape}")
print(f"\nPrice range: ${data_subset['MedHouseVal'].min():,.0f} - ${data_subset['MedHouseVal'].max():,.0f}")

3.3.1 Training the Model

What we’re doing: We’ll split our data into training (80%) and testing (20%) sets. The model learns patterns from the training data, and we evaluate its performance on the unseen test data. This simulates how well our model will work on brand new houses it’s never seen before.

# Separate features (X) and target (y)
X = data_subset[['MedInc', 'HouseAge', 'AveRooms']]
y = data_subset['MedHouseVal']

# Split data: 80% training, 20% testing
# random_state ensures we get the same split every time (reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} houses")
print(f"Test set: {X_test.shape[0]} houses")

# Create and train the model
# max_depth=5 limits tree depth to prevent overfitting
model = DecisionTreeRegressor(random_state=42, max_depth=5)
model.fit(X_train, y_train)

print("\nModel trained successfully!")
print("The tree learned to ask up to 5 questions to predict house prices.")

3.3.2 Making Predictions

# Predict on test set
predictions = model.predict(X_test)

# Compare first 5 predictions vs actual
comparison = pd.DataFrame({
    'Actual': y_test[:5].values,
    'Predicted': predictions[:5],
    'Difference': y_test[:5].values - predictions[:5]
})

print(comparison)

3.3.3 Model Evaluation

# Calculate metrics
mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

print(f"Mean Absolute Error: ${mae:,.2f}")
print(f"Root Mean Squared Error: ${rmse:,.2f}")
print(f"R² Score: {r2:.3f}")

Interpreting metrics: - MAE: Average prediction error in dollars - RMSE: Penalizes large errors more heavily - R² Score: How much variance is explained (1.0 = perfect, 0.0 = baseline)

3.4 Classification Example: Titanic Survival Prediction

Dataset: The Titanic dataset contains real passenger data from the 1912 tragedy. We’ll predict whether a passenger survived based on features like age, sex, ticket class, and fare paid.

Why this dataset? It’s a famous classification problem that teaches important concepts: class imbalance, categorical features, and the harsh reality that some features (like passenger class) had life-or-death implications.

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns

# Load Titanic dataset (built into seaborn)
titanic = sns.load_dataset('titanic')

# Select relevant features and remove missing values
features = ['pclass', 'age', 'fare', 'sex', 'sibsp', 'parch']
titanic_clean = titanic[features + ['survived']].dropna()

# Convert sex to numeric (male=1, female=0)
titanic_clean['sex'] = (titanic_clean['sex'] == 'male').astype(int)

print("First 5 passengers:")
print(titanic_clean.head())
print(f"\nSurvival rate: {titanic_clean['survived'].mean():.1%}")
print(f"Dataset size: {len(titanic_clean)} passengers")

3.4.1 Training the Classifier

Classification vs Regression: Unlike regression (predicting numbers), classification predicts categories. Here, we’re predicting two classes: survived (1) or died (0).

# Prepare data
X = titanic_clean[features]
y = titanic_clean['survived']

# Split data
# stratify=y ensures both train and test have similar survival rates
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} passengers")
print(f"Test set: {len(X_test)} passengers")

# Train decision tree classifier
classifier = DecisionTreeClassifier(random_state=42, max_depth=4)
classifier.fit(X_train, y_train)

# Predict on test set
predictions = classifier.predict(X_test)

print("\nClassification model trained!")
print("The tree learned patterns like: '1st class females had high survival rates'")

3.4.2 Evaluating the Classifier

# Accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")

# Detailed report
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=['Rejected', 'Approved']))

3.5 Visualizing the Decision Tree

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))
plot_tree(
    classifier,
    feature_names=X.columns,
    class_names=['Died', 'Survived'],
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title("Titanic Survival Decision Tree")
plt.tight_layout()
plt.show()

3.6 Feature Importance

Which features matter most?

# Get feature importance
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': classifier.feature_importances_
}).sort_values('Importance', ascending=False)

print(importance)

# Visualize
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(importance['Feature'], importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance in Titanic Survival')
plt.tight_layout()
plt.show()

3.7 Key Hyperparameters

# Common parameters to control tree complexity
models = {
    'Default': DecisionTreeClassifier(random_state=42),
    'max_depth=3': DecisionTreeClassifier(max_depth=3, random_state=42),
    'min_samples_split=50': DecisionTreeClassifier(min_samples_split=50, random_state=42),
    'min_samples_leaf=20': DecisionTreeClassifier(min_samples_leaf=20, random_state=42),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    accuracy = model.score(X_test, y_test)
    print(f"{name:25} Accuracy: {accuracy:.2%}")

Important parameters: - max_depth: Maximum tree depth (prevents overfitting) - min_samples_split: Minimum samples to split a node - min_samples_leaf: Minimum samples in a leaf node

3.8 When to Use Decision Trees

Pros: - Easy to interpret and visualize - No feature scaling needed - Handles non-linear relationships - Works with mixed data types

Cons: - Prone to overfitting - Unstable (small data changes → different tree) - Not the best for prediction accuracy

Next chapter: We’ll learn how to measure and prevent overfitting!

3.9 Summary

  • Decision trees make predictions using a series of questions
  • DecisionTreeRegressor for continuous targets
  • DecisionTreeClassifier for categorical targets
  • Key metrics: MAE, RMSE, R² (regression); Accuracy, Precision, Recall (classification)
  • Control complexity with max_depth, min_samples_split, etc.
  • Always evaluate on held-out test data