7  Unsupervised Learning

Unsupervised learning discovers hidden patterns in unlabeled data.

7.1 Algorithms Covered

  1. K-Means Clustering
  2. Hierarchical Clustering
  3. DBSCAN
  4. Principal Component Analysis (PCA)
  5. Anomaly Detection

7.2 Part 1: Clustering

Goal: Group similar data points together without predefined labels.

7.2.1 Dataset: Customer Segmentation

To demonstrate clustering algorithms, we’ll create a synthetic customer dataset with features like age and income. In a real business scenario, identifying customer segments helps companies tailor marketing strategies, personalize offerings, and improve customer retention. Unlike supervised learning, we don’t have predefined labels—the algorithms will discover natural groupings in the data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

# Create synthetic customer data
np.random.seed(42)

# Three customer segments
segment1 = np.random.multivariate_normal([30, 25000], [[5, 0], [0, 5000]], 100)  # Young, low income
segment2 = np.random.multivariate_normal([45, 75000], [[7, 0], [0, 10000]], 100)  # Middle-age, high income
segment3 = np.random.multivariate_normal([60, 45000], [[8, 0], [0, 8000]], 100)  # Senior, medium income

# Combine
data = np.vstack([segment1, segment2, segment3])

df = pd.DataFrame(data, columns=['age', 'annual_income'])

# Add more features
df['spending_score'] = (
    df['annual_income'] / 1000
    + np.random.normal(0, 10, len(df))
)

print(df.head())
print(f"\nDataset shape: {df.shape}")
print(df.describe())

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(df['age'], df['annual_income'], alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Annual Income ($)')
plt.title('Customer Data (Unlabeled)')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

7.2.2 Prepare Data

Clustering algorithms like K-Means and DBSCAN rely on distance calculations between data points. When features have different scales (e.g., age in years vs. income in thousands of dollars), the larger-scale features can dominate the distance calculations. Scaling ensures all features contribute equally to finding meaningful clusters.

# Scale features (important for distance-based algorithms!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

print(f"Scaled data shape: {X_scaled.shape}")
print(f"First 5 rows:\n{X_scaled[:5]}")

7.3 1. K-Means Clustering

Concept: Partition data into K clusters by minimizing within-cluster variance.

7.3.1 How K-Means Works

  1. Initialize K cluster centers randomly
  2. Assign each point to nearest center
  3. Update centers to mean of assigned points
  4. Repeat steps 2-3 until convergence
from sklearn.cluster import KMeans

# Fit K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_scaled)

# Get cluster assignments
df['cluster_kmeans'] = kmeans.labels_

print(f"Cluster sizes:")
print(df['cluster_kmeans'].value_counts().sort_index())

7.3.2 Visualize Clusters

# Visualize clusters
plt.figure(figsize=(14, 6))

# Original data
plt.subplot(1, 2, 1)
plt.scatter(df['age'], df['annual_income'], alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Annual Income ($)')
plt.title('Original Data (No Labels)')
plt.grid(alpha=0.3)

# K-Means clusters
plt.subplot(1, 2, 2)
scatter = plt.scatter(df['age'], df['annual_income'],
                      c=df['cluster_kmeans'], cmap='viridis', alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Annual Income ($)')
plt.title('K-Means Clustering (K=3)')
plt.colorbar(scatter, label='Cluster')
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

7.3.3 Finding Optimal K: Elbow Method

# Try different K values
inertias = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.title('Elbow Method for Optimal K')
plt.grid(alpha=0.3)
plt.xticks(K_range)
plt.tight_layout()
plt.show()

print("Look for the 'elbow' where inertia starts decreasing more slowly")

7.3.4 Silhouette Score

Better metric: Measures how similar points are to their own cluster vs other clusters.

from sklearn.metrics import silhouette_score

silhouette_scores = []

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)

plt.figure(figsize=(10, 6))
plt.plot(range(2, 11), silhouette_scores, 'go-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs K')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Best K (by silhouette): {silhouette_scores.index(max(silhouette_scores)) + 2}")

7.4 2. Hierarchical Clustering

Concept: Build a hierarchy of clusters (bottom-up or top-down).

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Fit hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
df['cluster_hierarchical'] = hierarchical.fit_predict(X_scaled)

# Create dendrogram (use subset for clarity)
plt.figure(figsize=(12, 6))
linkage_matrix = linkage(X_scaled[:50], method='ward')
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram (First 50 samples)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()

# Visualize clusters
plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['age'], df['annual_income'],
                      c=df['cluster_hierarchical'], cmap='plasma', alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Annual Income ($)')
plt.title('Hierarchical Clustering')
plt.colorbar(scatter, label='Cluster')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Linkage methods: - ward: Minimizes variance (most common) - complete: Maximum distance between clusters - average: Average distance between clusters - single: Minimum distance between clusters

7.5 3. DBSCAN (Density-Based Clustering)

Concept: Finds clusters based on density (handles arbitrary shapes!).

Advantages: - Doesn’t require specifying number of clusters - Can find arbitrarily shaped clusters - Identifies outliers as noise

from sklearn.cluster import DBSCAN

# Fit DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
df['cluster_dbscan'] = dbscan.fit_predict(X_scaled)

print(f"Number of clusters: {len(set(df['cluster_dbscan'])) - (1 if -1 in df['cluster_dbscan'] else 0)}")
print(f"Number of noise points: {(df['cluster_dbscan'] == -1).sum()}")
print(f"\nCluster distribution:")
print(df['cluster_dbscan'].value_counts().sort_index())

# Visualize
plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['age'], df['annual_income'],
                      c=df['cluster_dbscan'], cmap='tab10', alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Annual Income ($)')
plt.title('DBSCAN Clustering (noise points labeled -1)')
plt.colorbar(scatter, label='Cluster')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Key parameters: - eps: Maximum distance between points in same cluster - min_samples: Minimum points to form a dense region

7.6 Clustering Comparison

# Compare all three methods
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

methods = [
    ('K-Means', 'cluster_kmeans', 'viridis'),
    ('Hierarchical', 'cluster_hierarchical', 'plasma'),
    ('DBSCAN', 'cluster_dbscan', 'tab10')
]

for idx, (name, col, cmap) in enumerate(methods):
    ax = axes[idx]
    scatter = ax.scatter(df['age'], df['annual_income'],
                         c=df[col], cmap=cmap, alpha=0.6)
    ax.set_xlabel('Age')
    ax.set_ylabel('Annual Income ($)')
    ax.set_title(name)
    plt.colorbar(scatter, ax=ax, label='Cluster')
    ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

7.7 Part 2: Dimensionality Reduction

7.7.1 Principal Component Analysis (PCA)

Goal: Reduce features while preserving maximum variance.

PCA is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while retaining as much information as possible. It identifies the directions (principal components) along which data varies the most. This is incredibly useful for visualizing high-dimensional data, reducing computation time, and removing noise from datasets.

Use cases: - Visualization (reduce to 2D/3D) - Noise reduction - Speed up training - Feature extraction

from sklearn.decomposition import PCA

# Create higher-dimensional data
np.random.seed(42)
X_high_dim = np.random.randn(200, 10)

# Fit PCA
pca = PCA()
pca.fit(X_high_dim)

# Explained variance
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
axes[0].bar(range(1, 11), explained_var)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Variance Explained by Each Component')
axes[0].grid(axis='y', alpha=0.3)

# Cumulative variance
axes[1].plot(range(1, 11), cumulative_var, 'bo-', linewidth=2, markersize=8)
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% variance')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance Explained')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Components needed for 95% variance: {np.argmax(cumulative_var >= 0.95) + 1}")

7.7.2 PCA for Visualization

# Reduce customer data to 2D
pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_scaled)

df['pca1'] = X_pca[:, 0]
df['pca2'] = X_pca[:, 1]

print(f"Variance explained by 2 components: {pca_2d.explained_variance_ratio_.sum():.1%}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Original features
axes[0].scatter(df['age'], df['annual_income'], c=df['cluster_kmeans'], cmap='viridis', alpha=0.6)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Annual Income')
axes[0].set_title('Original Features')
axes[0].grid(alpha=0.3)

# PCA features
scatter = axes[1].scatter(df['pca1'], df['pca2'], c=df['cluster_kmeans'], cmap='viridis', alpha=0.6)
axes[1].set_xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.1%} variance)')
axes[1].set_ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.1%} variance)')
axes[1].set_title('PCA Features')
axes[1].grid(alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Cluster')

plt.tight_layout()
plt.show()

7.8 Part 3: Anomaly Detection

Goal: Identify unusual data points (outliers).

Anomaly detection (also called outlier detection) finds data points that deviate significantly from the norm. These could be fraudulent transactions, faulty sensors, network intrusions, or manufacturing defects. Isolation Forest is a popular algorithm that works by isolating anomalies—unusual points are easier to isolate than normal points, requiring fewer splits in a decision tree structure.

from sklearn.ensemble import IsolationForest

# Create data with outliers
np.random.seed(42)
X_normal = np.random.randn(200, 2) * 0.5
X_outliers = np.random.uniform(-4, 4, (20, 2))
X_anomaly = np.vstack([X_normal, X_outliers])

# Fit Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
anomaly_labels = iso_forest.fit_predict(X_anomaly)

# -1 = anomaly, 1 = normal
df_anomaly = pd.DataFrame(X_anomaly, columns=['feature1', 'feature2'])
df_anomaly['is_anomaly'] = (anomaly_labels == -1)

print(f"Anomalies detected: {df_anomaly['is_anomaly'].sum()} / {len(df_anomaly)}")

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(df_anomaly[~df_anomaly['is_anomaly']]['feature1'],
            df_anomaly[~df_anomaly['is_anomaly']]['feature2'],
            c='blue', label='Normal', alpha=0.6)
plt.scatter(df_anomaly[df_anomaly['is_anomaly']]['feature1'],
            df_anomaly[df_anomaly['is_anomaly']]['feature2'],
            c='red', label='Anomaly', marker='x', s=100, linewidths=3)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Anomaly Detection with Isolation Forest')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Use cases: - Fraud detection - System monitoring - Quality control - Network intrusion detection

7.9 Clustering Evaluation Metrics

7.9.1 For datasets with true labels

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# Create true labels (for demonstration)
true_labels = np.concatenate([np.zeros(100), np.ones(100), np.full(100, 2)])

# Evaluate K-Means
ari = adjusted_rand_score(true_labels, df['cluster_kmeans'])
nmi = normalized_mutual_info_score(true_labels, df['cluster_kmeans'])

print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized Mutual Information: {nmi:.3f}")

7.9.2 For datasets without true labels

  • Silhouette Score: Higher is better (-1 to 1)
  • Davies-Bouldin Index: Lower is better
  • Calinski-Harabasz Index: Higher is better
from sklearn.metrics import davies_bouldin_score, calinski_harabasz_score

silhouette = silhouette_score(X_scaled, df['cluster_kmeans'])
davies_bouldin = davies_bouldin_score(X_scaled, df['cluster_kmeans'])
calinski = calinski_harabasz_score(X_scaled, df['cluster_kmeans'])

print(f"Silhouette Score: {silhouette:.3f} (higher is better)")
print(f"Davies-Bouldin Index: {davies_bouldin:.3f} (lower is better)")
print(f"Calinski-Harabasz Index: {calinski:.3f} (higher is better)")

7.10 When to Use Each Algorithm

Algorithm Use When Pros Cons
K-Means Spherical clusters, know K Fast, scalable Requires K, assumes spherical
Hierarchical Want dendrogram, small data No need to specify K Slow on large datasets
DBSCAN Arbitrary shapes, outliers Finds outliers, any shape Hard to tune parameters
PCA High dimensions, visualization Reduces dimensions Loses interpretability
Isolation Forest Anomaly detection Handles high dimensions Needs contamination estimate

7.11 Summary

  • Clustering groups similar points without labels
  • K-Means is fast and simple (use elbow method for K)
  • DBSCAN handles arbitrary shapes and finds outliers
  • PCA reduces dimensions while preserving variance
  • Anomaly detection finds unusual patterns
  • Evaluate clusters with silhouette score when no true labels exist

Next: Hyperparameter tuning and model selection!