Dimensionality Reduction

Techniques for reducing the number of features while preserving important information

Dimensionality reduction is a crucial technique in machine learning that transforms high-dimensional data into a lower-dimensional form while preserving essential information.

Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction technique that finds the directions of maximum variance in high-dimensional data.

Mathematical Foundation

The principal components are eigenvectors of the covariance matrix:

Σ=1n1i=1n(xiμ)(xiμ)TΣ = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)(x_i - \mu)^T

where Σ is the covariance matrix, n is the number of samples, x_i are the data points, and μ is the mean.

Implementation Example

from sklearn.decomposition import PCA
import numpy as np

# Create sample data
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Initialize and fit PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Check explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

t-SNE (t-Distributed Stochastic Neighbor Embedding)

t-SNE is a non-linear dimensionality reduction technique that's particularly good at visualizing high-dimensional data.

How it Works

t-SNE converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)

# Visualize results
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.title('t-SNE visualization')

UMAP (Uniform Manifold Approximation and Projection)

UMAP is a newer dimensionality reduction technique that often performs better than t-SNE for large datasets.

Mathematical Intuition

UMAP constructs a fuzzy topological representation using:

pji=exp(max(0,d(xi,xj)ρi)σi)p_{j|i} = \exp(-\frac{\max(0, d(x_i, x_j) - \rho_i)}{\sigma_i})

where d(x_i, x_j) is the distance between points, ρ_i is the distance to the nearest neighbor, and σ_i is a scaling factor.

Code Example

import umap

# Initialize and fit UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X)

# Visualization
plt.scatter(X_umap[:, 0], X_umap[:, 1])
plt.title('UMAP projection')

Best Practices

  1. Scaling: Always scale your features before applying dimensionality reduction:
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
  1. Choosing Components: Use explained variance ratio to determine the optimal number of components:
cumsum = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumsum >= 0.95) + 1
  1. Validation: Always validate that the reduced representation preserves important patterns in your data:
    • Check explained variance ratio
    • Visualize the results
    • Validate downstream task performance