Dimensionality Reduction
Techniques for reducing the number of features while preserving important information
Dimensionality reduction is a crucial technique in machine learning that transforms high-dimensional data into a lower-dimensional form while preserving essential information.
Principal Component Analysis (PCA)
PCA is a linear dimensionality reduction technique that finds the directions of maximum variance in high-dimensional data.
Mathematical Foundation
The principal components are eigenvectors of the covariance matrix:
where Σ is the covariance matrix, n is the number of samples, x_i are the data points, and μ is the mean.
Implementation Example
from sklearn.decomposition import PCA
import numpy as np
# Create sample data
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Initialize and fit PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Check explained variance ratio
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE is a non-linear dimensionality reduction technique that's particularly good at visualizing high-dimensional data.
How it Works
t-SNE converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
# Visualize results
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.title('t-SNE visualization')
UMAP (Uniform Manifold Approximation and Projection)
UMAP is a newer dimensionality reduction technique that often performs better than t-SNE for large datasets.
Mathematical Intuition
UMAP constructs a fuzzy topological representation using:
where d(x_i, x_j) is the distance between points, ρ_i is the distance to the nearest neighbor, and σ_i is a scaling factor.
Code Example
import umap
# Initialize and fit UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X)
# Visualization
plt.scatter(X_umap[:, 0], X_umap[:, 1])
plt.title('UMAP projection')
Best Practices
- Scaling: Always scale your features before applying dimensionality reduction:
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
- Choosing Components: Use explained variance ratio to determine the optimal number of components:
cumsum = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumsum >= 0.95) + 1
- Validation: Always validate that the reduced representation preserves important patterns in your data:
- Check explained variance ratio
- Visualize the results
- Validate downstream task performance