Introduction to Unsupervised Learning
This section provides an overview of unsupervised learning, including core concepts, algorithms, and implementation examples.
Core Concepts
1. What is Unsupervised Learning?
Unsupervised learning is a type of machine learning where the algorithm learns patterns from unlabeled data. Unlike supervised learning, there are no explicit target variables or labels to predict. Instead, the goal is to discover hidden structures, patterns, or relationships within the data.
Key characteristics of unsupervised learning:
- No labeled training data
- Focus on finding patterns and structure
- Multiple possible interpretations
- Evaluation can be subjective
- Often used for exploratory analysis
2. Types of Unsupervised Learning Tasks
-
Clustering
- Grouping similar data points together
- Examples: K-means, hierarchical clustering, DBSCAN
-
Dimensionality Reduction
- Reducing the number of features while preserving information
- Examples: PCA, t-SNE, UMAP
-
Association Rule Learning
- Finding relationships between variables
- Examples: Apriori algorithm, FP-growth
-
Anomaly Detection
- Identifying unusual patterns or outliers
- Examples: Isolation Forest, One-Class SVM
Implementation Examples
1. Data Preprocessing
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
class UnsupervisedPreprocessor:
def __init__(self, scaling=True, n_components=None):
"""Initialize preprocessor"""
self.scaling = scaling
self.n_components = n_components
self.scaler = StandardScaler() if scaling else None
self.pca = PCA(n_components=n_components) if n_components else None
def fit_transform(self, X):
"""Fit and transform data"""
# Scale data
if self.scaling:
X = self.scaler.fit_transform(X)
# Apply PCA
if self.pca:
X = self.pca.fit_transform(X)
self.plot_explained_variance()
return X
def transform(self, X):
"""Transform new data"""
if self.scaling:
X = self.scaler.transform(X)
if self.pca:
X = self.pca.transform(X)
return X
def plot_explained_variance(self):
"""Plot explained variance ratio"""
if self.pca:
plt.figure(figsize=(10, 6))
plt.plot(
np.cumsum(self.pca.explained_variance_ratio_),
'bo-'
)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('PCA Explained Variance')
plt.grid(True)
return plt.gcf()
2. Data Visualization
from sklearn.manifold import TSNE
def visualize_clusters(X, labels=None, method='pca'):
"""Visualize high-dimensional data in 2D"""
# Reduce dimensionality
if method == 'pca':
reducer = PCA(n_components=2)
elif method == 'tsne':
reducer = TSNE(n_components=2, random_state=42)
else:
raise ValueError("Method must be 'pca' or 'tsne'")
X_2d = reducer.fit_transform(X)
# Plot
plt.figure(figsize=(10, 6))
scatter = plt.scatter(
X_2d[:, 0],
X_2d[:, 1],
c=labels,
cmap='viridis' if labels is not None else None
)
if labels is not None:
plt.colorbar(scatter)
plt.title(f'2D Visualization using {method.upper()}')
plt.xlabel('First Component')
plt.ylabel('Second Component')
return plt.gcf()
def plot_feature_correlation(X, feature_names=None):
"""Plot feature correlation matrix"""
corr = np.corrcoef(X.T)
plt.figure(figsize=(10, 8))
plt.imshow(corr, cmap='coolwarm')
plt.colorbar()
if feature_names is not None:
plt.xticks(
range(len(feature_names)),
feature_names,
rotation=45
)
plt.yticks(
range(len(feature_names)),
feature_names
)
plt.title('Feature Correlation Matrix')
plt.tight_layout()
return plt.gcf()
3. Basic Clustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def find_optimal_clusters(
X,
max_clusters=10,
random_state=42
):
"""Find optimal number of clusters using elbow method"""
inertias = []
silhouette_scores = []
for k in range(2, max_clusters + 1):
# Fit KMeans
kmeans = KMeans(
n_clusters=k,
random_state=random_state
)
kmeans.fit(X)
# Calculate metrics
inertias.append(kmeans.inertia_)
silhouette_scores.append(
silhouette_score(X, kmeans.labels_)
)
# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Elbow curve
ax1.plot(
range(2, max_clusters + 1),
inertias,
'bo-'
)
ax1.set_xlabel('Number of Clusters')
ax1.set_ylabel('Inertia')
ax1.set_title('Elbow Method')
# Silhouette scores
ax2.plot(
range(2, max_clusters + 1),
silhouette_scores,
'ro-'
)
ax2.set_xlabel('Number of Clusters')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette Analysis')
plt.tight_layout()
return fig
Applications
1. Common Use Cases
-
Customer Segmentation
- Grouping customers by behavior
- Personalizing marketing strategies
- Understanding customer preferences
-
Anomaly Detection
- Fraud detection
- System health monitoring
- Quality control
-
Feature Learning
- Dimensionality reduction
- Feature extraction
- Data compression
-
Pattern Discovery
- Market basket analysis
- Document clustering
- Image segmentation
2. Industry Examples
-
Retail
- Customer segmentation
- Product recommendations
- Inventory management
-
Finance
- Fraud detection
- Risk assessment
- Portfolio analysis
-
Healthcare
- Patient grouping
- Disease pattern analysis
- Medical image analysis
-
Technology
- Network security
- User behavior analysis
- Content organization
Best Practices
1. Data Preparation
- Clean and preprocess data
- Handle missing values
- Scale features appropriately
- Remove outliers if necessary
2. Algorithm Selection
- Consider data characteristics
- Evaluate computational requirements
- Test multiple algorithms
- Validate results
3. Model Evaluation
- Use appropriate metrics
- Validate stability
- Consider interpretability
- Document assumptions
4. Implementation Tips
- Start simple
- Iterate and refine
- Monitor performance
- Document process