Introduction to Unsupervised Learning

This section provides an overview of unsupervised learning, including core concepts, algorithms, and implementation examples.

Core Concepts

1. What is Unsupervised Learning?

Unsupervised learning is a type of machine learning where the algorithm learns patterns from unlabeled data. Unlike supervised learning, there are no explicit target variables or labels to predict. Instead, the goal is to discover hidden structures, patterns, or relationships within the data.

Key characteristics of unsupervised learning:

  • No labeled training data
  • Focus on finding patterns and structure
  • Multiple possible interpretations
  • Evaluation can be subjective
  • Often used for exploratory analysis

2. Types of Unsupervised Learning Tasks

  1. Clustering

    • Grouping similar data points together
    • Examples: K-means, hierarchical clustering, DBSCAN
  2. Dimensionality Reduction

    • Reducing the number of features while preserving information
    • Examples: PCA, t-SNE, UMAP
  3. Association Rule Learning

    • Finding relationships between variables
    • Examples: Apriori algorithm, FP-growth
  4. Anomaly Detection

    • Identifying unusual patterns or outliers
    • Examples: Isolation Forest, One-Class SVM

Implementation Examples

1. Data Preprocessing

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

class UnsupervisedPreprocessor:
    def __init__(self, scaling=True, n_components=None):
        """Initialize preprocessor"""
        self.scaling = scaling
        self.n_components = n_components
        self.scaler = StandardScaler() if scaling else None
        self.pca = PCA(n_components=n_components) if n_components else None
    
    def fit_transform(self, X):
        """Fit and transform data"""
        # Scale data
        if self.scaling:
            X = self.scaler.fit_transform(X)
        
        # Apply PCA
        if self.pca:
            X = self.pca.fit_transform(X)
            self.plot_explained_variance()
        
        return X
    
    def transform(self, X):
        """Transform new data"""
        if self.scaling:
            X = self.scaler.transform(X)
        
        if self.pca:
            X = self.pca.transform(X)
        
        return X
    
    def plot_explained_variance(self):
        """Plot explained variance ratio"""
        if self.pca:
            plt.figure(figsize=(10, 6))
            plt.plot(
                np.cumsum(self.pca.explained_variance_ratio_),
                'bo-'
            )
            plt.xlabel('Number of Components')
            plt.ylabel('Cumulative Explained Variance Ratio')
            plt.title('PCA Explained Variance')
            plt.grid(True)
            return plt.gcf()

2. Data Visualization

from sklearn.manifold import TSNE

def visualize_clusters(X, labels=None, method='pca'):
    """Visualize high-dimensional data in 2D"""
    # Reduce dimensionality
    if method == 'pca':
        reducer = PCA(n_components=2)
    elif method == 'tsne':
        reducer = TSNE(n_components=2, random_state=42)
    else:
        raise ValueError("Method must be 'pca' or 'tsne'")
    
    X_2d = reducer.fit_transform(X)
    
    # Plot
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(
        X_2d[:, 0],
        X_2d[:, 1],
        c=labels,
        cmap='viridis' if labels is not None else None
    )
    
    if labels is not None:
        plt.colorbar(scatter)
    
    plt.title(f'2D Visualization using {method.upper()}')
    plt.xlabel('First Component')
    plt.ylabel('Second Component')
    return plt.gcf()

def plot_feature_correlation(X, feature_names=None):
    """Plot feature correlation matrix"""
    corr = np.corrcoef(X.T)
    
    plt.figure(figsize=(10, 8))
    plt.imshow(corr, cmap='coolwarm')
    plt.colorbar()
    
    if feature_names is not None:
        plt.xticks(
            range(len(feature_names)),
            feature_names,
            rotation=45
        )
        plt.yticks(
            range(len(feature_names)),
            feature_names
        )
    
    plt.title('Feature Correlation Matrix')
    plt.tight_layout()
    return plt.gcf()

3. Basic Clustering

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def find_optimal_clusters(
    X,
    max_clusters=10,
    random_state=42
):
    """Find optimal number of clusters using elbow method"""
    inertias = []
    silhouette_scores = []
    
    for k in range(2, max_clusters + 1):
        # Fit KMeans
        kmeans = KMeans(
            n_clusters=k,
            random_state=random_state
        )
        kmeans.fit(X)
        
        # Calculate metrics
        inertias.append(kmeans.inertia_)
        silhouette_scores.append(
            silhouette_score(X, kmeans.labels_)
        )
    
    # Plot results
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Elbow curve
    ax1.plot(
        range(2, max_clusters + 1),
        inertias,
        'bo-'
    )
    ax1.set_xlabel('Number of Clusters')
    ax1.set_ylabel('Inertia')
    ax1.set_title('Elbow Method')
    
    # Silhouette scores
    ax2.plot(
        range(2, max_clusters + 1),
        silhouette_scores,
        'ro-'
    )
    ax2.set_xlabel('Number of Clusters')
    ax2.set_ylabel('Silhouette Score')
    ax2.set_title('Silhouette Analysis')
    
    plt.tight_layout()
    return fig

Applications

1. Common Use Cases

  1. Customer Segmentation

    • Grouping customers by behavior
    • Personalizing marketing strategies
    • Understanding customer preferences
  2. Anomaly Detection

    • Fraud detection
    • System health monitoring
    • Quality control
  3. Feature Learning

    • Dimensionality reduction
    • Feature extraction
    • Data compression
  4. Pattern Discovery

    • Market basket analysis
    • Document clustering
    • Image segmentation

2. Industry Examples

  1. Retail

    • Customer segmentation
    • Product recommendations
    • Inventory management
  2. Finance

    • Fraud detection
    • Risk assessment
    • Portfolio analysis
  3. Healthcare

    • Patient grouping
    • Disease pattern analysis
    • Medical image analysis
  4. Technology

    • Network security
    • User behavior analysis
    • Content organization

Best Practices

1. Data Preparation

  • Clean and preprocess data
  • Handle missing values
  • Scale features appropriately
  • Remove outliers if necessary

2. Algorithm Selection

  • Consider data characteristics
  • Evaluate computational requirements
  • Test multiple algorithms
  • Validate results

3. Model Evaluation

  • Use appropriate metrics
  • Validate stability
  • Consider interpretability
  • Document assumptions

4. Implementation Tips

  • Start simple
  • Iterate and refine
  • Monitor performance
  • Document process