Support Vector Machines

Support Vector Machines (SVMs) are powerful supervised learning models used for classification, regression, and outlier detection. They are particularly effective in high-dimensional spaces and cases where the number of dimensions exceeds the number of samples.

Theoretical Foundations

Linear SVM Classification

The fundamental idea behind SVM is to find the optimal hyperplane that maximizes the margin between two classes. For a binary classification problem:

  1. Mathematical Formulation

    • Given training data: (x1,y1),...,(xn,yn)(x_1, y_1), ..., (x_n, y_n) where yi{1,1}y_i \in \{-1, 1\}
    • The hyperplane is defined as: wTx+b=0w^T x + b = 0
    • The decision function: f(x)=sign(wTx+b)f(x) = sign(w^T x + b)
  2. Optimization Problem

    minimize: \frac{1}{2} ||w||^2
    subject to: y_i(w^T x_i + b) \geq 1, \forall i
    

    This formulation maximizes the margin while ensuring correct classification of training points.

  3. Soft Margin SVM

    • Introduces slack variables ξi\xi_i to handle non-separable data
    • Modified optimization:
    minimize: \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i
    subject to: y_i(w^T x_i + b) \geq 1 - \xi_i, \xi_i \geq 0, \forall i
    

    where C controls the trade-off between margin size and training error.

Implementation Examples

1. Linear SVM Implementation

import numpy as np
from cvxopt import matrix, solvers
import matplotlib.pyplot as plt

class LinearSVM:
    def __init__(self, C=1.0):
        self.C = C
        self.support_vectors = None
        self.support_vector_labels = None
        self.alphas = None
        self.w = None
        self.b = None
    
    def fit(self, X, y):
        """Train Linear SVM using quadratic programming"""
        n_samples, n_features = X.shape
        
        # Compute the Gram matrix
        K = X @ X.T
        
        # Set up the quadratic programming problem
        P = matrix(np.outer(y, y) * K)
        q = matrix(-np.ones(n_samples))
        G = matrix(np.vstack((-np.eye(n_samples), np.eye(n_samples))))
        h = matrix(np.hstack((np.zeros(n_samples), self.C * np.ones(n_samples))))
        A = matrix(y.reshape(1, -1))
        b = matrix(0.0)
        
        # Solve the quadratic programming problem
        solution = solvers.qp(P, q, G, h, A, b)
        alphas = np.array(solution['x']).reshape(-1)
        
        # Find support vectors
        sv_indices = alphas > 1e-5
        self.alphas = alphas[sv_indices]
        self.support_vectors = X[sv_indices]
        self.support_vector_labels = y[sv_indices]
        
        # Calculate w and b
        self.w = np.sum(self.alphas.reshape(-1, 1) * 
                       self.support_vector_labels.reshape(-1, 1) * 
                       self.support_vectors, axis=0)
        
        margin_points = self.support_vectors[self.alphas < self.C]
        self.b = np.mean(self.support_vector_labels[self.alphas < self.C] - 
                        np.dot(margin_points, self.w))
        
        return self
    
    def predict(self, X):
        """Make predictions"""
        return np.sign(np.dot(X, self.w) + self.b)

# Example usage with visualization
np.random.seed(42)
X = np.random.randn(100, 2)
y = 2 * (X[:, 0] + X[:, 1] > 0) - 1

model = LinearSVM(C=1.0)
model.fit(X, y)

# Plot decision boundary
plt.figure(figsize=(10, 8))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.scatter(model.support_vectors[:, 0], model.support_vectors[:, 1],
           s=100, facecolors='none', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Linear SVM Decision Boundary with Support Vectors')

2. Kernel SVM

The kernel trick allows SVMs to handle non-linearly separable data by mapping it to a higher-dimensional space.

  1. Common Kernels
    • Linear: K(x,y)=xTyK(x, y) = x^T y
    • Polynomial: K(x,y)=(γxTy+r)dK(x, y) = (γx^T y + r)^d
    • RBF (Gaussian): K(x,y)=exp(γxy2)K(x, y) = exp(-γ||x-y||^2)
    • Sigmoid: K(x,y)=tanh(γxTy+r)K(x, y) = tanh(γx^T y + r)
class KernelSVM:
    def __init__(self, kernel='rbf', C=1.0, gamma=1.0):
        self.kernel = kernel
        self.C = C
        self.gamma = gamma
        
    def kernel_function(self, x1, x2):
        """Compute kernel function"""
        if self.kernel == 'linear':
            return np.dot(x1, x2)
        elif self.kernel == 'rbf':
            return np.exp(-self.gamma * np.linalg.norm(x1 - x2)**2)
        elif self.kernel == 'poly':
            return (np.dot(x1, x2) + 1)**2
    
    def compute_kernel_matrix(self, X1, X2):
        """Compute kernel matrix for two sets of points"""
        n1, n2 = len(X1), len(X2)
        K = np.zeros((n1, n2))
        for i in range(n1):
            for j in range(n2):
                K[i,j] = self.kernel_function(X1[i], X2[j])
        return K

Advanced Concepts

1. Geometric Margin

The geometric margin of a hyperplane relative to a dataset is:

γ = min_{i=1,...,n} \frac{|w^T x_i + b|}{||w||}

This represents the minimum distance from any point to the decision boundary.

2. Dual Formulation

The dual optimization problem is:

maximize: \sum_{i=1}^n α_i - \frac{1}{2} \sum_{i,j=1}^n y_i y_j α_i α_j K(x_i, x_j)
subject to: \sum_{i=1}^n y_i α_i = 0, 0 ≤ α_i ≤ C, \forall i

3. Kernel Selection

  • Linear Kernel: Best for linearly separable data

    • Computationally efficient
    • Good for high-dimensional data
    • Risk of underfitting for complex data
  • RBF Kernel: Versatile, good default choice

    • Can capture non-linear patterns
    • Requires tuning of γ parameter
    • Risk of overfitting with small γ
  • Polynomial Kernel: Good for normalized data

    • Can capture higher-order feature interactions
    • More parameters to tune
    • Numerically unstable for high degrees

Model Selection and Tuning

def svm_parameter_search(X, y, param_grid):
    """Perform grid search for SVM parameters"""
    best_score = float('-inf')
    best_params = None
    
    for C in param_grid['C']:
        for gamma in param_grid['gamma']:
            scores = []
            kf = KFold(n_splits=5, shuffle=True)
            
            for train_idx, val_idx in kf.split(X):
                X_train, X_val = X[train_idx], X[val_idx]
                y_train, y_val = y[train_idx], y[val_idx]
                
                model = KernelSVM(C=C, gamma=gamma)
                model.fit(X_train, y_train)
                score = model.score(X_val, y_val)
                scores.append(score)
            
            avg_score = np.mean(scores)
            if avg_score > best_score:
                best_score = avg_score
                best_params = {'C': C, 'gamma': gamma}
    
    return best_params, best_score

2. Model Evaluation

def evaluate_svm(model, X, y):
    """Comprehensive SVM evaluation"""
    # Predictions
    y_pred = model.predict(X)
    
    # Basic metrics
    accuracy = np.mean(y_pred == y)
    precision = precision_score(y, y_pred)
    recall = recall_score(y, y_pred)
    f1 = f1_score(y, y_pred)
    
    # Decision function values
    decision_values = model.decision_function(X)
    
    # ROC curve
    fpr, tpr, _ = roc_curve(y, decision_values)
    auc_score = auc(fpr, tpr)
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'auc': auc_score,
        'fpr': fpr,
        'tpr': tpr
    }

Best Practices

  1. Data Preprocessing

    • Scale features to [0,1] or [-1,1]
    • Handle missing values
    • Remove outliers if necessary
    • Consider dimensionality reduction
  2. Kernel Selection

    • Start with linear kernel for high-dimensional data
    • Use RBF kernel for non-linear patterns
    • Consider polynomial kernel for normalized data
    • Validate kernel choice through cross-validation
  3. Parameter Tuning

    • Use grid search or random search
    • Focus on C and kernel-specific parameters
    • Consider computational cost
    • Use domain knowledge when possible
  4. Model Validation

    • Use k-fold cross-validation
    • Check for overfitting
    • Monitor support vector count
    • Assess prediction confidence