Support Vector Machines
Support Vector Machines (SVMs) are powerful supervised learning models used for classification, regression, and outlier detection. They are particularly effective in high-dimensional spaces and cases where the number of dimensions exceeds the number of samples.
Theoretical Foundations
Linear SVM Classification
The fundamental idea behind SVM is to find the optimal hyperplane that maximizes the margin between two classes. For a binary classification problem:
-
Mathematical Formulation
- Given training data: where
- The hyperplane is defined as:
- The decision function:
-
Optimization Problem
minimize: \frac{1}{2} ||w||^2 subject to: y_i(w^T x_i + b) \geq 1, \forall i
This formulation maximizes the margin while ensuring correct classification of training points.
-
Soft Margin SVM
- Introduces slack variables to handle non-separable data
- Modified optimization:
minimize: \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i subject to: y_i(w^T x_i + b) \geq 1 - \xi_i, \xi_i \geq 0, \forall i
where C controls the trade-off between margin size and training error.
Implementation Examples
1. Linear SVM Implementation
import numpy as np
from cvxopt import matrix, solvers
import matplotlib.pyplot as plt
class LinearSVM:
def __init__(self, C=1.0):
self.C = C
self.support_vectors = None
self.support_vector_labels = None
self.alphas = None
self.w = None
self.b = None
def fit(self, X, y):
"""Train Linear SVM using quadratic programming"""
n_samples, n_features = X.shape
# Compute the Gram matrix
K = X @ X.T
# Set up the quadratic programming problem
P = matrix(np.outer(y, y) * K)
q = matrix(-np.ones(n_samples))
G = matrix(np.vstack((-np.eye(n_samples), np.eye(n_samples))))
h = matrix(np.hstack((np.zeros(n_samples), self.C * np.ones(n_samples))))
A = matrix(y.reshape(1, -1))
b = matrix(0.0)
# Solve the quadratic programming problem
solution = solvers.qp(P, q, G, h, A, b)
alphas = np.array(solution['x']).reshape(-1)
# Find support vectors
sv_indices = alphas > 1e-5
self.alphas = alphas[sv_indices]
self.support_vectors = X[sv_indices]
self.support_vector_labels = y[sv_indices]
# Calculate w and b
self.w = np.sum(self.alphas.reshape(-1, 1) *
self.support_vector_labels.reshape(-1, 1) *
self.support_vectors, axis=0)
margin_points = self.support_vectors[self.alphas < self.C]
self.b = np.mean(self.support_vector_labels[self.alphas < self.C] -
np.dot(margin_points, self.w))
return self
def predict(self, X):
"""Make predictions"""
return np.sign(np.dot(X, self.w) + self.b)
# Example usage with visualization
np.random.seed(42)
X = np.random.randn(100, 2)
y = 2 * (X[:, 0] + X[:, 1] > 0) - 1
model = LinearSVM(C=1.0)
model.fit(X, y)
# Plot decision boundary
plt.figure(figsize=(10, 8))
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.scatter(model.support_vectors[:, 0], model.support_vectors[:, 1],
s=100, facecolors='none', edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Linear SVM Decision Boundary with Support Vectors')
2. Kernel SVM
The kernel trick allows SVMs to handle non-linearly separable data by mapping it to a higher-dimensional space.
- Common Kernels
- Linear:
- Polynomial:
- RBF (Gaussian):
- Sigmoid:
class KernelSVM:
def __init__(self, kernel='rbf', C=1.0, gamma=1.0):
self.kernel = kernel
self.C = C
self.gamma = gamma
def kernel_function(self, x1, x2):
"""Compute kernel function"""
if self.kernel == 'linear':
return np.dot(x1, x2)
elif self.kernel == 'rbf':
return np.exp(-self.gamma * np.linalg.norm(x1 - x2)**2)
elif self.kernel == 'poly':
return (np.dot(x1, x2) + 1)**2
def compute_kernel_matrix(self, X1, X2):
"""Compute kernel matrix for two sets of points"""
n1, n2 = len(X1), len(X2)
K = np.zeros((n1, n2))
for i in range(n1):
for j in range(n2):
K[i,j] = self.kernel_function(X1[i], X2[j])
return K
Advanced Concepts
1. Geometric Margin
The geometric margin of a hyperplane relative to a dataset is:
γ = min_{i=1,...,n} \frac{|w^T x_i + b|}{||w||}
This represents the minimum distance from any point to the decision boundary.
2. Dual Formulation
The dual optimization problem is:
maximize: \sum_{i=1}^n α_i - \frac{1}{2} \sum_{i,j=1}^n y_i y_j α_i α_j K(x_i, x_j)
subject to: \sum_{i=1}^n y_i α_i = 0, 0 ≤ α_i ≤ C, \forall i
3. Kernel Selection
-
Linear Kernel: Best for linearly separable data
- Computationally efficient
- Good for high-dimensional data
- Risk of underfitting for complex data
-
RBF Kernel: Versatile, good default choice
- Can capture non-linear patterns
- Requires tuning of γ parameter
- Risk of overfitting with small γ
-
Polynomial Kernel: Good for normalized data
- Can capture higher-order feature interactions
- More parameters to tune
- Numerically unstable for high degrees
Model Selection and Tuning
1. Cross-Validation Grid Search
def svm_parameter_search(X, y, param_grid):
"""Perform grid search for SVM parameters"""
best_score = float('-inf')
best_params = None
for C in param_grid['C']:
for gamma in param_grid['gamma']:
scores = []
kf = KFold(n_splits=5, shuffle=True)
for train_idx, val_idx in kf.split(X):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model = KernelSVM(C=C, gamma=gamma)
model.fit(X_train, y_train)
score = model.score(X_val, y_val)
scores.append(score)
avg_score = np.mean(scores)
if avg_score > best_score:
best_score = avg_score
best_params = {'C': C, 'gamma': gamma}
return best_params, best_score
2. Model Evaluation
def evaluate_svm(model, X, y):
"""Comprehensive SVM evaluation"""
# Predictions
y_pred = model.predict(X)
# Basic metrics
accuracy = np.mean(y_pred == y)
precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)
f1 = f1_score(y, y_pred)
# Decision function values
decision_values = model.decision_function(X)
# ROC curve
fpr, tpr, _ = roc_curve(y, decision_values)
auc_score = auc(fpr, tpr)
return {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'auc': auc_score,
'fpr': fpr,
'tpr': tpr
}
Best Practices
-
Data Preprocessing
- Scale features to [0,1] or [-1,1]
- Handle missing values
- Remove outliers if necessary
- Consider dimensionality reduction
-
Kernel Selection
- Start with linear kernel for high-dimensional data
- Use RBF kernel for non-linear patterns
- Consider polynomial kernel for normalized data
- Validate kernel choice through cross-validation
-
Parameter Tuning
- Use grid search or random search
- Focus on C and kernel-specific parameters
- Consider computational cost
- Use domain knowledge when possible
-
Model Validation
- Use k-fold cross-validation
- Check for overfitting
- Monitor support vector count
- Assess prediction confidence