Probability and Statistics
Understanding fundamental concepts of probability and statistics essential for machine learning and data science.
Probability and statistics form the theoretical foundation for machine learning and data science, providing tools for understanding uncertainty, making predictions, and analyzing data.
Fundamentals of Probability
Basic Concepts
- Sample space (): Set of all possible outcomes
- Event (): Subset of sample space
- Probability measure:
- Axioms of probability:
- for all events
- For disjoint events:
Probability Rules
- Addition rule:
- Multiplication rule:
- Bayes' theorem:
- Law of total probability:
import numpy as np
from scipy import stats
# Example: Bayes' theorem implementation
def bayes_theorem(p_a, p_b_given_a, p_b):
return (p_b_given_a * p_a) / p_b
Random Variables
Types and Properties
-
Discrete Random Variables
- Probability mass function (PMF):
- Expected value:
- Variance:
-
Continuous Random Variables
- Probability density function (PDF):
- Expected value:
- Variance:
# Example: Computing statistics for random variables
def random_variable_stats(data):
mean = np.mean(data)
variance = np.var(data)
std_dev = np.std(data)
return mean, variance, std_dev
Probability Distributions
Common Distributions
-
Normal Distribution
- PDF:
# Generate normal distribution mu, sigma = 0, 1 x = np.random.normal(mu, sigma, 1000)
-
Binomial Distribution
- PMF:
# Binomial distribution example n, p = 10, 0.5 x = np.random.binomial(n, p, 1000)
Statistical Inference
Sampling and Estimation
-
Point Estimation
- Maximum Likelihood Estimation (MLE):
def mle_normal(data): mu = np.mean(data) sigma = np.std(data, ddof=1) return mu, sigma
-
Interval Estimation
- Confidence Interval:
def confidence_interval(data, confidence=0.95): mean = np.mean(data) std_err = stats.sem(data) ci = stats.t.interval(confidence, len(data)-1, mean, std_err) return ci
Hypothesis Testing
def t_test(sample1, sample2, alpha=0.05):
t_stat, p_value = stats.ttest_ind(sample1, sample2)
return {
't_statistic': t_stat,
'p_value': p_value,
'reject_null': p_value < alpha
}
Applications in Machine Learning
Model Evaluation
-
Cross-validation
from sklearn.model_selection import cross_val_score def cv_evaluation(model, X, y, cv=5): scores = cross_val_score(model, X, y, cv=cv) return np.mean(scores), np.std(scores)
-
Performance Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score def classification_metrics(y_true, y_pred): return { 'accuracy': accuracy_score(y_true, y_pred), 'precision': precision_score(y_true, y_pred), 'recall': recall_score(y_true, y_pred) }
Probabilistic Models
-
Naive Bayes
from sklearn.naive_bayes import GaussianNB def train_naive_bayes(X_train, y_train): model = GaussianNB() model.fit(X_train, y_train) return model
-
Gaussian Processes
from sklearn.gaussian_process import GaussianProcessRegressor def train_gp(X_train, y_train): gp = GaussianProcessRegressor() gp.fit(X_train, y_train) return gp
Advanced Topics
Information Theory
- Entropy:
- KL Divergence:
def entropy(probabilities):
return -np.sum(probabilities * np.log2(probabilities + 1e-10))
Bayesian Methods
def bayesian_update(prior, likelihood, evidence):
return (likelihood * prior) / evidence
Practical Considerations
Implementation
# Example: Bootstrap sampling
def bootstrap_sample(data, num_samples=1000):
n = len(data)
indices = np.random.randint(0, n, (num_samples, n))
samples = data[indices]
return np.mean(samples, axis=1)
Common Challenges
- Missing data handling
- Outlier detection
- Sample size determination
- Multiple testing correction