Probability and Statistics

Understanding fundamental concepts of probability and statistics essential for machine learning and data science.

Probability and statistics form the theoretical foundation for machine learning and data science, providing tools for understanding uncertainty, making predictions, and analyzing data.

Fundamentals of Probability

Basic Concepts

  • Sample space (Ω\Omega): Set of all possible outcomes
  • Event (AΩA \subseteq \Omega): Subset of sample space
  • Probability measure: P:2Ω[0,1]P: 2^\Omega \to [0,1]
  • Axioms of probability:
    1. P(A)0P(A) \geq 0 for all events AA
    2. P(Ω)=1P(\Omega) = 1
    3. For disjoint events: P(iAi)=iP(Ai)P(\cup_i A_i) = \sum_i P(A_i)

Probability Rules

  • Addition rule: P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
  • Multiplication rule: P(AB)=P(AB)P(B)P(A \cap B) = P(A|B)P(B)
  • Bayes' theorem: P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}
  • Law of total probability: P(A)=iP(ABi)P(Bi)P(A) = \sum_i P(A|B_i)P(B_i)
import numpy as np
from scipy import stats

# Example: Bayes' theorem implementation
def bayes_theorem(p_a, p_b_given_a, p_b):
    return (p_b_given_a * p_a) / p_b

Random Variables

Types and Properties

  1. Discrete Random Variables

    • Probability mass function (PMF): pX(x)=P(X=x)p_X(x) = P(X = x)
    • Expected value: E[X]=xxpX(x)E[X] = \sum_x x p_X(x)
    • Variance: Var(X)=E[(Xμ)2]\text{Var}(X) = E[(X-\mu)^2]
  2. Continuous Random Variables

    • Probability density function (PDF): fX(x)f_X(x)
    • Expected value: E[X]=xfX(x)dxE[X] = \int_{-\infty}^{\infty} x f_X(x) dx
    • Variance: Var(X)=E[(Xμ)2]=(xμ)2fX(x)dx\text{Var}(X) = E[(X-\mu)^2] = \int_{-\infty}^{\infty} (x-\mu)^2 f_X(x) dx
# Example: Computing statistics for random variables
def random_variable_stats(data):
    mean = np.mean(data)
    variance = np.var(data)
    std_dev = np.std(data)
    return mean, variance, std_dev

Probability Distributions

Common Distributions

  1. Normal Distribution

    • PDF: f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
    # Generate normal distribution
    mu, sigma = 0, 1
    x = np.random.normal(mu, sigma, 1000)
    
  2. Binomial Distribution

    • PMF: P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k}p^k(1-p)^{n-k}
    # Binomial distribution example
    n, p = 10, 0.5
    x = np.random.binomial(n, p, 1000)
    

Statistical Inference

Sampling and Estimation

  1. Point Estimation

    • Maximum Likelihood Estimation (MLE): θ^=argmaxθL(θx)\hat{\theta} = \arg\max_\theta L(\theta|x)
    def mle_normal(data):
        mu = np.mean(data)
        sigma = np.std(data, ddof=1)
        return mu, sigma
    
  2. Interval Estimation

    • Confidence Interval: xˉ±zα/2sn\bar{x} \pm z_{\alpha/2}\frac{s}{\sqrt{n}}
    def confidence_interval(data, confidence=0.95):
        mean = np.mean(data)
        std_err = stats.sem(data)
        ci = stats.t.interval(confidence, len(data)-1, mean, std_err)
        return ci
    

Hypothesis Testing

def t_test(sample1, sample2, alpha=0.05):
    t_stat, p_value = stats.ttest_ind(sample1, sample2)
    return {
        't_statistic': t_stat,
        'p_value': p_value,
        'reject_null': p_value < alpha
    }

Applications in Machine Learning

Model Evaluation

  1. Cross-validation

    from sklearn.model_selection import cross_val_score
    
    def cv_evaluation(model, X, y, cv=5):
        scores = cross_val_score(model, X, y, cv=cv)
        return np.mean(scores), np.std(scores)
    
  2. Performance Metrics

    from sklearn.metrics import accuracy_score, precision_score, recall_score
    
    def classification_metrics(y_true, y_pred):
        return {
            'accuracy': accuracy_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred),
            'recall': recall_score(y_true, y_pred)
        }
    

Probabilistic Models

  1. Naive Bayes

    from sklearn.naive_bayes import GaussianNB
    
    def train_naive_bayes(X_train, y_train):
        model = GaussianNB()
        model.fit(X_train, y_train)
        return model
    
  2. Gaussian Processes

    from sklearn.gaussian_process import GaussianProcessRegressor
    
    def train_gp(X_train, y_train):
        gp = GaussianProcessRegressor()
        gp.fit(X_train, y_train)
        return gp
    

Advanced Topics

Information Theory

  • Entropy: H(X)=xp(x)logp(x)H(X) = -\sum_x p(x)\log p(x)
  • KL Divergence: DKL(PQ)=xp(x)logp(x)q(x)D_{KL}(P||Q) = \sum_x p(x)\log\frac{p(x)}{q(x)}
def entropy(probabilities):
    return -np.sum(probabilities * np.log2(probabilities + 1e-10))

Bayesian Methods

def bayesian_update(prior, likelihood, evidence):
    return (likelihood * prior) / evidence

Practical Considerations

Implementation

# Example: Bootstrap sampling
def bootstrap_sample(data, num_samples=1000):
    n = len(data)
    indices = np.random.randint(0, n, (num_samples, n))
    samples = data[indices]
    return np.mean(samples, axis=1)

Common Challenges

  • Missing data handling
  • Outlier detection
  • Sample size determination
  • Multiple testing correction