Foundational Concepts in Deep Learning

1. Introduction to Artificial Neural Networks

Artificial Neural Networks (ANNs) are the building blocks of deep learning. Inspired by the biological brain, ANNs consist of interconnected neurons that process information and learn patterns from data.

Key Concepts

  • Perceptron:
    • A single-layer neural network.
    • Models a linear decision boundary but cannot solve non-linear problems (e.g., XOR problem).
  • Multi-layer Perceptrons (MLP):
    • Combines multiple perceptrons in layers.
    • Solves non-linear problems using hidden layers and non-linear activation functions.
  • Neurons and Weights:
    • Each neuron receives input, applies weights, adds bias, and passes the result through an activation function.

Mathematical Representation

Given inputs x1x_1, x2x_2, \ldots, xnx_n with weights w1w_1, w2w_2, \ldots, wnw_n, the output yy of a perceptron is:

z=i=1nwixi+bz = \sum_{i=1}^{n} w_i x_i + b y=f(z)y = f(z)

Where ff is the activation function.


2. Activation Functions

Activation functions introduce non-linearity into the network, enabling it to learn complex patterns.

Common Activation Functions

  • Sigmoid:

    f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}
    • Output range: (0, 1).
    • Pros: Smooth gradients.
    • Cons: Vanishing gradient problem.
  • ReLU (Rectified Linear Unit):

    f(x)=max(0,x)f(x) = \max(0, x)
    • Output range: [0,)[0, \infty).
    • Pros: Efficient and prevents vanishing gradients.
    • Cons: Can lead to "dead neurons."
  • Tanh:

    f(x)=tanh(x)=exexex+exf(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
    • Output range: (-1, 1).
    • Pros: Centered at zero, smooth gradients.
    • Cons: Suffers from vanishing gradients for large inputs.
  • Softmax:

    fi(x)=exijexjf_i(x) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
    • Used for multi-class classification.
    • Converts logits into probabilities.

Usage Guidelines

  • ReLU: Default for hidden layers.
  • Softmax: For output layers in multi-class classification.
  • Sigmoid/Tanh: For binary classification or specific architectures.

3. Loss Functions and Optimization

Loss functions measure the difference between the predicted output and the ground truth. Optimization algorithms minimize this loss by adjusting the weights and biases.

Common Loss Functions

  • Mean Squared Error (MSE):

    MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
    • Used for regression tasks.
    • Penalizes large errors.
  • Cross-Entropy Loss:

    L=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
    • Commonly used for classification tasks.

Optimization Algorithms

  • Gradient Descent: Iteratively adjusts weights based on the gradient of the loss function:

    w=wαLw = w - \alpha \nabla \mathcal{L}

    Where α\alpha is the learning rate.

  • Variants of Gradient Descent:

    • Stochastic Gradient Descent (SGD): Updates weights for each training example.
    • Mini-batch Gradient Descent: Updates weights using small batches of data.
    • Adam (Adaptive Moment Estimation):
      • Combines momentum and RMSProp for adaptive learning rates.
      • Efficient and widely used.

Summary

The foundational concepts discussed here lay the groundwork for understanding and building deep learning models. Mastery of these concepts ensures a strong understanding of neural network design and optimization.