Neural Networks
From the single perceptron to deep learning. A mathematically rigorous, visually interactive journey through how machines learn to think.
Begin Learning ↓From the single perceptron to deep learning. A mathematically rigorous, visually interactive journey through how machines learn to think.
Begin Learning ↓The fascinating journey from biological neurons to artificial intelligence.
The human brain contains roughly 86 billion neurons, each connected to thousands of others through synapses. In the 1940s, scientists began asking a revolutionary question: can we build a mathematical model of how neurons work?
This question sparked a field that would undergo multiple cycles of excitement and disappointment, ultimately leading to the deep learning revolution that powers today's AI systems.
The history of neural networks is marked by two major periods of disillusionment:
Triggered by the XOR problem. Single-layer perceptrons were shown to be fundamentally limited. Funding dried up, and researchers moved to other approaches like expert systems and symbolic AI.
Despite backpropagation, deep networks were extremely difficult to train. Vanishing gradients made learning in early layers nearly impossible. SVMs and other methods dominated machine learning research.
Three factors ended the second AI winter: (1) massive datasets from the internet, (2) GPU computing power, and (3) algorithmic breakthroughs like ReLU activations, dropout, and batch normalization. Neural networks did not just return - they became the dominant paradigm in AI.
The simplest neural network - a single artificial neuron that learns from data.
A perceptron takes multiple inputs, multiplies each by a learnable weight, sums them together with a bias term, and passes the result through an activation function:
Compute the linear combination of inputs and weights:
Pass through the step (threshold) function:
Update weights based on prediction error:
Here $\eta$ is the learning rate, $y$ is the true label, and $\hat{y}$ is the predicted label. The perceptron converges if the data is linearly separable.
The perceptron can learn AND, OR, and NOT gates, but it cannot learn the XOR function. This is because XOR is not linearly separable - no single straight line can separate the two classes.
A single perceptron can only learn linearly separable functions. It creates a single hyperplane to divide the input space. For problems like XOR, where the classes are interleaved, we need multiple layers of neurons - a multi-layer perceptron (MLP).
If we replace the step function with the sigmoid function, the perceptron becomes logistic regression:
Outputs 0 or 1. Not differentiable. Uses the perceptron learning rule.
Outputs probability between 0 and 1. Differentiable. Uses gradient descent with cross-entropy loss.
Logistic regression is essentially a smooth, differentiable perceptron. It is the fundamental building block of modern neural networks.
How individual neurons are organized into layers to create powerful learning machines.
A neural network is organized into layers. Each layer consists of one or more neurons. Data flows from the input layer through hidden layers to the output layer.
Receives the raw features. No computation happens here - it simply passes the data forward. The number of neurons equals the number of features.
Where the "learning" happens. Each neuron computes a weighted sum, adds a bias, and applies an activation function. Multiple hidden layers create a "deep" network.
Produces the final prediction. For binary classification: 1 neuron with sigmoid. For multiclass: K neurons with softmax. For regression: 1 neuron with no activation (linear).
The learnable parameters of a neural network are organized into weight matrices and bias vectors for each layer.
Where $n^{[l]}$ is the number of neurons in layer $l$. The total number of parameters in a fully connected layer is:
We use superscript $[l]$ to denote the layer number. So $W^{[1]}$ is the weight matrix of layer 1, $a^{[2]}$ is the activation of layer 2, etc. Do not confuse this with exponentiation!
In a fully connected layer, every neuron in layer $l$ is connected to every neuron in layer $l-1$. This means each neuron receives input from all neurons in the previous layer.
For a network with layers of sizes [4, 8, 6, 1]:
The nonlinear magic that gives neural networks their power.
Without activation functions, a neural network is just a series of linear transformations. Composing linear functions gives another linear function:
This means a 100-layer network with no activations is equivalent to a single-layer network. Activation functions introduce nonlinearity, allowing the network to learn complex patterns.
Range: (0, 1). Smooth, differentiable. Used for output layer in binary classification. Suffers from vanishing gradients.
Range: (-1, 1). Zero-centered, which helps learning. Stronger gradients than sigmoid. Still has vanishing gradient problem.
Range: [0, infinity). Computationally efficient. No vanishing gradient for positive inputs. The default choice for hidden layers.
Small slope $\alpha$ (typically 0.01) for negative inputs. Fixes the "dying ReLU" problem where neurons output zero permanently.
The softmax function converts a vector of raw scores (logits) into a probability distribution over K classes:
All outputs sum to 1, and each output is between 0 and 1. This is used in the output layer for multi-class classification problems.
Hidden layers: Use ReLU (or Leaky ReLU) as the default. It trains faster and avoids vanishing gradients. Output layer: Sigmoid for binary classification, softmax for multi-class, linear (no activation) for regression.
During backpropagation, we need the derivatives of activation functions. These determine how gradient signals flow backward through the network.
Sigmoid derivative. Maximum value is 0.25 at z=0, causing gradients to shrink.
Tanh derivative. Maximum value is 1 at z=0 - better than sigmoid but still shrinks.
ReLU derivative. Either 0 or 1 - no gradient shrinking for active neurons!
Leaky ReLU derivative. Never truly zero, so neurons never completely "die."
How data flows through the network to produce a prediction.
Forward propagation is the process of computing the output of the network given an input. At each layer, we perform two operations:
Compute the pre-activation (weighted sum plus bias):
Apply the nonlinear activation function:
In one equation per layer:
Where $a^{[0]} = x$ (the input) and $a^{[L]} = \hat{y}$ (the output prediction).
For a 3-layer network (2 hidden layers + 1 output layer) with ReLU hidden activations and sigmoid output:
In practice, we process multiple samples simultaneously using matrix operations. If we have $m$ training examples:
Where $A^{[l-1]}$ is a matrix with each column being one sample's activations. This vectorized form is critical for GPU acceleration and efficient computation.
During forward propagation, store all intermediate values ($z^{[l]}$ and $a^{[l]}$) for every layer. These cached values are essential for backpropagation - without them, you would need to recompute everything backward.
Measuring how wrong the network's predictions are - the signal that drives learning.
For regression tasks where the output is a continuous value:
MSE penalizes large errors quadratically. It is differentiable everywhere and has a unique global minimum.
For binary classification with sigmoid output:
If prediction is near 1: loss is near 0 (correct). If prediction is near 0: loss goes to infinity (wrong and confident).
If prediction is near 0: loss is near 0 (correct). If prediction is near 1: loss goes to infinity (wrong and confident).
For multi-class classification with softmax output and K classes:
Where $y_k$ is a one-hot encoded vector (1 for the true class, 0 elsewhere) and $\hat{y}_k$ is the predicted probability for class $k$.
Cross-entropy produces larger gradients when the prediction is confidently wrong, leading to faster learning. With MSE and sigmoid, gradients become very small when the prediction is near 0 or 1 (due to the sigmoid derivative), causing learning to stall. Cross-entropy cancels out this effect, providing strong learning signals even for extreme predictions.
The algorithm that makes neural networks learn - propagating errors backward through the chain rule.
Backpropagation is simply the chain rule of calculus applied systematically through the network. The key question is: how does changing each weight affect the final loss?
Consider the chain of computations: $W \rightarrow z \rightarrow a \rightarrow \mathcal{L}$. To find how $W$ affects $\mathcal{L}$, we multiply the local derivatives along the chain:
For each layer $l$, we define the error signal $\delta^{[l]}$ (how much each neuron contributed to the error):
(for cross-entropy loss with sigmoid/softmax output)
Where $\odot$ denotes element-wise multiplication (Hadamard product).
Backpropagation works in the opposite direction of forward propagation. Starting from the loss, it computes how much each parameter contributed to the error:
Compute all activations $a^{[l]}$ from input to output. Cache $z^{[l]}$ and $a^{[l]}$ at every layer.
Compare the output $a^{[L]}$ with the true label $y$ using the loss function.
Propagate $\delta$ from output to input, computing weight and bias gradients at each layer.
Use the computed gradients to adjust all weights and biases in the direction that reduces the loss.
In deep networks, gradients can shrink exponentially (vanishing) or grow exponentially (exploding) as they propagate backward. With sigmoid, gradients are multiplied by values less than 0.25 at each layer. After 10 layers: $0.25^{10} \approx 0.000001$. This is why ReLU and proper initialization are essential for deep networks.
Practical techniques for making neural networks learn effectively.
Uses the entire dataset per update. Stable but slow. Memory-intensive for large datasets.
Uses one sample per update. Very noisy but fast. The noise can help escape local minima.
Uses a small batch (typically 32-256 samples). Best of both worlds. The standard in practice.
Adam (Adaptive Moment Estimation) is the most widely used optimizer. It combines momentum (tracking the direction) with RMSProp (adaptive learning rates):
Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$.
A fixed learning rate is often suboptimal. Scheduling strategies adjust the learning rate during training:
Reduce the learning rate by a factor every N epochs. Example: multiply by 0.1 every 30 epochs.
Smoothly decrease the learning rate following a cosine curve. Often used with warm restarts.
Start with a very small learning rate and gradually increase it over the first few epochs, then decay.
Batch normalization normalizes the inputs to each layer, stabilizing and accelerating training:
Where $\mu_B$ and $\sigma_B^2$ are the batch mean and variance, and $\gamma$, $\beta$ are learnable parameters that allow the network to undo the normalization if needed.
Faster training convergence, allows higher learning rates, reduces sensitivity to weight initialization, and acts as a mild regularizer (reducing the need for dropout in some cases).
Proper initialization prevents vanishing and exploding activations at the start of training:
Designed for sigmoid and tanh activations. Keeps variance consistent across layers.
Designed for ReLU activations. Accounts for the fact that ReLU zeros out half the neurons.
When neural networks shine and when simpler methods may be better.
A neural network with a single hidden layer of sufficient width can approximate any continuous function to arbitrary accuracy (Universal Approximation Theorem).
Unlike traditional ML, neural networks learn their own feature representations from raw data. No manual feature engineering needed - the network discovers the best features automatically.
Can learn highly nonlinear decision boundaries, hierarchical feature representations, and intricate patterns that are impossible for linear models.
The same fundamental architecture can be adapted for classification, regression, generation, translation, image recognition, speech processing, and more.
Neural networks are notoriously difficult to interpret. Understanding why a specific prediction was made is challenging, which limits adoption in regulated industries like healthcare and finance.
Neural networks typically require large amounts of labeled data to train effectively. With small datasets, simpler models like logistic regression or decision trees often outperform deep networks.
Training deep networks requires significant computing resources (GPUs/TPUs), energy, and time. Inference can also be slow for very large models.
With millions of parameters, neural networks can memorize training data instead of generalizing. Careful regularization (dropout, weight decay, early stopping) is essential.
Extending feedforward networks to specialized architectures and advanced techniques.
CNNs are designed for grid-structured data like images. Instead of connecting every neuron to every input, they use convolutional filters that slide across the input, detecting local patterns like edges, textures, and shapes.
Apply learnable filters to detect spatial features. Parameter sharing makes CNNs extremely efficient.
Downsample feature maps to reduce spatial dimensions. Max pooling retains the most prominent features.
Early layers detect edges, middle layers detect textures and parts, deep layers detect entire objects.
RNNs are designed for sequential data like text, speech, and time series. They maintain a hidden state that carries information across time steps:
Variants like LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) use gating mechanisms to better capture long-range dependencies and mitigate vanishing gradients.
Transformers replaced RNNs as the dominant architecture for sequence modeling. Their key innovation is the self-attention mechanism:
Self-attention allows every position in a sequence to attend to every other position directly, without the sequential bottleneck of RNNs. This enables massive parallelization and better long-range dependency modeling.
Transformers power GPT, BERT, Vision Transformers (ViT), and virtually all modern AI breakthroughs. They demonstrate that attention-based architectures, combined with massive scale, can achieve remarkable performance across language, vision, and multi-modal tasks.
Preventing overfitting is critical for neural networks. Key regularization techniques include:
Randomly set a fraction $p$ (typically 0.2-0.5) of neuron activations to zero during training. This prevents co-adaptation and forces the network to learn redundant representations. At test time, scale activations by $(1-p)$.
Add a penalty $\frac{\lambda}{2}\|W\|^2$ to the loss function. This encourages smaller weights and smoother decision boundaries, reducing overfitting.
Monitor validation loss during training and stop when it begins to increase (while training loss continues to decrease). This finds the sweet spot before overfitting.
Artificially expand the training set by applying transformations (flipping, rotating, cropping for images; synonym replacement, back-translation for text). More data is the best regularizer.
One of the most important theoretical results in neural networks:
A feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$, given appropriate activation functions and sufficient width.
However, while the theorem guarantees existence, it says nothing about learnability. In practice, deep networks (many layers) work much better than wide shallow networks because they learn hierarchical representations more efficiently.