Neural Networks
Cheat Sheet
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Hidden layers: use ReLU. Output: sigmoid (binary), softmax (multi-class), linear (regression).
Cache all $z^{[l]}$ and $a^{[l]}$ values - they are needed for backpropagation!
Use cross-entropy for classification (stronger gradients than MSE with sigmoid/softmax).
$\odot$ = element-wise (Hadamard) product. Propagate from layer L back to layer 1.
Adam optimizer with learning rate 0.001 is a great default starting point.
A: Without nonlinear activations, stacking layers produces only a linear transformation (composition of linear functions is linear). Nonlinearity allows the network to learn complex, nonlinear decision boundaries.
A: In deep networks with sigmoid/tanh, gradients shrink exponentially as they propagate backward (each layer multiplies by values less than 1). Early layers barely learn. ReLU and residual connections solve this.
A: ReLU avoids vanishing gradients (derivative is 1 for positive inputs), is computationally faster (simple threshold), and produces sparse activations. Sigmoid saturates and its max derivative is only 0.25.
A: All neurons in a layer compute the same output and receive the same gradient. They update identically and remain identical forever. This "symmetry breaking" problem means the network effectively has one neuron per layer.
A: Logistic regression is a single-layer neural network with sigmoid activation. A neural network adds hidden layers, enabling it to learn nonlinear features automatically rather than requiring manual feature engineering.
A: A feedforward network with one hidden layer of sufficient width can approximate any continuous function. However, it does not guarantee learnability - deep networks learn hierarchical representations more efficiently in practice.