σ(z) = 1/(1+e-z)
ReLU(x) = max(0,x)
∇ = backprop
Quick Reference
Deep Learning
Home / Study Lab / Cheat Sheets / Neural Networks
QUICK REFERENCE

Neural Networks
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Network Architecture

Layer Notation:
$$a^{[l]} = \text{activation of layer } l$$
Weight Matrix:
$$W^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}}$$
Bias Vector:
$$b^{[l]} \in \mathbb{R}^{n^{[l]}}$$
Parameters per Layer:
$$n^{[l]} \times n^{[l-1]} + n^{[l]}$$
  • Input layer: $a^{[0]} = x$ (no computation)
  • Hidden layers: learn feature representations
  • Output layer: $a^{[L]} = \hat{y}$ (prediction)

Activation Functions

Sigmoid:
$$\sigma(z) = \frac{1}{1+e^{-z}} \quad \text{range: }(0,1)$$
Tanh:
$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \quad \text{range: }(-1,1)$$
ReLU:
$$f(z) = \max(0, z) \quad \text{range: }[0, \infty)$$
Leaky ReLU:
$$f(z) = \max(\alpha z, z) \quad \alpha \approx 0.01$$
Softmax:
$$\text{softmax}(z_k) = \frac{e^{z_k}}{\sum_{j} e^{z_j}}$$

Hidden layers: use ReLU. Output: sigmoid (binary), softmax (multi-class), linear (regression).

Forward Propagation

Pre-activation:
$$z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$$
Activation:
$$a^{[l]} = g^{[l]}(z^{[l]})$$
Combined:
$$a^{[l]} = g(W^{[l]}a^{[l-1]} + b^{[l]})$$
Batch (matrix):
$$A^{[l]} = g(W^{[l]} A^{[l-1]} + b^{[l]})$$

Cache all $z^{[l]}$ and $a^{[l]}$ values - they are needed for backpropagation!

Loss Functions

MSE (Regression):
$$\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
Binary CE:
$$\mathcal{L} = -\frac{1}{n}\sum[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]$$
Categorical CE:
$$\mathcal{L} = -\sum_{k=1}^{K} y_k \log(\hat{y}_k)$$

Use cross-entropy for classification (stronger gradients than MSE with sigmoid/softmax).

Backpropagation

Output Error:
$$\delta^{[L]} = a^{[L]} - y$$
Hidden Error:
$$\delta^{[l]} = (W^{[l+1]T}\delta^{[l+1]}) \odot g'(z^{[l]})$$
Weight Gradient:
$$\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]} a^{[l-1]T}$$
Bias Gradient:
$$\frac{\partial \mathcal{L}}{\partial b^{[l]}} = \delta^{[l]}$$
Weight Update:
$$W^{[l]} := W^{[l]} - \eta \frac{\partial \mathcal{L}}{\partial W^{[l]}}$$

$\odot$ = element-wise (Hadamard) product. Propagate from layer L back to layer 1.

Common Mistakes

  • Using sigmoid in hidden layers (causes vanishing gradients - use ReLU)
  • Not normalizing or standardizing input features before training
  • Initializing all weights to zero (breaks symmetry - neurons learn the same thing)
  • Using too high a learning rate (loss diverges/oscillates wildly)
  • Using too low a learning rate (training takes forever, gets stuck)
  • Wrong activation for output layer (e.g., ReLU for classification instead of sigmoid/softmax)
  • Forgetting to shuffle training data between epochs
  • Not monitoring validation loss (overfitting without noticing)

Hyperparameters

  • Learning rate ($\eta$): Start with 0.001 (Adam) or 0.01 (SGD). Most important hyperparameter.
  • Batch size: 32-256 typical. Larger = more stable, smaller = better generalization.
  • Number of layers: Start with 2-3 hidden layers. Add more only if underfitting.
  • Neurons per layer: Common to start wide and narrow toward output. Powers of 2 (64, 128, 256).
  • Epochs: Use early stopping based on validation loss rather than fixed number.
  • Dropout rate: 0.2-0.5 typical. Higher = stronger regularization.
  • Weight decay: 1e-4 to 1e-2 typical for L2 regularization.

Adam optimizer with learning rate 0.001 is a great default starting point.

Interview Questions

Q: Why do we need nonlinear activation functions?

A: Without nonlinear activations, stacking layers produces only a linear transformation (composition of linear functions is linear). Nonlinearity allows the network to learn complex, nonlinear decision boundaries.

Q: What is the vanishing gradient problem?

A: In deep networks with sigmoid/tanh, gradients shrink exponentially as they propagate backward (each layer multiplies by values less than 1). Early layers barely learn. ReLU and residual connections solve this.

Q: Why is ReLU preferred over sigmoid in hidden layers?

A: ReLU avoids vanishing gradients (derivative is 1 for positive inputs), is computationally faster (simple threshold), and produces sparse activations. Sigmoid saturates and its max derivative is only 0.25.

Q: What happens if you initialize all weights to zero?

A: All neurons in a layer compute the same output and receive the same gradient. They update identically and remain identical forever. This "symmetry breaking" problem means the network effectively has one neuron per layer.

Q: How is a neural network different from logistic regression?

A: Logistic regression is a single-layer neural network with sigmoid activation. A neural network adds hidden layers, enabling it to learn nonlinear features automatically rather than requiring manual feature engineering.

Q: What is the Universal Approximation Theorem?

A: A feedforward network with one hidden layer of sufficient width can approximate any continuous function. However, it does not guarantee learnability - deep networks learn hierarchical representations more efficiently in practice.