Neural Networks Cheat Sheet | Techma Zone Study Lab

Network Architecture

Layer Notation:

a^{[l]} = \text{activation of layer } l

Weight Matrix:

W^{[l]} \in \mathbb{R}^{n^{[l]} \times n^{[l-1]}}

Bias Vector:

b^{[l]} \in \mathbb{R}^{n^{[l]}}

Parameters per Layer:

n^{[l]} \times n^{[l-1]} + n^{[l]}

Input layer: $a^{[0]} = x$ (no computation)
Hidden layers: learn feature representations
Output layer: $a^{[L]} = \hat{y}$ (prediction)

Activation Functions

Sigmoid:

\sigma(z) = \frac{1}{1+e^{-z}} \quad \text{range: }(0,1)

Tanh:

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} \quad \text{range: }(-1,1)

ReLU:

f(z) = \max(0, z) \quad \text{range: }[0, \infty)

Leaky ReLU:

f(z) = \max(\alpha z, z) \quad \alpha \approx 0.01

Softmax:

\text{softmax}(z_k) = \frac{e^{z_k}}{\sum_{j} e^{z_j}}

Hidden layers: use ReLU. Output: sigmoid (binary), softmax (multi-class), linear (regression).

Forward Propagation

Pre-activation:

z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}

Activation:

a^{[l]} = g^{[l]}(z^{[l]})

Combined:

a^{[l]} = g(W^{[l]}a^{[l-1]} + b^{[l]})

Batch (matrix):

A^{[l]} = g(W^{[l]} A^{[l-1]} + b^{[l]})

Cache all $z^{[l]}$ and $a^{[l]}$ values - they are needed for backpropagation!

Loss Functions

MSE (Regression):

\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Binary CE:

\mathcal{L} = -\frac{1}{n}\sum[y\log(\hat{y}) + (1-y)\log(1-\hat{y})]

Categorical CE:

\mathcal{L} = -\sum_{k=1}^{K} y_k \log(\hat{y}_k)

Use cross-entropy for classification (stronger gradients than MSE with sigmoid/softmax).

Backpropagation

Output Error:

\delta^{[L]} = a^{[L]} - y

Hidden Error:

\delta^{[l]} = (W^{[l+1]T}\delta^{[l+1]}) \odot g'(z^{[l]})

Weight Gradient:

\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]} a^{[l-1]T}

Bias Gradient:

\frac{\partial \mathcal{L}}{\partial b^{[l]}} = \delta^{[l]}

Weight Update:

W^{[l]} := W^{[l]} - \eta \frac{\partial \mathcal{L}}{\partial W^{[l]}}

$\odot$ = element-wise (Hadamard) product. Propagate from layer L back to layer 1.

Common Mistakes

Using sigmoid in hidden layers (causes vanishing gradients - use ReLU)
Not normalizing or standardizing input features before training
Initializing all weights to zero (breaks symmetry - neurons learn the same thing)
Using too high a learning rate (loss diverges/oscillates wildly)
Using too low a learning rate (training takes forever, gets stuck)
Wrong activation for output layer (e.g., ReLU for classification instead of sigmoid/softmax)
Forgetting to shuffle training data between epochs
Not monitoring validation loss (overfitting without noticing)

Hyperparameters

Learning rate ( $\eta$ ): Start with 0.001 (Adam) or 0.01 (SGD). Most important hyperparameter.
Batch size: 32-256 typical. Larger = more stable, smaller = better generalization.
Number of layers: Start with 2-3 hidden layers. Add more only if underfitting.
Neurons per layer: Common to start wide and narrow toward output. Powers of 2 (64, 128, 256).
Epochs: Use early stopping based on validation loss rather than fixed number.
Dropout rate: 0.2-0.5 typical. Higher = stronger regularization.
Weight decay: 1e-4 to 1e-2 typical for L2 regularization.

Adam optimizer with learning rate 0.001 is a great default starting point.

Interview Questions

Q: Why do we need nonlinear activation functions?

A: Without nonlinear activations, stacking layers produces only a linear transformation (composition of linear functions is linear). Nonlinearity allows the network to learn complex, nonlinear decision boundaries.

Q: What is the vanishing gradient problem?

A: In deep networks with sigmoid/tanh, gradients shrink exponentially as they propagate backward (each layer multiplies by values less than 1). Early layers barely learn. ReLU and residual connections solve this.

Q: Why is ReLU preferred over sigmoid in hidden layers?

A: ReLU avoids vanishing gradients (derivative is 1 for positive inputs), is computationally faster (simple threshold), and produces sparse activations. Sigmoid saturates and its max derivative is only 0.25.

Q: What happens if you initialize all weights to zero?

A: All neurons in a layer compute the same output and receive the same gradient. They update identically and remain identical forever. This "symmetry breaking" problem means the network effectively has one neuron per layer.

Q: How is a neural network different from logistic regression?

A: Logistic regression is a single-layer neural network with sigmoid activation. A neural network adds hidden layers, enabling it to learn nonlinear features automatically rather than requiring manual feature engineering.

Q: What is the Universal Approximation Theorem?

A: A feedforward network with one hidden layer of sufficient width can approximate any continuous function. However, it does not guarantee learnability - deep networks learn hierarchical representations more efficiently in practice.

Neural NetworksCheat Sheet