recurrence
hidden states
gating
Quick Reference
Deep Learning
Home / Study Lab / Cheat Sheets / RNN / LSTM
QUICK REFERENCE

RNN / LSTM
Cheat Sheet

Your quick reference for Recurrent Neural Networks -- from vanilla RNN equations and LSTM gates to GRU, vanishing gradients, and sequence modeling strategies.

Key Formulas

Vanilla RNN:
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$
LSTM Forget Gate:
$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$
LSTM Input Gate:
$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), \quad \tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$$
LSTM Cell State:
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$
LSTM Output Gate:
$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o), \quad h_t = o_t \odot \tanh(C_t)$$
GRU Update Gate:
$$z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)$$
GRU Reset Gate:
$$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)$$
GRU Hidden State:
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)$$

Vanilla RNN

Architecture:
A single hidden layer that recurrently processes one time step at a time. At each step $t$, the hidden state $h_t$ depends on the current input $x_t$ and the previous hidden state $h_{t-1}$.
Hidden State Recurrence:
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b), \quad \hat{y}_t = \text{softmax}(W_y h_t + b_y)$$
BPTT (Backpropagation Through Time):
Unroll the network across all time steps and apply standard backpropagation. Gradient of loss w.r.t. $W_h$ involves a product of Jacobians: $\prod_{k} \frac{\partial h_k}{\partial h_{k-1}}$.
Limitations:
Short-term memory only. Gradients vanish exponentially over long sequences because $\|\frac{\partial h_t}{\partial h_k}\| \approx \|W_h\|^{t-k}$. If the largest eigenvalue of $W_h$ is < 1, gradients vanish; if > 1, they explode.

Vanilla RNNs work for short sequences (10--20 steps) but fail on long-range dependencies. They are rarely used in practice today -- LSTM, GRU, or Transformers are preferred for all but the simplest tasks.

LSTM Gates

Forget Gate (Erase):
$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$ Sigmoid output in $[0,1]$. Decides what to erase from the cell state. $f_t = 0$ forgets completely, $f_t = 1$ retains fully.
Input Gate (Write):
$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), \quad \tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$$ $i_t$ controls how much of the candidate $\tilde{C}_t$ to write. Tanh generates candidate values in $[-1,1]$.
Cell State (Long-Term Memory Highway):
$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$ Linear self-loop that allows gradients to flow unchanged across many time steps. The key mechanism that solves the vanishing gradient problem.
Output Gate (Read):
$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o), \quad h_t = o_t \odot \tanh(C_t)$$ Controls what parts of the cell state are exposed as the hidden state output. Tanh squashes cell values; sigmoid selects which to output.

The cell state $C_t$ acts as a conveyor belt carrying information across time. Gates use sigmoid ($\sigma$) for soft binary decisions and tanh for generating candidate values. This gating mechanism gives LSTMs the ability to learn long-range dependencies over hundreds of time steps.

GRU (Gated Recurrent Unit)

Reset Gate:
$$r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)$$ Controls how much of the previous hidden state to forget when computing the candidate. $r_t = 0$ ignores previous state entirely (acts like a standard NN).
Update Gate:
$$z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)$$ Controls the balance between retaining old state and accepting new candidate. Combines the roles of LSTM's forget and input gates into one.
Candidate Hidden State:
$$\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)$$
Linear Interpolation:
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$ When $z_t \approx 0$, the hidden state is copied forward (long-term memory). When $z_t \approx 1$, the state is replaced with the candidate.
Comparison to LSTM:
GRU has 2 gates vs. LSTM's 3 gates, no separate cell state, and ~25% fewer parameters. Empirically achieves similar performance to LSTM on most tasks; tends to train faster on smaller datasets.

Vanishing Gradients

Why Gradients Vanish:
In vanilla RNNs, the gradient w.r.t. earlier time steps involves: $\frac{\partial L}{\partial W_h} = \sum_t \frac{\partial L_t}{\partial h_t} \prod_{k=1}^{t} \frac{\partial h_k}{\partial h_{k-1}}$. Each Jacobian factor $\frac{\partial h_k}{\partial h_{k-1}} = W_h^T \cdot \text{diag}(1 - h_k^2)$. If $\|W_h\| < 1$, the product shrinks exponentially.
Why Gradients Explode:
If the spectral radius of $W_h > 1$, the same product grows exponentially, causing gradients to blow up. Loss diverges and training becomes unstable.
Solution 1: LSTM / GRU:
The cell state $C_t$ provides an additive path: $\frac{\partial C_t}{\partial C_{t-1}} = f_t$. Since $f_t \approx 1$, gradients flow with minimal decay -- analogous to ResNet skip connections for sequences.
Solution 2: Gradient Clipping:
$$\text{if } \|\nabla\| > \theta: \quad \nabla \leftarrow \frac{\theta}{\|\nabla\|} \nabla$$ Clips the gradient norm to a threshold $\theta$ (typically 1.0--5.0). Prevents explosions but does not fix vanishing.
Solution 3: Other Techniques:
Skip connections: residual RNN variants. Proper initialization: orthogonal init for $W_h$ keeps eigenvalues near 1. Layer normalization: stabilizes hidden state magnitudes across time.

Vanishing gradients make vanilla RNNs unable to learn dependencies beyond ~10--20 steps. LSTM/GRU extend this to hundreds of steps. For very long sequences (1K+), Transformers with self-attention are preferred because they provide direct connections between any two positions.

Hyperparameters

Hidden Size ($d_h$):
Typical: 64--1024. Determines capacity. Each LSTM layer has $4 \times (d_h^2 + d_h \cdot d_x + d_h)$ parameters. Larger = more capacity but slower and more data-hungry.
Number of Layers ($L$):
Typical: 1--4. Stacking layers builds hierarchical temporal representations. More than 3 layers often sees diminishing returns. Use residual connections for deep stacks.
Dropout:
Applied between layers (not within recurrence): rate 0.2--0.5. Variational dropout uses the same mask across time steps for better regularization. Zoneout randomly preserves hidden states.
Learning Rate:
Start with $\sim 10^{-3}$ (Adam) or $\sim 10^{-2}$ (SGD). RNNs are sensitive to LR -- too high causes gradient explosions even with clipping. Use schedulers: reduce on plateau or cosine annealing.
Gradient Clip Value ($\theta$):
Typical: 1.0--5.0. Clip by global norm is preferred over clip by value. Essential for stable RNN training; even LSTMs benefit from gradient clipping.
Batch Size:
Typical: 32--256. Smaller batches add noise that acts as regularization. For language modeling, often use batch size 64 with BPTT length 35--70.
Sequence Length (BPTT window):
Typical: 35--200. Longer = captures more context but costs more memory and compute (linear in sequence length). Truncated BPTT splits sequences into fixed-length chunks.

Modern best practice: use LSTM or GRU with hidden size 256--512, 2 layers, dropout 0.3, Adam optimizer with LR $10^{-3}$, gradient clipping at 1.0, and truncated BPTT with sequence length 50--100. Consider Transformers for new projects unless latency or streaming constraints favor RNNs.

Pros vs Cons

Pros:

  • Handles variable-length sequences natively -- no fixed input size required
  • Captures temporal dependencies and sequential patterns that feedforward networks cannot
  • Parameter sharing across time steps -- same weights for every position, efficient for long sequences
  • Flexible architectures -- many-to-one, one-to-many, many-to-many, encoder-decoder, bidirectional
  • Well-suited for streaming / online inference -- processes one token at a time with constant memory
  • LSTM/GRU effectively learn long-range dependencies over hundreds of time steps

Cons:

  • Sequential computation -- cannot parallelize across time steps, making training slow on GPUs compared to Transformers
  • Vanilla RNNs suffer from vanishing/exploding gradients -- only LSTM/GRU partially mitigate this
  • Hard to capture very long-range dependencies (1000+ steps) even with LSTM -- attention or Transformers handle this better
  • Largely replaced by Transformers for NLP, machine translation, and many sequence tasks since 2018+
  • Debugging and interpreting internal states is difficult -- hidden states are opaque high-dimensional vectors

Interview Quick-Fire

Q: What is a Recurrent Neural Network?

A: A neural network designed for sequential data that maintains a hidden state across time steps. At each step, the hidden state is updated based on the current input and the previous hidden state: $h_t = f(W_h h_{t-1} + W_x x_t + b)$. This recurrence allows the network to model temporal dependencies in variable-length sequences like text, speech, and time series.

Q: What is the vanishing gradient problem in RNNs?

A: During backpropagation through time (BPTT), gradients are multiplied by the recurrent weight matrix at each time step. If the spectral radius of $W_h$ is less than 1, gradients shrink exponentially with sequence length, making it impossible to learn long-range dependencies. LSTM solves this with a cell state that provides an additive (rather than multiplicative) gradient path.

Q: What is the difference between LSTM and GRU?

A: LSTM uses three gates (forget, input, output) and a separate cell state for long-term memory. GRU uses two gates (reset, update) and merges cell and hidden state into one. GRU has ~25% fewer parameters and trains faster. Performance is similar on most tasks -- GRU is preferred for smaller datasets and faster iteration, LSTM for tasks requiring fine-grained memory control.

Q: What is a bidirectional RNN?

A: A bidirectional RNN processes the sequence in both forward and backward directions using two separate hidden states, then concatenates them: $h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t]$. This allows each position to access context from both past and future. Essential for tasks where full context is available (e.g., NER, sentiment analysis), but not for autoregressive generation where future tokens are unknown.

Q: What is Backpropagation Through Time (BPTT)?

A: BPTT unrolls the recurrent network across all time steps, creating a deep feedforward graph, then applies standard backpropagation. The gradient of the loss w.r.t. shared weights is the sum of gradients at each time step. Truncated BPTT limits the unrolling to a fixed window (e.g., 35 steps) to reduce memory and computation while approximating the full gradient.

Q: Why have Transformers largely replaced LSTMs?

A: Transformers use self-attention to connect any two positions directly in $O(1)$ computational steps (vs. $O(n)$ for RNNs), enabling better long-range dependency modeling. Critically, self-attention is fully parallelizable across sequence positions, making Transformers dramatically faster to train on modern GPUs/TPUs. They also scale better with data and compute, as demonstrated by GPT and BERT families.

Q: What is gradient clipping and why is it important for RNNs?

A: Gradient clipping rescales the gradient when its norm exceeds a threshold: if $\|\nabla\| > \theta$, set $\nabla \leftarrow \frac{\theta}{\|\nabla\|}\nabla$. It prevents exploding gradients that cause training instability in RNNs. Clipping by global norm (across all parameters) is preferred over per-parameter clipping. Typical thresholds are 1.0--5.0. It is essential even for LSTMs and GRUs.

Q: What is teacher forcing?

A: Teacher forcing feeds the ground-truth previous token as input during training instead of the model's own prediction. This accelerates convergence and stabilizes training by preventing error accumulation. However, it creates a train-test mismatch (exposure bias) since at inference the model must use its own predictions. Scheduled sampling gradually shifts from teacher forcing to model predictions during training to mitigate this gap.

Continue Your Journey