RNN / LSTM
Cheat Sheet
Your quick reference for Recurrent Neural Networks -- from vanilla RNN equations and LSTM gates to GRU, vanishing gradients, and sequence modeling strategies.
Your quick reference for Recurrent Neural Networks -- from vanilla RNN equations and LSTM gates to GRU, vanishing gradients, and sequence modeling strategies.
Vanilla RNNs work for short sequences (10--20 steps) but fail on long-range dependencies. They are rarely used in practice today -- LSTM, GRU, or Transformers are preferred for all but the simplest tasks.
The cell state $C_t$ acts as a conveyor belt carrying information across time. Gates use sigmoid ($\sigma$) for soft binary decisions and tanh for generating candidate values. This gating mechanism gives LSTMs the ability to learn long-range dependencies over hundreds of time steps.
Vanishing gradients make vanilla RNNs unable to learn dependencies beyond ~10--20 steps. LSTM/GRU extend this to hundreds of steps. For very long sequences (1K+), Transformers with self-attention are preferred because they provide direct connections between any two positions.
Modern best practice: use LSTM or GRU with hidden size 256--512, 2 layers, dropout 0.3, Adam optimizer with LR $10^{-3}$, gradient clipping at 1.0, and truncated BPTT with sequence length 50--100. Consider Transformers for new projects unless latency or streaming constraints favor RNNs.
A: A neural network designed for sequential data that maintains a hidden state across time steps. At each step, the hidden state is updated based on the current input and the previous hidden state: $h_t = f(W_h h_{t-1} + W_x x_t + b)$. This recurrence allows the network to model temporal dependencies in variable-length sequences like text, speech, and time series.
A: During backpropagation through time (BPTT), gradients are multiplied by the recurrent weight matrix at each time step. If the spectral radius of $W_h$ is less than 1, gradients shrink exponentially with sequence length, making it impossible to learn long-range dependencies. LSTM solves this with a cell state that provides an additive (rather than multiplicative) gradient path.
A: LSTM uses three gates (forget, input, output) and a separate cell state for long-term memory. GRU uses two gates (reset, update) and merges cell and hidden state into one. GRU has ~25% fewer parameters and trains faster. Performance is similar on most tasks -- GRU is preferred for smaller datasets and faster iteration, LSTM for tasks requiring fine-grained memory control.
A: A bidirectional RNN processes the sequence in both forward and backward directions using two separate hidden states, then concatenates them: $h_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t]$. This allows each position to access context from both past and future. Essential for tasks where full context is available (e.g., NER, sentiment analysis), but not for autoregressive generation where future tokens are unknown.
A: BPTT unrolls the recurrent network across all time steps, creating a deep feedforward graph, then applies standard backpropagation. The gradient of the loss w.r.t. shared weights is the sum of gradients at each time step. Truncated BPTT limits the unrolling to a fixed window (e.g., 35 steps) to reduce memory and computation while approximating the full gradient.
A: Transformers use self-attention to connect any two positions directly in $O(1)$ computational steps (vs. $O(n)$ for RNNs), enabling better long-range dependency modeling. Critically, self-attention is fully parallelizable across sequence positions, making Transformers dramatically faster to train on modern GPUs/TPUs. They also scale better with data and compute, as demonstrated by GPT and BERT families.
A: Gradient clipping rescales the gradient when its norm exceeds a threshold: if $\|\nabla\| > \theta$, set $\nabla \leftarrow \frac{\theta}{\|\nabla\|}\nabla$. It prevents exploding gradients that cause training instability in RNNs. Clipping by global norm (across all parameters) is preferred over per-parameter clipping. Typical thresholds are 1.0--5.0. It is essential even for LSTMs and GRUs.
A: Teacher forcing feeds the ground-truth previous token as input during training instead of the model's own prediction. This accelerates convergence and stabilizes training by preventing error accumulation. However, it creates a train-test mismatch (exposure bias) since at inference the model must use its own predictions. Scheduled sampling gradually shifts from teacher forcing to model predictions during training to mitigate this gap.