Transformers & Attention - Complete Master Guide

01

Historical Intuition

How the attention mechanism evolved from a simple alignment trick into the foundation of modern deep learning.

Bahdanau Attention (2014)

The story of Transformers begins with a fundamental limitation of sequence-to-sequence models. In 2014, the standard approach to machine translation was an encoder-decoder architecture built from Recurrent Neural Networks (RNNs). The encoder would read an entire input sentence and compress it into a single fixed-length context vector. The decoder would then generate the translation from this single vector.

The problem was obvious: squeezing an entire sentence into one fixed-size vector creates an information bottleneck. Long sentences lost critical details. Dzmitry Bahdanau, along with Cho and Bengio, proposed a revolutionary solution in their 2014 paper: instead of using a single context vector, allow the decoder to look back at all encoder hidden states and focus on the most relevant ones at each decoding step.

This mechanism -- called attention -- computes a weighted sum of encoder hidden states, where the weights (attention scores) indicate how relevant each input token is to the current output token being generated. The attention score between decoder state $ s_t $ and encoder state $ h_j $ is:

e_{tj} = a(s_{t-1}, h_j), \quad \alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T_x} \exp(e_{tk})}, \quad c_t = \sum_{j=1}^{T_x} \alpha_{tj} h_j

Where $ a $ is a learned alignment function, $ \alpha_{tj} $ are the normalized attention weights, and $ c_t $ is the context vector for decoding step $ t $. This simple idea improved translation quality dramatically, especially for long sentences.

"Attention Is All You Need" (2017)

The critical leap came in 2017 when Vaswani et al. at Google published their landmark paper "Attention Is All You Need". Their insight was radical: remove the recurrence entirely and build the entire model using only attention mechanisms. The result was the Transformer architecture.

The key innovation was self-attention (also called intra-attention), where a sequence attends to itself. Instead of computing attention between encoder and decoder, each token in a sequence computes attention weights with every other token in the same sequence. This allows the model to capture dependencies regardless of distance in constant time.

The Transformer achieved state-of-the-art results on English-to-German and English-to-French translation while being significantly faster to train than RNN-based models. More importantly, it became the foundation for virtually every major advance in NLP that followed.

The Transformer Revolution

From a translation model to the backbone of modern AI, the Transformer's impact has been extraordinary:

2014

Bahdanau et al. introduce additive attention for neural machine translation, solving the information bottleneck problem

2015

Luong et al. propose multiplicative (dot-product) attention, which is simpler and faster to compute

2017

Vaswani et al. publish "Attention Is All You Need," introducing the Transformer architecture that eliminates recurrence entirely

2018

BERT (Google) and GPT (OpenAI) demonstrate the power of pre-trained Transformers, achieving breakthroughs across all NLP benchmarks

2020

Vision Transformer (ViT) proves Transformers work for images. GPT-3 shows emergent abilities with scale

Today

Transformers dominate NLP, vision, audio, protein folding, robotics, and generative AI. Foundation models are all Transformer-based

02

Core Intuition

Why letting every token attend to every other token is so powerful -- and why it replaced recurrence.

The Problem with Sequential Processing

Recurrent Neural Networks process sequences one token at a time. To understand how the first word in a sentence relates to the last word, information must flow through every intermediate hidden state. For a sequence of length $ n $, the maximum path length between any two tokens is $ O(n) $.

This creates two critical problems. First, long-range dependencies are hard to learn: gradients must flow through $ n $ steps during backpropagation, suffering from vanishing or exploding gradients. Even LSTMs and GRUs, designed to mitigate this, struggle with very long sequences. Second, sequential processing is inherently slow: each time step depends on the previous one, so RNNs cannot be parallelized along the sequence dimension.

Every Token Attends to Every Other Token

The Transformer's self-attention mechanism eliminates both problems with an elegant solution: every token directly attends to every other token in a single operation. The maximum path length between any two positions drops from $ O(n) $ in RNNs to $ O(1) $ in Transformers.

This means a token at position 1 can directly interact with a token at position 1000 without any information needing to pass through intermediate positions. The attention weights are computed in parallel for all positions simultaneously, making the operation highly efficient on modern GPUs.

The computational trade-off is that self-attention has $ O(n^2) $ time and memory complexity (every pair of tokens interacts), compared to $ O(n) $ for RNNs. However, the constant factors are much smaller and the parallelism more than compensates for typical sequence lengths.

Path Length Comparison

RNN: $ O(n) $ path length, $ O(n) $ sequential ops, $ O(1) $ complexity per layer. Self-Attention: $ O(1) $ path length, $ O(1) $ sequential ops, $ O(n^2) $ complexity per layer. The constant path length is what makes Transformers so effective at capturing long-range dependencies.

Why Self-Attention Replaced Recurrence

Parallelism

All attention computations are independent and can run simultaneously on GPUs, enabling massive speedups during training compared to sequential RNNs

Constant Path Length

Any two tokens can interact directly in a single layer, eliminating the vanishing gradient problem for long-range dependencies

Interpretable Weights

Attention weights provide a natural way to visualize which tokens the model focuses on, offering insights into model behavior

03

Self-Attention

The heart of the Transformer -- computing query, key, and value projections to determine how tokens relate.

Query, Key, Value: The Retrieval Analogy

Self-attention can be understood through a database retrieval analogy. Imagine you have a question (query) and a database of entries, each with a label (key) and content (value). You compare your query against every key, find the most relevant matches, and retrieve a weighted combination of their values.

In self-attention, every token plays all three roles simultaneously. Given an input sequence of embeddings $ X \in \mathbb{R}^{n \times d} $, we project them into three separate spaces using learned weight matrices:

Q = XW^Q, \quad K = XW^K, \quad V = XW^V

Where $ W^Q, W^K \in \mathbb{R}^{d \times d_k} $ and $ W^V \in \mathbb{R}^{d \times d_v} $. The query represents "what am I looking for?", the key represents "what do I contain?", and the value represents "what information do I provide?"

Scaled Dot-Product Attention

The attention output is computed by taking the dot product of queries with keys, scaling, applying softmax, and multiplying by values:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The scaling factor $ \sqrt{d_k} $ is crucial. Without it, for large $ d_k $, the dot products $ QK^T $ grow in magnitude, pushing the softmax into regions with extremely small gradients. Specifically, if the entries of $ Q $ and $ K $ have zero mean and unit variance, their dot product has variance $ d_k $. Dividing by $ \sqrt{d_k} $ normalizes this back to unit variance:

\text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) = d_k \quad \Rightarrow \quad \text{Var}\!\left(\frac{q \cdot k}{\sqrt{d_k}}\right) = 1

The softmax converts raw scores into a probability distribution: each row of the attention matrix sums to 1, representing how much each token attends to every other token (including itself).

Interactive: Self-Attention Heatmap

Type a sentence and visualize the self-attention weights. Each cell shows how strongly token $ i $ (row) attends to token $ j $ (column). The diagonal is typically brightest because tokens attend strongly to themselves.

Self-Attention Visualization

Enter a sentence below and click "Compute Attention" to see the attention heatmap. Darker blue indicates lower weight; brighter orange indicates higher weight. Arrows show top connections.

Input Sentence:

04

Multi-Head Attention

Running multiple attention operations in parallel to capture different types of relationships simultaneously.

Why Multiple Heads?

A single attention head can only focus on one type of relationship at a time. For example, it might learn to attend to syntactic dependencies (subject-verb agreement) or semantic relationships (coreference), but it is difficult for one set of $ Q, K, V $ projections to capture all types of relationships simultaneously.

Multi-head attention solves this by running $ h $ attention operations in parallel, each with its own learned projection matrices. Each head can specialize in a different type of relationship:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Each head operates on a reduced dimensionality: $ d_k = d_v = d_{\text{model}} / h $. For the original Transformer with $ d_{\text{model}} = 512 $ and $ h = 8 $, each head works with $ d_k = 64 $ dimensions. The total computation is roughly the same as single-head attention with full dimensionality.

Head Specialization

Research has shown that different attention heads learn to specialize in different linguistic phenomena. In BERT, for example:

Positional heads attend primarily to adjacent tokens (local context)
Syntactic heads learn to attend along dependency parse tree edges (subject-verb, noun-adjective)
Separator heads attend strongly to special tokens like [SEP] and [CLS]
Rare token heads focus on infrequent words that carry high information content

The final output projection $ W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}} $ combines these diverse perspectives into a single representation that benefits from all of them.

Interactive: Multi-Head Attention

Visualize how multiple attention heads capture different patterns. Toggle between viewing individual heads and the merged (averaged) result.

Multi-Head Attention Visualization

Adjust the number of heads. Each small heatmap represents one head's attention pattern. Click "Show Merged" to see the averaged attention across all heads.

Number of Heads: 2

05

Positional Encoding

How Transformers understand word order without any recurrence -- injecting position information through sinusoidal functions.

The Position Problem

Self-attention is permutation-equivariant: if you shuffle the input tokens, the attention mechanism will produce the same outputs (just shuffled). This means the Transformer has no inherent notion of token order. The sentence "dog bites man" and "man bites dog" would produce identical representations without positional information.

To solve this, the original Transformer adds positional encodings to the input embeddings. These encodings inject information about each token's position in the sequence. The authors chose sinusoidal functions of different frequencies:

PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Where $ pos $ is the position in the sequence and $ i $ is the dimension index. Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, forming a geometric progression from $ 2\pi $ to $ 10000 \cdot 2\pi $.

Why Sinusoidal Encodings Work

The sinusoidal encoding has several elegant properties:

Unique representation: Each position gets a unique encoding vector, so the model can distinguish all positions
Relative position: For any fixed offset $ k $, $ PE_{pos+k} $ can be represented as a linear function of $ PE_{pos} $, allowing the model to learn relative position attention
Bounded values: All values are between -1 and 1, preventing the positional signal from overwhelming the semantic embedding
Extrapolation: The model can handle sequences longer than those seen during training because the sinusoidal functions are defined for any position

The final input to the Transformer is the sum of the token embedding and the positional encoding:

\text{Input}_i = \text{Embedding}(x_i) + PE_{(i)}

Learned vs. Sinusoidal Encodings

BERT and GPT use learned positional embeddings instead of sinusoidal ones. A trainable embedding matrix $ P \in \mathbb{R}^{L_{\max} \times d} $ is used, where $ L_{\max} $ is the maximum sequence length. Both approaches perform similarly, but learned embeddings cannot extrapolate beyond $ L_{\max} $.

Interactive: Positional Encoding Heatmap

Visualize the sinusoidal positional encoding matrix. X-axis represents embedding dimensions, Y-axis represents sequence positions. Blue = -1, white = 0, orange = +1. Notice how low dimensions oscillate rapidly while high dimensions change slowly.

Positional Encoding Heatmap

Adjust the model dimension and sequence length to explore how the encoding patterns change. Even dimensions use sine; odd dimensions use cosine.

Dimension (d_model): 32

Sequence Length: 20

06

Transformer Architecture

The full encoder-decoder stack with layer normalization, residual connections, and feed-forward networks.

Encoder Block

Each encoder layer consists of two sub-layers, each wrapped with a residual connection and layer normalization:

1

Multi-Head Self-Attention

The input passes through multi-head self-attention. Every token attends to every other token in the sequence. The output has the same shape as the input: $ \mathbb{R}^{n \times d_{\text{model}}} $.

2

Add & Norm

The attention output is added to the input (residual connection) and then layer-normalized: $ \text{LayerNorm}(x + \text{MultiHead}(x, x, x)) $. This stabilizes training and allows gradients to flow directly through skip connections.

3

Position-wise Feed-Forward Network

A two-layer MLP applied independently to each position: $ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $. The inner dimension $ d_{ff} $ is typically 4x the model dimension (e.g., 2048 for $ d_{\text{model}} = 512 $).

4

Add & Norm (again)

Another residual connection and layer normalization: $ \text{LayerNorm}(x + \text{FFN}(x)) $. The output feeds into the next encoder layer.

Decoder Block

The decoder is similar to the encoder but with an additional sub-layer for cross-attention and a crucial modification to self-attention:

Masked Self-Attention: The decoder's self-attention is causally masked so that position $ i $ can only attend to positions $ \leq i $. This prevents the model from "cheating" by looking at future tokens during training. The mask sets future positions to $ -\infty $ before softmax.
Cross-Attention: The decoder attends to the encoder's output. Queries come from the decoder, while keys and values come from the encoder. This is how the decoder accesses the input sequence information.
Feed-Forward + Add & Norm: Same as in the encoder, applied after cross-attention.

\text{Mask}_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}

Interactive: Transformer Architecture

Explore the full Transformer architecture layer by layer. Use the slider to highlight individual layers and see the data flow through the encoder-decoder stack.

Transformer Architecture Diagram

Adjust the layer depth slider to highlight specific layers. The diagram shows embeddings at the bottom, encoder stack on the left, and decoder stack on the right.

Highlight Layer: 1

07

Encoder-Only: BERT

How bidirectional pre-training with masked language modeling revolutionized natural language understanding.

Masked Language Modeling (MLM)

BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google in 2018, uses only the encoder stack of the Transformer. Its key innovation is bidirectional pre-training: unlike GPT which reads left-to-right, BERT can attend to both left and right context simultaneously.

To enable bidirectional training without the model trivially predicting the target token (since it can see it in the input), BERT uses Masked Language Modeling: 15% of input tokens are randomly selected, and of those, 80% are replaced with [MASK], 10% with a random token, and 10% are left unchanged. The model must predict the original token:

\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\backslash \mathcal{M}})

Where $ \mathcal{M} $ is the set of masked positions and $ x_{\backslash \mathcal{M}} $ represents the corrupted input.

Next Sentence Prediction (NSP)

BERT's second pre-training objective is Next Sentence Prediction. Given two sentences A and B, the model predicts whether B actually follows A in the original text (label: IsNext) or is a random sentence (label: NotNext). This helps the model learn inter-sentence relationships.

The input format is: [CLS] Sentence A [SEP] Sentence B [SEP]. The [CLS] token's final hidden state is used for the NSP classification. While NSP was part of the original BERT, later work (RoBERTa) showed that removing NSP and training with longer sequences often improves performance.

Fine-Tuning BERT

BERT's power comes from its pre-train then fine-tune paradigm. The same pre-trained model can be adapted to dozens of different tasks by adding a simple task-specific head:

Classification

Add a linear layer on top of [CLS] for sentiment analysis, spam detection, or topic classification

Token Classification

Add a per-token classifier for Named Entity Recognition (NER) or Part-of-Speech tagging

Question Answering

Predict start and end positions in a passage that answer a question (extractive QA)

BERT Variants

RoBERTa removes NSP, trains longer with more data. ALBERT shares parameters across layers to reduce model size. DistilBERT is a 6-layer distilled version that retains 97% of BERT's performance at 60% of the size. DeBERTa uses disentangled attention for improved performance.

08

Decoder-Only: GPT

Autoregressive language modeling and the remarkable power of scaling Transformers.

Causal (Autoregressive) Attention

The GPT (Generative Pre-trained Transformer) family uses only the decoder stack with causal masking. Each token can only attend to previous tokens and itself, never to future tokens. This makes GPT a left-to-right language model that predicts the next token given all previous tokens:

P(x_1, x_2, \ldots, x_n) = \prod_{t=1}^{n} P(x_t \mid x_1, \ldots, x_{t-1})

The training objective is simple: maximize the likelihood of the next token at every position. This next-token prediction objective, despite its simplicity, turns out to be extraordinarily powerful when combined with enough data and model capacity.

\mathcal{L}_{\text{GPT}} = -\sum_{t=1}^{n} \log P(x_t \mid x_{<t}; \theta)

Scaling Laws

One of the most important discoveries in modern AI is that Transformer performance follows predictable scaling laws. Kaplan et al. (2020) showed that the cross-entropy loss $ L $ decreases as a power law with model size $ N $, dataset size $ D $, and compute budget $ C $:

L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}, \quad L(C) \propto C^{-0.050}

This means that making models bigger, training on more data, and using more compute all reliably improve performance. GPT-3 demonstrated this dramatically with 175 billion parameters, showing emergent abilities like few-shot learning that smaller models lacked entirely.

GPT Scaling: Parameters vs Perplexity

The chart below shows how perplexity (lower is better) decreases as GPT model size increases. Note the log-log scale revealing the power-law relationship.

From GPT to GPT-4 and Beyond

1

GPT-1 (2018, 117M params)

Demonstrated that unsupervised pre-training on a large corpus followed by supervised fine-tuning outperformed task-specific architectures. Used 12 Transformer layers.

2

GPT-2 (2019, 1.5B params)

Showed that scaling up enables zero-shot task transfer. A single language model could perform translation, summarization, and QA without any fine-tuning.

3

GPT-3 (2020, 175B params)

Demonstrated few-shot learning: by providing a few examples in the prompt, GPT-3 could perform new tasks without any gradient updates. Introduced "in-context learning."

4

GPT-4 (2023, rumored ~1.8T MoE)

Multimodal (text + images), dramatically improved reasoning, and achieved human-level performance on many professional benchmarks including the bar exam.

09

Hyperparameters

The key architectural and training hyperparameters that define a Transformer model.

d_model -- Model Dimension

The dimensionality of all token representations throughout the model. All sub-layers (attention, FFN) produce outputs of this dimension.

BERT-base: 768, BERT-large: 1024
GPT-2: 768 (small) to 1600 (XL)
GPT-3: 12288 for the full 175B model
Rule of thumb: Larger d_model captures richer representations but increases memory quadratically with sequence length

num_heads -- Number of Attention Heads

The number of parallel attention operations. Must divide $ d_{\text{model}} $ evenly, giving each head a dimension of $ d_k = d_{\text{model}} / h $.

BERT-base: 12 heads (d_k = 64), BERT-large: 16 heads (d_k = 64)
GPT-3: 96 heads (d_k = 128)
Trade-off: More heads allow more diverse attention patterns but each head has fewer dimensions to work with

num_layers -- Transformer Depth

The number of stacked encoder or decoder blocks. Deeper models can learn more complex representations but are harder to train.

BERT-base: 12 layers, BERT-large: 24 layers
GPT-3: 96 layers
Observation: Lower layers tend to capture syntactic features; higher layers capture semantic features

d_ff -- Feed-Forward Inner Dimension

The hidden size of the position-wise feed-forward network. Typically set to $ 4 \times d_{\text{model}} $:

d_{ff} = 4 \times d_{\text{model}}

BERT-base: d_ff = 3072 (4 x 768)
GPT-3: d_ff = 49152 (4 x 12288)
Purpose: The FFN provides nonlinear transformation capacity. The expansion-contraction pattern (768 -> 3072 -> 768) acts like a bottleneck that forces the model to learn compressed representations

Learning Rate Warmup

Transformers are notoriously sensitive to learning rate. The original paper introduced a warmup schedule that linearly increases the learning rate for the first $ T_{\text{warmup}} $ steps, then decays proportionally to the inverse square root of the step number:

lr = d_{\text{model}}^{-0.5} \cdot \min(step^{-0.5},\; step \cdot T_{\text{warmup}}^{-1.5})

Modern practice typically uses cosine decay with warmup: linear warmup for 1-5% of training, then cosine annealing to zero. AdamW is the standard optimizer with $ \beta_1 = 0.9 $, $ \beta_2 = 0.999 $, and weight decay of 0.01-0.1.

Why Warmup Matters

At initialization, the model's attention patterns are essentially random. A large learning rate at this stage causes unstable updates to the attention weights, which can permanently damage training. The warmup period allows attention patterns to stabilize before applying larger learning rates.

10

Applications

From natural language processing to computer vision and beyond -- Transformers have become the universal architecture.

Where Transformers Excel

Natural Language Processing

Machine translation, text generation, summarization, question answering, sentiment analysis. Transformers dominate every NLP benchmark.

Vision Transformer (ViT)

Images are split into patches (e.g., 16x16), flattened, and treated as tokens. ViT matches or exceeds CNN performance on ImageNet when trained with sufficient data.

Audio & Speech

Whisper (OpenAI) uses a Transformer encoder-decoder for speech recognition. Music generation models like MusicLM also use Transformer backbones.

Multimodal Models

GPT-4V, Gemini, and CLIP process both text and images. Transformers serve as the unified backbone for combining multiple modalities.

Benchmark Comparison

The chart below compares popular Transformer models across standard NLP benchmarks. BERT, RoBERTa, GPT-2, and T5 are evaluated on GLUE, SQuAD, MNLI, and SST-2 tasks.

Advantages

Captures Long-Range Dependencies

O(1) path length means any two tokens can directly interact, solving the fundamental limitation of RNNs.

Highly Parallelizable

All positions are processed simultaneously, enabling efficient training on modern GPU/TPU hardware.

Transfer Learning

Pre-trained Transformers can be fine-tuned on small datasets, democratizing access to powerful NLP capabilities.

Scalable Architecture

Performance improves predictably with scale. The same architecture works from millions to trillions of parameters.

Disadvantages

Quadratic Memory Complexity

Self-attention requires O(n^2) memory for sequence length n, limiting context windows without specialized techniques like FlashAttention.

Massive Compute Requirements

Training large Transformers requires enormous GPU clusters. GPT-3 training cost an estimated $4.6M in compute alone.

No Inherent Position Awareness

Position must be explicitly injected. Different positional encoding schemes (sinusoidal, learned, RoPE, ALiBi) have different trade-offs.

Data Hungry

Transformers generally need large datasets to outperform simpler architectures. Small-data regimes may favor CNNs or RNNs.

11

Python Implementation

From self-attention in NumPy to using pre-trained Transformers with HuggingFace.

Self-Attention from Scratch (NumPy)

A minimal implementation of scaled dot-product self-attention using only NumPy.

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

def self_attention(X, W_q, W_k, W_v):
    """
    Scaled dot-product self-attention.
    X: (seq_len, d_model) input embeddings
    W_q, W_k: (d_model, d_k) projection matrices
    W_v: (d_model, d_v) projection matrix
    Returns: (seq_len, d_v) attention output
    """
    Q = X @ W_q   # (seq_len, d_k)
    K = X @ W_k   # (seq_len, d_k)
    V = X @ W_v   # (seq_len, d_v)

    d_k = Q.shape[-1]
    scores = (Q @ K.T) / np.sqrt(d_k)   # (seq_len, seq_len)
    weights = softmax(scores, axis=-1)    # (seq_len, seq_len)
    output = weights @ V                  # (seq_len, d_v)
    return output, weights

# Example usage
np.random.seed(42)
seq_len, d_model, d_k, d_v = 5, 16, 8, 8

X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_v) * 0.1

output, attn_weights = self_attention(X, W_q, W_k, W_v)
print(f"Input shape:   {X.shape}")
print(f"Output shape:  {output.shape}")
print(f"Attn weights:\n{np.round(attn_weights, 3)}")

Multi-Head Attention from Scratch

import numpy as np

def multi_head_attention(X, num_heads, d_model):
    """
    Multi-head self-attention from scratch.
    X: (seq_len, d_model)
    """
    assert d_model % num_heads == 0
    d_k = d_model // num_heads

    # Initialize projections for each head
    np.random.seed(0)
    heads_output = []
    all_weights = []

    for h in range(num_heads):
        W_q = np.random.randn(d_model, d_k) * 0.1
        W_k = np.random.randn(d_model, d_k) * 0.1
        W_v = np.random.randn(d_model, d_k) * 0.1

        Q = X @ W_q
        K = X @ W_k
        V = X @ W_v

        scores = (Q @ K.T) / np.sqrt(d_k)

        # Softmax
        e = np.exp(scores - scores.max(axis=-1, keepdims=True))
        weights = e / e.sum(axis=-1, keepdims=True)

        head_out = weights @ V
        heads_output.append(head_out)
        all_weights.append(weights)

    # Concatenate all heads
    concat = np.concatenate(heads_output, axis=-1)  # (seq_len, d_model)

    # Output projection
    W_o = np.random.randn(d_model, d_model) * 0.1
    output = concat @ W_o

    return output, all_weights

# Example
seq_len, d_model, num_heads = 6, 16, 4
X = np.random.randn(seq_len, d_model)
out, weights = multi_head_attention(X, num_heads, d_model)
print(f"Output shape: {out.shape}")
for i, w in enumerate(weights):
    print(f"Head {i+1} attn shape: {w.shape}")

HuggingFace Transformers: Sentiment Analysis

from transformers import pipeline

# Zero-shot sentiment analysis with a pre-trained model
classifier = pipeline("sentiment-analysis")

texts = [
    "Transformers have completely revolutionized NLP!",
    "The quadratic complexity is a major limitation.",
    "BERT is still great for classification tasks.",
]

results = classifier(texts)
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"  Label: {result['label']}, Score: {result['score']:.4f}\n")

HuggingFace: Text Generation with GPT-2

from transformers import pipeline

# Text generation with GPT-2
generator = pipeline("text-generation", model="gpt2")

prompt = "The Transformer architecture revolutionized AI because"
outputs = generator(
    prompt,
    max_length=100,
    num_return_sequences=2,
    temperature=0.8,
    top_p=0.9,
    do_sample=True,
)

for i, output in enumerate(outputs):
    print(f"--- Generation {i+1} ---")
    print(output["generated_text"])
    print()

Fine-Tuning BERT for Classification

from transformers import (
    BertTokenizer, BertForSequenceClassification,
    Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np

# Load dataset and tokenizer
dataset = load_dataset("imdb")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(
        batch["text"], padding="max_length",
        truncation=True, max_length=256
    )

tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Load pre-trained BERT with classification head
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
)

# Training arguments
args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

def compute_metrics(eval_pred):
    preds = np.argmax(eval_pred.predictions, axis=-1)
    acc = (preds == eval_pred.label_ids).mean()
    return {"accuracy": acc}

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"].select(range(5000)),
    eval_dataset=tokenized["test"].select(range(1000)),
    compute_metrics=compute_metrics,
)

trainer.train()
print(trainer.evaluate())

Training Tips

Use a small learning rate (2e-5 to 5e-5) for fine-tuning pre-trained models. Always use warmup (5-10% of total steps). Monitor validation loss to detect overfitting early. For small datasets, consider freezing lower layers and only fine-tuning the top few layers.

Transformers & Attention Mechanism