convolution
feature maps
pooling
Quick Reference
Deep Learning
QUICK REFERENCE

Convolutional Neural Network
Cheat Sheet

Your quick reference for CNNs -- from convolution operations and pooling layers to famous architectures and transfer learning strategies.

Key Formulas

Output Size (1D):
$$O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1$$
Where:
$W$ = input size, $K$ = kernel size, $P$ = padding, $S$ = stride
Parameters per Conv Layer:
$$\text{Params} = (K^2 \cdot C_{\text{in}} + 1) \cdot C_{\text{out}}$$
Receptive Field (layer $l$):
$$r_l = r_{l-1} + (K_l - 1) \cdot \prod_{i=1}^{l-1} S_i$$
FLOPs per Conv Layer:
$$\text{FLOPs} = 2 \cdot K^2 \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot O_H \cdot O_W$$

Convolution Operation

2D Convolution:
$$(f * g)(i,j) = \sum_{m}\sum_{n} f(m,n) \cdot g(i-m, j-n)$$
Stride:
Step size of the kernel. Stride $S=2$ halves the spatial dimensions. Replaces pooling in some modern architectures (e.g., all-convolutional nets).
Padding Types:
Valid: $P=0$, output shrinks. Same: $P = \lfloor K/2 \rfloor$, output = input size. Full: $P = K-1$, output grows.
1x1 Convolutions:
$K=1$: acts as a pointwise linear transformation across channels. Used for channel dimension reduction (Inception), bottlenecks (ResNet), and adding non-linearity.

Each filter slides across the input computing element-wise multiplication and summation (dot product). Multiple filters produce multiple output feature maps, each detecting a different pattern.

Pooling Layers

Max Pooling:
$$y_{i,j} = \max_{(m,n) \in \mathcal{R}_{i,j}} x_{m,n}$$
Average Pooling:
$$y_{i,j} = \frac{1}{K^2}\sum_{(m,n) \in \mathcal{R}_{i,j}} x_{m,n}$$
Global Average Pooling:
$$y_c = \frac{1}{H \times W}\sum_{i=1}^{H}\sum_{j=1}^{W} x_{c,i,j}$$
Output Dimension:
$$O = \left\lfloor \frac{W - K}{S} \right\rfloor + 1 \quad \text{(pooling typically uses } P=0\text{)}$$

Max pooling retains the strongest activations and provides slight translation invariance. Global average pooling replaces fully connected layers at the end of modern CNNs, reducing parameters and overfitting. Pooling has zero learnable parameters.

Famous Architectures

LeNet-5 (1998):
Pioneer CNN for digit recognition. 2 conv + 3 FC layers, ~60K params. Introduced the conv-pool-conv-pool-FC pattern.
AlexNet (2012):
Won ImageNet by a large margin. 5 conv + 3 FC, ~60M params. Introduced ReLU, dropout, data augmentation, and GPU training to CNNs.
VGGNet (2014):
Showed depth matters: 16--19 layers using only 3x3 filters. Two 3x3 convs have the same receptive field as one 5x5 but fewer parameters.
GoogLeNet/Inception (2014):
Inception modules with parallel 1x1, 3x3, 5x5 filters + pooling. 22 layers but only ~5M params via 1x1 bottlenecks.
ResNet (2015):
Skip connections enable 152+ layers: $\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$. Solved vanishing gradients. Backbone for most modern vision tasks.
EfficientNet (2019):
Compound scaling of depth, width, and resolution. EfficientNet-B7 achieves SOTA accuracy with 8.4x fewer params than previous best.

Transfer Learning

Feature Extraction (Freeze):
Freeze all pretrained conv layers. Replace and train only the final classifier head. Best when target dataset is small and similar to the source domain.
Fine-Tuning (Unfreeze Top):
Freeze early layers, unfreeze and retrain the top conv blocks + classifier. Use a low learning rate ($\sim 10^{-4}$ to $10^{-5}$) to avoid destroying pretrained features.
Domain Adaptation:
When target domain differs significantly (e.g., medical images vs. ImageNet). Progressively unfreeze layers from top to bottom, using discriminative learning rates per layer group.
Strategy Decision:
Small data + similar domain $\rightarrow$ freeze. Small data + different domain $\rightarrow$ freeze early, fine-tune late. Large data + different domain $\rightarrow$ fine-tune all or train from scratch.

Early layers learn universal features (edges, textures). Later layers learn task-specific features. This hierarchy is why transfer learning works -- universal features transfer across domains. Models pretrained on ImageNet (1.2M images, 1000 classes) are the standard starting point.

Hyperparameters

Filter Size ($K$):
Standard: 3x3 (most common), 5x5, 7x7 (first layer only). Smaller filters are preferred -- stack two 3x3 filters instead of one 5x5 for the same receptive field with fewer parameters.
Number of Filters ($C_{\text{out}}$):
Typically doubles after each pooling: 32 → 64 → 128 → 256. More filters = more feature maps = more capacity but more parameters.
Padding ($P$):
Use same padding ($P = \lfloor K/2 \rfloor$) to preserve spatial dimensions. valid ($P=0$) shrinks the feature map at each layer.
Stride ($S$):
$S=1$ preserves resolution. $S=2$ halves spatial dimensions (used as alternative to pooling). Larger strides reduce computation but lose spatial detail.
Learning Rate:
Start with $\sim 10^{-3}$ (Adam) or $\sim 10^{-1}$ (SGD with momentum). Use schedulers: cosine annealing, step decay, or warmup + decay.
Batch Size:
Typical: 32--128. Larger batches need larger LR (linear scaling rule). Limited by GPU memory. Smaller batches provide regularization effect.
Dropout:
Applied after FC layers (rate: 0.25--0.5). Spatial dropout for conv layers. Batch normalization often reduces the need for dropout.

Modern best practice: use 3x3 filters throughout, batch normalization after each conv, ReLU activation, and Adam optimizer. Start with a proven architecture (ResNet) rather than designing from scratch.

Pros vs Cons

Pros:

  • Automatic feature extraction -- no manual feature engineering required
  • Translation invariance via shared weights and pooling -- detects patterns regardless of position
  • Parameter sharing -- same filter applied across the entire input, drastically reducing parameters vs. fully connected
  • Hierarchical feature learning -- edges → textures → parts → objects, building increasingly abstract representations
  • Transfer learning -- pretrained models generalize across tasks and domains
  • State-of-the-art on image, video, and spatial data tasks

Cons:

  • Requires large labeled datasets -- data-hungry models that underperform on small datasets without transfer learning
  • Computationally expensive -- training deep CNNs requires GPUs/TPUs and significant time
  • Not suitable for non-spatial data -- tabular data is better handled by tree-based methods or MLPs
  • Many hyperparameters to tune -- architecture, LR, batch size, augmentation strategy
  • Black-box model -- difficult to interpret which features drive specific predictions

Interview Quick-Fire

Q: What is a Convolutional Neural Network?

A: A deep learning architecture designed for spatial data (images, video, audio spectrograms). It uses convolutional layers with learnable filters that slide across the input to produce feature maps. By sharing weights spatially, CNNs achieve translation equivariance and drastically reduce parameters compared to fully connected networks.

Q: Why use ReLU instead of sigmoid in CNNs?

A: ReLU ($\max(0, x)$) avoids the vanishing gradient problem that plagues sigmoid/tanh in deep networks. It is computationally cheap (simple thresholding), promotes sparsity, and enables faster convergence. Variants like Leaky ReLU and GELU address the "dying ReLU" problem where neurons get stuck at zero.

Q: What is the difference between stride and pooling for downsampling?

A: Both reduce spatial dimensions. Stride-based downsampling uses convolutions with $S > 1$ and has learnable parameters, so the network decides what to keep. Pooling (e.g., max pool 2x2) is a fixed operation with no learnable parameters. Modern architectures increasingly prefer strided convolutions over pooling for this flexibility.

Q: How does batch normalization help CNN training?

A: Batch normalization normalizes activations to zero mean and unit variance within each mini-batch, then applies learnable scale and shift. It stabilizes training, allows higher learning rates, acts as mild regularization, and reduces sensitivity to weight initialization. Applied after convolution and before activation in most architectures.

Q: What are skip connections and why do they matter?

A: Skip (residual) connections add the input directly to the output of a block: $\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$. They create identity shortcuts that allow gradients to flow directly through the network, enabling training of 100+ layer networks. Without them, very deep networks suffer from degradation -- higher training error than shallower networks.

Q: How do you calculate the number of parameters in a conv layer?

A: For a conv layer with kernel size $K \times K$, $C_{\text{in}}$ input channels, and $C_{\text{out}}$ output channels: Params $= (K^2 \cdot C_{\text{in}} + 1) \cdot C_{\text{out}}$. The "+1" accounts for the bias term per filter. Example: a 3x3 conv with 64 input and 128 output channels has $(9 \times 64 + 1) \times 128 = 73{,}856$ parameters.

Q: What is data augmentation and why is it critical for CNNs?

A: Data augmentation artificially expands the training set by applying random transformations: flips, rotations, crops, color jitter, cutout, and mixup. It reduces overfitting by exposing the model to more variations, effectively regularizing the network. Critical for CNNs because they are data-hungry and augmentation can double or triple effective dataset size at zero labeling cost.

Q: When would you NOT use a CNN?

A: Avoid CNNs for: (1) tabular data -- tree-based models (XGBoost, Random Forest) consistently outperform CNNs on structured data, (2) small datasets without pretrained models available, (3) tasks requiring global context from the start (transformers may be better), (4) sequential data without spatial structure (use RNNs/transformers). CNNs assume local spatial correlations exist in the input.

Continue Your Journey