Convolutional Neural Network
Cheat Sheet
Your quick reference for CNNs -- from convolution operations and pooling layers to famous architectures and transfer learning strategies.
Your quick reference for CNNs -- from convolution operations and pooling layers to famous architectures and transfer learning strategies.
Each filter slides across the input computing element-wise multiplication and summation (dot product). Multiple filters produce multiple output feature maps, each detecting a different pattern.
Max pooling retains the strongest activations and provides slight translation invariance. Global average pooling replaces fully connected layers at the end of modern CNNs, reducing parameters and overfitting. Pooling has zero learnable parameters.
Early layers learn universal features (edges, textures). Later layers learn task-specific features. This hierarchy is why transfer learning works -- universal features transfer across domains. Models pretrained on ImageNet (1.2M images, 1000 classes) are the standard starting point.
same padding ($P = \lfloor K/2 \rfloor$) to preserve spatial dimensions. valid ($P=0$) shrinks the feature map at each layer.Modern best practice: use 3x3 filters throughout, batch normalization after each conv, ReLU activation, and Adam optimizer. Start with a proven architecture (ResNet) rather than designing from scratch.
A: A deep learning architecture designed for spatial data (images, video, audio spectrograms). It uses convolutional layers with learnable filters that slide across the input to produce feature maps. By sharing weights spatially, CNNs achieve translation equivariance and drastically reduce parameters compared to fully connected networks.
A: ReLU ($\max(0, x)$) avoids the vanishing gradient problem that plagues sigmoid/tanh in deep networks. It is computationally cheap (simple thresholding), promotes sparsity, and enables faster convergence. Variants like Leaky ReLU and GELU address the "dying ReLU" problem where neurons get stuck at zero.
A: Both reduce spatial dimensions. Stride-based downsampling uses convolutions with $S > 1$ and has learnable parameters, so the network decides what to keep. Pooling (e.g., max pool 2x2) is a fixed operation with no learnable parameters. Modern architectures increasingly prefer strided convolutions over pooling for this flexibility.
A: Batch normalization normalizes activations to zero mean and unit variance within each mini-batch, then applies learnable scale and shift. It stabilizes training, allows higher learning rates, acts as mild regularization, and reduces sensitivity to weight initialization. Applied after convolution and before activation in most architectures.
A: Skip (residual) connections add the input directly to the output of a block: $\mathbf{y} = F(\mathbf{x}) + \mathbf{x}$. They create identity shortcuts that allow gradients to flow directly through the network, enabling training of 100+ layer networks. Without them, very deep networks suffer from degradation -- higher training error than shallower networks.
A: For a conv layer with kernel size $K \times K$, $C_{\text{in}}$ input channels, and $C_{\text{out}}$ output channels: Params $= (K^2 \cdot C_{\text{in}} + 1) \cdot C_{\text{out}}$. The "+1" accounts for the bias term per filter. Example: a 3x3 conv with 64 input and 128 output channels has $(9 \times 64 + 1) \times 128 = 73{,}856$ parameters.
A: Data augmentation artificially expands the training set by applying random transformations: flips, rotations, crops, color jitter, cutout, and mixup. It reduces overfitting by exposing the model to more variations, effectively regularizing the network. Critical for CNNs because they are data-hungry and augmentation can double or triple effective dataset size at zero labeling cost.
A: Avoid CNNs for: (1) tabular data -- tree-based models (XGBoost, Random Forest) consistently outperform CNNs on structured data, (2) small datasets without pretrained models available, (3) tasks requiring global context from the start (transformers may be better), (4) sequential data without spatial structure (use RNNs/transformers). CNNs assume local spatial correlations exist in the input.