Convolutional Neural Network
15 interview questions from basic to advanced
15 interview questions from basic to advanced
A Convolutional Neural Network (CNN) is a deep learning architecture specifically designed for processing data with spatial structure, such as images, video, and audio spectrograms. Instead of connecting every input pixel to every neuron (as in fully connected networks), CNNs use convolutional layers with small learnable filters that slide across the input to produce feature maps.
CNNs exploit three key properties of images: (1) Local connectivity -- nearby pixels are more related than distant ones, (2) Translation equivariance -- the same pattern should be detected regardless of position, (3) Hierarchical composition -- complex patterns are built from simpler ones (edges combine into textures, textures into parts, parts into objects). These inductive biases make CNNs far more parameter-efficient and effective than fully connected networks for spatial data.
The convolution operation slides a small filter (kernel) across the input, computing an element-wise multiplication and summation (dot product) at each position. For a 3x3 filter on a single-channel input, the filter has 9 weights. At each position, these 9 weights are multiplied with the corresponding 9 input values, summed, and a bias is added to produce one output value.
For multi-channel inputs (e.g., an RGB image with 3 channels), each filter has depth equal to the number of input channels -- a 3x3 filter on RGB input is actually 3x3x3 = 27 weights. The dot product spans all channels simultaneously. Multiple filters are applied to produce multiple output feature maps, each detecting a different pattern. The output size depends on the input size, kernel size, padding, and stride.
Pooling progressively reduces the spatial dimensions of feature maps, which decreases the number of parameters and computations in subsequent layers. Max pooling (the most common type) selects the maximum value within each pooling window, retaining the strongest activations. Average pooling computes the mean value instead.
Pooling provides several benefits: (1) Dimensionality reduction -- a 2x2 max pool with stride 2 halves the height and width, reducing data by 4x, (2) Translation invariance -- small shifts in the input produce the same pooled output, (3) Noise reduction -- by summarizing local regions, minor variations are suppressed. Modern architectures sometimes replace pooling with strided convolutions, which learn what to downsample rather than using a fixed operation. Global average pooling at the network's end replaces large fully connected layers.
A fully connected (FC) layer connects every input neuron to every output neuron, with a unique weight for each connection. For a 224x224x3 image flattened to 150,528 inputs connected to 4,096 outputs, that is over 600 million parameters in a single layer. FC layers have no concept of spatial structure -- they treat every input equally regardless of position.
A convolutional layer uses small filters (e.g., 3x3) that share weights across all spatial positions. The same filter is applied everywhere, so a 3x3 conv with 64 input and 128 output channels has only 73,856 parameters regardless of input resolution. Conv layers preserve spatial structure and exploit local patterns. In modern CNNs, FC layers are used only at the very end (or replaced by global average pooling), while conv layers do the heavy lifting of feature extraction throughout the network.
Transfer learning takes a CNN pretrained on a large dataset (typically ImageNet with 1.2 million images across 1,000 classes) and adapts it to a new task. Instead of training from scratch, you leverage the features already learned by the pretrained model. Early layers learn universal features (edges, textures, colors) that transfer across almost any visual task, while later layers learn task-specific features.
Two main strategies: (1) Feature extraction -- freeze all pretrained layers and only train a new classifier head. Best for small datasets similar to the source domain. (2) Fine-tuning -- unfreeze some or all pretrained layers and retrain with a small learning rate. Best when you have moderate data or the target domain differs from the source. Transfer learning is critical because training a deep CNN from scratch requires millions of labeled images and days of GPU time, while fine-tuning can achieve excellent results with hundreds of images in hours.
Stride (S) controls how many pixels the filter moves at each step. Stride 1 moves the filter one pixel at a time, preserving spatial resolution. Stride 2 skips every other position, halving the output dimensions. The output size formula is: O = floor((W - K + 2P) / S) + 1. Larger strides aggressively reduce spatial dimensions and can replace pooling layers.
Padding (P) adds zeros around the input border. Valid padding (P=0) causes the output to shrink by (K-1) pixels. Same padding (P = floor(K/2)) preserves the input dimensions when stride=1. Padding prevents information loss at the borders -- without it, edge pixels contribute to fewer output values than center pixels, creating a spatial bias. Most modern architectures use same padding with 3x3 filters (P=1) to maintain spatial resolution through convolution blocks, and use stride=2 or pooling for intentional downsampling.
A feature map is the 2D output produced by applying a single filter to the input. Each position in the feature map indicates how strongly a particular pattern is present at that spatial location. A conv layer with 64 filters produces 64 feature maps, each detecting a different pattern. As we go deeper into the network, feature maps become smaller spatially but increase in number (channels), encoding increasingly abstract information.
CNNs learn a hierarchy of features automatically: (1) Early layers (layers 1-2) detect low-level features like edges, corners, and color gradients, (2) Middle layers (layers 3-5) combine low-level features into textures, patterns, and simple shapes, (3) Deep layers (layers 6+) detect high-level semantic features like object parts, faces, or wheels. This hierarchy emerges naturally from training and is the reason transfer learning works -- early universal features transfer across tasks while later task-specific features need adaptation.
Max pooling selects the maximum value in each pooling window, making it ideal when you want to detect whether a feature is present regardless of its exact position. It is the dominant choice for intermediate layers in classification networks because it retains the strongest activations and provides sharp, discriminative features. It is more robust to small translations and noise in the feature maps.
Average pooling computes the mean value, which is useful when you want to retain information about the overall distribution of activations rather than just the peaks. Global average pooling (GAP) is widely used as the final layer before the classifier -- it averages each channel's entire feature map into a single value, replacing large fully connected layers and drastically reducing parameters. GAP also provides natural regularization and is standard in modern architectures (ResNet, Inception, EfficientNet). In general: max pool for intermediate layers, global average pool for the final spatial reduction.
Skip connections (residual connections) add the input of a block directly to its output: y = F(x) + x, where F(x) is the learned transformation. Instead of learning the full mapping from input to output, the network only needs to learn the residual -- the difference between the desired output and the input. If the optimal transformation is close to identity, the residual F(x) is close to zero, which is easier to learn than the full mapping.
Before ResNet, networks deeper than ~20 layers suffered from the degradation problem: training accuracy got worse with added layers, even though a deeper network should be at least as good as a shallower one (it could learn identity mappings for extra layers). Skip connections solve this by providing identity shortcut paths that allow gradients to flow directly through the network during backpropagation, preventing vanishing gradients. ResNet demonstrated that 152-layer networks could be trained effectively, and the principle has been adopted in virtually all modern architectures including DenseNet (dense connections), U-Net (long-range skips), and transformers.
Data augmentation artificially expands the training set by applying random transformations to images during training. Common augmentations include horizontal flips, random crops, rotations, color jitter, scaling, and perspective transforms. More advanced techniques include Cutout (random rectangular masking), Mixup (blending two images and their labels), CutMix (pasting patches between images), and RandAugment (automated augmentation policy search).
Augmentation is critical for CNNs because: (1) CNNs are data-hungry and easily overfit on small datasets -- augmentation effectively multiplies the dataset size at zero labeling cost, (2) It teaches the network invariances the architecture does not inherently have (rotation invariance, scale invariance), (3) It improves generalization by exposing the model to more diverse visual variations during training. For small datasets, aggressive augmentation combined with transfer learning is often the difference between a failed model and a successful one. Test-time augmentation (TTA) applies augmentations at inference and averages predictions for additional accuracy gains.
For a convolutional layer: each filter has K × K × C_in weights plus 1 bias, and there are C_out filters. Total: (K² · C_in + 1) · C_out. Example: a 3×3 conv with 256 input channels and 512 output channels has (9 × 256 + 1) × 512 = 1,180,160 parameters. Batch normalization adds 2 × C_out learnable parameters (scale and shift) plus 2 × C_out running statistics (not learnable). Pooling layers have zero parameters.
For a fully connected layer: inputs × outputs + outputs (bias). A single FC layer from a 7×7×512 feature map to 4,096 neurons has 7 × 7 × 512 × 4,096 + 4,096 = 102,764,544 parameters -- more than all conv layers combined in VGG-16. This is why modern architectures replace FC layers with global average pooling (GAP), which maps each 7×7 channel to a single value with zero parameters. In VGG-16, the three FC layers contain 123M of the total 138M parameters (89%). In ResNet-50, GAP reduces the final layer from millions to just 2,048 × 1,000 = 2M parameters.
AlexNet (2012) proved deep CNNs work: 8 layers, ReLU activation, dropout, GPU training, and data augmentation. VGGNet (2014) showed depth matters by stacking 16-19 layers of 3×3 filters, proving two 3×3 convolutions equal one 5×5 receptive field with fewer parameters. GoogLeNet/Inception (2014) introduced multi-scale processing with parallel filter branches and 1×1 convolutions for bottleneck dimension reduction, achieving 22 layers with only 5M parameters versus VGG's 138M.
ResNet (2015) was the biggest breakthrough: skip connections enabled training 152+ layers by solving vanishing gradients and the degradation problem. DenseNet (2017) extended this with dense connections where every layer receives feature maps from all preceding layers. MobileNet (2017) introduced depthwise separable convolutions for 8-9x computation reduction, enabling mobile deployment. EfficientNet (2019) unified all insights through compound scaling -- systematically scaling network depth, width, and input resolution together using a compound coefficient, achieving SOTA accuracy with 8.4x fewer parameters than previous best. The trend: deeper, more efficient architectures with better scaling strategies.
A standard convolution with K×K kernel, C_in input channels, and C_out output channels performs K² · C_in multiplications at each spatial position for each output channel, totaling K² · C_in · C_out · H · W operations. Depthwise separable convolution splits this into two steps: (1) Depthwise convolution applies one K×K filter per input channel independently (C_in filters, each K×K×1), costing K² · C_in · H · W. (2) Pointwise 1×1 convolution mixes channels (C_out filters, each 1×1×C_in), costing C_in · C_out · H · W.
The reduction ratio is: (K² · C_in + C_in · C_out) / (K² · C_in · C_out) = 1/C_out + 1/K². For K=3 and C_out=256, the reduction is approximately 1/9 + 1/256 ≈ 8-9x fewer operations. MobileNet v1 used this throughout, achieving AlexNet-level accuracy with 1/30th the parameters. MobileNet v2 added inverted residual blocks (expand with 1×1, depthwise 3×3, compress with 1×1) with skip connections. EfficientNet combines this with compound scaling for the best accuracy-efficiency tradeoff. The tradeoff: separable convolutions have slightly less representational capacity per layer but compensate through increased depth within the same compute budget.
With only 50 labeled images, training from scratch is futile -- a CNN has millions of parameters and will memorize the data instantly. The strategy revolves around maximum transfer and minimal new parameters: (1) Start with a strong pretrained backbone (EfficientNet-B3 or ResNet-50 pretrained on ImageNet). (2) Freeze the entire backbone and use it as a fixed feature extractor. (3) Train only a lightweight classifier head (e.g., single linear layer or 1-2 FC layers with dropout). (4) Apply aggressive data augmentation: random crops, flips, rotations, color jitter, Cutout, Mixup, and potentially generative augmentation with diffusion models.
Advanced strategies for extreme few-shot: (5) Use metric learning approaches like Siamese networks or Prototypical Networks that learn to compare rather than classify -- they generalize better from few examples. (6) If the domain is very different from ImageNet (e.g., medical, satellite), consider a backbone pretrained on a closer domain or use self-supervised pretraining (SimCLR, DINO) on unlabeled data from the target domain. (7) Apply knowledge distillation from a larger model. (8) Use class-balanced sampling if classes are imbalanced. (9) Ensemble multiple few-shot models trained with different augmentation strategies. The key insight: with 50 images, the quality of pretrained features and augmentation strategy matters more than any architectural choice.
The receptive field is the region of the original input that influences a neuron's output. Larger receptive fields capture more context, which is critical for understanding large objects, global scene layout, and long-range dependencies. However, naively increasing the receptive field by using larger kernels is computationally expensive: a 7×7 kernel has 5.4x more parameters than a 3×3 kernel. Increasing depth also grows the receptive field but adds parameters, memory, and training difficulty at each layer.
Different architectures address this differently: (1) VGG's insight: stack small 3×3 filters. Two 3×3 layers give a 5×5 receptive field with 18 parameters (vs. 25 for a single 5×5), plus extra non-linearity. (2) Dilated (atrous) convolutions insert gaps in the kernel -- a 3×3 kernel with dilation rate 2 covers a 5×5 area using only 9 parameters. Stacking dilated convolutions with exponentially increasing rates grows the receptive field exponentially with depth. (3) Pooling and strided convolutions increase the effective receptive field by downsampling -- each pixel after a stride-2 layer represents a 2x larger input region. (4) Multi-scale processing (Inception) uses parallel branches with different kernel sizes. (5) Attention mechanisms (SENet, CBAM) and self-attention (Vision Transformers) provide global receptive field from the first layer but at O(n²) cost. The optimal strategy depends on the task: dense prediction tasks (segmentation) need large receptive fields at full resolution (favoring dilated convolutions), while classification can aggressively downsample.