From sliding filters and feature maps to deep hierarchical learning — the architecture that transformed visual intelligence.
How learnable filters scan images pixel by pixel to detect spatial patterns — and why sharing weights across positions is so powerful.
"A Sobel filter finds edges. A Gabor filter detects textures. Before deep learning, engineers spent careers hand-crafting these filters for each vision task. A single convolutional layer learns hundreds of them automatically — from data."
Think of a convolutional filter like a rubber stamp sliding across a document. Each time you press the stamp, it measures how well the stamp pattern matches the ink below it. Wherever there's a strong match, the output (the feature map) lights up. The CNN learns which stamps are most useful for the task.
Convolution anatomy: a 3×3 filter slides over a 6×6 input (stride 1, no padding) producing a 4×4 feature map. Brighter cells = stronger activation.
Given input size $H \times W$, kernel size $K$, padding $P$, and stride $S$:
Same formula applies for width $W_{\text{out}}$. With $P=\frac{K-1}{2}$ and $S=1$, output equals input size ("same" padding).
Input: $64 \times 64$, $K=3$, $P=1$, $S=1$
Now try: $64 \times 64$, $K=3$, $P=0$, $S=2$
What output size does a $32 \times 32$ input produce with $K=5$, $P=2$, $S=1$?
Real images have $C_{\text{in}}$ input channels (e.g., RGB = 3). Each filter has shape $K \times K \times C_{\text{in}}$, and the layer stacks $C_{\text{out}}$ filters:
The extra $C_{\text{out}}$ term is from the bias (one per output channel). When $C_{\text{in}}=3$, $C_{\text{out}}=64$, $K=3$: params = 9×3×64 + 64 = 1,792.
Layer config: $K=3$, $C_{\text{in}}=3$ (RGB), $C_{\text{out}}=64$
If we use $K=5$ instead of $K=3$ (same $C_{\text{in}}=3$, $C_{\text{out}}=64$), how many more parameters does that add?
Forgetting to account for padding. With $K=3$, $P=0$, $S=1$: a 32×32 input becomes 30×30 (shrinks by 2). Stack 5 such layers and you lose 10 pixels — your 32×32 feature map becomes 22×22. Always choose $P = (K-1)/2$ for "same" padding when you want to preserve spatial size.
Before using the widget: if you increase stride from 1 to 2, what happens to the output size? And if you increase padding from 0 to 1 with $K=3$, what changes?
Hint: stride divides, padding adds twice (once each side).
A convolutional layer is a bank of learnable spatial filters — each filter produces one feature map by sliding across the input with shared weights, making CNNs dramatically more parameter-efficient than fully-connected layers.
Google's DeepMind deployed a CNN that detects diabetic retinopathy from retinal photos. Its first convolutional layer learns ~64 edge-detection filters — not hand-coded, but discovered from 128,000 labeled fundus images. These filters are identical in spirit to the Gabor filters ophthalmologists once designed manually, but adapted precisely to pathological features that matter for diagnosis.
Q1 Why does a CNN with weight sharing require far fewer parameters than a fully-connected layer of the same output size?
Q2 A 128×128 RGB image passes through a conv layer with $K=5$, $P=2$, $S=2$, $C_{\text{out}}=32$. What is the output shape?
How pooling builds translation invariance, and how receptive fields grow with depth — enabling deep layers to reason about large-scale image structure.
"After Layer 1 detects edges, how does a CNN combine them into circles, then eyes, then faces? The answer: pooling compresses spatial detail while receptive fields grow — letting deeper neurons see broader context."
Pooling is like reading a blurry photocopy of a document. You lose exact pixel positions (fine detail), but you can still read every word (semantic content). The CNN trades spatial precision for position invariance — useful because "the cat can be anywhere in the image."
CNN feature hierarchy: each layer combines features from the previous, building increasingly abstract and semantically rich representations.
Given a feature map region of size $P_H \times P_W$:
Max pool preserves the strongest activation (is this pattern present?). Average pool softens responses. Max pool is standard in classification; global average pool replaces FC layers in ResNet/MobileNet.
Input 4×4 patch:
What would Average Pool give for the top-left 2×2 quadrant {1,3,4,6}?
The receptive field (RF) is the input region that influences one output neuron. For $L$ stacked conv layers with kernel $K$ and stride $S=1$:
With $K=3$, each layer adds exactly 2 pixels to the RF. Pooling (stride $>1$) multiplies the growth rate for subsequent layers.
Config: 3 conv layers, $K=3$, $S=1$, no pooling
How many K=3, S=1 conv layers do you need to achieve RF≥15?
Three stacked 3×3 conv layers cover the same RF as one 7×7 layer, but use 3×9 = 27 weights vs 49 — and introduce two additional non-linearities (ReLUs). This is why VGG, ResNet, and virtually all modern CNNs use small kernels stacked deep instead of large kernels.
2×2 Max Pooling (stride 2) selects the maximum from each quadrant, halving spatial dimensions while retaining peak activations. Receptive fields grow by 2 with each K=3 layer.
Confusing theoretical RF with effective RF. The formula $\text{RF} = 1 + L(K-1)$ gives the theoretical maximum. In practice, neurons near the edge of the RF have much weaker influence due to weight magnitudes. Effective RF is often 2–3× smaller than theoretical RF — which is why very deep networks are necessary to reason about whole-image context.
Before checking the widget: if you add a 2×2 max pool (stride 2) after every conv layer, how does this change the receptive field growth rate for subsequent layers?
Hint: stride acts as a multiplier on how quickly the RF expands in later layers.
Pooling trades spatial resolution for translation invariance while receptive fields grow multiplicatively with depth — together enabling deep layers to integrate global context and recognize objects regardless of their exact position in the image.
CheXNet (Stanford, 2017) detects 14 thoracic diseases from chest X-rays at radiologist-level accuracy. By the 5th block of its DenseNet-121 backbone, each neuron has a receptive field covering nearly the entire 224×224 image. This global context is essential: detecting cardiomegaly (enlarged heart) requires comparing the cardiac silhouette to the entire lung field — impossible with early-layer local features alone.
Q1 Why is max pooling preferred over average pooling for feature detection tasks in classification networks?
Q2 What is the receptive field after stacking 4 conv layers with $K=3$, $S=1$, followed by 1 MaxPool (2×2, stride 2), followed by 2 more conv layers with $K=3$, $S=1$?
The 14-year journey that transformed handwritten digit recognition into an ImageNet-conquering deep architecture — and rewrote the rules of computer vision.
"In September 2012, a Toronto team's entry in the ImageNet Large Scale Visual Recognition Challenge achieved 15.3% top-5 error — shattering the previous best of 26.2%. The runner-up used hand-crafted features. Every subsequent winner used deep CNNs. The field changed overnight."
LeNet-5 is the Wright Brothers' Flyer: it flew, it proved the concept, but it could only carry one person a short distance. AlexNet is the first jet aircraft: same basic principles, but scaled up with better materials (ReLU), crash-proofing (Dropout), and industrial fuel (GPU compute + ImageNet). Both are revolutionary in their era.
LeCun's landmark CNN for digit recognition on MNIST/USPS:
Activation: Tanh/Sigmoid. No ReLU, no Dropout. Works for 32×32 images; fails to scale to ImageNet (224×224, 1000 classes).
Sigmoid gradient: $\frac{d\sigma}{dx} = \sigma(x)(1-\sigma(x))$. At $x=5$: $\sigma(5)\approx0.993$, gradient $\approx 0.007$ — nearly zero, gradient vanishes.
What is the gradient of ReLU at $x = -0.01$? What does this mean for a "dead ReLU" neuron?
AlexNet architecture: 5 convolutional layers (purple) followed by 3 fully-connected layers (teal). Two GPUs ran in parallel, each handling half the feature maps in early layers.
Using Sigmoid/Tanh activations in deep CNNs. These functions saturate: at input values above ≈3, the gradient approaches 0. Backpropagating through 8 such layers multiplies many near-zero values → effectively no gradient reaches early layers. Always use ReLU (or Leaky ReLU/GELU) in deep networks unless you have a specific reason not to.
If you train AlexNet on ImageNet with Sigmoid instead of ReLU, and start from the same random initialization, what will most likely happen after 10 epochs?
Think about gradient flow through 8 layers of Sigmoid — what happens to the first-layer weight updates?
AlexNet's 2012 breakthrough was not a single idea but a combination: ReLU activations enabling depth, Dropout taming overfitting, GPU compute enabling scale, and ImageNet providing the training signal — proving that depth × data × compute unlock qualitatively new capabilities in visual recognition.
Modern PCB inspection systems use AlexNet-lineage CNNs to detect solder defects at sub-millimeter precision on production lines running at 30,000 units/hour. The system uses AlexNet's core innovations — ReLU speed, Dropout robustness — applied to high-resolution industrial cameras. A single GPU inference pass flags 14 defect categories in <10ms, replacing what previously required trained human inspectors examining every board under magnification.
Q1 What specific problem does Dropout solve, and why is it especially important for a 60M-parameter model like AlexNet?
Q2 Why does AlexNet use an 11×11 kernel in the first layer rather than 3×3 like modern architectures?
Watch how a neuron's receptive field in the input image grows as you add more convolutional layers. Drag the depth slider and observe which input pixels influence the highlighted output neuron.
From individual filter arithmetic to the architecture that changed the field — the core ideas of Convolutional Neural Networks.
A learnable filter bank that slides across input images. Weight sharing makes CNNs dramatically more parameter-efficient than FC layers. Output size: $\lfloor(H+2P-K)/S\rfloor+1$.
$H_{\text{out}} = \lfloor(H+2P-K)/S\rfloor + 1$. Use "same" padding $P=(K-1)/2$ to preserve spatial size. Stride 2 ≈ halves dimensions.
Max pool selects peak activations from local regions — building translation invariance. 2×2 pool (stride 2) halves spatial dims and cuts downstream compute by 4×.
RF grows by $(K-1)$ with each layer: $\text{RF}_L = 1 + L(K-1)$. Stacking small kernels is more efficient than large kernels: three K=3 layers give RF=7, same as K=7 but 45% fewer params.
14-year leap from 60K to 60M params. AlexNet's combination of ReLU + Dropout + GPU training + data augmentation on ImageNet validated deep CNNs as the dominant paradigm.
Modern CNNs & Transfer Learning: Skip connections (ResNet), efficient convolutions (MobileNet), and fine-tuning pre-trained models for custom tasks with minimal data.
Primary references and interactive tools for deepening your understanding of CNNs.
Covers convolutional layers, pooling, and the transition from LeNet to modern architectures. Excellent derivations of the output dimension formula and receptive field proofs.
→ Primary reference for this week"ImageNet Classification with Deep Convolutional Neural Networks" — the AlexNet paper. Read Section 3 (Architecture) and Section 5 (Results) for direct context on the design choices.
→ arXiv:1404.5997 (revised)Real-time visual exploration of a CNN on Tiny ImageNet. Drag the input, watch feature maps activate, and trace how max pooling compresses spatial information.
→ poloclub.github.io/cnn-explainerJustin Johnson's lecture on CNNs covers the convolution operator, pooling mechanics, and architectural history from AlexNet to VGG, with excellent visual intuition.
→ youtube.com (CS231n 2017 L5)Eight exercises spanning output dimension arithmetic, parameter counting, receptive field reasoning, PyTorch implementation, and architectural design — covering convolutional layers, pooling, and the LeNet-to-AlexNet evolution.
For each configuration below, compute the output feature map height $H_{\text{out}}$ using $H_{\text{out}} = \lfloor(H+2P-K)/S\rfloor+1$.
Substitute into the formula directly:
(a) $\lfloor(28+0-5)/1\rfloor+1 = 24$. (b) $\lfloor(64+2-3)/1\rfloor+1 = 64$ (same padding). (c) $\lfloor(128+0-3)/2\rfloor+1 = 63$. (d) $\lfloor(224+6-7)/2\rfloor+1 = 112$.
Note (d) is the exact first layer of ResNet-50!
Implement a single convolutional layer in PyTorch and visualize its feature maps on a real image.
PIL, convert to a float tensor of shape (1,1,H,W)nn.Conv2d(1, 8, kernel_size=3, padding=1) (random init)matplotlib subplotsUse torchvision.transforms.ToTensor() for the image → tensor conversion. The output of conv(x) has shape (1, 8, H, W); use out[0] (shape 8, H, W) and iterate over the first dimension for plotting. Use cmap='gray' in plt.imshow.
Answer the following questions about a CNN's field of view.
[[2,8,4,1],[6,3,7,2],[0,5,9,3],[4,1,6,8]](1) RF = 1 + 5×2 = 11. (2) With pooling at layer 2 and 4, the effective multiplier doubles: layers 1-2 add 2, then ×2 = 4 from pool1, layers 3-4 add 4, then ×2 = 8 from pool2, layer 5 adds 8. Effective RF ≈ 47. (3) MaxPool 2×2: top-left={2,8,6,3}→8; top-right={4,1,7,2}→7; bot-left={0,5,4,1}→5; bot-right={9,3,6,8}→9. Output: [[8,7],[5,9]].
Build a CNN block and track how feature map sizes change at each layer.
(4, 3, 64, 64) through the networksum(p.numel() for p in model.parameters()))After MaxPool(2) the spatial dims halve each time: 64→32→16. After GlobalAvgPool(1,1): shape is (4, 64, 1, 1). Use nn.AdaptiveAvgPool2d((1,1)). Expected total params: Conv1: 3×3×3×16+16=448; Conv2: 3×3×16×32+32=4,640; Conv3: 3×3×32×64+64=18,496. Total ≈ 23,584.
For each AlexNet innovation, explain the problem it solved and why that problem was critical for training on ImageNet.
(1) Tanh/Sigmoid saturate → gradient ≈ 0 for large |x|; backprop fails in deep nets. ReLU gradient=1 for x>0, so gradients flow unchanged. (2) Overfitting: 60M params / 1.2M samples ≈ 50 params per sample — without regularization, the model memorizes training data. (3) Crops: 2048 crops per image × horizontal flip = effective 2× augmentation → millions of distinct views. (4) FC6: 9216×4096+4096 = 37,752,832. LeNet FC: 400×120+120 = 48,120. AlexNet FC6 alone is 785× larger than LeNet's biggest FC.
Implement and train a compact AlexNet-inspired model on CIFAR-10.
torchvision.datasets.CIFAR10 with augmentation (RandomCrop + HorizontalFlip)MiniAlexNet from the slides (3 conv blocks + 2 FC layers with Dropout)CIFAR-10 images are 32×32×3 with 10 classes. Use transforms.Compose([transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))]). Expected test accuracy: ~75-80% with Dropout, ~70% without (signs of overfitting). Set model.train() during training and model.eval() for validation to correctly enable/disable Dropout.
You are designing a CNN to classify fundus (retinal) photographs (512×512 pixels, 3 channels) into 5 disease categories. Design the architecture and justify every decision.
With K=3, S=1: each pair of conv layers adds 4 to RF. After pool (2×2 stride-2), subsequent layers see 2× larger input regions. A typical design: Conv1,Conv2→Pool (RF≈5, output 256×256); Conv3,Conv4→Pool (RF≈18, output 128×128); Conv5,Conv6→Pool (RF≈50, output 64×64); Conv7,Conv8→Pool (RF≈144, output 32×32). Filters: 32→32→64→64→128→128→256→256. Using stride-2 conv instead of MaxPool is increasingly preferred as it's learnable and preserves more information. BatchNorm is essential for training stability with deep medical imaging models.
Systematically remove each AlexNet innovation and measure the impact on CIFAR-10 validation accuracy after 10 epochs.
Use the same random seed (torch.manual_seed(42)) and same weight initialization for fair comparison. For Sigmoid variant, replace nn.ReLU() with nn.Sigmoid() throughout. Expected ordering (highest val acc): Baseline > No Augmentation > No Dropout > Sigmoid. The Sigmoid variant may fail to improve beyond ~50% due to vanishing gradients — this is the most dramatic effect. No Dropout will show higher train accuracy but lower val accuracy (classic overfitting signature).