Computer Vision Foundations — Week 6

Convolutional
Neural Networks

From sliding filters and feature maps to deep hierarchical learning — the architecture that transformed visual intelligence.

Convolution Feature Maps Pooling Receptive Fields LeNet AlexNet PyTorch

The Convolutional Layer

How learnable filters scan images pixel by pixel to detect spatial patterns — and why sharing weights across positions is so powerful.

After this section you will be able to
  • Explain how a 2D convolutional filter scans an input image using stride and padding
  • Calculate output feature map dimensions given input size, kernel size, stride, and padding
  • Implement multi-channel convolution in PyTorch and inspect the resulting feature maps

"A Sobel filter finds edges. A Gabor filter detects textures. Before deep learning, engineers spent careers hand-crafting these filters for each vision task. A single convolutional layer learns hundreds of them automatically — from data."

🧩
Why convolution over fully-connected layers? Convolution encodes two powerful priors about images: local connectivity (nearby pixels are related) and weight sharing (the same pattern should be detectable anywhere in the image). A 3×3 filter applied to a 224×224 image needs only 9 weights instead of 224×224=50,176 — and those 9 weights are reused at every spatial position.
🔍
Analogy Bridge

Think of a convolutional filter like a rubber stamp sliding across a document. Each time you press the stamp, it measures how well the stamp pattern matches the ink below it. Wherever there's a strong match, the output (the feature map) lights up. The CNN learns which stamps are most useful for the task.

Input (6×6) 3×3 kernel H=6, P=0, K=3, S=1 Kernel (3×3) w₀₀ w₀₁ w₀₂ w₁₀ w₁₁ w₁₂ w₂₀ w₂₁ w₂₂ learned weights + bias b σ(sum + b) Feature Map (4×4) H_out = ⌊(6−3)/1⌋+1 = 4 C_out feature maps (one per filter)

Convolution anatomy: a 3×3 filter slides over a 6×6 input (stride 1, no padding) producing a 4×4 feature map. Brighter cells = stronger activation.

Input
Image Patch
Raw pixel values: a local region of the input image
Filter
3×3 Kernel
Learnable weights; one set per output channel
Output
Feature Map
One scalar per patch — measures filter-pattern match strength
Stack
C_out Maps
Multiple filters → multiple feature maps → richer representation
9
Parameters
In one 3×3 filter — regardless of input spatial size
~75×
Fewer params
vs. fully-connected layer for 224×224 input with same output
K²C_inC_out
Total weights
Full conv layer parameter count (plus C_out biases)
Reuse
Each filter weight is shared across all spatial positions
Problem

Output Dimension Formula

Given input size $H \times W$, kernel size $K$, padding $P$, and stride $S$:

$$H_{\text{out}} = \left\lfloor \frac{H + 2P - K}{S} \right\rfloor + 1$$

Same formula applies for width $W_{\text{out}}$. With $P=\frac{K-1}{2}$ and $S=1$, output equals input size ("same" padding).

📐 Worked Example — Compute output size

Input: $64 \times 64$, $K=3$, $P=1$, $S=1$

1
Substitute into formula: $H_{\text{out}} = \lfloor (64 + 2\cdot1 - 3) / 1 \rfloor + 1$
2
$= \lfloor 63 / 1 \rfloor + 1 = 63 + 1 = 64$
✓ Output: 64×64 — "same" padding preserved spatial size

Now try: $64 \times 64$, $K=3$, $P=0$, $S=2$

3
$H_{\text{out}} = \lfloor (64 + 0 - 3) / 2 \rfloor + 1 = \lfloor 61/2 \rfloor + 1 = 30 + 1 = 31$
✓ Output: 31×31 — stride 2 halved the spatial resolution (≈)
Quick Check

What output size does a $32 \times 32$ input produce with $K=5$, $P=2$, $S=1$?

H_out = ⌊(32 + 2×2 − 5)/1⌋ + 1 = ⌊31/1⌋ + 1 = 32 ✓ (same padding)

Multi-Channel Convolution

Real images have $C_{\text{in}}$ input channels (e.g., RGB = 3). Each filter has shape $K \times K \times C_{\text{in}}$, and the layer stacks $C_{\text{out}}$ filters:

$$\text{Params} = K \times K \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}$$

The extra $C_{\text{out}}$ term is from the bias (one per output channel). When $C_{\text{in}}=3$, $C_{\text{out}}=64$, $K=3$: params = 9×3×64 + 64 = 1,792.

📐 Worked Example — Count parameters

Layer config: $K=3$, $C_{\text{in}}=3$ (RGB), $C_{\text{out}}=64$

1
Weight params: $3 \times 3 \times 3 \times 64 = 1,\!728$
2
Bias params: $C_{\text{out}} = 64$
Total = 1,728 + 64 = 1,792 — this is the first layer of VGG-16!
Quick Check

If we use $K=5$ instead of $K=3$ (same $C_{\text{in}}=3$, $C_{\text{out}}=64$), how many more parameters does that add?

K=5: 5×5×3×64 + 64 = 4,864. K=3: 1,792. Difference = 3,072 more — 2.7× as many for only slightly larger RF.
$$H_{\text{out}} = \left\lfloor \frac{H + 2P - K}{S} \right\rfloor + 1$$
H
Input Height
Spatial height of the input feature map
P
Padding
Rows of zeros added on each side
K
Kernel Size
Filter height/width (usually odd: 3, 5, 7)
S
Stride
Step size of the sliding window
⌊⌋+1
Floor + 1
Integer division; +1 for the first position
⚠️
Common Mistake

Forgetting to account for padding. With $K=3$, $P=0$, $S=1$: a 32×32 input becomes 30×30 (shrinks by 2). Stack 5 such layers and you lose 10 pixels — your 32×32 feature map becomes 22×22. Always choose $P = (K-1)/2$ for "same" padding when you want to preserve spatial size.

Solution
Pause & Predict

Before using the widget: if you increase stride from 1 to 2, what happens to the output size? And if you increase padding from 0 to 1 with $K=3$, what changes?

Hint: stride divides, padding adds twice (once each side).

Try It: Output Dimension Calculator

Adjust the sliders to see how kernel size, padding, and stride affect the output feature map dimensions. Live calculation updates in real time.

32
3
1
1
Input Kernel Output
Live Calculation
H_out = ⌊(H + 2P − K) / S⌋ + 1
Implementation
Python · PyTorch — Convolutional Layer Exploration
import torch import torch.nn as nn # Conv2d(in_channels, out_channels, kernel_size, stride, padding) conv = nn.Conv2d(1, 8, kernel_size=3, stride=1, padding=1) # Count parameters: K²·C_in·C_out + C_out = 9·1·8 + 8 = 80 params = sum(p.numel() for p in conv.parameters()) # Input: batch=1, channels=1, 28×28 x = torch.randn(1, 1, 28, 28) out = conv(x) print(f"Input: {x.shape}") print(f"Output: {out.shape}") print(f"Params: {params}") # Change to stride=2 — spatial dims halve conv_s2 = nn.Conv2d(1, 8, kernel_size=3, stride=2, padding=1) out_s2 = conv_s2(x) print(f"Stride-2 out: {out_s2.shape}")
stdout
Input: torch.Size([1, 1, 28, 28]) Output: torch.Size([1, 8, 28, 28]) Params: 80 Stride-2 out: torch.Size([1, 8, 14, 14])
Key Takeaway

A convolutional layer is a bank of learnable spatial filters — each filter produces one feature map by sliding across the input with shared weights, making CNNs dramatically more parameter-efficient than fully-connected layers.

📸
Real-World Application

Medical Imaging: Diabetic Retinopathy Screening

Google's DeepMind deployed a CNN that detects diabetic retinopathy from retinal photos. Its first convolutional layer learns ~64 edge-detection filters — not hand-coded, but discovered from 128,000 labeled fundus images. These filters are identical in spirit to the Gabor filters ophthalmologists once designed manually, but adapted precisely to pathological features that matter for diagnosis.

Checkpoint The Convolutional Layer

Q1 Why does a CNN with weight sharing require far fewer parameters than a fully-connected layer of the same output size?

A 3×3 filter applied to any-sized input uses the same 9 weights regardless of input dimensions, reusing them at every spatial position. A fully-connected layer needs one weight per (input pixel × output neuron), which scales with $H \times W \times C_{\text{in}} \times C_{\text{out}}$ — orders of magnitude more for typical image sizes.

Q2 A 128×128 RGB image passes through a conv layer with $K=5$, $P=2$, $S=2$, $C_{\text{out}}=32$. What is the output shape?

$H_{\text{out}} = \lfloor(128 + 2\cdot2 - 5)/2\rfloor + 1 = \lfloor127/2\rfloor + 1 = 63+1 = 64$. Output: (1, 32, 64, 64) as a PyTorch tensor.

Pooling & Receptive Fields

How pooling builds translation invariance, and how receptive fields grow with depth — enabling deep layers to reason about large-scale image structure.

After this section you will be able to
  • Distinguish max pooling from average pooling and explain when each is preferred in practice
  • Compute the receptive field size after N stacked convolutional layers using the RF formula
  • Describe the CNN feature hierarchy and explain what edge → texture → part → object corresponds to in each depth tier

"After Layer 1 detects edges, how does a CNN combine them into circles, then eyes, then faces? The answer: pooling compresses spatial detail while receptive fields grow — letting deeper neurons see broader context."

🎯
Why pooling? Translation invariance: if a feature moves 1–2 pixels, max pooling still fires. Dimensionality reduction: each 2×2 pool halves the spatial dimensions, cutting computation by 4× for subsequent layers. Why receptive fields? A neuron in Layer 5 "sees" a 47×47 region of the input (stacking 5 × K=3 layers) — large enough to reason about whole objects, not just local texture.
📰
Analogy Bridge

Pooling is like reading a blurry photocopy of a document. You lose exact pixel positions (fine detail), but you can still read every word (semantic content). The CNN trades spatial precision for position invariance — useful because "the cat can be anywhere in the image."

L1
edges
Edges & Orientations
L2
corners/curves
Textures & Corners
L3
facial parts
Parts & Components
L4
full face
Objects & Scenes
FC
Cat 97.3%
Class Prediction

CNN feature hierarchy: each layer combines features from the previous, building increasingly abstract and semantically rich representations.

2×2
Pool window
Standard max pool — halves each spatial dimension
+2
RF per layer
Each K=3, S=1 conv adds 2 pixels to the receptive field
47×47
RF after 5 layers
Stacking five K=3, S=1 convs on a 224×224 input
Compute saved
One 2×2 pool step cuts FLOPs for subsequent layers by 75%
Problem

Max & Average Pooling

Given a feature map region of size $P_H \times P_W$:

$$\text{MaxPool}: y = \max_{i,j} x_{ij}$$ $$\text{AvgPool}: y = \frac{1}{P_H P_W}\sum_{i,j} x_{ij}$$

Max pool preserves the strongest activation (is this pattern present?). Average pool softens responses. Max pool is standard in classification; global average pool replaces FC layers in ResNet/MobileNet.

📐 Worked Example — 2×2 Max Pool

Input 4×4 patch:

$$\begin{bmatrix}1&3&2&1\\4&6&5&2\\1&2&8&3\\0&1&3&4\end{bmatrix} \xrightarrow{2\times2\;\text{MaxPool, S=2}} \begin{bmatrix}?&?\\?&?\end{bmatrix}$$
1
Top-left 2×2: {1,3,4,6} → max = 6
2
Top-right 2×2: {2,1,5,2} → max = 5
3
Bottom-left 2×2: {1,2,0,1} → max = 2
4
Bottom-right 2×2: {8,3,3,4} → max = 8
Output: [[6, 5], [2, 8]] — spatial size halved, strongest features preserved
Quick Check

What would Average Pool give for the top-left 2×2 quadrant {1,3,4,6}?

(1+3+4+6)/4 = 14/4 = 3.5 — softer than max=6; preserves scale but loses the "peak detection" property.

Receptive Field Growth

The receptive field (RF) is the input region that influences one output neuron. For $L$ stacked conv layers with kernel $K$ and stride $S=1$:

$$\text{RF}_L = 1 + L \times (K - 1)$$

With $K=3$, each layer adds exactly 2 pixels to the RF. Pooling (stride $>1$) multiplies the growth rate for subsequent layers.

📐 Worked Example — RF after 3 layers

Config: 3 conv layers, $K=3$, $S=1$, no pooling

1
After L=1: RF = 1 + 1×(3−1) = 3
2
After L=2: RF = 1 + 2×(3−1) = 5
3
After L=3: RF = 1 + 3×(3−1) = 7
RF=7 with K=3×3 — same RF as one K=7×7 layer, but only 3×9=27 vs 49 weights!
Quick Check

How many K=3, S=1 conv layers do you need to achieve RF≥15?

RF = 1 + L×2 ≥ 15 → L ≥ 7 layers. So 7 conv layers gives RF = 1 + 7×2 = 15.
💡
Key Insight

Three stacked 3×3 conv layers cover the same RF as one 7×7 layer, but use 3×9 = 27 weights vs 49 — and introduce two additional non-linearities (ReLUs). This is why VGG, ResNet, and virtually all modern CNNs use small kernels stacked deep instead of large kernels.

Feature Map (4×4) 1 3 2 1 4 6 5 2 1 2 8 3 0 1 3 4 MaxPool 2×2 Output (2×2) 6 5 2 8 RF grows per layer Layer 1 (K=3): RF = 3 Layer 2 (K=3): RF = 5 Layer 3 (K=3): RF = 7 Layer 5 (K=3): RF = 9 Layer 10 (K=3): RF = 19

2×2 Max Pooling (stride 2) selects the maximum from each quadrant, halving spatial dimensions while retaining peak activations. Receptive fields grow by 2 with each K=3 layer.

⚠️
Common Mistake

Confusing theoretical RF with effective RF. The formula $\text{RF} = 1 + L(K-1)$ gives the theoretical maximum. In practice, neurons near the edge of the RF have much weaker influence due to weight magnitudes. Effective RF is often 2–3× smaller than theoretical RF — which is why very deep networks are necessary to reason about whole-image context.

Solution
Pause & Predict

Before checking the widget: if you add a 2×2 max pool (stride 2) after every conv layer, how does this change the receptive field growth rate for subsequent layers?

Hint: stride acts as a multiplier on how quickly the RF expands in later layers.

Try It: Pooling Output Visualizer

Select pooling type and adjust window size to see how the feature map is downsampled. The input is a 4×4 activation grid — watch how pooling preserves (or softens) peaks.

2×2
Input feature map Pooled output
Implementation
Python · PyTorch — Pooling Layers & Feature Map Sizes
import torch import torch.nn as nn # Build a small CNN block: Conv → ReLU → MaxPool block = nn.Sequential( nn.Conv2d(3, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2), # halves spatial dims nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2), ) x = torch.randn(1, 3, 64, 64) for i, layer in enumerate(block): x = layer(x) print(f"After layer {i}: {x.shape}")
stdout
After layer 0: torch.Size([1, 32, 64, 64]) # Conv (same padding) After layer 1: torch.Size([1, 32, 64, 64]) # ReLU (no size change) After layer 2: torch.Size([1, 32, 32, 32]) # MaxPool 2x2 → halved After layer 3: torch.Size([1, 64, 32, 32]) # Conv (same padding) After layer 4: torch.Size([1, 64, 32, 32]) # ReLU After layer 5: torch.Size([1, 64, 16, 16]) # MaxPool 2x2 → halved
Key Takeaway

Pooling trades spatial resolution for translation invariance while receptive fields grow multiplicatively with depth — together enabling deep layers to integrate global context and recognize objects regardless of their exact position in the image.

🫁
Real-World Application

Radiology AI: Chest X-Ray Pathology Detection

CheXNet (Stanford, 2017) detects 14 thoracic diseases from chest X-rays at radiologist-level accuracy. By the 5th block of its DenseNet-121 backbone, each neuron has a receptive field covering nearly the entire 224×224 image. This global context is essential: detecting cardiomegaly (enlarged heart) requires comparing the cardiac silhouette to the entire lung field — impossible with early-layer local features alone.

Checkpoint Pooling & Receptive Fields

Q1 Why is max pooling preferred over average pooling for feature detection tasks in classification networks?

Max pooling asks "is this feature present anywhere in the region?" — it fires if even one position has a strong activation. Average pooling dilutes strong activations with weak neighbors, reducing sensitivity to localized features. For classification (detecting "is there a cat?"), max pooling's binary presence detection is more useful.

Q2 What is the receptive field after stacking 4 conv layers with $K=3$, $S=1$, followed by 1 MaxPool (2×2, stride 2), followed by 2 more conv layers with $K=3$, $S=1$?

After 4 conv: RF=1+4×2=9. After MaxPool: effective RF doubles to 18 (each pixel in the post-pool feature map sees 2 input pixels). After 2 more conv (on post-pool scale): RF = 18 + 2×2×2 = 18 + 8 = 26.

From LeNet to AlexNet

The 14-year journey that transformed handwritten digit recognition into an ImageNet-conquering deep architecture — and rewrote the rules of computer vision.

After this section you will be able to
  • Trace the architectural evolution from LeNet-5 (1998) to AlexNet (2012) and identify the structural differences
  • Explain AlexNet's five key innovations: ReLU activations, Dropout, GPU training, data augmentation, and LRN
  • Implement a compact AlexNet-inspired CNN classifier in PyTorch for image classification

"In September 2012, a Toronto team's entry in the ImageNet Large Scale Visual Recognition Challenge achieved 15.3% top-5 error — shattering the previous best of 26.2%. The runner-up used hand-crafted features. Every subsequent winner used deep CNNs. The field changed overnight."

🚀
Why did 2012 matter? AlexNet proved the data × depth × compute trifecta. ImageNet (1.2M images, 1000 classes) provided the data scale that earlier datasets couldn't. GPUs (two GTX 580s) provided the compute. Deep CNNs provided the capacity. No single piece was sufficient — all three together unlocked a qualitative leap.
✈️
Analogy Bridge

LeNet-5 is the Wright Brothers' Flyer: it flew, it proved the concept, but it could only carry one person a short distance. AlexNet is the first jet aircraft: same basic principles, but scaled up with better materials (ReLU), crash-proofing (Dropout), and industrial fuel (GPU compute + ImageNet). Both are revolutionary in their era.

1998
LeNet-5
LeNet-5 (LeCun)
2012
AlexNet
AlexNet (Krizhevsky)
2014
VGG-16
VGG (Simonyan)
2015+
ResNet
ResNet (He) — Week 7
1998
LeNet-5
60K parameters, MNIST, Tanh activations
60M
AlexNet params
1000× more than LeNet; enabled by GPU training
10.9%
Error drop
Top-5 improvement over 2011 SOTA (26.2% → 15.3%)
5
Key innovations
ReLU · Dropout · GPU · Augmentation · LRN
Problem

LeNet-5 Architecture (1998)

LeCun's landmark CNN for digit recognition on MNIST/USPS:

  • Input: 32×32 grayscale
  • C1: Conv 5×5, 6 filters → 28×28×6
  • S2: Avg Pool 2×2 → 14×14×6
  • C3: Conv 5×5, 16 filters → 10×10×16
  • S4: Avg Pool 2×2 → 5×5×16
  • F5: FC 120, F6: FC 84, Out: 10 classes

Activation: Tanh/Sigmoid. No ReLU, no Dropout. Works for 32×32 images; fails to scale to ImageNet (224×224, 1000 classes).

📐 Worked Example — LeNet parameter count
1
C1: $5\times5\times1\times6 + 6 = 156$ params
2
C3: $5\times5\times6\times16 + 16 = 2,\!416$ params
3
FC layers: $400\times120 + 120 + 120\times84 + 84 + 84\times10 + 10 \approx 61,\!706$
Total ≈ 60,000 parameters — fit in 1990s RAM; saturates on complex datasets

AlexNet's 5 Key Innovations (2012)

  1. ReLU Activation: $f(x) = \max(0,x)$ instead of Tanh/Sigmoid. Trains 6× faster; avoids saturation/vanishing gradient
  2. Dropout: Randomly zero 50% of FC neurons during training → strong regularization for 60M params
  3. GPU Training: Two GTX 580 (3GB each) run in parallel — enabled 60M params in reasonable time
  4. Data Augmentation: Random crops, horizontal flips, color jitter → 2048× effective dataset size
  5. LRN (Local Response Normalization): Lateral inhibition between nearby feature maps (later replaced by BatchNorm)
📐 Worked Example — ReLU vs Sigmoid gradient

Sigmoid gradient: $\frac{d\sigma}{dx} = \sigma(x)(1-\sigma(x))$. At $x=5$: $\sigma(5)\approx0.993$, gradient $\approx 0.007$ — nearly zero, gradient vanishes.

1
ReLU gradient: $\frac{d}{dx}\max(0,x) = \begin{cases}1 & x > 0 \\ 0 & x \le 0\end{cases}$
2
At $x=5$: gradient = 1. At $x=-2$: gradient = 0 (dead neuron, but doesn't saturate)
ReLU's constant gradient for x>0 keeps updates flowing through 8+ layers — Sigmoid can't
Quick Check

What is the gradient of ReLU at $x = -0.01$? What does this mean for a "dead ReLU" neuron?

Gradient = 0. The neuron never fires and never receives a gradient update — it's "dead." If many neurons die, the network loses capacity. This motivates Leaky ReLU: gradient = 0.01 for x<0.
Input 224×224 RGB Conv1 96 filters 11×11 S=4 55×55×96 ReLU+Pool Conv2 256 filters 5×5 P=2 27×27×256 ReLU+Pool Conv 3-5 384/384/256 3×3 P=1 13×13×256 ReLU+Pool FC 6 4096 units ReLU Dropout 0.5 FC 7 4096 units ReLU Dropout 0.5 FC 8 1000 units Softmax Top-5: 15.3% GPU 1 Conv1+2 (half filters) GPU 2 Conv1+2 (other half)

AlexNet architecture: 5 convolutional layers (purple) followed by 3 fully-connected layers (teal). Two GPUs ran in parallel, each handling half the feature maps in early layers.

⚠️
Common Mistake

Using Sigmoid/Tanh activations in deep CNNs. These functions saturate: at input values above ≈3, the gradient approaches 0. Backpropagating through 8 such layers multiplies many near-zero values → effectively no gradient reaches early layers. Always use ReLU (or Leaky ReLU/GELU) in deep networks unless you have a specific reason not to.

Solution
Pause & Predict

If you train AlexNet on ImageNet with Sigmoid instead of ReLU, and start from the same random initialization, what will most likely happen after 10 epochs?

Think about gradient flow through 8 layers of Sigmoid — what happens to the first-layer weight updates?

Try It: Architecture Comparison

Compare LeNet-5 vs a mini-AlexNet in terms of parameter count, depth, and activation function. Toggle between architectures to see how each design decision impacts model capacity.

Conv layers FC layers Pool
Architecture Stats
Select an architecture above
Implementation
Python · PyTorch — AlexNet-Inspired CNN Classifier
import torch import torch.nn as nn class MiniAlexNet(nn.Module): def __init__(self, num_classes=10): super().__init__() self.features = nn.Sequential( # Block 1: 3→32 filters, large kernel (AlexNet: 96, 11×11) nn.Conv2d(3, 32, 5, padding=2), nn.ReLU(), nn.MaxPool2d(2), # Block 2: 32→64 nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # Block 3: 64→128 (AlexNet: 384/256 channels) nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d((4, 4)), ) self.classifier = nn.Sequential( nn.Dropout(0.5), # AlexNet innovation! nn.Linear(128 * 4 * 4, 512), nn.ReLU(), nn.Dropout(0.5), nn.Linear(512, num_classes), ) def forward(self, x): x = self.features(x) x = x.flatten(1) return self.classifier(x) model = MiniAlexNet(num_classes=10) params = sum(p.numel() for p in model.parameters()) print(f"Parameters: {params:,}") print(f"Output: {model(torch.randn(1, 3, 32, 32)).shape}")
stdout
Parameters: 430,922 Output: torch.Size([1, 10])
Key Takeaway

AlexNet's 2012 breakthrough was not a single idea but a combination: ReLU activations enabling depth, Dropout taming overfitting, GPU compute enabling scale, and ImageNet providing the training signal — proving that depth × data × compute unlock qualitatively new capabilities in visual recognition.

🏭
Real-World Application

Manufacturing Quality Control: PCB Defect Detection

Modern PCB inspection systems use AlexNet-lineage CNNs to detect solder defects at sub-millimeter precision on production lines running at 30,000 units/hour. The system uses AlexNet's core innovations — ReLU speed, Dropout robustness — applied to high-resolution industrial cameras. A single GPU inference pass flags 14 defect categories in <10ms, replacing what previously required trained human inspectors examining every board under magnification.

Checkpoint From LeNet to AlexNet

Q1 What specific problem does Dropout solve, and why is it especially important for a 60M-parameter model like AlexNet?

Dropout prevents co-adaptation — neurons learning to rely on specific neighbors rather than developing independent features. With 60M parameters and only 1.2M training images, AlexNet would massively overfit without regularization. By randomly zeroing 50% of FC neurons per batch, Dropout forces each neuron to learn useful features on its own, acting like training an ensemble of 2^N thinned networks.

Q2 Why does AlexNet use an 11×11 kernel in the first layer rather than 3×3 like modern architectures?

In 2012, AlexNet needed to capture large-scale features from 224×224 images efficiently. With stride 4, an 11×11 kernel reduces the spatial dimension from 224 to 55 in one step while covering a large RF. Modern architectures use stacked 3×3 layers (same RF, fewer params) made possible by BatchNorm and skip connections that didn't exist in 2012.

Receptive Field Visualizer

Watch how a neuron's receptive field in the input image grows as you add more convolutional layers. Drag the depth slider and observe which input pixels influence the highlighted output neuron.

3 layers
3
No pooling
3
3
Off
Input Image (21×21 grid) Blue = RF region
RF Growth per Layer Formula: RF = 1 + L×(K−1)
Receptive Field Calculator
RF = 1 + L × (K − 1)

What We Covered

From individual filter arithmetic to the architecture that changed the field — the core ideas of Convolutional Neural Networks.

🔳

Convolutional Layer

A learnable filter bank that slides across input images. Weight sharing makes CNNs dramatically more parameter-efficient than FC layers. Output size: $\lfloor(H+2P-K)/S\rfloor+1$.

📐

Dimension Formula

$H_{\text{out}} = \lfloor(H+2P-K)/S\rfloor + 1$. Use "same" padding $P=(K-1)/2$ to preserve spatial size. Stride 2 ≈ halves dimensions.

🗜️

Pooling & Invariance

Max pool selects peak activations from local regions — building translation invariance. 2×2 pool (stride 2) halves spatial dims and cuts downstream compute by 4×.

👁️

Receptive Fields

RF grows by $(K-1)$ with each layer: $\text{RF}_L = 1 + L(K-1)$. Stacking small kernels is more efficient than large kernels: three K=3 layers give RF=7, same as K=7 but 45% fewer params.

🕰️

LeNet → AlexNet

14-year leap from 60K to 60M params. AlexNet's combination of ReLU + Dropout + GPU training + data augmentation on ImageNet validated deep CNNs as the dominant paradigm.

Coming Up — Week 7

Modern CNNs & Transfer Learning: Skip connections (ResNet), efficient convolutions (MobileNet), and fine-tuning pre-trained models for custom tasks with minimal data.

Further Reading

Primary references and interactive tools for deepening your understanding of CNNs.

Textbook

Prince — Understanding Deep Learning, Ch. 10

Covers convolutional layers, pooling, and the transition from LeNet to modern architectures. Excellent derivations of the output dimension formula and receptive field proofs.

→ Primary reference for this week
Original Paper

Krizhevsky et al. — ImageNet Classification with Deep CNNs (2012)

"ImageNet Classification with Deep Convolutional Neural Networks" — the AlexNet paper. Read Section 3 (Architecture) and Section 5 (Results) for direct context on the design choices.

→ arXiv:1404.5997 (revised)
Interactive

CNN Explainer — Polo Club of Data Science

Real-time visual exploration of a CNN on Tiny ImageNet. Drag the input, watch feature maps activate, and trace how max pooling compresses spatial information.

→ poloclub.github.io/cnn-explainer
Video Lecture

Stanford CS231n — Convolutional Neural Networks

Justin Johnson's lecture on CNNs covers the convolution operator, pooling mechanics, and architectural history from AlexNet to VGG, with excellent visual intuition.

→ youtube.com (CS231n 2017 L5)

Exercises

Eight exercises spanning output dimension arithmetic, parameter counting, receptive field reasoning, PyTorch implementation, and architectural design — covering convolutional layers, pooling, and the LeNet-to-AlexNet evolution.

1
Theory · Convolutional Layer Easy

Output Dimension Arithmetic

For each configuration below, compute the output feature map height $H_{\text{out}}$ using $H_{\text{out}} = \lfloor(H+2P-K)/S\rfloor+1$.

  1. $H=28$, $K=5$, $P=0$, $S=1$
  2. $H=64$, $K=3$, $P=1$, $S=1$
  3. $H=128$, $K=3$, $P=0$, $S=2$
  4. $H=224$, $K=7$, $P=3$, $S=2$

Substitute into the formula directly:

(a) $\lfloor(28+0-5)/1\rfloor+1 = 24$. (b) $\lfloor(64+2-3)/1\rfloor+1 = 64$ (same padding). (c) $\lfloor(128+0-3)/2\rfloor+1 = 63$. (d) $\lfloor(224+6-7)/2\rfloor+1 = 112$.

Note (d) is the exact first layer of ResNet-50!

2
Code · Convolutional Layer Medium

Feature Map Visualization

Implement a single convolutional layer in PyTorch and visualize its feature maps on a real image.

  1. Load any grayscale image using PIL, convert to a float tensor of shape (1,1,H,W)
  2. Create nn.Conv2d(1, 8, kernel_size=3, padding=1) (random init)
  3. Forward pass: obtain 8 feature maps
  4. Plot all 8 feature maps side-by-side using matplotlib subplots
  5. Verify: all 8 output maps have the same spatial size as the input

Use torchvision.transforms.ToTensor() for the image → tensor conversion. The output of conv(x) has shape (1, 8, H, W); use out[0] (shape 8, H, W) and iterate over the first dimension for plotting. Use cmap='gray' in plt.imshow.

3
Theory · Pooling & Receptive Fields Medium

Receptive Field & Pooling Calculations

Answer the following questions about a CNN's field of view.

  1. Compute the receptive field after 5 stacked Conv(K=3, S=1, P=1) layers
  2. Now add 2×2 MaxPool (stride 2) after layers 2 and 4. What is the effective RF at the final layer? (Each pool doubles the RF scaling factor for later layers)
  3. A 4×4 feature map is processed with MaxPool(2×2, stride=2). Write out the output values given input: [[2,8,4,1],[6,3,7,2],[0,5,9,3],[4,1,6,8]]
  4. Why would using AvgPool instead of MaxPool in question 3 give a different result semantically?

(1) RF = 1 + 5×2 = 11. (2) With pooling at layer 2 and 4, the effective multiplier doubles: layers 1-2 add 2, then ×2 = 4 from pool1, layers 3-4 add 4, then ×2 = 8 from pool2, layer 5 adds 8. Effective RF ≈ 47. (3) MaxPool 2×2: top-left={2,8,6,3}→8; top-right={4,1,7,2}→7; bot-left={0,5,4,1}→5; bot-right={9,3,6,8}→9. Output: [[8,7],[5,9]].

4
Code · Pooling & Receptive Fields Medium

Feature Map Size Tracking

Build a CNN block and track how feature map sizes change at each layer.

  1. Create a 5-layer CNN: Conv(3→16, K=3, P=1) → ReLU → MaxPool(2) → Conv(16→32, K=3, P=1) → ReLU → MaxPool(2) → Conv(32→64, K=3, P=1) → ReLU → GlobalAvgPool
  2. Pass a batch of shape (4, 3, 64, 64) through the network
  3. Print the shape after every layer using a hook or by breaking into steps
  4. Verify: what is the total parameter count? (Use sum(p.numel() for p in model.parameters()))

After MaxPool(2) the spatial dims halve each time: 64→32→16. After GlobalAvgPool(1,1): shape is (4, 64, 1, 1). Use nn.AdaptiveAvgPool2d((1,1)). Expected total params: Conv1: 3×3×3×16+16=448; Conv2: 3×3×16×32+32=4,640; Conv3: 3×3×32×64+64=18,496. Total ≈ 23,584.

5
Theory · CNN Architecture Evolution Easy

AlexNet Innovation Analysis

For each AlexNet innovation, explain the problem it solved and why that problem was critical for training on ImageNet.

  1. ReLU activations replaced Tanh — what gradient problem does this fix?
  2. Dropout(0.5) on FC layers — why is this necessary for 60M params on 1.2M images?
  3. Data augmentation (crops + flips) — how does this affect effective dataset size?
  4. Compute the parameter savings: compare AlexNet's FC6 layer (9216→4096) against LeNet's FC layer (400→120). Which is larger, by how much?

(1) Tanh/Sigmoid saturate → gradient ≈ 0 for large |x|; backprop fails in deep nets. ReLU gradient=1 for x>0, so gradients flow unchanged. (2) Overfitting: 60M params / 1.2M samples ≈ 50 params per sample — without regularization, the model memorizes training data. (3) Crops: 2048 crops per image × horizontal flip = effective 2× augmentation → millions of distinct views. (4) FC6: 9216×4096+4096 = 37,752,832. LeNet FC: 400×120+120 = 48,120. AlexNet FC6 alone is 785× larger than LeNet's biggest FC.

6
Code · CNN Architecture Evolution Hard

Train a Mini-CNN on CIFAR-10

Implement and train a compact AlexNet-inspired model on CIFAR-10.

  1. Load CIFAR-10 using torchvision.datasets.CIFAR10 with augmentation (RandomCrop + HorizontalFlip)
  2. Build the MiniAlexNet from the slides (3 conv blocks + 2 FC layers with Dropout)
  3. Train for 20 epochs using Adam (lr=1e-3), CrossEntropyLoss, batch size 128
  4. Plot training loss and validation accuracy curves
  5. Report final test accuracy. Compare training with Dropout ON vs OFF — what do you observe?

CIFAR-10 images are 32×32×3 with 10 classes. Use transforms.Compose([transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))]). Expected test accuracy: ~75-80% with Dropout, ~70% without (signs of overfitting). Set model.train() during training and model.eval() for validation to correctly enable/disable Dropout.

7
Synthesis · Theory: CNN Design Challenge Hard

Design a CNN for Retinal Disease Detection

You are designing a CNN to classify fundus (retinal) photographs (512×512 pixels, 3 channels) into 5 disease categories. Design the architecture and justify every decision.

  1. How many convolutional layers do you need to achieve a receptive field of at least 64×64 pixels? (Assume K=3, S=1, 2×2 MaxPool after every 2 conv layers)
  2. Propose the number of filters per layer (start at 32, double after each pool)
  3. Calculate the output feature map size after your final conv+pool block, then the total parameters in the feature extractor
  4. Justify your choice of: (a) pooling vs stride-2 conv; (b) Dropout rate; (c) whether to use BatchNorm

With K=3, S=1: each pair of conv layers adds 4 to RF. After pool (2×2 stride-2), subsequent layers see 2× larger input regions. A typical design: Conv1,Conv2→Pool (RF≈5, output 256×256); Conv3,Conv4→Pool (RF≈18, output 128×128); Conv5,Conv6→Pool (RF≈50, output 64×64); Conv7,Conv8→Pool (RF≈144, output 32×32). Filters: 32→32→64→64→128→128→256→256. Using stride-2 conv instead of MaxPool is increasingly preferred as it's learnable and preserves more information. BatchNorm is essential for training stability with deep medical imaging models.

8
Synthesis · Code: CNN Ablation Study Hard

Ablation Study: What Makes AlexNet Work?

Systematically remove each AlexNet innovation and measure the impact on CIFAR-10 validation accuracy after 10 epochs.

  1. Implement four variants of MiniAlexNet: (a) Baseline (all innovations), (b) No Dropout, (c) Sigmoid instead of ReLU, (d) No data augmentation
  2. Train all four with identical hyperparameters for 10 epochs each
  3. Record train loss and val accuracy every epoch
  4. Create a single plot with 4 validation accuracy curves (one color per variant)
  5. Write 3 sentences explaining which innovation had the largest impact and why

Use the same random seed (torch.manual_seed(42)) and same weight initialization for fair comparison. For Sigmoid variant, replace nn.ReLU() with nn.Sigmoid() throughout. Expected ordering (highest val acc): Baseline > No Augmentation > No Dropout > Sigmoid. The Sigmoid variant may fail to improve beyond ~50% due to vanishing gradients — this is the most dramatic effect. No Dropout will show higher train accuracy but lower val accuracy (classic overfitting signature).