Computer Vision Foundations — Week 6: Convolutional Neural Networks

Intuition

The Convolutional Layer

How learnable filters scan images pixel by pixel to detect spatial patterns — and why sharing weights across positions is so powerful.

After this section you will be able to

Explain how a 2D convolutional filter scans an input image using stride and padding
Calculate output feature map dimensions given input size, kernel size, stride, and padding
Implement multi-channel convolution in PyTorch and inspect the resulting feature maps

"A Sobel filter finds edges. A Gabor filter detects textures. Before deep learning, engineers spent careers hand-crafting these filters for each vision task. A single convolutional layer learns hundreds of them automatically — from data."

🧩

Why convolution over fully-connected layers? Convolution encodes two powerful priors about images: local connectivity (nearby pixels are related) and weight sharing (the same pattern should be detectable anywhere in the image). A 3×3 filter applied to a 224×224 image needs only 9 weights instead of 224×224=50,176 — and those 9 weights are reused at every spatial position.

🔍

Analogy Bridge

Think of a convolutional filter like a rubber stamp sliding across a document. Each time you press the stamp, it measures how well the stamp pattern matches the ink below it. Wherever there's a strong match, the output (the feature map) lights up. The CNN learns which stamps are most useful for the task.

Convolution anatomy: a 3×3 filter slides over a 6×6 input (stride 1, no padding) producing a 4×4 feature map. Brighter cells = stronger activation.

Input

Image Patch

Raw pixel values: a local region of the input image

Filter

3×3 Kernel

Learnable weights; one set per output channel

Output

Feature Map

One scalar per patch — measures filter-pattern match strength

Stack

C_out Maps

Multiple filters → multiple feature maps → richer representation

Parameters

In one 3×3 filter — regardless of input spatial size

~75×

Fewer params

vs. fully-connected layer for 224×224 input with same output

K²C_inC_out

Total weights

Full conv layer parameter count (plus C_out biases)

∞

Reuse

Each filter weight is shared across all spatial positions

Problem

Output Dimension Formula

Given input size $H \times W$, kernel size $K$, padding $P$, and stride $S$:

$$H_{\text{out}} = \left\lfloor \frac{H + 2P - K}{S} \right\rfloor + 1$$

Same formula applies for width $W_{\text{out}}$. With $P=\frac{K-1}{2}$ and $S=1$, output equals input size ("same" padding).

📐 Worked Example — Compute output size

Input: $64 \times 64$, $K=3$, $P=1$, $S=1$

Substitute into formula: $H_{\text{out}} = \lfloor (64 + 2\cdot1 - 3) / 1 \rfloor + 1$

$= \lfloor 63 / 1 \rfloor + 1 = 63 + 1 = 64$

✓ Output: 64×64 — "same" padding preserved spatial size

Now try: $64 \times 64$, $K=3$, $P=0$, $S=2$

$H_{\text{out}} = \lfloor (64 + 0 - 3) / 2 \rfloor + 1 = \lfloor 61/2 \rfloor + 1 = 30 + 1 = 31$

✓ Output: 31×31 — stride 2 halved the spatial resolution (≈)

Quick Check

What output size does a $32 \times 32$ input produce with $K=5$, $P=2$, $S=1$?

H_out = ⌊(32 + 2×2 − 5)/1⌋ + 1 = ⌊31/1⌋ + 1 = 32 ✓ (same padding)

Multi-Channel Convolution

Real images have $C_{\text{in}}$ input channels (e.g., RGB = 3). Each filter has shape $K \times K \times C_{\text{in}}$, and the layer stacks $C_{\text{out}}$ filters:

$$\text{Params} = K \times K \times C_{\text{in}} \times C_{\text{out}} + C_{\text{out}}$$

The extra $C_{\text{out}}$ term is from the bias (one per output channel). When $C_{\text{in}}=3$, $C_{\text{out}}=64$, $K=3$: params = 9×3×64 + 64 = 1,792.

📐 Worked Example — Count parameters

Layer config: $K=3$, $C_{\text{in}}=3$ (RGB), $C_{\text{out}}=64$

Weight params: $3 \times 3 \times 3 \times 64 = 1,\!728$

Bias params: $C_{\text{out}} = 64$

Total = 1,728 + 64 = 1,792 — this is the first layer of VGG-16!

Quick Check

If we use $K=5$ instead of $K=3$ (same $C_{\text{in}}=3$, $C_{\text{out}}=64$), how many more parameters does that add?

K=5: 5×5×3×64 + 64 = 4,864. K=3: 1,792. Difference = 3,072 more — 2.7× as many for only slightly larger RF.

$$H_{\text{out}} = \left\lfloor \frac{H + 2P - K}{S} \right\rfloor + 1$$

Input Height

Spatial height of the input feature map

Padding

Rows of zeros added on each side

Kernel Size

Filter height/width (usually odd: 3, 5, 7)

Stride

Step size of the sliding window

⌊⌋+1

Floor + 1

Integer division; +1 for the first position

⚠️

Common Mistake

Forgetting to account for padding. With $K=3$, $P=0$, $S=1$: a 32×32 input becomes 30×30 (shrinks by 2). Stack 5 such layers and you lose 10 pixels — your 32×32 feature map becomes 22×22. Always choose $P = (K-1)/2$ for "same" padding when you want to preserve spatial size.

Solution

Pause & Predict

Before using the widget: if you increase stride from 1 to 2, what happens to the output size? And if you increase padding from 0 to 1 with $K=3$, what changes?

Hint: stride divides, padding adds twice (once each side).

Try It: Output Dimension Calculator

Adjust the sliders to see how kernel size, padding, and stride affect the output feature map dimensions. Live calculation updates in real time.

Input Size H 32

Kernel K 3

Padding P 1

Stride S 1

Input Kernel Output

Live Calculation

H_out = ⌊(H + 2P − K) / S⌋ + 1

Implementation

Python · PyTorch — Convolutional Layer Exploration

import torch
import torch.nn as nn

# Conv2d(in_channels, out_channels, kernel_size, stride, padding)
conv = nn.Conv2d(1, 8, kernel_size=3, stride=1, padding=1)

# Count parameters: K²·C_in·C_out + C_out = 9·1·8 + 8 = 80
params = sum(p.numel() for p in conv.parameters())

# Input: batch=1, channels=1, 28×28
x = torch.randn(1, 1, 28, 28)
out = conv(x)

print(f"Input:  {x.shape}")
print(f"Output: {out.shape}")
print(f"Params: {params}")

# Change to stride=2 — spatial dims halve
conv_s2 = nn.Conv2d(1, 8, kernel_size=3, stride=2, padding=1)
out_s2 = conv_s2(x)
print(f"Stride-2 out: {out_s2.shape}")

stdout

Input: torch.Size([1, 1, 28, 28]) Output: torch.Size([1, 8, 28, 28]) Params: 80 Stride-2 out: torch.Size([1, 8, 14, 14])

Key Takeaway

A convolutional layer is a bank of learnable spatial filters — each filter produces one feature map by sliding across the input with shared weights, making CNNs dramatically more parameter-efficient than fully-connected layers.

📸

Real-World Application

Medical Imaging: Diabetic Retinopathy Screening

Google's DeepMind deployed a CNN that detects diabetic retinopathy from retinal photos. Its first convolutional layer learns ~64 edge-detection filters — not hand-coded, but discovered from 128,000 labeled fundus images. These filters are identical in spirit to the Gabor filters ophthalmologists once designed manually, but adapted precisely to pathological features that matter for diagnosis.

Checkpoint The Convolutional Layer

Q1 Why does a CNN with weight sharing require far fewer parameters than a fully-connected layer of the same output size?

A 3×3 filter applied to any-sized input uses the same 9 weights regardless of input dimensions, reusing them at every spatial position. A fully-connected layer needs one weight per (input pixel × output neuron), which scales with $H \times W \times C_{\text{in}} \times C_{\text{out}}$ — orders of magnitude more for typical image sizes.

Q2 A 128×128 RGB image passes through a conv layer with $K=5$, $P=2$, $S=2$, $C_{\text{out}}=32$. What is the output shape?

$H_{\text{out}} = \lfloor(128 + 2\cdot2 - 5)/2\rfloor + 1 = \lfloor127/2\rfloor + 1 = 63+1 = 64$. Output: (1, 32, 64, 64) as a PyTorch tensor.

Mechanics

Pooling & Receptive Fields

How pooling builds translation invariance, and how receptive fields grow with depth — enabling deep layers to reason about large-scale image structure.

After this section you will be able to

Distinguish max pooling from average pooling and explain when each is preferred in practice
Compute the receptive field size after N stacked convolutional layers using the RF formula
Describe the CNN feature hierarchy and explain what edge → texture → part → object corresponds to in each depth tier

"After Layer 1 detects edges, how does a CNN combine them into circles, then eyes, then faces? The answer: pooling compresses spatial detail while receptive fields grow — letting deeper neurons see broader context."

🎯

Why pooling? Translation invariance: if a feature moves 1–2 pixels, max pooling still fires. Dimensionality reduction: each 2×2 pool halves the spatial dimensions, cutting computation by 4× for subsequent layers. Why receptive fields? A neuron in Layer 5 "sees" a 47×47 region of the input (stacking 5 × K=3 layers) — large enough to reason about whole objects, not just local texture.

📰

Analogy Bridge

Pooling is like reading a blurry photocopy of a document. You lose exact pixel positions (fine detail), but you can still read every word (semantic content). The CNN trades spatial precision for position invariance — useful because "the cat can be anywhere in the image."

Edges & Orientations

→

Textures & Corners

→

Parts & Components

→

Objects & Scenes

→

Class Prediction

CNN feature hierarchy: each layer combines features from the previous, building increasingly abstract and semantically rich representations.

2×2

Pool window

Standard max pool — halves each spatial dimension

RF per layer

Each K=3, S=1 conv adds 2 pixels to the receptive field

47×47

RF after 5 layers

Stacking five K=3, S=1 convs on a 224×224 input

4×

Compute saved

One 2×2 pool step cuts FLOPs for subsequent layers by 75%

Problem

Max & Average Pooling

Given a feature map region of size $P_H \times P_W$:

$$\text{MaxPool}: y = \max_{i,j} x_{ij}$$ $$\text{AvgPool}: y = \frac{1}{P_H P_W}\sum_{i,j} x_{ij}$$

Max pool preserves the strongest activation (is this pattern present?). Average pool softens responses. Max pool is standard in classification; global average pool replaces FC layers in ResNet/MobileNet.

📐 Worked Example — 2×2 Max Pool

Input 4×4 patch:

$$\begin{bmatrix}1&3&2&1\\4&6&5&2\\1&2&8&3\\0&1&3&4\end{bmatrix} \xrightarrow{2\times2\;\text{MaxPool, S=2}} \begin{bmatrix}?&?\\?&?\end{bmatrix}$$

Top-left 2×2: {1,3,4,6} → max = 6

Top-right 2×2: {2,1,5,2} → max = 5

Bottom-left 2×2: {1,2,0,1} → max = 2

Bottom-right 2×2: {8,3,3,4} → max = 8

Output: [[6, 5], [2, 8]] — spatial size halved, strongest features preserved

Quick Check

What would Average Pool give for the top-left 2×2 quadrant {1,3,4,6}?

(1+3+4+6)/4 = 14/4 = 3.5 — softer than max=6; preserves scale but loses the "peak detection" property.

Receptive Field Growth

The receptive field (RF) is the input region that influences one output neuron. For $L$ stacked conv layers with kernel $K$ and stride $S=1$:

$$\text{RF}_L = 1 + L \times (K - 1)$$

With $K=3$, each layer adds exactly 2 pixels to the RF. Pooling (stride $>1$) multiplies the growth rate for subsequent layers.

📐 Worked Example — RF after 3 layers

Config: 3 conv layers, $K=3$, $S=1$, no pooling

After L=1: RF = 1 + 1×(3−1) = 3

After L=2: RF = 1 + 2×(3−1) = 5

After L=3: RF = 1 + 3×(3−1) = 7

RF=7 with K=3×3 — same RF as one K=7×7 layer, but only 3×9=27 vs 49 weights!

Quick Check

How many K=3, S=1 conv layers do you need to achieve RF≥15?

RF = 1 + L×2 ≥ 15 → L ≥ 7 layers. So 7 conv layers gives RF = 1 + 7×2 = 15.

💡

Key Insight

Three stacked 3×3 conv layers cover the same RF as one 7×7 layer, but use 3×9 = 27 weights vs 49 — and introduce two additional non-linearities (ReLUs). This is why VGG, ResNet, and virtually all modern CNNs use small kernels stacked deep instead of large kernels.

2×2 Max Pooling (stride 2) selects the maximum from each quadrant, halving spatial dimensions while retaining peak activations. Receptive fields grow by 2 with each K=3 layer.

⚠️

Common Mistake

Confusing theoretical RF with effective RF. The formula $\text{RF} = 1 + L(K-1)$ gives the theoretical maximum. In practice, neurons near the edge of the RF have much weaker influence due to weight magnitudes. Effective RF is often 2–3× smaller than theoretical RF — which is why very deep networks are necessary to reason about whole-image context.

Solution

Pause & Predict

Before checking the widget: if you add a 2×2 max pool (stride 2) after every conv layer, how does this change the receptive field growth rate for subsequent layers?

Hint: stride acts as a multiplier on how quickly the RF expands in later layers.

Try It: Pooling Output Visualizer

Select pooling type and adjust window size to see how the feature map is downsampled. The input is a 4×4 activation grid — watch how pooling preserves (or softens) peaks.

Pool Size 2×2

Input feature map Pooled output

Implementation

Python · PyTorch — Pooling Layers & Feature Map Sizes

import torch
import torch.nn as nn

# Build a small CNN block: Conv → ReLU → MaxPool
block = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),   # halves spatial dims

    nn.Conv2d(32, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

x = torch.randn(1, 3, 64, 64)
for i, layer in enumerate(block):
    x = layer(x)
    print(f"After layer {i}: {x.shape}")

stdout

After layer 0: torch.Size([1, 32, 64, 64]) # Conv (same padding) After layer 1: torch.Size([1, 32, 64, 64]) # ReLU (no size change) After layer 2: torch.Size([1, 32, 32, 32]) # MaxPool 2x2 → halved After layer 3: torch.Size([1, 64, 32, 32]) # Conv (same padding) After layer 4: torch.Size([1, 64, 32, 32]) # ReLU After layer 5: torch.Size([1, 64, 16, 16]) # MaxPool 2x2 → halved

Key Takeaway

Pooling trades spatial resolution for translation invariance while receptive fields grow multiplicatively with depth — together enabling deep layers to integrate global context and recognize objects regardless of their exact position in the image.

🫁

Real-World Application

Radiology AI: Chest X-Ray Pathology Detection

CheXNet (Stanford, 2017) detects 14 thoracic diseases from chest X-rays at radiologist-level accuracy. By the 5th block of its DenseNet-121 backbone, each neuron has a receptive field covering nearly the entire 224×224 image. This global context is essential: detecting cardiomegaly (enlarged heart) requires comparing the cardiac silhouette to the entire lung field — impossible with early-layer local features alone.

Checkpoint Pooling & Receptive Fields

Q1 Why is max pooling preferred over average pooling for feature detection tasks in classification networks?

Max pooling asks "is this feature present anywhere in the region?" — it fires if even one position has a strong activation. Average pooling dilutes strong activations with weak neighbors, reducing sensitivity to localized features. For classification (detecting "is there a cat?"), max pooling's binary presence detection is more useful.

Q2 What is the receptive field after stacking 4 conv layers with $K=3$, $S=1$, followed by 1 MaxPool (2×2, stride 2), followed by 2 more conv layers with $K=3$, $S=1$?

After 4 conv: RF=1+4×2=9. After MaxPool: effective RF doubles to 18 (each pixel in the post-pool feature map sees 2 input pixels). After 2 more conv (on post-pool scale): RF = 18 + 2×2×2 = 18 + 8 = 26.

Application

From LeNet to AlexNet

The 14-year journey that transformed handwritten digit recognition into an ImageNet-conquering deep architecture — and rewrote the rules of computer vision.

After this section you will be able to

Trace the architectural evolution from LeNet-5 (1998) to AlexNet (2012) and identify the structural differences
Explain AlexNet's five key innovations: ReLU activations, Dropout, GPU training, data augmentation, and LRN
Implement a compact AlexNet-inspired CNN classifier in PyTorch for image classification

"In September 2012, a Toronto team's entry in the ImageNet Large Scale Visual Recognition Challenge achieved 15.3% top-5 error — shattering the previous best of 26.2%. The runner-up used hand-crafted features. Every subsequent winner used deep CNNs. The field changed overnight."

🚀

Why did 2012 matter? AlexNet proved the data × depth × compute trifecta. ImageNet (1.2M images, 1000 classes) provided the data scale that earlier datasets couldn't. GPUs (two GTX 580s) provided the compute. Deep CNNs provided the capacity. No single piece was sufficient — all three together unlocked a qualitative leap.

✈️

Analogy Bridge

LeNet-5 is the Wright Brothers' Flyer: it flew, it proved the concept, but it could only carry one person a short distance. AlexNet is the first jet aircraft: same basic principles, but scaled up with better materials (ReLU), crash-proofing (Dropout), and industrial fuel (GPU compute + ImageNet). Both are revolutionary in their era.

1998

LeNet-5 (LeCun)

→

2012

AlexNet (Krizhevsky)

→

2014

VGG (Simonyan)

→

2015+

ResNet (He) — Week 7

1998

LeNet-5

60K parameters, MNIST, Tanh activations

60M

AlexNet params

1000× more than LeNet; enabled by GPU training

10.9%

Error drop

Top-5 improvement over 2011 SOTA (26.2% → 15.3%)

Key innovations

ReLU · Dropout · GPU · Augmentation · LRN

Problem

LeNet-5 Architecture (1998)

LeCun's landmark CNN for digit recognition on MNIST/USPS:

Input: 32×32 grayscale
C1: Conv 5×5, 6 filters → 28×28×6
S2: Avg Pool 2×2 → 14×14×6
C3: Conv 5×5, 16 filters → 10×10×16
S4: Avg Pool 2×2 → 5×5×16
F5: FC 120, F6: FC 84, Out: 10 classes

Activation: Tanh/Sigmoid. No ReLU, no Dropout. Works for 32×32 images; fails to scale to ImageNet (224×224, 1000 classes).

📐 Worked Example — LeNet parameter count

C1: $5\times5\times1\times6 + 6 = 156$ params

C3: $5\times5\times6\times16 + 16 = 2,\!416$ params

FC layers: $400\times120 + 120 + 120\times84 + 84 + 84\times10 + 10 \approx 61,\!706$

Total ≈ 60,000 parameters — fit in 1990s RAM; saturates on complex datasets

AlexNet's 5 Key Innovations (2012)

ReLU Activation: $f(x) = \max(0,x)$ instead of Tanh/Sigmoid. Trains 6× faster; avoids saturation/vanishing gradient
Dropout: Randomly zero 50% of FC neurons during training → strong regularization for 60M params
GPU Training: Two GTX 580 (3GB each) run in parallel — enabled 60M params in reasonable time
Data Augmentation: Random crops, horizontal flips, color jitter → 2048× effective dataset size
LRN (Local Response Normalization): Lateral inhibition between nearby feature maps (later replaced by BatchNorm)

📐 Worked Example — ReLU vs Sigmoid gradient

Sigmoid gradient: $\frac{d\sigma}{dx} = \sigma(x)(1-\sigma(x))$. At $x=5$: $\sigma(5)\approx0.993$, gradient $\approx 0.007$ — nearly zero, gradient vanishes.

ReLU gradient: $\frac{d}{dx}\max(0,x) = \begin{cases}1 & x > 0 \\ 0 & x \le 0\end{cases}$

At $x=5$: gradient = 1. At $x=-2$: gradient = 0 (dead neuron, but doesn't saturate)

ReLU's constant gradient for x>0 keeps updates flowing through 8+ layers — Sigmoid can't

Quick Check

What is the gradient of ReLU at $x = -0.01$? What does this mean for a "dead ReLU" neuron?

Gradient = 0. The neuron never fires and never receives a gradient update — it's "dead." If many neurons die, the network loses capacity. This motivates Leaky ReLU: gradient = 0.01 for x<0.

AlexNet architecture: 5 convolutional layers (purple) followed by 3 fully-connected layers (teal). Two GPUs ran in parallel, each handling half the feature maps in early layers.

⚠️

Common Mistake

Using Sigmoid/Tanh activations in deep CNNs. These functions saturate: at input values above ≈3, the gradient approaches 0. Backpropagating through 8 such layers multiplies many near-zero values → effectively no gradient reaches early layers. Always use ReLU (or Leaky ReLU/GELU) in deep networks unless you have a specific reason not to.

Solution

Pause & Predict

If you train AlexNet on ImageNet with Sigmoid instead of ReLU, and start from the same random initialization, what will most likely happen after 10 epochs?

Think about gradient flow through 8 layers of Sigmoid — what happens to the first-layer weight updates?

Try It: Architecture Comparison

Compare LeNet-5 vs a mini-AlexNet in terms of parameter count, depth, and activation function. Toggle between architectures to see how each design decision impacts model capacity.

Conv layers FC layers Pool

Architecture Stats

Select an architecture above

Implementation

Python · PyTorch — AlexNet-Inspired CNN Classifier

import torch
import torch.nn as nn

class MiniAlexNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3→32 filters, large kernel (AlexNet: 96, 11×11)
            nn.Conv2d(3, 32, 5, padding=2), nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 2: 32→64
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 3: 64→128 (AlexNet: 384/256 channels)
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4)),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),                 # AlexNet innovation!
            nn.Linear(128 * 4 * 4, 512), nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)
        return self.classifier(x)

model = MiniAlexNet(num_classes=10)
params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {params:,}")
print(f"Output: {model(torch.randn(1, 3, 32, 32)).shape}")

stdout

Parameters: 430,922 Output: torch.Size([1, 10])

Key Takeaway

AlexNet's 2012 breakthrough was not a single idea but a combination: ReLU activations enabling depth, Dropout taming overfitting, GPU compute enabling scale, and ImageNet providing the training signal — proving that depth × data × compute unlock qualitatively new capabilities in visual recognition.

🏭

Real-World Application

Manufacturing Quality Control: PCB Defect Detection

Modern PCB inspection systems use AlexNet-lineage CNNs to detect solder defects at sub-millimeter precision on production lines running at 30,000 units/hour. The system uses AlexNet's core innovations — ReLU speed, Dropout robustness — applied to high-resolution industrial cameras. A single GPU inference pass flags 14 defect categories in <10ms, replacing what previously required trained human inspectors examining every board under magnification.

Checkpoint From LeNet to AlexNet

Q1 What specific problem does Dropout solve, and why is it especially important for a 60M-parameter model like AlexNet?

Dropout prevents co-adaptation — neurons learning to rely on specific neighbors rather than developing independent features. With 60M parameters and only 1.2M training images, AlexNet would massively overfit without regularization. By randomly zeroing 50% of FC neurons per batch, Dropout forces each neuron to learn useful features on its own, acting like training an ensemble of 2^N thinned networks.

Q2 Why does AlexNet use an 11×11 kernel in the first layer rather than 3×3 like modern architectures?

In 2012, AlexNet needed to capture large-scale features from 224×224 images efficiently. With stride 4, an 11×11 kernel reduces the spatial dimension from 224 to 55 in one step while covering a large RF. Modern architectures use stacked 3×3 layers (same RF, fewer params) made possible by BatchNorm and skip connections that didn't exist in 2012.

Practice

Exercises

Eight exercises spanning output dimension arithmetic, parameter counting, receptive field reasoning, PyTorch implementation, and architectural design — covering convolutional layers, pooling, and the LeNet-to-AlexNet evolution.

Theory · Convolutional Layer Easy

Output Dimension Arithmetic

For each configuration below, compute the output feature map height $H_{\text{out}}$ using $H_{\text{out}} = \lfloor(H+2P-K)/S\rfloor+1$.

$H=28$, $K=5$, $P=0$, $S=1$
$H=64$, $K=3$, $P=1$, $S=1$
$H=128$, $K=3$, $P=0$, $S=2$
$H=224$, $K=7$, $P=3$, $S=2$

Substitute into the formula directly:

(a) $\lfloor(28+0-5)/1\rfloor+1 = 24$. (b) $\lfloor(64+2-3)/1\rfloor+1 = 64$ (same padding). (c) $\lfloor(128+0-3)/2\rfloor+1 = 63$. (d) $\lfloor(224+6-7)/2\rfloor+1 = 112$.

Note (d) is the exact first layer of ResNet-50!

Code · Convolutional Layer Medium

Feature Map Visualization

Implement a single convolutional layer in PyTorch and visualize its feature maps on a real image.

Load any grayscale image using PIL, convert to a float tensor of shape (1,1,H,W)
Create nn.Conv2d(1, 8, kernel_size=3, padding=1) (random init)
Forward pass: obtain 8 feature maps
Plot all 8 feature maps side-by-side using matplotlib subplots
Verify: all 8 output maps have the same spatial size as the input

Use torchvision.transforms.ToTensor() for the image → tensor conversion. The output of conv(x) has shape (1, 8, H, W); use out[0] (shape 8, H, W) and iterate over the first dimension for plotting. Use cmap='gray' in plt.imshow.

Theory · Pooling & Receptive Fields Medium

Receptive Field & Pooling Calculations

Answer the following questions about a CNN's field of view.

Compute the receptive field after 5 stacked Conv(K=3, S=1, P=1) layers
Now add 2×2 MaxPool (stride 2) after layers 2 and 4. What is the effective RF at the final layer? (Each pool doubles the RF scaling factor for later layers)
A 4×4 feature map is processed with MaxPool(2×2, stride=2). Write out the output values given input: [[2,8,4,1],[6,3,7,2],[0,5,9,3],[4,1,6,8]]
Why would using AvgPool instead of MaxPool in question 3 give a different result semantically?

(1) RF = 1 + 5×2 = 11. (2) With pooling at layer 2 and 4, the effective multiplier doubles: layers 1-2 add 2, then ×2 = 4 from pool1, layers 3-4 add 4, then ×2 = 8 from pool2, layer 5 adds 8. Effective RF ≈ 47. (3) MaxPool 2×2: top-left={2,8,6,3}→8; top-right={4,1,7,2}→7; bot-left={0,5,4,1}→5; bot-right={9,3,6,8}→9. Output: [[8,7],[5,9]].

Code · Pooling & Receptive Fields Medium

Feature Map Size Tracking

Build a CNN block and track how feature map sizes change at each layer.

Create a 5-layer CNN: Conv(3→16, K=3, P=1) → ReLU → MaxPool(2) → Conv(16→32, K=3, P=1) → ReLU → MaxPool(2) → Conv(32→64, K=3, P=1) → ReLU → GlobalAvgPool
Pass a batch of shape (4, 3, 64, 64) through the network
Print the shape after every layer using a hook or by breaking into steps
Verify: what is the total parameter count? (Use sum(p.numel() for p in model.parameters()))

After MaxPool(2) the spatial dims halve each time: 64→32→16. After GlobalAvgPool(1,1): shape is (4, 64, 1, 1). Use nn.AdaptiveAvgPool2d((1,1)). Expected total params: Conv1: 3×3×3×16+16=448; Conv2: 3×3×16×32+32=4,640; Conv3: 3×3×32×64+64=18,496. Total ≈ 23,584.

Theory · CNN Architecture Evolution Easy

AlexNet Innovation Analysis

For each AlexNet innovation, explain the problem it solved and why that problem was critical for training on ImageNet.

ReLU activations replaced Tanh — what gradient problem does this fix?
Dropout(0.5) on FC layers — why is this necessary for 60M params on 1.2M images?
Data augmentation (crops + flips) — how does this affect effective dataset size?
Compute the parameter savings: compare AlexNet's FC6 layer (9216→4096) against LeNet's FC layer (400→120). Which is larger, by how much?

(1) Tanh/Sigmoid saturate → gradient ≈ 0 for large |x|; backprop fails in deep nets. ReLU gradient=1 for x>0, so gradients flow unchanged. (2) Overfitting: 60M params / 1.2M samples ≈ 50 params per sample — without regularization, the model memorizes training data. (3) Crops: 2048 crops per image × horizontal flip = effective 2× augmentation → millions of distinct views. (4) FC6: 9216×4096+4096 = 37,752,832. LeNet FC: 400×120+120 = 48,120. AlexNet FC6 alone is 785× larger than LeNet's biggest FC.

Code · CNN Architecture Evolution Hard

Train a Mini-CNN on CIFAR-10

Implement and train a compact AlexNet-inspired model on CIFAR-10.

Load CIFAR-10 using torchvision.datasets.CIFAR10 with augmentation (RandomCrop + HorizontalFlip)
Build the MiniAlexNet from the slides (3 conv blocks + 2 FC layers with Dropout)
Train for 20 epochs using Adam (lr=1e-3), CrossEntropyLoss, batch size 128
Plot training loss and validation accuracy curves
Report final test accuracy. Compare training with Dropout ON vs OFF — what do you observe?

CIFAR-10 images are 32×32×3 with 10 classes. Use transforms.Compose([transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))]). Expected test accuracy: ~75-80% with Dropout, ~70% without (signs of overfitting). Set model.train() during training and model.eval() for validation to correctly enable/disable Dropout.

Synthesis · Theory: CNN Design Challenge Hard

Design a CNN for Retinal Disease Detection

You are designing a CNN to classify fundus (retinal) photographs (512×512 pixels, 3 channels) into 5 disease categories. Design the architecture and justify every decision.

How many convolutional layers do you need to achieve a receptive field of at least 64×64 pixels? (Assume K=3, S=1, 2×2 MaxPool after every 2 conv layers)
Propose the number of filters per layer (start at 32, double after each pool)
Calculate the output feature map size after your final conv+pool block, then the total parameters in the feature extractor
Justify your choice of: (a) pooling vs stride-2 conv; (b) Dropout rate; (c) whether to use BatchNorm

With K=3, S=1: each pair of conv layers adds 4 to RF. After pool (2×2 stride-2), subsequent layers see 2× larger input regions. A typical design: Conv1,Conv2→Pool (RF≈5, output 256×256); Conv3,Conv4→Pool (RF≈18, output 128×128); Conv5,Conv6→Pool (RF≈50, output 64×64); Conv7,Conv8→Pool (RF≈144, output 32×32). Filters: 32→32→64→64→128→128→256→256. Using stride-2 conv instead of MaxPool is increasingly preferred as it's learnable and preserves more information. BatchNorm is essential for training stability with deep medical imaging models.

Synthesis · Code: CNN Ablation Study Hard

Ablation Study: What Makes AlexNet Work?

Systematically remove each AlexNet innovation and measure the impact on CIFAR-10 validation accuracy after 10 epochs.

Implement four variants of MiniAlexNet: (a) Baseline (all innovations), (b) No Dropout, (c) Sigmoid instead of ReLU, (d) No data augmentation
Train all four with identical hyperparameters for 10 epochs each
Record train loss and val accuracy every epoch
Create a single plot with 4 validation accuracy curves (one color per variant)
Write 3 sentences explaining which innovation had the largest impact and why

Use the same random seed (torch.manual_seed(42)) and same weight initialization for fair comparison. For Sigmoid variant, replace nn.ReLU() with nn.Sigmoid() throughout. Expected ordering (highest val acc): Baseline > No Augmentation > No Dropout > Sigmoid. The Sigmoid variant may fail to improve beyond ~50% due to vanishing gradients — this is the most dramatic effect. No Dropout will show higher train accuracy but lower val accuracy (classic overfitting signature).

Convolutional
Neural Networks

The Convolutional Layer

Output Dimension Formula

Multi-Channel Convolution

Try It: Output Dimension Calculator

Medical Imaging: Diabetic Retinopathy Screening

Pooling & Receptive Fields

Max & Average Pooling

Receptive Field Growth

Try It: Pooling Output Visualizer

Radiology AI: Chest X-Ray Pathology Detection

From LeNet to AlexNet

LeNet-5 Architecture (1998)

AlexNet's 5 Key Innovations (2012)

Try It: Architecture Comparison

Manufacturing Quality Control: PCB Defect Detection

Receptive Field Visualizer

What We Covered

Convolutional Layer

Dimension Formula

Pooling & Invariance

Receptive Fields

LeNet → AlexNet

Coming Up — Week 7

Further Reading

Prince — Understanding Deep Learning, Ch. 10

Krizhevsky et al. — ImageNet Classification with Deep CNNs (2012)

CNN Explainer — Polo Club of Data Science

Stanford CS231n — Convolutional Neural Networks

Exercises

Output Dimension Arithmetic

Feature Map Visualization

Receptive Field & Pooling Calculations

Feature Map Size Tracking

AlexNet Innovation Analysis

Train a Mini-CNN on CIFAR-10

Design a CNN for Retinal Disease Detection

Ablation Study: What Makes AlexNet Work?

ConvolutionalNeural Networks

The Convolutional Layer

Output Dimension Formula

Multi-Channel Convolution

Try It: Output Dimension Calculator

Medical Imaging: Diabetic Retinopathy Screening

Pooling & Receptive Fields

Max & Average Pooling

Receptive Field Growth

Try It: Pooling Output Visualizer

Radiology AI: Chest X-Ray Pathology Detection

From LeNet to AlexNet

LeNet-5 Architecture (1998)

AlexNet's 5 Key Innovations (2012)

Try It: Architecture Comparison

Manufacturing Quality Control: PCB Defect Detection

Receptive Field Visualizer

What We Covered

Convolutional Layer

Dimension Formula

Pooling & Invariance

Receptive Fields

LeNet → AlexNet

Coming Up — Week 7

Further Reading

Prince — Understanding Deep Learning, Ch. 10

Krizhevsky et al. — ImageNet Classification with Deep CNNs (2012)

CNN Explainer — Polo Club of Data Science

Stanford CS231n — Convolutional Neural Networks

Exercises

Output Dimension Arithmetic

Feature Map Visualization

Receptive Field & Pooling Calculations

Feature Map Size Tracking

AlexNet Innovation Analysis

Train a Mini-CNN on CIFAR-10

Design a CNN for Retinal Disease Detection

Ablation Study: What Makes AlexNet Work?

Convolutional
Neural Networks