Computer Vision Foundations — Week 7: Modern CNNs & Transfer Learning

Topic 1 · Intuition

Skip Connections & Residual Networks

Why adding more layers can hurt — and how a single identity shortcut unlocked the 150-layer barrier.

After this section you will be able to

Explain the degradation problem and why deeper plain networks converge worse than shallower ones
Derive the residual output $H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ and describe how the identity path stabilises gradient flow
Implement a PyTorch residual block with a projection shortcut for dimension changes

"By 2014 the consensus was clear: deeper is better. Then teams tried 30-, 50-, even 100-layer plain networks — and training accuracy got worse. Something fundamental was broken."

🔗

Why skip connections matter: In a deep stack of weight layers each backpropagation step multiplies the gradient by a small number. After 20+ layers the signal reaching early weights is near zero — training stalls. Skip connections add a direct gradient highway that bypasses the weight layers entirely, ensuring every parameter receives a usable update signal regardless of depth.

🛣️

Analogy Bridge

Think of a road network. The main path winds through many toll booths — each one slows traffic (degrades gradient). The skip connection is an express bypass: the gradient travels directly without attenuation. The final merge (+) combines both flows.

Residual block: the main convolution path learns F(x), then the identity skip adds x directly. The gradient at the + node is ∂F/∂x + I — never zero.

152

layers

ResNet-152 won ImageNet 2015 — only possible with skip connections

3.57%

top-5 error

First model to surpass human-level (5.1%) performance on ImageNet

10⁻¹⁰

min gradient

Typical gradient magnitude at layer 1 of a 50-layer plain network — effectively zero

2 lines

of code

out = F(x) becomes out = F(x) + x — the full innovation

Problem — The Degradation Barrier

Vanishing Gradients in Deep Plain Networks

In backpropagation the gradient at layer $l$ is a product of all upstream Jacobians. For sigmoid-like activations each factor $\sigma'(z_k) \leq 0.25$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}_L}\prod_{k=l}^{L-1} \mathbf{W}_k\,\sigma'(\mathbf{z}_k)$$

After $L$ layers the product collapses — early weights receive near-zero updates and training stalls regardless of how long you wait.

📝 Worked Example — Gradient Magnitude vs. Depth

Assume each layer attenuates gradient by $\delta = 0.85$ (typical for tanh/sigmoid layers).

Gradient at layer 1 after $L$ layers: $g = g_L \cdot \delta^{L-1}$

$L=5$: $0.85^4 = 0.522$ — usable
$L=20$: $0.85^{19} = 0.034$ — very small
$L=50$: $0.85^{49} \approx 5.5\times10^{-4}$ — nearly zero

With learning rate $\eta = 0.01$, weight update at depth 50: $\Delta W \approx 5.5\times10^{-6}$ — practically no learning signal.

Gradient at layer 1, depth 50: ≈ 5.5 × 10⁻⁴ × g_L → training stalls

Quick Check

If $\delta = 0.9$, at what depth does the gradient drop below 1% of its initial value?

0.9^(L-1) < 0.01 → (L-1) > ln(0.01)/ln(0.9) ≈ 43.7 → depth L ≈ 45 layers

The Residual Mapping Fix

ResNet re-parameterises the learning target. Instead of learning $H(\mathbf{x})$ directly, the layers learn the residual $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$:

$$H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$$

The Jacobian of the block is $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$. The identity $\mathbf{I}$ is always present — gradient can never completely vanish through this block.

📝 Worked Example — Residual Block Forward Pass

Input: $\mathbf{x} = [2.0,\ 1.0,\ 3.0]^\top$. Two simplified conv layers output $F(\mathbf{x}) = [0.4,\ 0.6,\ 0.2]^\top$.

Add identity shortcut:

$$H(\mathbf{x}) = \begin{bmatrix}0.4\\0.6\\0.2\end{bmatrix} + \begin{bmatrix}2.0\\1.0\\3.0\end{bmatrix} = \begin{bmatrix}2.4\\1.6\\3.2\end{bmatrix}$$

Gradient guarantee: $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$. Even if $\frac{\partial F}{\partial \mathbf{x}} \approx \mathbf{0}$, gradient $\geq \mathbf{I}$ — training always proceeds.

H(x) = [2.4, 1.6, 3.2]ᵀ — output stays close to input; only the correction is learned

Quick Check

If $F(\mathbf{x}) = [-2.0,\ -1.0,\ -3.0]^\top$, what is $H(\mathbf{x})$? What has the block learned to do?

H(x) = [0, 0, 0]ᵀ. The block learned to suppress the input entirely. F learns only the "correction" (residual) — if the optimal output is zero, F just needs to cancel x.

⚠️

Common Mistake — Dimension Mismatch on the Skip Path

The skip addition requires $\mathbf{x}$ and $F(\mathbf{x})$ to have identical shapes. When stride > 1 or the number of channels changes, add a 1×1 projection convolution (+ BatchNorm) on the skip path to match. Forgetting this causes a runtime shape error at the + node.

Solution — Depth vs. Error Visualiser

Pause & Predict

The widget below plots training error vs. depth for plain and residual networks. Before dragging: at what depth does the plain network begin to degrade? Where does the residual network plateau?

Hint: gradients multiply ~0.85 per layer — how many steps until the product falls below 0.01?

Try It: Network Depth vs. Training Error

Drag the slider to change depth. Observe how plain networks degrade while residual networks keep improving.

Network Depth 20 layers

Plain Network (no skip) Residual Network (skip connections)

Live Calculation — Gradient at Layer 1 (δ = 0.85/layer)

depth = 20 layers

Implementation

Python · PyTorch — ResidualBlock with projection shortcut

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3,
                               stride=stride, padding=1,
                               bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.relu  = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3,
                               padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        # 1×1 projection when dims change
        needs_proj = (stride != 1) or (in_ch != out_ch)
        self.shortcut = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 1,
                      stride=stride, bias=False),
            nn.BatchNorm2d(out_ch)
        ) if needs_proj else nn.Identity()

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)  # H(x) = F(x) + x
        return self.relu(out)

# Stride=2: spatial dims halve, projection shortcut fires
block = ResidualBlock(64, 128, stride=2)
x = torch.randn(4, 64, 32, 32)
print(block(x).shape)  # (4, 128, 16, 16)

Output

torch.Size([4, 128, 16, 16])

Key Takeaway

Adding out += x — a two-character change — provides a permanent gradient highway through every block, making networks of 50, 100, or 1 000 layers trainable where plain stacks completely fail.

🏥

Real-World Application

Medical Image Segmentation — ResNet Encoders in nnU-Net

nnU-Net, the leading automatic medical segmentation framework, replaces plain conv encoders with ResNet-50 backbones. Skip connections prevent vanishing gradients when fine-tuning on small datasets (100–500 annotated 3D volumes), enabling reliable convergence of 20 M+ parameter models on CT and MRI scans.

CheckpointSkip Connections & ResNet

Q1 A residual block receives input $\mathbf{x}$ of shape (64, 32, 32) and uses stride=2 to produce output of shape (128, 16, 16). What shape must the projection shortcut output, and which layer achieves this?

Shape: (128, 16, 16). The projection shortcut uses a 1×1 Conv2d(64, 128, stride=2) + BatchNorm2d(128). This matches the main path output so element-wise addition is valid.

Q2 Explain why $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$ prevents vanishing gradients even in a 100-layer network.

The identity matrix I is always present, regardless of how small ∂F/∂x becomes due to saturated activations. Through backpropagation the gradient accumulates a +1 contribution from every skip connection, so it can never collapse to zero across the full depth.

Topic 2 · Mechanics

Depthwise Separable Convolutions & MobileNet

How factorising the convolution operation achieves an 8–9× reduction in computation — enabling real-time vision on a smartphone.

After this section you will be able to

Compare the parameter count of a standard conv layer vs. its depthwise separable equivalent using the formula
Explain the two-stage structure (depthwise then pointwise) and what each stage learns
Implement a depthwise separable convolution block in PyTorch using groups=in_channels

"ResNet-50 has 25 million parameters and needs 4 GFLOPs for a single image. That's fine in a data-centre. On a 2015 smartphone with 1 W of compute budget, it would take 2 seconds per frame. MobileNet runs at 30 fps."

📱

Why efficient convolutions matter: Edge devices — phones, drones, medical monitors — have strict power and memory budgets. Depthwise separable convolution achieves similar representational power to a standard conv at a fraction of the cost by separating spatial filtering (per channel) from channel mixing (1×1 conv).

👨‍🍳

Analogy Bridge

Standard convolution is like one chef doing everything: mixing all ingredients and cooking simultaneously. Depthwise separable conv is specialised teamwork: one chef handles each ingredient separately (depthwise — spatial), then a coordinator blends all the flavours at the end (pointwise — channel mixing).

Depthwise separable convolution: stage 1 filters each channel independently (spatial), stage 2 mixes channels (1×1). Together they approximate a full conv at ~1/9 the cost.

8–9×

ops saved

For a 3×3 conv with 64 output channels: ratio = 1/64 + 1/9 ≈ 12.7% of standard ops

4.2M

params

MobileNetV1 — vs. 25 M for ResNet-50 and 138 M for VGG-16

70.6%

top-1 acc

MobileNetV1 on ImageNet — competitive with VGG-16 at 1/30 the parameters

569M

MAdds

MobileNetV1 multiply-adds vs. 4,000 M for ResNet-50 — 7× faster inference

Problem — Standard Convolution Is Expensive

Standard Convolution Cost

A standard conv layer with $D_K \times D_K$ kernel, $M$ input channels, $N$ output channels, and $D_F \times D_F$ feature map has total multiply-adds:

$$\text{Ops}_{\text{std}} = D_K^2 \times M \times N \times D_F^2$$

Parameters: $D_K^2 \times M \times N$. Every output channel looks at every input channel at every spatial position — expensive cross-channel interactions happen at each kernel position.

📝 Worked Example — Standard Conv Parameter Count

Settings: $D_K = 3$, $M = 32$ input channels, $N = 64$ output channels, $D_F = 112$ (input resolution).

Parameters: $D_K^2 \times M \times N = 9 \times 32 \times 64 = \mathbf{18{,}432}$

Operations: $9 \times 32 \times 64 \times 112^2 = 18{,}432 \times 12{,}544 \approx \mathbf{231\,M}$ MAdds

Standard 3×3 conv (32→64, 112×112): 18,432 params, 231 M ops

Quick Check

If you double both $M$ and $N$ (32→64 input, 64→128 output), how many times more parameters does the layer have?

Parameters scale as M×N, so doubling both: (64×128)/(32×64) = 8,192/2,048 = 4× more parameters. Ops also 4× since they scale as M×N×D_F².

Depthwise Separable Conv Cost

Split into two stages. Depthwise ops (1 filter per input channel): $D_K^2 \times M \times D_F^2$. Pointwise ops (1×1 cross-channel): $M \times N \times D_F^2$. Total:

$$\text{Ops}_{\text{dw}} = D_K^2 \cdot M \cdot D_F^2 + M \cdot N \cdot D_F^2$$

The ratio to standard conv:

$$\frac{\text{Ops}_{\text{dw}}}{\text{Ops}_{\text{std}}} = \frac{1}{N} + \frac{1}{D_K^2}$$

For $N=64$, $D_K=3$: ratio $= 1/64 + 1/9 \approx 0.127$ — an 87% reduction.

📝 Worked Example — DW-Sep Parameter Count

Depthwise params: $D_K^2 \times M = 9 \times 32 = \mathbf{288}$

Pointwise params: $M \times N = 32 \times 64 = \mathbf{2{,}048}$

Total: $288 + 2{,}048 = \mathbf{2{,}336}$ vs. $18{,}432$ standard

$$\text{Reduction} = \frac{18{,}432}{2{,}336} \approx 7.9\times$$

DW-Sep (32→64, 3×3): 2,336 params — 7.9× fewer than standard conv

Quick Check

Using the ratio formula $1/N + 1/D_K^2$ with $N=64$, $D_K=3$, verify the 87% reduction result numerically.

1/64 + 1/9 = 0.01563 + 0.1111 = 0.1267 → 12.67% of standard ops → 87.3% reduction ✓

💡

Key Insight — What Each Stage Learns

Depthwise (3×3): detects spatial patterns (edges, textures) within each channel independently — like running a dedicated edge-detector per feature map. Pointwise (1×1): combines channels to form richer representations — like a learned colour mixer. Together they approximate full conv expressiveness at ~1/9 the cost.

Solution — Standard vs. Depthwise Separable Cost Explorer

Pause & Predict

Below, adjust the number of output channels $N$. Before dragging: at which value of $N$ does the depthwise separable approach give more than 8× fewer operations? Does the benefit grow or shrink as $N$ increases?

Hint: the ratio formula is $1/N + 1/9$. Plug in $N = 9$.

Try It: Parameter Count Comparison

Compare parameter counts for standard vs. depthwise separable convolution as you change input channels $M$, output channels $N$, and kernel size $D_K$.

Input channels M 32

Output channels N 64

Kernel D_K 3

Standard Conv params Depthwise-Sep params Reduction ratio

Live Calculation — Parameter Count

M=32, N=64, Dₖ=3

Implementation

Python · PyTorch — Depthwise Separable Convolution Block

import torch
import torch.nn as nn

class DWSepConv(nn.Module):
    """Depthwise-separable conv: 3×3 DW + 1×1 PW + BN + ReLU6."""
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        # Depthwise: one filter per input channel
        self.dw = nn.Conv2d(in_ch, in_ch, kernel_size=3,
                            stride=stride, padding=1,
                            groups=in_ch,  # key: groups=in_ch
                            bias=False)
        self.bn1 = nn.BatchNorm2d(in_ch)
        # Pointwise: mix channels with 1×1 conv
        self.pw = nn.Conv2d(in_ch, out_ch, kernel_size=1,
                            bias=False)
        self.bn2  = nn.BatchNorm2d(out_ch)
        self.act  = nn.ReLU6(inplace=True)

    def forward(self, x):
        x = self.act(self.bn1(self.dw(x)))   # spatial
        x = self.act(self.bn2(self.pw(x)))   # channel mix
        return x

# Compare param counts
M, N = 32, 64
std_params = 3**2 * M * N         # 18,432
dw_sep     = 3**2 * M + M * N     # 2,336
print(f"Standard:   {std_params:,} params")
print(f"DW-Sep:     {dw_sep:,} params")
print(f"Reduction:  {std_params/dw_sep:.1f}×")

block = DWSepConv(32, 64)
x = torch.randn(4, 32, 112, 112)
print(block(x).shape)

Output

Standard: 18,432 params DW-Sep: 2,336 params Reduction: 7.9× torch.Size([4, 64, 112, 112])

Key Takeaway

Depthwise separable convolution separates where to detect (depthwise, per-channel) from what to detect (pointwise, cross-channel), cutting operations by 8–9× with less than 1% accuracy loss on ImageNet.

📦

Real-World Application

Real-Time Quality Control on a Raspberry Pi

A Thai packaging factory replaced a ResNet-based defect detector (GPU-required, 18 FPS) with a MobileNetV2 model built on depthwise separable blocks. The result: 30 FPS on a $35 Raspberry Pi 4, 91% defect recall, and zero GPU costs — enabling in-line inspection on every production line.

CheckpointDepthwise Separable Convolutions

Q1 A standard 3×3 conv has $M=128$ input channels and $N=256$ output channels. Calculate (a) the parameter count, (b) the depthwise separable parameter count, and (c) the reduction ratio.

(a) Standard: 9 × 128 × 256 = 294,912 params
(b) DW-Sep: 9×128 + 128×256 = 1,152 + 32,768 = 33,920 params
(c) Ratio: 294,912 / 33,920 ≈ 8.7× (formula: 1/N + 1/9 = 1/256+1/9 ≈ 0.115 → 8.7× reduction)

Q2 In PyTorch, what argument to Conv2d implements the depthwise step, and what value should it have for a layer with 64 input channels?

groups=64. Setting groups=in_channels forces each filter to operate on exactly one input channel — that is the definition of depthwise convolution.

Topic 3 · Application

Transfer Learning & Fine-Tuning Strategies

How to transplant 10 million GPU-hours of ImageNet training into your custom task in an afternoon — and which layers to touch.

After this section you will be able to

Explain the feature hierarchy in a pre-trained CNN and why early layers generalise across tasks
Select the correct fine-tuning strategy (freeze all / partial unfreeze / full fine-tune) based on dataset size and domain similarity
Implement MobileNetV2 fine-tuning in PyTorch — freezing backbone, replacing the classifier head, and training with a discriminative learning rate

"Training ResNet-50 on ImageNet from scratch takes 90 epochs on 8 GPUs — roughly 3 days and thousands of dollars. Transfer learning lets you reach the same accuracy on a custom 500-image dataset in 20 minutes on a laptop."

🧠

Why pre-trained weights work for new tasks: The first layers of any CNN trained on natural images learn universal detectors — edges at layer 1, curves at layer 2, textures at layer 3. These features are useful for nearly any visual task (medical, satellite, industrial). Only the final task-specific layers need to be retrained on your data.

👷

Analogy Bridge

Hiring an experienced engineer who already knows physics, CAD, and general manufacturing. You only need to train them on your specific product — not reteach the fundamentals. The pre-trained backbone is the general expertise; your dataset teaches the product-specific knowledge.

Fine-tuning strategy: grey = frozen backbone (universal features), green = partially unfrozen (domain adaptation), blue = fully trainable, purple = new task-specific head always trained.

1 000

ImageNet classes

Pre-trained on 1.2 M images — the backbone "sees" edges, textures, and shapes across 1 000 categories

~10

epochs

Typical time to fine-tune MobileNetV2 on a custom 1,000-image dataset (head only, GPU)

95%+

accuracy

Achievable on many custom datasets with transfer learning vs. ~60% training from scratch

10×

lower LR

Recommended backbone LR relative to head when partially fine-tuning — protects pre-trained weights

Problem — Choosing What to Freeze

The Feature Hierarchy

CNN features become increasingly task-specific with depth:

Layer 1–2: Gabor-like edge detectors and colour blobs — universal across all visual tasks
Layer 3–4: Textures, curves, contours — generalisable with some variation
Layer 5+: Object parts and semantic regions — begin to specialise to training domain
Final layer: Fully task-specific (1 000 ImageNet classes)

The decision of where to "cut" drives your strategy.

📝 Worked Example — Strategy Selection

Task A: classify 3 cat breeds, 200 photos. ImageNet contains cat images → similar domain, tiny data. Strategy: freeze all backbone, train head only.

Task B: classify skin lesions (dermoscopy images), 2,000 photos. ImageNet has no medical images → different domain, medium data. Strategy: freeze early layers, unfreeze deep layers + head at lower LR.

Task C: classify satellite crop types, 50,000 images. Very different domain, large data → full fine-tune with small initial LR across all layers.

Rule: more data + more domain shift → unfreeze more layers

Quick Check

You have 10,000 X-ray images. X-rays look very different from natural images. Which strategy would you choose and why?

Strategy B (partial fine-tune) or full fine-tune. Large domain shift means late backbone layers may not transfer well — unfreeze the last 2–3 blocks. 10k images is enough to fine-tune without severe overfitting if you use regularisation (dropout, weight decay).

Discriminative Learning Rates

When unfreezing backbone layers, do not use the same learning rate for all layers. Earlier backbone layers contain well-tuned, general features — destroy them with a high LR and accuracy collapses.

The standard practice is a discriminative learning rate schedule:

$$\eta_l = \eta_{\text{head}} \times r^{(L - l)}$$

where $r \in [0.1, 0.5]$ is the decay factor per layer group, $L$ is the total number of groups, and $l$ is the group index (0 = earliest). In practice use $r = 0.1$ (PyTorch parameter groups).

📝 Worked Example — Computing Layer Learning Rates

Head learning rate: $\eta_{\text{head}} = 1 \times 10^{-3}$. Decay $r = 0.1$. Three groups: early backbone, late backbone, head.

$$\begin{aligned}\eta_{\text{head}} &= 1\times10^{-3}\\[4pt]\eta_{\text{late backbone}} &= 1\times10^{-3} \times 0.1 = 1\times10^{-4}\\[4pt]\eta_{\text{early backbone}} &= 1\times10^{-3} \times 0.1^2 = 1\times10^{-5}\end{aligned}$$

Early layers update 100× more slowly — they barely move from their pre-trained values, preserving the universal edge detectors.

LR schedule: 1e-5 (early) → 1e-4 (late backbone) → 1e-3 (head)

Quick Check

Using $r=0.3$ and $\eta_\text{head}=5\times10^{-4}$, what is $\eta_\text{early backbone}$ with 3 groups?

η_early = 5×10⁻⁴ × 0.3² = 5×10⁻⁴ × 0.09 = 4.5×10⁻⁵

⚠️

Common Mistake — Not Replacing the Classifier Head

Pre-trained models end with a classifier or fc layer outputting 1 000 ImageNet classes. If you forget to replace it with a new head matching your class count, the loss will be computed against the wrong number of outputs — causing either a shape error or silent wrong-task training.

Solution — Dataset Size × Domain Shift Guide

Pause & Predict

The widget below maps (dataset size, domain similarity) to a fine-tuning strategy recommendation. Before clicking: predict which quadrant requires the most aggressive unfreezing — large dataset + very different domain?

Hint: what happens if you freeze all layers and you have 50,000 images from a completely different domain?

Try It: Strategy Decision Matrix

Click any quadrant to see the recommended fine-tuning strategy and reasoning.

Click a quadrant to see strategy details.

Implementation

Python · PyTorch — Fine-Tuning MobileNetV2 for Custom Classification

import torchvision.models as models
import torch.nn as nn
import torch

NUM_CLASSES = 5  # your custom number of classes

# 1. Load pre-trained MobileNetV2
model = models.mobilenet_v2(
    weights=models.MobileNet_V2_Weights.IMAGENET1K_V1)

# 2. Freeze entire backbone
for param in model.parameters():
    param.requires_grad = False

# 3. Replace classifier head (1280 → NUM_CLASSES)
model.classifier = nn.Sequential(
    nn.Dropout(0.2),
    nn.Linear(model.last_channel, NUM_CLASSES)
)

# 4. Strategy B: unfreeze last 3 backbone blocks
for layer in model.features[-3:]:
    for p in layer.parameters():
        p.requires_grad = True

# 5. Discriminative LR: 10× lower for backbone
optimizer = torch.optim.Adam([
    {'params': model.features[-3:].parameters(),
     'lr': 1e-4},          # late backbone
    {'params': model.classifier.parameters(),
     'lr': 1e-3},          # head
])

trainable = sum(p.numel() for p in model.parameters()
                 if p.requires_grad)
print(f"Trainable params: {trainable:,}")

Output

Trainable params: 576,645 (out of 3,504,872 total — 16.5% trainable)

Key Takeaway

Transfer learning is not a single setting — match your freeze depth to your data: tiny data + similar domain → freeze everything except the head; large data + different domain → unfreeze deep layers with discriminative learning rates 10–100× smaller than the head.

🔬

Real-World Application

Fine-Tuning MobileNet for Custom Product Quality Control

A KMITL industrial partner trained a 5-class surface defect classifier (good, crack, scratch, dent, discolour) on 1,200 product images using MobileNetV2 fine-tuning (Strategy B). Frozen early layers, unfrozen last 3 blocks, discriminative LR. Result: 96.4% accuracy in 25 epochs — a model deployed on a Raspberry Pi to inspect 40 units per minute.

CheckpointTransfer Learning & Fine-Tuning

Q1 You have 300 dermoscopy images and want to classify 2 types of skin lesion. ImageNet has no medical images. Which fine-tuning strategy would you choose, and why?

Strategy A or light Strategy B. 300 images is very few — full fine-tune risks overfitting. Freeze the early backbone (universal features still useful), train only the head + optionally the last 1–2 blocks at a 10× lower LR. Add strong data augmentation (flip, rotation, colour jitter) to compensate for the small dataset.

Q2 After calling model = mobilenet_v2(weights=...) and freezing all parameters, you get an error when computing the loss. What did you forget?

You forgot to replace the classifier head. The pre-trained model's final layer outputs 1,000 classes. If your dataset has a different number of classes (e.g., 5), the loss function (e.g., CrossEntropyLoss) will receive a size mismatch — either a runtime error or wrong-task training.

Practice

Exercises

8 exercises covering skip connections, depthwise separable convolutions, and transfer learning fine-tuning strategies.

Theory · Skip Connections Easy

Gradient Magnitude in a 6-Layer Stack

A plain 6-layer network multiplies the gradient by $\delta = 0.7$ at each layer (from output back to layer 1). A ResNet uses the same 6 layers with skip connections added every 2 layers.

Calculate the gradient magnitude at layer 1 of the plain network (as a fraction of the output gradient $g_L$).
Sketch (or describe) how skip connections change the effective gradient formula at each block boundary.
Why does the identity shortcut guarantee a gradient of at least 1 through each residual block?

For part (a): gradient at layer 1 = $g_L \cdot \delta^{L-1}$ where $L=6$, so $g_L \cdot 0.7^5$. For part (c): the Jacobian of $H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ with respect to $\mathbf{x}$ is $\nabla_\mathbf{x} H = \nabla_\mathbf{x} F + \mathbf{I}$, so even if $\nabla_\mathbf{x} F \approx 0$ the gradient still has the identity term.

Code · Skip Connections Medium

Mini ResNet for CIFAR-10

Build a small ResNet-style classifier for CIFAR-10 (32×32×3 input, 10 classes).

Implement a ResidualBlock with two 3×3 convolutions and a projection shortcut when channels/stride changes.
Stack 3 residual blocks: (32→32), (32→64, stride=2), (64→64). Add a global average pool and linear head.
Pass a batch torch.randn(8, 3, 32, 32) through the model and print the output shape.
Count total trainable parameters using sum(p.numel() for p in model.parameters()).

After the 3 blocks: tensor shape is (8, 64, 8, 8). After global average pool (nn.AdaptiveAvgPool2d(1)) and flatten: (8, 64). After linear head: (8, 10). The projection shortcut fires only at the (32→64, stride=2) block — use nn.Conv2d(32, 64, 1, stride=2, bias=False) + BN.

Theory · Depthwise Separable Convolutions Easy

Parameter Count and Reduction Ratio

Consider a convolution with $D_K = 3$, $M = 64$ input channels, $N = 128$ output channels, and spatial feature map $D_F = 56$.

Calculate the parameter count of a standard 3×3 conv layer.
Calculate the parameter count of an equivalent depthwise separable block (depthwise + pointwise, no bias).
Compute the exact reduction ratio and verify using the formula $\frac{1}{N} + \frac{1}{D_K^2}$.
Calculate the total multiply-add operations for both approaches.

(a) Standard: $9 \times 64 \times 128 = 73,728$ params. (b) DW-Sep: $9 \times 64 + 64 \times 128 = 576 + 8,192 = 8,768$ params. (c) Ratio: $73728 / 8768 \approx 8.41\times$. Formula: $1/128 + 1/9 = 0.0078 + 0.111 = 0.119 \rightarrow 8.41\times$ ✓. (d) Ops: standard = $73728 \times 56^2 \approx 231\text{M}$; DW-Sep = $(576 + 8192) \times 56^2 \approx 27.5\text{M}$.

Code · Depthwise Separable Convolutions Medium

Build a DW-Sep Block and Verify Efficiency

Implement and benchmark a depthwise separable convolution block in PyTorch.

Implement DWSepConv(in_ch, out_ch, stride=1) with: depthwise 3×3 + BN + ReLU6, then pointwise 1×1 + BN + ReLU6.
Implement a matching StandardConv(in_ch, out_ch) using a regular 3×3 conv + BN + ReLU.
Compare parameter counts for $M=64$, $N=128$: print both counts and the ratio.
Pass the same torch.randn(2, 64, 112, 112) through both and confirm identical output shapes.

Key: nn.Conv2d(in_ch, in_ch, 3, padding=1, groups=in_ch, bias=False) for the depthwise step. The groups=in_ch argument is the critical implementation detail. Both models should produce shape (2, 128, 112, 112).

Theory · Transfer Learning Medium

Fine-Tuning Strategy Selection

For each scenario below, choose the correct strategy (A: freeze all / B: partial unfreeze / C: full fine-tune) and justify your choice.

Scenario 1: 300 photos of cats and dogs (similar to ImageNet). Classify into 2 breeds.
Scenario 2: 5,000 chest X-ray images. Classify pneumonia vs. normal. Very different from ImageNet.
Scenario 3: 200,000 satellite images. Classify 10 land-use types. Very different domain, abundant data.
For Scenario 2, compute the backbone learning rate if $\eta_\text{head} = 5\times10^{-4}$ and $r = 0.1$.

S1: Strategy A — tiny + similar; frozen backbone prevents overfitting. S2: Strategy B — medium + different domain; unfreeze last 2–3 blocks at 10× lower LR. S3: Strategy C — large + very different; full fine-tune needed to adapt early features. Part 4: $\eta_\text{backbone} = 5\times10^{-4} \times 0.1 = 5\times10^{-5}$.

Code · Transfer Learning Medium

Fine-Tuning MobileNetV2 — Step by Step

Implement a complete fine-tuning setup for a 4-class defect classifier.

Load mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1).
Freeze all parameters. Print the trainable parameter count (should be 0).
Replace model.classifier with nn.Sequential(nn.Dropout(0.2), nn.Linear(1280, 4)).
Unfreeze the last 3 feature blocks (model.features[-3:]). Print the new trainable count.
Set up an Adam optimiser with lr=1e-4 for backbone, lr=1e-3 for head using parameter groups.

After step 2: 0 trainable params. After step 3: 5,124 (head only: 1280×4 + 4 bias). After step 4: varies — model.features[-3:] contains inverted residual blocks with roughly 500k params. Use sum(p.numel() for p in model.parameters() if p.requires_grad) to count.

Synthesis · Theory: Architecture Design Trade-offs Hard

Combining Skip Connections with Depthwise Separable Convolutions

MobileNetV2 introduces inverted residuals: expand channels (1×1), apply depthwise 3×3, then contract (1×1). This is the reverse of standard bottlenecks.

Explain why MobileNetV2 expands channels before the depthwise conv (hint: more channels → richer spatial features for the depthwise step).
A standard MobileNetV1 DW-Sep block: $M=32$ input, $N=32$ output, 3×3 DW. An inverted residual block (MV2): expand to $6M=192$, then DW 3×3 at 192, then contract to 32. Calculate total parameter counts for both blocks.
MobileNetV2 only adds the skip connection when $M_\text{in} = M_\text{out}$ and stride = 1. Why is the projection shortcut not used here (unlike ResNet)?

Part 2 — MV1: $9\times32 + 32\times32 = 288 + 1024 = 1312$ params. MV2 inverted residual: expand ($32\times192 = 6144$) + DW ($9\times192 = 1728$) + contract ($192\times32 = 6144$) = $\mathbf{14016}$ params. MV2 has more parameters per block but the expansion enables more expressive spatial filtering. Part 3: Adding a projection would add parameters and computation in an already constrained block; the goal is efficiency. The skip is free only when dimensions match (identity).

Synthesis · Code: Industrial Defect Inspection Pipeline Hard

End-to-End Fine-Tuning Pipeline

Build a complete fine-tuning and evaluation pipeline for a 5-class product defect classifier using MobileNetV2.

Load MobileNetV2. Freeze backbone. Replace head for 5 classes.
Create a synthetic dataset: torch.randn(100, 3, 224, 224) with random integer labels (0–4). Wrap in a TensorDataset and DataLoader(batch_size=16).
Define criterion = nn.CrossEntropyLoss() and an Adam optimiser (head-only LR=1e-3).
Run 3 training epochs. Print average loss and accuracy per epoch.
After epoch 1, unfreeze the last 2 backbone blocks and add them to the optimiser at LR=1e-4. Continue training for 2 more epochs. Compare accuracy before and after unfreezing.

To add parameters to an existing optimiser after unfreezing: call optimizer.add_param_group({'params': new_params, 'lr': 1e-4}). For accuracy: preds = logits.argmax(dim=1); acc = (preds == labels).float().mean(). With random data, expect ~20% accuracy (chance level for 5 classes) regardless of training — the goal is verifying the pipeline runs without errors.

Modern CNNs &Transfer Learning

Skip Connections & Residual Networks

Vanishing Gradients in Deep Plain Networks

The Residual Mapping Fix

Try It: Network Depth vs. Training Error

Medical Image Segmentation — ResNet Encoders in nnU-Net

Depthwise Separable Convolutions & MobileNet

Standard Convolution Cost

Depthwise Separable Conv Cost

Try It: Parameter Count Comparison

Real-Time Quality Control on a Raspberry Pi

Transfer Learning & Fine-Tuning Strategies

The Feature Hierarchy

Discriminative Learning Rates

Try It: Strategy Decision Matrix

Fine-Tuning MobileNet for Custom Product Quality Control

Fine-Tuning Strategy Selector

Trainable Params

Recommended Dataset

Head LR

Backbone LR

Training Risk

What We Covered

Skip Connections

Depthwise Separable Convolution

Transfer Learning

Coming up — Week 8

Go Deeper

Prince — Understanding Deep Learning, Ch. 10

Géron — Hands-On ML, Ch. 14

He et al. (2016) — Deep Residual Learning for Image Recognition

Howard et al. (2017) — MobileNets: Efficient CNNs for Mobile Vision

Stanford CS231n — Transfer Learning Notes

Stanford CS231n (Justin Johnson / Fei-Fei Li) — Lecture on CNNs

Exercises

Gradient Magnitude in a 6-Layer Stack

Mini ResNet for CIFAR-10

Parameter Count and Reduction Ratio

Build a DW-Sep Block and Verify Efficiency

Fine-Tuning Strategy Selection

Fine-Tuning MobileNetV2 — Step by Step

Combining Skip Connections with Depthwise Separable Convolutions

End-to-End Fine-Tuning Pipeline

Modern CNNs &
Transfer Learning