Computer Vision Foundations — Week 7

Modern CNNs &
Transfer Learning

From skip connections that unlocked 150-layer networks, to convolutions that run on a smartphone, to the art of re-using what a network already knows.

ResNet · Skip Connections Vanishing Gradients Depthwise Separable Conv MobileNet Transfer Learning Fine-Tuning Strategies

Skip Connections & Residual Networks

Why adding more layers can hurt — and how a single identity shortcut unlocked the 150-layer barrier.

After this section you will be able to
  • Explain the degradation problem and why deeper plain networks converge worse than shallower ones
  • Derive the residual output $H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ and describe how the identity path stabilises gradient flow
  • Implement a PyTorch residual block with a projection shortcut for dimension changes

"By 2014 the consensus was clear: deeper is better. Then teams tried 30-, 50-, even 100-layer plain networks — and training accuracy got worse. Something fundamental was broken."

🔗
Why skip connections matter: In a deep stack of weight layers each backpropagation step multiplies the gradient by a small number. After 20+ layers the signal reaching early weights is near zero — training stalls. Skip connections add a direct gradient highway that bypasses the weight layers entirely, ensuring every parameter receives a usable update signal regardless of depth.
🛣️
Analogy Bridge

Think of a road network. The main path winds through many toll booths — each one slows traffic (degrades gradient). The skip connection is an express bypass: the gradient travels directly without attenuation. The final merge (+) combines both flows.

x Conv 3×3 BN → ReLU Conv 3×3 BN → F(x) Identity shortcut x + ReLU H(x) = F(x) + x Residual path F(x)

Residual block: the main convolution path learns F(x), then the identity skip adds x directly. The gradient at the + node is ∂F/∂x + I — never zero.

152
layers
ResNet-152 won ImageNet 2015 — only possible with skip connections
3.57%
top-5 error
First model to surpass human-level (5.1%) performance on ImageNet
10⁻¹⁰
min gradient
Typical gradient magnitude at layer 1 of a 50-layer plain network — effectively zero
2 lines
of code
out = F(x) becomes out = F(x) + x — the full innovation
Problem — The Degradation Barrier

Vanishing Gradients in Deep Plain Networks

In backpropagation the gradient at layer $l$ is a product of all upstream Jacobians. For sigmoid-like activations each factor $\sigma'(z_k) \leq 0.25$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}_L}\prod_{k=l}^{L-1} \mathbf{W}_k\,\sigma'(\mathbf{z}_k)$$

After $L$ layers the product collapses — early weights receive near-zero updates and training stalls regardless of how long you wait.

📝 Worked Example — Gradient Magnitude vs. Depth
1
Assume each layer attenuates gradient by $\delta = 0.85$ (typical for tanh/sigmoid layers).
2
Gradient at layer 1 after $L$ layers: $g = g_L \cdot \delta^{L-1}$
  • $L=5$:  $0.85^4 = 0.522$ — usable
  • $L=20$: $0.85^{19} = 0.034$ — very small
  • $L=50$: $0.85^{49} \approx 5.5\times10^{-4}$ — nearly zero
3
With learning rate $\eta = 0.01$, weight update at depth 50: $\Delta W \approx 5.5\times10^{-6}$ — practically no learning signal.
Gradient at layer 1, depth 50: ≈ 5.5 × 10⁻⁴ × g_L → training stalls
Quick Check

If $\delta = 0.9$, at what depth does the gradient drop below 1% of its initial value?

0.9^(L-1) < 0.01 → (L-1) > ln(0.01)/ln(0.9) ≈ 43.7 → depth L ≈ 45 layers

The Residual Mapping Fix

ResNet re-parameterises the learning target. Instead of learning $H(\mathbf{x})$ directly, the layers learn the residual $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$:

$$H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$$

The Jacobian of the block is $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$. The identity $\mathbf{I}$ is always present — gradient can never completely vanish through this block.

📝 Worked Example — Residual Block Forward Pass
1
Input: $\mathbf{x} = [2.0,\ 1.0,\ 3.0]^\top$. Two simplified conv layers output $F(\mathbf{x}) = [0.4,\ 0.6,\ 0.2]^\top$.
2
Add identity shortcut:
$$H(\mathbf{x}) = \begin{bmatrix}0.4\\0.6\\0.2\end{bmatrix} + \begin{bmatrix}2.0\\1.0\\3.0\end{bmatrix} = \begin{bmatrix}2.4\\1.6\\3.2\end{bmatrix}$$
3
Gradient guarantee: $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$. Even if $\frac{\partial F}{\partial \mathbf{x}} \approx \mathbf{0}$, gradient $\geq \mathbf{I}$ — training always proceeds.
H(x) = [2.4, 1.6, 3.2]ᵀ — output stays close to input; only the correction is learned
Quick Check

If $F(\mathbf{x}) = [-2.0,\ -1.0,\ -3.0]^\top$, what is $H(\mathbf{x})$? What has the block learned to do?

H(x) = [0, 0, 0]ᵀ. The block learned to suppress the input entirely. F learns only the "correction" (residual) — if the optimal output is zero, F just needs to cancel x.
⚠️
Common Mistake — Dimension Mismatch on the Skip Path

The skip addition requires $\mathbf{x}$ and $F(\mathbf{x})$ to have identical shapes. When stride > 1 or the number of channels changes, add a 1×1 projection convolution (+ BatchNorm) on the skip path to match. Forgetting this causes a runtime shape error at the + node.

Solution — Depth vs. Error Visualiser
Pause & Predict

The widget below plots training error vs. depth for plain and residual networks. Before dragging: at what depth does the plain network begin to degrade? Where does the residual network plateau?

Hint: gradients multiply ~0.85 per layer — how many steps until the product falls below 0.01?

Try It: Network Depth vs. Training Error

Drag the slider to change depth. Observe how plain networks degrade while residual networks keep improving.

20 layers
Plain Network (no skip) Residual Network (skip connections)
Live Calculation — Gradient at Layer 1 (δ = 0.85/layer)
depth = 20 layers
Implementation
Python · PyTorch — ResidualBlock with projection shortcut
import torch import torch.nn as nn class ResidualBlock(nn.Module): def __init__(self, in_ch, out_ch, stride=1): super().__init__() self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride=stride, padding=1, bias=False) self.bn1 = nn.BatchNorm2d(out_ch) self.relu = nn.ReLU(inplace=True) self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(out_ch) # 1×1 projection when dims change needs_proj = (stride != 1) or (in_ch != out_ch) self.shortcut = nn.Sequential( nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False), nn.BatchNorm2d(out_ch) ) if needs_proj else nn.Identity() def forward(self, x): out = self.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) out += self.shortcut(x) # H(x) = F(x) + x return self.relu(out) # Stride=2: spatial dims halve, projection shortcut fires block = ResidualBlock(64, 128, stride=2) x = torch.randn(4, 64, 32, 32) print(block(x).shape) # (4, 128, 16, 16)
Output
torch.Size([4, 128, 16, 16])
Key Takeaway

Adding out += x — a two-character change — provides a permanent gradient highway through every block, making networks of 50, 100, or 1 000 layers trainable where plain stacks completely fail.

🏥
Real-World Application

Medical Image Segmentation — ResNet Encoders in nnU-Net

nnU-Net, the leading automatic medical segmentation framework, replaces plain conv encoders with ResNet-50 backbones. Skip connections prevent vanishing gradients when fine-tuning on small datasets (100–500 annotated 3D volumes), enabling reliable convergence of 20 M+ parameter models on CT and MRI scans.

CheckpointSkip Connections & ResNet

Q1 A residual block receives input $\mathbf{x}$ of shape (64, 32, 32) and uses stride=2 to produce output of shape (128, 16, 16). What shape must the projection shortcut output, and which layer achieves this?

Shape: (128, 16, 16). The projection shortcut uses a 1×1 Conv2d(64, 128, stride=2) + BatchNorm2d(128). This matches the main path output so element-wise addition is valid.

Q2 Explain why $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$ prevents vanishing gradients even in a 100-layer network.

The identity matrix I is always present, regardless of how small ∂F/∂x becomes due to saturated activations. Through backpropagation the gradient accumulates a +1 contribution from every skip connection, so it can never collapse to zero across the full depth.

Depthwise Separable Convolutions & MobileNet

How factorising the convolution operation achieves an 8–9× reduction in computation — enabling real-time vision on a smartphone.

After this section you will be able to
  • Compare the parameter count of a standard conv layer vs. its depthwise separable equivalent using the formula
  • Explain the two-stage structure (depthwise then pointwise) and what each stage learns
  • Implement a depthwise separable convolution block in PyTorch using groups=in_channels

"ResNet-50 has 25 million parameters and needs 4 GFLOPs for a single image. That's fine in a data-centre. On a 2015 smartphone with 1 W of compute budget, it would take 2 seconds per frame. MobileNet runs at 30 fps."

📱
Why efficient convolutions matter: Edge devices — phones, drones, medical monitors — have strict power and memory budgets. Depthwise separable convolution achieves similar representational power to a standard conv at a fraction of the cost by separating spatial filtering (per channel) from channel mixing (1×1 conv).
👨‍🍳
Analogy Bridge

Standard convolution is like one chef doing everything: mixing all ingredients and cooking simultaneously. Depthwise separable conv is specialised teamwork: one chef handles each ingredient separately (depthwise — spatial), then a coordinator blends all the flavours at the end (pointwise — channel mixing).

Input H×W×C (C channels) Depthwise Conv 3×3 filter per channel C filters total Spatial filtering only groups = C Pointwise Conv 1×1 filter across all C N filters → N channels Channel mixing only standard conv, k=1 Output H×W×N (N channels) ~8–9× fewer ops 9C × H × W ops CN × H × W ops

Depthwise separable convolution: stage 1 filters each channel independently (spatial), stage 2 mixes channels (1×1). Together they approximate a full conv at ~1/9 the cost.

8–9×
ops saved
For a 3×3 conv with 64 output channels: ratio = 1/64 + 1/9 ≈ 12.7% of standard ops
4.2M
params
MobileNetV1 — vs. 25 M for ResNet-50 and 138 M for VGG-16
70.6%
top-1 acc
MobileNetV1 on ImageNet — competitive with VGG-16 at 1/30 the parameters
569M
MAdds
MobileNetV1 multiply-adds vs. 4,000 M for ResNet-50 — 7× faster inference
Problem — Standard Convolution Is Expensive

Standard Convolution Cost

A standard conv layer with $D_K \times D_K$ kernel, $M$ input channels, $N$ output channels, and $D_F \times D_F$ feature map has total multiply-adds:

$$\text{Ops}_{\text{std}} = D_K^2 \times M \times N \times D_F^2$$

Parameters: $D_K^2 \times M \times N$. Every output channel looks at every input channel at every spatial position — expensive cross-channel interactions happen at each kernel position.

📝 Worked Example — Standard Conv Parameter Count
1
Settings: $D_K = 3$, $M = 32$ input channels, $N = 64$ output channels, $D_F = 112$ (input resolution).
2
Parameters: $D_K^2 \times M \times N = 9 \times 32 \times 64 = \mathbf{18{,}432}$
3
Operations: $9 \times 32 \times 64 \times 112^2 = 18{,}432 \times 12{,}544 \approx \mathbf{231\,M}$ MAdds
Standard 3×3 conv (32→64, 112×112): 18,432 params, 231 M ops
Quick Check

If you double both $M$ and $N$ (32→64 input, 64→128 output), how many times more parameters does the layer have?

Parameters scale as M×N, so doubling both: (64×128)/(32×64) = 8,192/2,048 = 4× more parameters. Ops also 4× since they scale as M×N×D_F².

Depthwise Separable Conv Cost

Split into two stages. Depthwise ops (1 filter per input channel): $D_K^2 \times M \times D_F^2$. Pointwise ops (1×1 cross-channel): $M \times N \times D_F^2$. Total:

$$\text{Ops}_{\text{dw}} = D_K^2 \cdot M \cdot D_F^2 + M \cdot N \cdot D_F^2$$

The ratio to standard conv:

$$\frac{\text{Ops}_{\text{dw}}}{\text{Ops}_{\text{std}}} = \frac{1}{N} + \frac{1}{D_K^2}$$

For $N=64$, $D_K=3$: ratio $= 1/64 + 1/9 \approx 0.127$ — an 87% reduction.

📝 Worked Example — DW-Sep Parameter Count
1
Depthwise params: $D_K^2 \times M = 9 \times 32 = \mathbf{288}$
2
Pointwise params: $M \times N = 32 \times 64 = \mathbf{2{,}048}$
3
Total: $288 + 2{,}048 = \mathbf{2{,}336}$ vs. $18{,}432$ standard
$$\text{Reduction} = \frac{18{,}432}{2{,}336} \approx 7.9\times$$
DW-Sep (32→64, 3×3): 2,336 params — 7.9× fewer than standard conv
Quick Check

Using the ratio formula $1/N + 1/D_K^2$ with $N=64$, $D_K=3$, verify the 87% reduction result numerically.

1/64 + 1/9 = 0.01563 + 0.1111 = 0.1267 → 12.67% of standard ops → 87.3% reduction ✓
💡
Key Insight — What Each Stage Learns

Depthwise (3×3): detects spatial patterns (edges, textures) within each channel independently — like running a dedicated edge-detector per feature map. Pointwise (1×1): combines channels to form richer representations — like a learned colour mixer. Together they approximate full conv expressiveness at ~1/9 the cost.

Solution — Standard vs. Depthwise Separable Cost Explorer
Pause & Predict

Below, adjust the number of output channels $N$. Before dragging: at which value of $N$ does the depthwise separable approach give more than 8× fewer operations? Does the benefit grow or shrink as $N$ increases?

Hint: the ratio formula is $1/N + 1/9$. Plug in $N = 9$.

Try It: Parameter Count Comparison

Compare parameter counts for standard vs. depthwise separable convolution as you change input channels $M$, output channels $N$, and kernel size $D_K$.

32
64
3
Standard Conv params Depthwise-Sep params Reduction ratio
Live Calculation — Parameter Count
M=32, N=64, Dₖ=3
Implementation
Python · PyTorch — Depthwise Separable Convolution Block
import torch import torch.nn as nn class DWSepConv(nn.Module): """Depthwise-separable conv: 3×3 DW + 1×1 PW + BN + ReLU6.""" def __init__(self, in_ch, out_ch, stride=1): super().__init__() # Depthwise: one filter per input channel self.dw = nn.Conv2d(in_ch, in_ch, kernel_size=3, stride=stride, padding=1, groups=in_ch, # key: groups=in_ch bias=False) self.bn1 = nn.BatchNorm2d(in_ch) # Pointwise: mix channels with 1×1 conv self.pw = nn.Conv2d(in_ch, out_ch, kernel_size=1, bias=False) self.bn2 = nn.BatchNorm2d(out_ch) self.act = nn.ReLU6(inplace=True) def forward(self, x): x = self.act(self.bn1(self.dw(x))) # spatial x = self.act(self.bn2(self.pw(x))) # channel mix return x # Compare param counts M, N = 32, 64 std_params = 3**2 * M * N # 18,432 dw_sep = 3**2 * M + M * N # 2,336 print(f"Standard: {std_params:,} params") print(f"DW-Sep: {dw_sep:,} params") print(f"Reduction: {std_params/dw_sep:.1f}×") block = DWSepConv(32, 64) x = torch.randn(4, 32, 112, 112) print(block(x).shape)
Output
Standard: 18,432 params DW-Sep: 2,336 params Reduction: 7.9× torch.Size([4, 64, 112, 112])
Key Takeaway

Depthwise separable convolution separates where to detect (depthwise, per-channel) from what to detect (pointwise, cross-channel), cutting operations by 8–9× with less than 1% accuracy loss on ImageNet.

📦
Real-World Application

Real-Time Quality Control on a Raspberry Pi

A Thai packaging factory replaced a ResNet-based defect detector (GPU-required, 18 FPS) with a MobileNetV2 model built on depthwise separable blocks. The result: 30 FPS on a $35 Raspberry Pi 4, 91% defect recall, and zero GPU costs — enabling in-line inspection on every production line.

CheckpointDepthwise Separable Convolutions

Q1 A standard 3×3 conv has $M=128$ input channels and $N=256$ output channels. Calculate (a) the parameter count, (b) the depthwise separable parameter count, and (c) the reduction ratio.

(a) Standard: 9 × 128 × 256 = 294,912 params
(b) DW-Sep: 9×128 + 128×256 = 1,152 + 32,768 = 33,920 params
(c) Ratio: 294,912 / 33,920 ≈ 8.7× (formula: 1/N + 1/9 = 1/256+1/9 ≈ 0.115 → 8.7× reduction)

Q2 In PyTorch, what argument to Conv2d implements the depthwise step, and what value should it have for a layer with 64 input channels?

groups=64. Setting groups=in_channels forces each filter to operate on exactly one input channel — that is the definition of depthwise convolution.

Transfer Learning & Fine-Tuning Strategies

How to transplant 10 million GPU-hours of ImageNet training into your custom task in an afternoon — and which layers to touch.

After this section you will be able to
  • Explain the feature hierarchy in a pre-trained CNN and why early layers generalise across tasks
  • Select the correct fine-tuning strategy (freeze all / partial unfreeze / full fine-tune) based on dataset size and domain similarity
  • Implement MobileNetV2 fine-tuning in PyTorch — freezing backbone, replacing the classifier head, and training with a discriminative learning rate

"Training ResNet-50 on ImageNet from scratch takes 90 epochs on 8 GPUs — roughly 3 days and thousands of dollars. Transfer learning lets you reach the same accuracy on a custom 500-image dataset in 20 minutes on a laptop."

🧠
Why pre-trained weights work for new tasks: The first layers of any CNN trained on natural images learn universal detectors — edges at layer 1, curves at layer 2, textures at layer 3. These features are useful for nearly any visual task (medical, satellite, industrial). Only the final task-specific layers need to be retrained on your data.
👷
Analogy Bridge

Hiring an experienced engineer who already knows physics, CAD, and general manufacturing. You only need to train them on your specific product — not reteach the fundamentals. The pre-trained backbone is the general expertise; your dataset teaches the product-specific knowledge.

Conv Block 1–2 Conv Block 3–4 Conv Block 5–6 Conv Block 7 Classifier Head Edges, corners Textures, curves Parts, shapes Object semantics Task output Frozen (pre-trained) Partially unfrozen Unfrozen (trainable) New classifier head Strategy A — Feature Extraction Freeze ALL backbone. Train head only. Use when dataset is tiny (<500 images). Risk of overfitting is highest — frozen backbone acts as a fixed feature extractor. Strategy B — Partial Fine-Tune Freeze early layers, unfreeze deep layers + head. Best for medium datasets (1k–10k). Use lower LR on unfrozen backbone (10× smaller) to avoid destroying learned weights.

Fine-tuning strategy: grey = frozen backbone (universal features), green = partially unfrozen (domain adaptation), blue = fully trainable, purple = new task-specific head always trained.

1 000
ImageNet classes
Pre-trained on 1.2 M images — the backbone "sees" edges, textures, and shapes across 1 000 categories
~10
epochs
Typical time to fine-tune MobileNetV2 on a custom 1,000-image dataset (head only, GPU)
95%+
accuracy
Achievable on many custom datasets with transfer learning vs. ~60% training from scratch
10×
lower LR
Recommended backbone LR relative to head when partially fine-tuning — protects pre-trained weights
Problem — Choosing What to Freeze

The Feature Hierarchy

CNN features become increasingly task-specific with depth:

  • Layer 1–2: Gabor-like edge detectors and colour blobs — universal across all visual tasks
  • Layer 3–4: Textures, curves, contours — generalisable with some variation
  • Layer 5+: Object parts and semantic regions — begin to specialise to training domain
  • Final layer: Fully task-specific (1 000 ImageNet classes)

The decision of where to "cut" drives your strategy.

📝 Worked Example — Strategy Selection
1
Task A: classify 3 cat breeds, 200 photos. ImageNet contains cat images → similar domain, tiny data. Strategy: freeze all backbone, train head only.
2
Task B: classify skin lesions (dermoscopy images), 2,000 photos. ImageNet has no medical images → different domain, medium data. Strategy: freeze early layers, unfreeze deep layers + head at lower LR.
3
Task C: classify satellite crop types, 50,000 images. Very different domain, large data → full fine-tune with small initial LR across all layers.
Rule: more data + more domain shift → unfreeze more layers
Quick Check

You have 10,000 X-ray images. X-rays look very different from natural images. Which strategy would you choose and why?

Strategy B (partial fine-tune) or full fine-tune. Large domain shift means late backbone layers may not transfer well — unfreeze the last 2–3 blocks. 10k images is enough to fine-tune without severe overfitting if you use regularisation (dropout, weight decay).

Discriminative Learning Rates

When unfreezing backbone layers, do not use the same learning rate for all layers. Earlier backbone layers contain well-tuned, general features — destroy them with a high LR and accuracy collapses.

The standard practice is a discriminative learning rate schedule:

$$\eta_l = \eta_{\text{head}} \times r^{(L - l)}$$

where $r \in [0.1, 0.5]$ is the decay factor per layer group, $L$ is the total number of groups, and $l$ is the group index (0 = earliest). In practice use $r = 0.1$ (PyTorch parameter groups).

📝 Worked Example — Computing Layer Learning Rates
1
Head learning rate: $\eta_{\text{head}} = 1 \times 10^{-3}$. Decay $r = 0.1$. Three groups: early backbone, late backbone, head.
2
$$\begin{aligned}\eta_{\text{head}} &= 1\times10^{-3}\\[4pt]\eta_{\text{late backbone}} &= 1\times10^{-3} \times 0.1 = 1\times10^{-4}\\[4pt]\eta_{\text{early backbone}} &= 1\times10^{-3} \times 0.1^2 = 1\times10^{-5}\end{aligned}$$
3
Early layers update 100× more slowly — they barely move from their pre-trained values, preserving the universal edge detectors.
LR schedule: 1e-5 (early) → 1e-4 (late backbone) → 1e-3 (head)
Quick Check

Using $r=0.3$ and $\eta_\text{head}=5\times10^{-4}$, what is $\eta_\text{early backbone}$ with 3 groups?

η_early = 5×10⁻⁴ × 0.3² = 5×10⁻⁴ × 0.09 = 4.5×10⁻⁵
⚠️
Common Mistake — Not Replacing the Classifier Head

Pre-trained models end with a classifier or fc layer outputting 1 000 ImageNet classes. If you forget to replace it with a new head matching your class count, the loss will be computed against the wrong number of outputs — causing either a shape error or silent wrong-task training.

Solution — Dataset Size × Domain Shift Guide
Pause & Predict

The widget below maps (dataset size, domain similarity) to a fine-tuning strategy recommendation. Before clicking: predict which quadrant requires the most aggressive unfreezing — large dataset + very different domain?

Hint: what happens if you freeze all layers and you have 50,000 images from a completely different domain?

Try It: Strategy Decision Matrix

Click any quadrant to see the recommended fine-tuning strategy and reasoning.

Click a quadrant to see strategy details.
Implementation
Python · PyTorch — Fine-Tuning MobileNetV2 for Custom Classification
import torchvision.models as models import torch.nn as nn import torch NUM_CLASSES = 5 # your custom number of classes # 1. Load pre-trained MobileNetV2 model = models.mobilenet_v2( weights=models.MobileNet_V2_Weights.IMAGENET1K_V1) # 2. Freeze entire backbone for param in model.parameters(): param.requires_grad = False # 3. Replace classifier head (1280 → NUM_CLASSES) model.classifier = nn.Sequential( nn.Dropout(0.2), nn.Linear(model.last_channel, NUM_CLASSES) ) # 4. Strategy B: unfreeze last 3 backbone blocks for layer in model.features[-3:]: for p in layer.parameters(): p.requires_grad = True # 5. Discriminative LR: 10× lower for backbone optimizer = torch.optim.Adam([ {'params': model.features[-3:].parameters(), 'lr': 1e-4}, # late backbone {'params': model.classifier.parameters(), 'lr': 1e-3}, # head ]) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Trainable params: {trainable:,}")
Output
Trainable params: 576,645 (out of 3,504,872 total — 16.5% trainable)
Key Takeaway

Transfer learning is not a single setting — match your freeze depth to your data: tiny data + similar domain → freeze everything except the head; large data + different domain → unfreeze deep layers with discriminative learning rates 10–100× smaller than the head.

🔬
Real-World Application

Fine-Tuning MobileNet for Custom Product Quality Control

A KMITL industrial partner trained a 5-class surface defect classifier (good, crack, scratch, dent, discolour) on 1,200 product images using MobileNetV2 fine-tuning (Strategy B). Frozen early layers, unfrozen last 3 blocks, discriminative LR. Result: 96.4% accuracy in 25 epochs — a model deployed on a Raspberry Pi to inspect 40 units per minute.

CheckpointTransfer Learning & Fine-Tuning

Q1 You have 300 dermoscopy images and want to classify 2 types of skin lesion. ImageNet has no medical images. Which fine-tuning strategy would you choose, and why?

Strategy A or light Strategy B. 300 images is very few — full fine-tune risks overfitting. Freeze the early backbone (universal features still useful), train only the head + optionally the last 1–2 blocks at a 10× lower LR. Add strong data augmentation (flip, rotation, colour jitter) to compensate for the small dataset.

Q2 After calling model = mobilenet_v2(weights=...) and freezing all parameters, you get an error when computing the loss. What did you forget?

You forgot to replace the classifier head. The pre-trained model's final layer outputs 1,000 classes. If your dataset has a different number of classes (e.g., 5), the loss function (e.g., CrossEntropyLoss) will receive a size mismatch — either a runtime error or wrong-task training.

Fine-Tuning Strategy Selector

Compare Feature Extraction, Partial Fine-Tuning, and Full Fine-Tuning side by side. See exactly which MobileNetV2 layers are frozen, how many parameters are trainable, and what dataset size each strategy requires.

Strategy:
Trainable Params
272,901
7.8% of 3.5M total
Recommended Dataset
< 1,000 images
Minimises overfitting risk
Head LR
1 × 10⁻³
Classifier head learning rate
Backbone LR
All backbone layers frozen
Training Risk
Low overfitting
Frozen backbone acts as regulariser
Strategy A — Feature Extraction: Freeze the entire MobileNetV2 backbone. Only the new classifier head is trainable. The backbone acts as a fixed feature extractor — every image is mapped to a 1280-dim feature vector, then your classifier learns the mapping to your classes. Best when your dataset is small (<1,000 images) or very similar to ImageNet.
Frozen layer Partially trainable Fully trainable Classifier head
💡

Progressive unfreezing tip: Start with Strategy A for 5 epochs, then gradually unfreeze later blocks toward Strategy B. This "warm-start" prevents the randomly-initialised head from corrupting pre-trained backbone weights during the first chaotic training steps.

What We Covered

Three connected ideas — deeper networks, efficient networks, and borrowed knowledge — that together define modern practical computer vision.

🔗

Skip Connections

$H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ adds an identity shortcut that guarantees $\frac{\partial H}{\partial \mathbf{x}} \geq \mathbf{I}$ — eliminating vanishing gradients and enabling 100+ layer networks.

⚗️

Depthwise Separable Convolution

Splitting a standard conv into a depthwise (spatial) + pointwise (channel) stage reduces operations by the factor $\frac{1}{N} + \frac{1}{D_K^2}$ — roughly 8–9× for 3×3 kernels with 64 output channels.

🧬

Transfer Learning

Match the freeze depth to your data: tiny & similar → freeze backbone, train head; medium & shifted → partial unfreeze with discriminative LR 10× lower on backbone; large & different → full fine-tune.

Coming up — Week 8

Midterm Examination. Covers all topics from Weeks 1–7: Image Formation, Spatial Filtering, Feature Detection, Image Alignment, Neural Networks, CNNs, and Modern CNNs & Transfer Learning. Open-note: one A4 handwritten cheat sheet permitted.

Go Deeper

Primary sources and interactive resources for Week 7 topics — in recommended reading order.

📘 Textbook

Prince — Understanding Deep Learning, Ch. 10

Covers modern CNN architectures including ResNet, MobileNet, and transfer learning strategies. Detailed mathematical treatment of skip connections and feature re-use.

→ Primary reference for Weeks 5–7
📗 Textbook

Géron — Hands-On ML, Ch. 14

Practical PyTorch implementation of ResNet, MobileNet, and transfer learning pipelines. Includes code for fine-tuning pre-trained models on custom datasets with torchvision.

→ Companion code for all lab exercises
📄 Paper

He et al. (2016) — Deep Residual Learning for Image Recognition

The original ResNet paper. Introduces the degradation problem, proposes residual mappings, and demonstrates 152-layer networks winning ImageNet. Essential reading.

→ arXiv 1512.03385
📄 Paper

Howard et al. (2017) — MobileNets: Efficient CNNs for Mobile Vision

Original MobileNet paper introducing depthwise separable convolutions for mobile and embedded vision. Contains detailed efficiency analysis and ImageNet benchmark results.

→ arXiv 1704.04861
🎓 Lecture

Stanford CS231n — Transfer Learning Notes

Comprehensive notes on fine-tuning strategies, feature extraction vs. full fine-tuning, and practical tips for discriminative learning rates. Free online resource.

→ cs231n.github.io/transfer-learning
🎬 Video

Stanford CS231n (Justin Johnson / Fei-Fei Li) — Lecture on CNNs

Deep-dive lecture covering ResNet, batch normalisation, and the evolution from AlexNet to modern architectures. Excellent visualisations of feature hierarchy at each layer.

→ Supplemental Materials (Week 7)

Exercises

8 exercises covering skip connections, depthwise separable convolutions, and transfer learning fine-tuning strategies.

1
Theory · Skip Connections Easy

Gradient Magnitude in a 6-Layer Stack

A plain 6-layer network multiplies the gradient by $\delta = 0.7$ at each layer (from output back to layer 1). A ResNet uses the same 6 layers with skip connections added every 2 layers.

  1. Calculate the gradient magnitude at layer 1 of the plain network (as a fraction of the output gradient $g_L$).
  2. Sketch (or describe) how skip connections change the effective gradient formula at each block boundary.
  3. Why does the identity shortcut guarantee a gradient of at least 1 through each residual block?

For part (a): gradient at layer 1 = $g_L \cdot \delta^{L-1}$ where $L=6$, so $g_L \cdot 0.7^5$. For part (c): the Jacobian of $H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ with respect to $\mathbf{x}$ is $\nabla_\mathbf{x} H = \nabla_\mathbf{x} F + \mathbf{I}$, so even if $\nabla_\mathbf{x} F \approx 0$ the gradient still has the identity term.

2
Code · Skip Connections Medium

Mini ResNet for CIFAR-10

Build a small ResNet-style classifier for CIFAR-10 (32×32×3 input, 10 classes).

  1. Implement a ResidualBlock with two 3×3 convolutions and a projection shortcut when channels/stride changes.
  2. Stack 3 residual blocks: (32→32), (32→64, stride=2), (64→64). Add a global average pool and linear head.
  3. Pass a batch torch.randn(8, 3, 32, 32) through the model and print the output shape.
  4. Count total trainable parameters using sum(p.numel() for p in model.parameters()).

After the 3 blocks: tensor shape is (8, 64, 8, 8). After global average pool (nn.AdaptiveAvgPool2d(1)) and flatten: (8, 64). After linear head: (8, 10). The projection shortcut fires only at the (32→64, stride=2) block — use nn.Conv2d(32, 64, 1, stride=2, bias=False) + BN.

3
Theory · Depthwise Separable Convolutions Easy

Parameter Count and Reduction Ratio

Consider a convolution with $D_K = 3$, $M = 64$ input channels, $N = 128$ output channels, and spatial feature map $D_F = 56$.

  1. Calculate the parameter count of a standard 3×3 conv layer.
  2. Calculate the parameter count of an equivalent depthwise separable block (depthwise + pointwise, no bias).
  3. Compute the exact reduction ratio and verify using the formula $\frac{1}{N} + \frac{1}{D_K^2}$.
  4. Calculate the total multiply-add operations for both approaches.

(a) Standard: $9 \times 64 \times 128 = 73,728$ params. (b) DW-Sep: $9 \times 64 + 64 \times 128 = 576 + 8,192 = 8,768$ params. (c) Ratio: $73728 / 8768 \approx 8.41\times$. Formula: $1/128 + 1/9 = 0.0078 + 0.111 = 0.119 \rightarrow 8.41\times$ ✓. (d) Ops: standard = $73728 \times 56^2 \approx 231\text{M}$; DW-Sep = $(576 + 8192) \times 56^2 \approx 27.5\text{M}$.

4
Code · Depthwise Separable Convolutions Medium

Build a DW-Sep Block and Verify Efficiency

Implement and benchmark a depthwise separable convolution block in PyTorch.

  1. Implement DWSepConv(in_ch, out_ch, stride=1) with: depthwise 3×3 + BN + ReLU6, then pointwise 1×1 + BN + ReLU6.
  2. Implement a matching StandardConv(in_ch, out_ch) using a regular 3×3 conv + BN + ReLU.
  3. Compare parameter counts for $M=64$, $N=128$: print both counts and the ratio.
  4. Pass the same torch.randn(2, 64, 112, 112) through both and confirm identical output shapes.

Key: nn.Conv2d(in_ch, in_ch, 3, padding=1, groups=in_ch, bias=False) for the depthwise step. The groups=in_ch argument is the critical implementation detail. Both models should produce shape (2, 128, 112, 112).

5
Theory · Transfer Learning Medium

Fine-Tuning Strategy Selection

For each scenario below, choose the correct strategy (A: freeze all / B: partial unfreeze / C: full fine-tune) and justify your choice.

  1. Scenario 1: 300 photos of cats and dogs (similar to ImageNet). Classify into 2 breeds.
  2. Scenario 2: 5,000 chest X-ray images. Classify pneumonia vs. normal. Very different from ImageNet.
  3. Scenario 3: 200,000 satellite images. Classify 10 land-use types. Very different domain, abundant data.
  4. For Scenario 2, compute the backbone learning rate if $\eta_\text{head} = 5\times10^{-4}$ and $r = 0.1$.

S1: Strategy A — tiny + similar; frozen backbone prevents overfitting. S2: Strategy B — medium + different domain; unfreeze last 2–3 blocks at 10× lower LR. S3: Strategy C — large + very different; full fine-tune needed to adapt early features. Part 4: $\eta_\text{backbone} = 5\times10^{-4} \times 0.1 = 5\times10^{-5}$.

6
Code · Transfer Learning Medium

Fine-Tuning MobileNetV2 — Step by Step

Implement a complete fine-tuning setup for a 4-class defect classifier.

  1. Load mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1).
  2. Freeze all parameters. Print the trainable parameter count (should be 0).
  3. Replace model.classifier with nn.Sequential(nn.Dropout(0.2), nn.Linear(1280, 4)).
  4. Unfreeze the last 3 feature blocks (model.features[-3:]). Print the new trainable count.
  5. Set up an Adam optimiser with lr=1e-4 for backbone, lr=1e-3 for head using parameter groups.

After step 2: 0 trainable params. After step 3: 5,124 (head only: 1280×4 + 4 bias). After step 4: varies — model.features[-3:] contains inverted residual blocks with roughly 500k params. Use sum(p.numel() for p in model.parameters() if p.requires_grad) to count.

7
Synthesis · Theory: Architecture Design Trade-offs Hard

Combining Skip Connections with Depthwise Separable Convolutions

MobileNetV2 introduces inverted residuals: expand channels (1×1), apply depthwise 3×3, then contract (1×1). This is the reverse of standard bottlenecks.

  1. Explain why MobileNetV2 expands channels before the depthwise conv (hint: more channels → richer spatial features for the depthwise step).
  2. A standard MobileNetV1 DW-Sep block: $M=32$ input, $N=32$ output, 3×3 DW. An inverted residual block (MV2): expand to $6M=192$, then DW 3×3 at 192, then contract to 32. Calculate total parameter counts for both blocks.
  3. MobileNetV2 only adds the skip connection when $M_\text{in} = M_\text{out}$ and stride = 1. Why is the projection shortcut not used here (unlike ResNet)?

Part 2 — MV1: $9\times32 + 32\times32 = 288 + 1024 = 1312$ params. MV2 inverted residual: expand ($32\times192 = 6144$) + DW ($9\times192 = 1728$) + contract ($192\times32 = 6144$) = $\mathbf{14016}$ params. MV2 has more parameters per block but the expansion enables more expressive spatial filtering. Part 3: Adding a projection would add parameters and computation in an already constrained block; the goal is efficiency. The skip is free only when dimensions match (identity).

8
Synthesis · Code: Industrial Defect Inspection Pipeline Hard

End-to-End Fine-Tuning Pipeline

Build a complete fine-tuning and evaluation pipeline for a 5-class product defect classifier using MobileNetV2.

  1. Load MobileNetV2. Freeze backbone. Replace head for 5 classes.
  2. Create a synthetic dataset: torch.randn(100, 3, 224, 224) with random integer labels (0–4). Wrap in a TensorDataset and DataLoader(batch_size=16).
  3. Define criterion = nn.CrossEntropyLoss() and an Adam optimiser (head-only LR=1e-3).
  4. Run 3 training epochs. Print average loss and accuracy per epoch.
  5. After epoch 1, unfreeze the last 2 backbone blocks and add them to the optimiser at LR=1e-4. Continue training for 2 more epochs. Compare accuracy before and after unfreezing.

To add parameters to an existing optimiser after unfreezing: call optimizer.add_param_group({'params': new_params, 'lr': 1e-4}). For accuracy: preds = logits.argmax(dim=1); acc = (preds == labels).float().mean(). With random data, expect ~20% accuracy (chance level for 5 classes) regardless of training — the goal is verifying the pipeline runs without errors.