From skip connections that unlocked 150-layer networks, to convolutions that run on a smartphone, to the art of re-using what a network already knows.
Why adding more layers can hurt — and how a single identity shortcut unlocked the 150-layer barrier.
"By 2014 the consensus was clear: deeper is better. Then teams tried 30-, 50-, even 100-layer plain networks — and training accuracy got worse. Something fundamental was broken."
Think of a road network. The main path winds through many toll booths — each one slows traffic (degrades gradient). The skip connection is an express bypass: the gradient travels directly without attenuation. The final merge (+) combines both flows.
Residual block: the main convolution path learns F(x), then the identity skip adds x directly. The gradient at the + node is ∂F/∂x + I — never zero.
out = F(x) becomes out = F(x) + x — the full innovationIn backpropagation the gradient at layer $l$ is a product of all upstream Jacobians. For sigmoid-like activations each factor $\sigma'(z_k) \leq 0.25$:
After $L$ layers the product collapses — early weights receive near-zero updates and training stalls regardless of how long you wait.
If $\delta = 0.9$, at what depth does the gradient drop below 1% of its initial value?
ResNet re-parameterises the learning target. Instead of learning $H(\mathbf{x})$ directly, the layers learn the residual $F(\mathbf{x}) = H(\mathbf{x}) - \mathbf{x}$:
The Jacobian of the block is $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$. The identity $\mathbf{I}$ is always present — gradient can never completely vanish through this block.
If $F(\mathbf{x}) = [-2.0,\ -1.0,\ -3.0]^\top$, what is $H(\mathbf{x})$? What has the block learned to do?
The skip addition requires $\mathbf{x}$ and $F(\mathbf{x})$ to have identical shapes. When stride > 1 or the number of channels changes, add a 1×1 projection convolution (+ BatchNorm) on the skip path to match. Forgetting this causes a runtime shape error at the + node.
The widget below plots training error vs. depth for plain and residual networks. Before dragging: at what depth does the plain network begin to degrade? Where does the residual network plateau?
Hint: gradients multiply ~0.85 per layer — how many steps until the product falls below 0.01?
Adding out += x — a two-character change — provides a permanent gradient highway through every block, making networks of 50, 100, or 1 000 layers trainable where plain stacks completely fail.
nnU-Net, the leading automatic medical segmentation framework, replaces plain conv encoders with ResNet-50 backbones. Skip connections prevent vanishing gradients when fine-tuning on small datasets (100–500 annotated 3D volumes), enabling reliable convergence of 20 M+ parameter models on CT and MRI scans.
Q1 A residual block receives input $\mathbf{x}$ of shape (64, 32, 32) and uses stride=2 to produce output of shape (128, 16, 16). What shape must the projection shortcut output, and which layer achieves this?
Q2 Explain why $\frac{\partial H}{\partial \mathbf{x}} = \frac{\partial F}{\partial \mathbf{x}} + \mathbf{I}$ prevents vanishing gradients even in a 100-layer network.
How factorising the convolution operation achieves an 8–9× reduction in computation — enabling real-time vision on a smartphone.
groups=in_channels"ResNet-50 has 25 million parameters and needs 4 GFLOPs for a single image. That's fine in a data-centre. On a 2015 smartphone with 1 W of compute budget, it would take 2 seconds per frame. MobileNet runs at 30 fps."
Standard convolution is like one chef doing everything: mixing all ingredients and cooking simultaneously. Depthwise separable conv is specialised teamwork: one chef handles each ingredient separately (depthwise — spatial), then a coordinator blends all the flavours at the end (pointwise — channel mixing).
Depthwise separable convolution: stage 1 filters each channel independently (spatial), stage 2 mixes channels (1×1). Together they approximate a full conv at ~1/9 the cost.
A standard conv layer with $D_K \times D_K$ kernel, $M$ input channels, $N$ output channels, and $D_F \times D_F$ feature map has total multiply-adds:
Parameters: $D_K^2 \times M \times N$. Every output channel looks at every input channel at every spatial position — expensive cross-channel interactions happen at each kernel position.
If you double both $M$ and $N$ (32→64 input, 64→128 output), how many times more parameters does the layer have?
Split into two stages. Depthwise ops (1 filter per input channel): $D_K^2 \times M \times D_F^2$. Pointwise ops (1×1 cross-channel): $M \times N \times D_F^2$. Total:
The ratio to standard conv:
For $N=64$, $D_K=3$: ratio $= 1/64 + 1/9 \approx 0.127$ — an 87% reduction.
Using the ratio formula $1/N + 1/D_K^2$ with $N=64$, $D_K=3$, verify the 87% reduction result numerically.
Depthwise (3×3): detects spatial patterns (edges, textures) within each channel independently — like running a dedicated edge-detector per feature map. Pointwise (1×1): combines channels to form richer representations — like a learned colour mixer. Together they approximate full conv expressiveness at ~1/9 the cost.
Below, adjust the number of output channels $N$. Before dragging: at which value of $N$ does the depthwise separable approach give more than 8× fewer operations? Does the benefit grow or shrink as $N$ increases?
Hint: the ratio formula is $1/N + 1/9$. Plug in $N = 9$.
Depthwise separable convolution separates where to detect (depthwise, per-channel) from what to detect (pointwise, cross-channel), cutting operations by 8–9× with less than 1% accuracy loss on ImageNet.
A Thai packaging factory replaced a ResNet-based defect detector (GPU-required, 18 FPS) with a MobileNetV2 model built on depthwise separable blocks. The result: 30 FPS on a $35 Raspberry Pi 4, 91% defect recall, and zero GPU costs — enabling in-line inspection on every production line.
Q1 A standard 3×3 conv has $M=128$ input channels and $N=256$ output channels. Calculate (a) the parameter count, (b) the depthwise separable parameter count, and (c) the reduction ratio.
Q2 In PyTorch, what argument to Conv2d implements the depthwise step, and what value should it have for a layer with 64 input channels?
groups=64. Setting groups=in_channels forces each filter to operate on exactly one input channel — that is the definition of depthwise convolution.How to transplant 10 million GPU-hours of ImageNet training into your custom task in an afternoon — and which layers to touch.
"Training ResNet-50 on ImageNet from scratch takes 90 epochs on 8 GPUs — roughly 3 days and thousands of dollars. Transfer learning lets you reach the same accuracy on a custom 500-image dataset in 20 minutes on a laptop."
Hiring an experienced engineer who already knows physics, CAD, and general manufacturing. You only need to train them on your specific product — not reteach the fundamentals. The pre-trained backbone is the general expertise; your dataset teaches the product-specific knowledge.
Fine-tuning strategy: grey = frozen backbone (universal features), green = partially unfrozen (domain adaptation), blue = fully trainable, purple = new task-specific head always trained.
CNN features become increasingly task-specific with depth:
The decision of where to "cut" drives your strategy.
You have 10,000 X-ray images. X-rays look very different from natural images. Which strategy would you choose and why?
When unfreezing backbone layers, do not use the same learning rate for all layers. Earlier backbone layers contain well-tuned, general features — destroy them with a high LR and accuracy collapses.
The standard practice is a discriminative learning rate schedule:
where $r \in [0.1, 0.5]$ is the decay factor per layer group, $L$ is the total number of groups, and $l$ is the group index (0 = earliest). In practice use $r = 0.1$ (PyTorch parameter groups).
Using $r=0.3$ and $\eta_\text{head}=5\times10^{-4}$, what is $\eta_\text{early backbone}$ with 3 groups?
Pre-trained models end with a classifier or fc layer outputting 1 000 ImageNet classes. If you forget to replace it with a new head matching your class count, the loss will be computed against the wrong number of outputs — causing either a shape error or silent wrong-task training.
The widget below maps (dataset size, domain similarity) to a fine-tuning strategy recommendation. Before clicking: predict which quadrant requires the most aggressive unfreezing — large dataset + very different domain?
Hint: what happens if you freeze all layers and you have 50,000 images from a completely different domain?
Transfer learning is not a single setting — match your freeze depth to your data: tiny data + similar domain → freeze everything except the head; large data + different domain → unfreeze deep layers with discriminative learning rates 10–100× smaller than the head.
A KMITL industrial partner trained a 5-class surface defect classifier (good, crack, scratch, dent, discolour) on 1,200 product images using MobileNetV2 fine-tuning (Strategy B). Frozen early layers, unfrozen last 3 blocks, discriminative LR. Result: 96.4% accuracy in 25 epochs — a model deployed on a Raspberry Pi to inspect 40 units per minute.
Q1 You have 300 dermoscopy images and want to classify 2 types of skin lesion. ImageNet has no medical images. Which fine-tuning strategy would you choose, and why?
Q2 After calling model = mobilenet_v2(weights=...) and freezing all parameters, you get an error when computing the loss. What did you forget?
Compare Feature Extraction, Partial Fine-Tuning, and Full Fine-Tuning side by side. See exactly which MobileNetV2 layers are frozen, how many parameters are trainable, and what dataset size each strategy requires.
Progressive unfreezing tip: Start with Strategy A for 5 epochs, then gradually unfreeze later blocks toward Strategy B. This "warm-start" prevents the randomly-initialised head from corrupting pre-trained backbone weights during the first chaotic training steps.
Three connected ideas — deeper networks, efficient networks, and borrowed knowledge — that together define modern practical computer vision.
$H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ adds an identity shortcut that guarantees $\frac{\partial H}{\partial \mathbf{x}} \geq \mathbf{I}$ — eliminating vanishing gradients and enabling 100+ layer networks.
Splitting a standard conv into a depthwise (spatial) + pointwise (channel) stage reduces operations by the factor $\frac{1}{N} + \frac{1}{D_K^2}$ — roughly 8–9× for 3×3 kernels with 64 output channels.
Match the freeze depth to your data: tiny & similar → freeze backbone, train head; medium & shifted → partial unfreeze with discriminative LR 10× lower on backbone; large & different → full fine-tune.
Midterm Examination. Covers all topics from Weeks 1–7: Image Formation, Spatial Filtering, Feature Detection, Image Alignment, Neural Networks, CNNs, and Modern CNNs & Transfer Learning. Open-note: one A4 handwritten cheat sheet permitted.
Primary sources and interactive resources for Week 7 topics — in recommended reading order.
Covers modern CNN architectures including ResNet, MobileNet, and transfer learning strategies. Detailed mathematical treatment of skip connections and feature re-use.
→ Primary reference for Weeks 5–7Practical PyTorch implementation of ResNet, MobileNet, and transfer learning pipelines. Includes code for fine-tuning pre-trained models on custom datasets with torchvision.
→ Companion code for all lab exercisesThe original ResNet paper. Introduces the degradation problem, proposes residual mappings, and demonstrates 152-layer networks winning ImageNet. Essential reading.
→ arXiv 1512.03385Original MobileNet paper introducing depthwise separable convolutions for mobile and embedded vision. Contains detailed efficiency analysis and ImageNet benchmark results.
→ arXiv 1704.04861Comprehensive notes on fine-tuning strategies, feature extraction vs. full fine-tuning, and practical tips for discriminative learning rates. Free online resource.
→ cs231n.github.io/transfer-learningDeep-dive lecture covering ResNet, batch normalisation, and the evolution from AlexNet to modern architectures. Excellent visualisations of feature hierarchy at each layer.
→ Supplemental Materials (Week 7)8 exercises covering skip connections, depthwise separable convolutions, and transfer learning fine-tuning strategies.
A plain 6-layer network multiplies the gradient by $\delta = 0.7$ at each layer (from output back to layer 1). A ResNet uses the same 6 layers with skip connections added every 2 layers.
For part (a): gradient at layer 1 = $g_L \cdot \delta^{L-1}$ where $L=6$, so $g_L \cdot 0.7^5$. For part (c): the Jacobian of $H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ with respect to $\mathbf{x}$ is $\nabla_\mathbf{x} H = \nabla_\mathbf{x} F + \mathbf{I}$, so even if $\nabla_\mathbf{x} F \approx 0$ the gradient still has the identity term.
Build a small ResNet-style classifier for CIFAR-10 (32×32×3 input, 10 classes).
ResidualBlock with two 3×3 convolutions and a projection shortcut when channels/stride changes.torch.randn(8, 3, 32, 32) through the model and print the output shape.sum(p.numel() for p in model.parameters()).After the 3 blocks: tensor shape is (8, 64, 8, 8). After global average pool (nn.AdaptiveAvgPool2d(1)) and flatten: (8, 64). After linear head: (8, 10). The projection shortcut fires only at the (32→64, stride=2) block — use nn.Conv2d(32, 64, 1, stride=2, bias=False) + BN.
Consider a convolution with $D_K = 3$, $M = 64$ input channels, $N = 128$ output channels, and spatial feature map $D_F = 56$.
(a) Standard: $9 \times 64 \times 128 = 73,728$ params. (b) DW-Sep: $9 \times 64 + 64 \times 128 = 576 + 8,192 = 8,768$ params. (c) Ratio: $73728 / 8768 \approx 8.41\times$. Formula: $1/128 + 1/9 = 0.0078 + 0.111 = 0.119 \rightarrow 8.41\times$ ✓. (d) Ops: standard = $73728 \times 56^2 \approx 231\text{M}$; DW-Sep = $(576 + 8192) \times 56^2 \approx 27.5\text{M}$.
Implement and benchmark a depthwise separable convolution block in PyTorch.
DWSepConv(in_ch, out_ch, stride=1) with: depthwise 3×3 + BN + ReLU6, then pointwise 1×1 + BN + ReLU6.StandardConv(in_ch, out_ch) using a regular 3×3 conv + BN + ReLU.torch.randn(2, 64, 112, 112) through both and confirm identical output shapes.Key: nn.Conv2d(in_ch, in_ch, 3, padding=1, groups=in_ch, bias=False) for the depthwise step. The groups=in_ch argument is the critical implementation detail. Both models should produce shape (2, 128, 112, 112).
For each scenario below, choose the correct strategy (A: freeze all / B: partial unfreeze / C: full fine-tune) and justify your choice.
S1: Strategy A — tiny + similar; frozen backbone prevents overfitting. S2: Strategy B — medium + different domain; unfreeze last 2–3 blocks at 10× lower LR. S3: Strategy C — large + very different; full fine-tune needed to adapt early features. Part 4: $\eta_\text{backbone} = 5\times10^{-4} \times 0.1 = 5\times10^{-5}$.
Implement a complete fine-tuning setup for a 4-class defect classifier.
mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1).model.classifier with nn.Sequential(nn.Dropout(0.2), nn.Linear(1280, 4)).model.features[-3:]). Print the new trainable count.lr=1e-4 for backbone, lr=1e-3 for head using parameter groups.After step 2: 0 trainable params. After step 3: 5,124 (head only: 1280×4 + 4 bias). After step 4: varies — model.features[-3:] contains inverted residual blocks with roughly 500k params. Use sum(p.numel() for p in model.parameters() if p.requires_grad) to count.
MobileNetV2 introduces inverted residuals: expand channels (1×1), apply depthwise 3×3, then contract (1×1). This is the reverse of standard bottlenecks.
Part 2 — MV1: $9\times32 + 32\times32 = 288 + 1024 = 1312$ params. MV2 inverted residual: expand ($32\times192 = 6144$) + DW ($9\times192 = 1728$) + contract ($192\times32 = 6144$) = $\mathbf{14016}$ params. MV2 has more parameters per block but the expansion enables more expressive spatial filtering. Part 3: Adding a projection would add parameters and computation in an already constrained block; the goal is efficiency. The skip is free only when dimensions match (identity).
Build a complete fine-tuning and evaluation pipeline for a 5-class product defect classifier using MobileNetV2.
torch.randn(100, 3, 224, 224) with random integer labels (0–4). Wrap in a TensorDataset and DataLoader(batch_size=16).criterion = nn.CrossEntropyLoss() and an Adam optimiser (head-only LR=1e-3).To add parameters to an existing optimiser after unfreezing: call optimizer.add_param_group({'params': new_params, 'lr': 1e-4}). For accuracy: preds = logits.argmax(dim=1); acc = (preds == labels).float().mean(). With random data, expect ~20% accuracy (chance level for 5 classes) regardless of training — the goal is verifying the pipeline runs without errors.