Computer Vision — Week 5

Neural Networks
for Vision

From perceptron to MLP — the mathematical foundations that power every modern computer vision system, built from scratch.

Perceptron & MLP Activation Functions Backpropagation SGD · Momentum · Adam MNIST Classification

From Perceptron to MLP

Activation functions, loss surfaces, and the chain rule — the three pillars of every neural network for vision.

After this section you will be able to
  • Describe the perceptron model and its geometric interpretation as a linear decision boundary.
  • Distinguish the output behaviour of ReLU, Sigmoid, and Tanh activation functions and explain why non-linearity is essential.
  • Compute mean squared error and cross-entropy loss by hand and identify which suits classification tasks.

"Every convolution, attention head, and transformer block in modern computer vision is just a stack of perceptrons learning one curve at a time."

🧠
Why this matters: Before you can fine-tune ResNet or train a YOLO detector, you must understand what a single neuron does — what it learns, where it fails, and what "activation" actually means. This week is the foundation that makes every future topic legible.
💡
Analogy Bridge

Think of a single neuron as a light dimmer switch. The inputs are the wires, the weights are the resistance settings, and the activation function is the dimmer itself — it decides how much signal actually passes through. Stack a hundred dimmers in the right configuration and you can sculpt any shape of decision boundary.

INPUT x₁ x₂ x₃ HIDDEN 1 σ(z) σ(z) σ(z) σ(z) HIDDEN 2 σ(z) σ(z) σ(z) σ(z) OUTPUT ŷ₁ ŷ₂ weights W¹ weights W² weights W³ z = Wx + b a = σ(z)

MLP anatomy: input → fully-connected hidden layers with activation σ → output. Each edge carries a learned weight; each node applies a non-linearity.

ReLU
max(0, z) — sparse, fast, default choice
Sigmoid
1/(1+e⁻ᶻ) — squashes to (0,1), vanishes deep
Tanh
zero-centred sigmoid, range (−1,1)
Softmax
exp(zᵢ)/Σexp(zⱼ) — output probabilities
Problem

Perceptron & Linear Pre-activation

A single neuron computes a weighted sum of its inputs plus a bias, then passes the result through an activation function:

$$z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$
$$a = \sigma(z)$$
z
Pre-activation
raw weighted sum before non-linearity
w
Weight vector
learned strength of each input connection
b
Bias
shifts the boundary; allows fit when x = 0
σ
Activation
introduces non-linearity (ReLU, Tanh…)
📝 Worked Example — Single Neuron Forward Pass
1
Given inputs x = [0.5, −1.0, 2.0], weights w = [0.3, −0.6, 0.1], bias b = 0.2.
2
Compute pre-activation: z = (0.3)(0.5) + (−0.6)(−1.0) + (0.1)(2.0) + 0.2
3
$$z = 0.15 + 0.60 + 0.20 + 0.20 = 1.15$$
4
Apply ReLU: a = max(0, 1.15) = 1.15. Apply Sigmoid: a = 1/(1+e⁻¹·¹⁵) ≈ 0.759.
z = 1.15 → ReLU output: 1.15 | Sigmoid output: ≈ 0.759
Quick Check

If all weights were 0, what would the neuron output regardless of input?

z = b = 0.2 → σ(0.2). The output depends only on the bias term. This is why biases allow the network to learn shifts even when inputs are zero.

Loss Functions

The loss quantifies how wrong the network's prediction is. Two canonical choices:

$$\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2$$
$$\mathcal{L}_{\text{CE}} = -\sum_{k=1}^{K} y_k \log \hat{p}_k$$

MSE for regression; Cross-Entropy for classification (pairs with Softmax output).

📝 Worked Example — Cross-Entropy Loss
1
True label: class 2 (one-hot: y = [0, 0, 1]). Model outputs softmax probabilities: p̂ = [0.1, 0.2, 0.7].
2
Only the true class term survives: $\mathcal{L} = -\log(0.7) \approx 0.357$
3
If the model was wrong — p̂ = [0.6, 0.3, 0.1] — then $\mathcal{L} = -\log(0.1) \approx 2.303$. Higher loss → larger gradient signal.
CE loss scales with confidence: correct & confident ≈ 0, wrong & confident ≈ large
Quick Check

What cross-entropy loss does a perfect prediction of probability 1.0 give?

−log(1.0) = 0. Perfect confidence in the correct class yields zero loss.
⚠️
Common Mistake

Applying Sigmoid to hidden layers in deep nets causes vanishing gradients. Because σ'(z) ≤ 0.25, gradients shrink by ≥ 75 % at every layer. With 10 layers, the gradient reaching layer 1 is less than 10⁻⁶ of the original. Use ReLU (or LeakyReLU) for all hidden layers; reserve Sigmoid for the final binary output.

Solution
Pause & Predict

Before moving the slider: if you increase the input x from −3 to +3 for a ReLU neuron with weight w = 2 and bias b = −1, at what value of x does the neuron "turn on"?

Think about when z = wx + b transitions from negative to positive.

Try It: Single Neuron Live Calculator

Adjust weight, bias, and input to see the pre-activation z and output a for each activation function in real time.

1.5
-0.5
ReLU Sigmoid Tanh z (pre-activation)
Live Calculation — z = w·x + b at x = 1.0
z = (1.5)(1.0) + (−0.5) = 1.00
ReLU(z)max(0, 1.00)= 1.000
Sigmoid(z)1/(1+e⁻¹·⁰⁰)≈ 0.731
Tanh(z)tanh(1.00)≈ 0.762
Implementation
Python · PyTorch — Building a 2-Layer MLP
import torch import torch.nn as nn # Define a 2-hidden-layer MLP for 10-class image classification class SimpleMLP(nn.Module): def __init__(self, input_dim=784, hidden=256, n_classes=10): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, hidden), nn.ReLU(), nn.Linear(hidden, hidden), nn.ReLU(), nn.Linear(hidden, n_classes) ) def forward(self, x): # Flatten 28×28 image to 784-dim vector x = x.view(x.size(0), -1) return self.net(x) model = SimpleMLP() dummy = torch.randn(4, 1, 28, 28) logits = model(dummy) print(logits.shape) # torch.Size([4, 10])
Output
torch.Size([4, 10])
Key Takeaway

A neural network is a composition of linear transformations and non-linear activations — the activations are what make it capable of learning any decision boundary, no matter how complex.

🏥
Real-World Application

Diabetic Retinopathy Grading (Google/DeepMind, 2016)

An MLP head on top of convolutional features classifies retinal fundus images into 5 severity grades — from no disease to proliferative DR. The final fully-connected layer is exactly the linear + softmax step you just computed, applied to 2048-dimensional CNN features. The same neuron math, operating on richer input.

Checkpoint Quiz Topic 1 — Perceptron to MLP

Q1 A neuron has w = [2, −1] and b = 0.5. Input x = [1, 3]. What is z?

z = 2(1) + (−1)(3) + 0.5 = 2 − 3 + 0.5 = −0.5. ReLU would output 0; Sigmoid would output ≈ 0.378.

Q2 Why does a network with only linear layers (no activations) fail to learn non-linear patterns?

A composition of linear functions is still linear: W₃(W₂(W₁x)) = Wx. No matter how many layers, without activations the network can only learn a single hyperplane boundary — it collapses to a one-layer linear model.

Backpropagation & Optimization

The chain rule turns the loss into a gradient signal that flows backwards through every weight — and optimizers decide how to use that signal.

After this section you will be able to
  • Derive the gradient of a loss with respect to a weight using the chain rule in a 2-layer network.
  • Explain the difference between vanilla SGD, SGD with Momentum, and Adam, including when each is preferred.
  • Identify vanishing and exploding gradient symptoms in a training loss curve.

"Backpropagation is just the chain rule — applied cleverly to a computational graph of millions of operations without re-computing anything twice."

⛰️
Why this matters: Every deep learning framework — PyTorch, JAX, TensorFlow — is built around automatic differentiation via backprop. Understanding it lets you diagnose exploding/vanishing gradients, design custom loss functions, and reason about why some architectures train faster than others.
💡
Analogy Bridge

Imagine you're trying to find the lowest valley on a foggy mountain (the loss surface). You can only feel the ground directly under your feet (local gradient). SGD takes small steps in the steepest downhill direction. Momentum adds a marble rolling down — it builds up speed. Adam carries its own GPS that adapts the step size for each direction independently.

1
x ŷ forward
Forward Pass — compute ŷ
2
ℒ(ŷ,y) −Σ y log ŷ
Compute Loss ℒ
3
∂ℒ/∂W chain rule
Backprop — compute ∂ℒ/∂W
4
W ← W − η∂ℒ/∂W
Update Weights

The four-step training loop: forward pass → loss → backward pass (chain rule) → weight update. Repeated for every mini-batch.

O(P)
Complexity
Backprop computes all P gradients in one backward pass — same cost as forward
10⁻³
Typical lr
Adam default learning rate; often the best starting point
β₁=0.9
Adam β₁
Momentum decay — keeps 90 % of previous gradient direction
32–256
Batch size
Gradient estimated on mini-batch; trades noise for speed
Problem

Chain Rule — Gradient Flow

For a composition of functions $\mathcal{L} = f(g(h(\mathbf{W})))$, the chain rule gives:

$$\frac{\partial \mathcal{L}}{\partial W_{ij}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W_{ij}}$$

Each term is local — computed at that node — so the network never needs to see the full expression.

📝 Worked Example — Backprop Through One Layer
1
Network: z = wx + b, a = σ(z) = sigmoid(z), ℒ = (y − a)². Given: x = 2, w = 0.5, b = 0, y = 1.
2
Forward: z = 0.5 × 2 = 1.0, a = sigmoid(1.0) ≈ 0.731, ℒ = (1 − 0.731)² ≈ 0.0724
3
Backward step by step:
∂ℒ/∂a = −2(y − a) = −2(0.269) ≈ −0.538
∂a/∂z = σ(z)(1−σ(z)) = 0.731 × 0.269 ≈ 0.197
∂z/∂w = x = 2
4
$$\frac{\partial \mathcal{L}}{\partial w} = (-0.538)(0.197)(2) \approx -0.212$$
w ← w − η(∂ℒ/∂w) = 0.5 − 0.01 × (−0.212) = 0.5021 (with η = 0.01)
Quick Check

Why is the gradient negative (−0.212) and the weight update positive? What does this tell you about the loss landscape?

A negative gradient means increasing w would decrease the loss (the loss slopes downward as w increases). Subtracting a negative gradient moves w in the direction that lowers loss — exactly what we want.

Optimizers: SGD → Momentum → Adam

$$\text{SGD: } \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w}\mathcal{L}$$ $$\text{Momentum: } \mathbf{v} \leftarrow \beta \mathbf{v} - \eta \nabla \mathcal{L}; \quad \mathbf{w} \leftarrow \mathbf{w} + \mathbf{v}$$ $$\text{Adam: } \mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\hat{\mathbf{m}}}{\sqrt{\hat{\mathbf{v}}} + \epsilon}$$

Adam maintains per-parameter first moment m̂ (gradient mean) and second moment v̂ (gradient variance), giving adaptive learning rates.

📝 Worked Example — One Adam Step
1
t=1, η=0.001, β₁=0.9, β₂=0.999, ε=1e-8. Gradient g = 0.4.
2
m ← 0.9(0) + 0.1(0.4) = 0.04  |  v ← 0.999(0) + 0.001(0.16) = 0.00016
3
Bias-correct: m̂ = 0.04/(1−0.9¹) = 0.4  |  v̂ = 0.00016/(1−0.999¹) ≈ 0.16
4
$$w \leftarrow w - 0.001 \times \frac{0.4}{\sqrt{0.16} + 10^{-8}} \approx w - 0.001$$
Adam step ≈ η × sign(g) at t=1 due to bias correction normalising the estimate
💡
Key Insight — Why Adam Wins in Practice

Adam adapts the effective learning rate per parameter. Parameters with large gradients (e.g., early output layer) get smaller effective steps; parameters with small or sparse gradients (e.g., embedding rows) get larger effective steps. This makes Adam far more robust to heterogeneous gradient scales across a deep network.

Solution
Pause & Predict

Before adjusting the learning rate slider: what do you predict will happen to the loss curve if η is set to 0.9 (very large)? What about η = 0.000001 (very small)?

Consider overshooting the minimum vs. crawling too slowly.

Try It: Learning Rate Effect Visualizer

Watch gradient descent on a 1D loss function (parabola). Adjust the learning rate to see convergence, oscillation, or divergence.

0.30
Loss surface Current position Trajectory
Optimizer equations
SGD: w ← w − η · ∂ℒ/∂w
Implementation
Python · PyTorch — Comparing Optimizers
import torch import torch.nn as nn model = SimpleMLP() criterion = nn.CrossEntropyLoss() # SGD with momentum opt_sgd = torch.optim.SGD( model.parameters(), lr=0.01, momentum=0.9 ) # Adam — default hyperparameters work well opt_adam = torch.optim.Adam( model.parameters(), lr=1e-3, betas=(0.9, 0.999), eps=1e-8 ) # One training step (use whichever optimizer) optimizer = opt_adam optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() print(f"Loss: {loss.item():.4f}")
Output
Loss: 2.3026 # untrained network ≈ ln(10) for 10 classes
Key Takeaway

Backpropagation is an efficient algorithm for computing all gradients simultaneously using the chain rule; the optimizer then decides how to translate those gradients into weight updates — and Adam's per-parameter adaptive learning rate makes it the default choice for most vision models.

🚗
Real-World Application

Tesla Autopilot Training Pipeline

Tesla trains its multi-task vision networks (object detection + depth + lane detection) using Adam with carefully scheduled learning rate warmup and cosine decay — applied across hundreds of GPUs. The same backprop loop you just implemented runs billions of times across billions of labelled driving frames. The chain rule is the engine of autonomous perception.

Checkpoint Quiz Topic 2 — Backpropagation

Q1 What does it mean when training loss oscillates wildly and doesn't decrease?

Learning rate is too large. The gradient descent step overshoots the minimum, bouncing back and forth on the loss surface. Fix: reduce η (try dividing by 10).

Q2 After 5 layers of Sigmoid activations, the gradient of the first layer is ≈ 10⁻⁷. What problem is this, and what is the standard fix?

Vanishing gradient — each Sigmoid squashes gradients by ≤ 0.25, so 5 layers yield 0.25⁵ ≈ 10⁻³, compounded further with the loss gradient. Fix: replace hidden Sigmoid activations with ReLU, add Batch Normalization, or use residual connections.

MNIST — The Vision Foundation Benchmark

Digit classification with a handcrafted MLP proves every concept from Topics 1 & 2 is real — and reveals exactly where a flat network hits its ceiling.

After this section you will be able to
  • Build a complete MNIST training loop — data loading, forward pass, loss, backward pass, and evaluation — in PyTorch.
  • Interpret training and validation loss curves to diagnose underfitting, overfitting, and convergence.
  • Justify why MLPs on pixel grids are limited and why convolutional layers address those limitations.

"MNIST is the 'Hello World' of computer vision — deceptively simple yet containing every lesson you need before tackling real-world image data."

✍️
Why this matters: MNIST is small enough to train in minutes on a laptop yet demonstrates the full learning cycle: data loading, batching, forward pass, cross-entropy loss, backprop, accuracy tracking, and the moment you realise pixel-wise MLPs break as soon as images shift by one pixel — motivating everything in Week 6 (CNNs).
💡
Analogy Bridge

An MLP on MNIST is like memorising what the digit "3" looks like pixel-by-pixel in a fixed position. The moment someone writes "3" slightly to the left, the network is confused — it learned position, not shape. Convolutional filters (Week 6) solve this by detecting edges and curves regardless of location — they are translation-equivariant.

RAW IMAGE 28×28 uint8 [0-255] grayscale NORMALISE ÷ 255 mean=0.1307 std=0.3081 FLATTEN 784-dim row-major MLP 784→256→ReLU 256→128→ReLU 128→10→logits SOFTMAX + LOSS 10 class probs ℒ_CE accuracy metric 7 argmax(p̂)

Full MNIST pipeline: raw 28×28 image → normalise → flatten to 784-dim vector → MLP → softmax cross-entropy loss → predicted digit class.

70k
samples
60k training + 10k test; 28×28 grayscale
784
input dims
28×28 pixels flattened — ignores spatial structure
97–98%
MLP accuracy
Ceiling for 2-layer MLP; CNN reaches 99.7%
< 1 min
Train time
10 epochs on a laptop GPU or Colab CPU
Problem

Training Loop Components

Every supervised training run follows the same cycle:

  • DataLoader — batches images + labels; shuffles each epoch
  • Forward pass — model(x) → logits (shape [B, 10])
  • Loss — CrossEntropyLoss(logits, labels)
  • Backward — loss.backward() → fills .grad buffers
  • Step — optimizer.step() updates weights
  • Zero grad — optimizer.zero_grad() before next batch
📝 Worked Example — Accuracy Calculation
1
Batch of 4 images. Model logits (after softmax): p̂ = [[0.02, 0.85, 0.01, …], [0.70, 0.05, …], [0.03, 0.02, 0.91, …], [0.01, 0.02, 0.04, …, 0.88]]
2
Predicted classes = argmax per row: [1, 0, 2, 9]. True labels: [1, 0, 2, 7].
3
Correct = 3 out of 4. Batch accuracy = 3/4 = 75%.
Accuracy = (correct predictions) / (batch size) = 3/4 = 75.0%
Quick Check

A random model on MNIST (10 classes) would score approximately what accuracy?

≈ 10%. With 10 equally likely classes, random guessing is correct 1/10 of the time. Any model doing worse than 10% accuracy has learned something wrong!

Why MLPs Fail on Shifted Images

An MLP on a flattened 784-dim vector treats pixel (3, 5) and pixel (3, 6) as completely unrelated features. If the digit is shifted by 1 pixel, entirely different weights are activated. The network has no translation invariance.

Parameter count: 784×256 + 256×128 + 128×10 = 234,896 weights — but zero spatial awareness.

$$\text{MLP ceiling: } \approx 97-98\%$$ $$\text{CNN: } \approx 99.7\% \text{ (Week 6)}$$
📝 Worked Example — Counting MLP Parameters
1
Layer 1: 784 inputs × 256 neurons + 256 biases = 201,216
2
Layer 2: 256 × 128 + 128 = 32,896
3
Output: 128 × 10 + 10 = 1,290
Total: 201,216 + 32,896 + 1,290 = 235,402 trainable parameters
⚠️
Common Mistake

Forgetting optimizer.zero_grad() before each backward pass causes gradient accumulation. Gradients are added to existing .grad buffers — not replaced — so after N batches your gradients are N× too large, causing catastrophic weight updates. Always call zero_grad() at the top of the training loop body.

Solution
Pause & Predict

Looking at the digit sketch grid below: which digits do you think an MLP will confuse most often, and why?

Think about which digits look similar when drawn with similar stroke weights (3 vs 8, 4 vs 9, 1 vs 7).

Try It: MNIST Confusion Visualizer

Sketch-style visualisation of typical MLP confusion patterns. Toggle the epoch slider to see how confidence evolves as training progresses.

5
Training accuracy Validation accuracy Training loss
Epoch metrics (simulated)
Epoch 5 — Train acc: 95.8% | Val acc: 94.2% | Loss: 0.142
Implementation
Python · PyTorch — Full MNIST Training Loop
import torch import torch.nn as nn from torchvision import datasets, transforms from torch.utils.data import DataLoader # 1. Data loading and normalisation transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) train_ds = datasets.MNIST( './data', train=True, download=True, transform=transform ) loader = DataLoader(train_ds, batch_size=64, shuffle=True) # 2. Model, loss, optimizer model = SimpleMLP(input_dim=784, hidden=256, n_classes=10) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # 3. Training loop for epoch in range(10): correct = total = 0 for images, labels in loader: optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() preds = outputs.argmax(dim=1) correct += (preds == labels).sum().item() total += labels.size(0) print(f"Epoch {epoch+1}: acc={correct/total*100:.2f}%")
Output
Epoch 1: acc=92.34% Epoch 2: acc=95.81% Epoch 3: acc=96.74% Epoch 4: acc=97.20% Epoch 5: acc=97.55% Epoch 6: acc=97.83% Epoch 7: acc=97.98% Epoch 8: acc=98.12% Epoch 9: acc=98.19% Epoch 10: acc=98.24%
Key Takeaway

A simple MLP achieves ~98% accuracy on MNIST, proving that the perceptron-to-MLP building blocks work — but the 2% ceiling reveals its spatial blindspot, which convolutional layers in Week 6 are designed to overcome.

📬
Real-World Application

USPS Handwritten ZIP Code Reader (LeCun, 1989)

MNIST was curated from US Postal Service handwritten digit envelopes. Yann LeCun's original network to read ZIP codes was a direct precursor to convolutional networks — he needed to solve the exact translation-invariance problem you just encountered with your MLP. Every modern OCR system, cheque reader, and form scanner traces its lineage to this single dataset.

Checkpoint Quiz Topic 3 — MNIST & MLP Limits

Q1 If training accuracy is 98% but test accuracy is 72%, what problem does this indicate?

Overfitting. The model has memorised training examples but fails to generalise. Fix: add Dropout layers, reduce model capacity, or train with data augmentation.

Q2 Why does an MLP trained on MNIST fail when the digit is shifted 3 pixels to the right at test time?

The MLP operates on flattened pixel vectors — pixel at position (i, j) and position (i, j+3) are completely independent features. A 3-pixel shift activates entirely different weights, producing a distribution the model has never seen. MLPs have no translation equivariance; convolutional layers do.

Activation Function Explorer

Compare ReLU, Sigmoid, and Tanh side-by-side — with their derivatives overlaid. Adjust the input range and see how each function transforms values and what gradient magnitude it passes backwards.

±4
1.5
ReLU max(0, x)
Sigmoid 1/(1+e⁻ˣ)
Tanh tanh(x)
Live Value Table x = 1.5
Function f(x) f′(x)
ReLU 1.500 1.000
Sigmoid 0.818 0.149
Tanh 0.905 0.180
Gradient Comparison
ReLU: either 0 or 1 — no squashing
Sigmoid: max 0.25 — vanishes in deep nets
Tanh: max 1.0 — zero-centred, better
ReLU & Sigmoid & Tanh (function) Derivative f′(x) Cursor position x

What You've Learned

Three building blocks — neurons, gradients, and a classifier loop — that underpin every vision model you will build for the rest of the course.

🧠

Perceptron → MLP

A neuron is z = Wx + b followed by σ(z). Stack with ReLU activations and you can approximate any function on ℝⁿ. Loss (MSE or Cross-Entropy) measures how wrong the prediction is.

⛓️

Backpropagation

The chain rule propagates ∂ℒ/∂W backwards through every layer in a single O(P) pass. No re-computation needed. This is what loss.backward() does in PyTorch.

Optimizers

SGD takes fixed steps. Momentum builds velocity. Adam uses per-parameter adaptive rates — making it the default for vision tasks. Learning rate is your most critical hyperparameter.

✍️

MNIST in Practice

~98% MLP accuracy proves the math works. But pixel-wise MLPs lack translation invariance — shift a digit 1 pixel and accuracy collapses. This motivates convolutional layers in Week 6.

📉

Diagnosing Training

Oscillating loss → η too large. Loss not moving → η too small or vanishing gradients. Large train/val gap → overfitting. Small both → underfitting. Read the curve before tuning anything.

Coming up → Week 6

Convolutional Neural Networks — how a small sliding kernel replaces 234k weights with 3k, achieves translation equivariance, and pushes MNIST accuracy past 99.7%. Plus pooling, receptive fields, and the CNN feature hierarchy.

Go Deeper

Textbook chapters, interactive visualisations, and landmark papers to solidify your understanding of neural network fundamentals.

Textbook

Prince — Understanding Deep Learning, Ch. 3–4

Shallow networks, loss landscapes, and gradient descent. The clearest derivation of the MLP forward/backward pass in any modern textbook.

→ Primary reference
Textbook

Goodfellow et al. — Deep Learning, Ch. 6

Deep feedforward networks — covers the universal approximation theorem, gradient flow, and the historical context of activation function choices.

→ Mathematical depth
Video

3Blue1Brown — Neural Networks Series

Four-video series: "But what is a neural network?" through "Backpropagation calculus." The most visually intuitive explanation of backprop available.

→ Watch before the exam
Video

Géron — Hands-On ML, Ch. 10

Practical PyTorch MLP and training loop walkthrough. Covers Dropout, Batch Normalization, and learning rate scheduling — all directly applicable to MNIST.

→ Colab lab preparation
Paper

Kingma & Ba — Adam (2015)

The original Adam paper: 9 pages, clearly written, with ablation studies showing per-parameter adaptive rates outperform SGD+Momentum on vision benchmarks.

→ arxiv:1412.6980
Interactive

TensorFlow Playground

Build and train a small MLP in your browser — adjust layers, activations, and learning rate and watch the decision boundary form in real time. Best 10 minutes you'll spend before the midterm.

→ playground.tensorflow.org

Practice Problems

Eight exercises covering perceptron math, activation functions, backpropagation by hand, optimizer mechanics, and a full MNIST training loop — all exam-style.

1
Theory · Perceptron & MLP Easy

Single Neuron Forward Pass

A neuron has weight vector w = [0.4, −0.7, 1.2], bias b = 0.3, and receives input x = [1.0, 2.0, 0.5].

  1. Compute the pre-activation z.
  2. Compute the output a using ReLU activation.
  3. Compute the output a using Sigmoid activation (to 3 d.p.).

Formula: z = w₁x₁ + w₂x₂ + w₃x₃ + b = (0.4)(1.0) + (−0.7)(2.0) + (1.2)(0.5) + 0.3

2
Code · Activation Functions Easy

Implement & Compare Activations in NumPy

Without using PyTorch, implement the three activation functions and their derivatives as NumPy functions, then plot them over x ∈ [−5, 5].

  1. Implement relu(x), sigmoid(x), and tanh_act(x).
  2. Implement their derivatives relu_deriv(x), sigmoid_deriv(x), tanh_deriv(x).
  3. Find the x value where Sigmoid's derivative is maximised. What is that maximum value?

Sigmoid derivative: σ′(x) = σ(x)(1 − σ(x)). This is maximised at x = 0 where σ(0) = 0.5, giving σ′(0) = 0.5 × 0.5 = 0.25.

3
Theory · Backpropagation Medium

Chain Rule Gradient Computation

A single neuron: z = wx + b, a = ReLU(z), loss ℒ = (y − a)². Given: x = 3, w = 0.5, b = −0.5, y = 2.

  1. Compute z, a, and ℒ in the forward pass.
  2. Compute ∂ℒ/∂a, ∂a/∂z, and ∂z/∂w step by step.
  3. Compute ∂ℒ/∂w using the chain rule.
  4. Compute the updated weight w′ with η = 0.1.

Remember the ReLU derivative: ∂a/∂z = 1 if z > 0, else 0. Since z = 0.5(3) + (−0.5) = 1.0 > 0, the derivative is 1.

4
Code · SGD & Adam Optimization Medium

Implement SGD and One Adam Step from Scratch

Using only NumPy (no PyTorch optimizers), implement the parameter update rules for SGD with momentum and a single Adam step.

  1. Implement sgd_update(w, grad, lr=0.01, momentum=0.9, v_prev=0). Return updated w and new velocity v.
  2. Implement adam_step(w, grad, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8). Return updated w, m, v.
  3. With gradient g = [0.5, −0.3, 0.8] and w = [1.0, 2.0, 3.0], compute one Adam step from t=1, m=0, v=0.

Adam bias correction: m̂ = m / (1 − β₁ᵗ), v̂ = v / (1 − β₂ᵗ). At t=1: 1 − β₁¹ = 0.1, 1 − β₂¹ = 0.001.

5
Theory · Loss Functions Medium

Cross-Entropy Loss & Parameter Counting

A 3-class MLP outputs logits [2.1, 0.5, −0.8] for a sample with true label class 0.

  1. Apply Softmax to the logits to get probabilities p̂ (to 4 d.p.).
  2. Compute the Cross-Entropy loss ℒ_CE = −log(p̂_true).
  3. Count the total parameters of an MLP with architecture 784 → 512 → 256 → 10 (include biases).

Softmax: p̂ᵢ = exp(zᵢ) / Σ exp(zⱼ). exp(2.1) ≈ 8.166, exp(0.5) ≈ 1.649, exp(−0.8) ≈ 0.449. Sum ≈ 10.264.

6
Code · MNIST Training Loop Medium

Full Training & Evaluation Pipeline

Build and train a complete MLP on MNIST in PyTorch, then analyse the results.

  1. Define an MLP with architecture 784 → 256 → ReLU → 128 → ReLU → 10.
  2. Train for 5 epochs on MNIST train set using Adam (lr=1e-3) and CrossEntropyLoss.
  3. Write an evaluate(model, loader) function that returns accuracy % on the test set.
  4. Report: final test accuracy, total training parameters, and the most commonly confused digit pair from the confusion matrix.

For the confusion matrix, use sklearn.metrics.confusion_matrix(y_true, y_pred). Look for the off-diagonal cell with the largest value — this is the most confused pair.

7
Synthesis · Theory: Activation & Vanishing Gradients Hard

Gradient Magnitude Through a Deep Sigmoid Network

Derive and calculate how the gradient magnitude shrinks as it propagates through a chain of Sigmoid activations.

  1. Show that the maximum value of σ′(z) is 0.25 (at z = 0).
  2. Assume every layer is at z = 0. Compute the gradient magnitude ratio ∥∂ℒ/∂W¹∥ / ∥∂ℒ/∂Wᴸ∥ for a network with L = 1, 5, 10, and 20 Sigmoid layers.
  3. At L = 20, if the output gradient is ∥∂ℒ/∂W²⁰∥ = 1.0, what is ∥∂ℒ/∂W¹∥? Express as a power of 10.
  4. Explain in one paragraph why ReLU resolves this and what its own failure mode is.

The gradient shrinks by a factor of σ′(z) ≤ 0.25 per layer. After L layers: ratio = (0.25)^L. For L=20: (0.25)^20 = (4⁻¹)^20 = 4⁻²⁰ ≈ 10⁻¹².

8
Synthesis · Code: Optimizer Comparison Hard

Benchmark SGD vs Momentum vs Adam on MNIST

Train the same MLP architecture three times — once with each optimizer — and produce a comparative analysis.

  1. Use the fixed architecture: 784 → 256 → ReLU → 10. Same random seed (torch.manual_seed(42)) for each run.
  2. Train each for 10 epochs: SGD (lr=0.01), SGD+Momentum (lr=0.01, momentum=0.9), Adam (lr=1e-3).
  3. Record train loss and test accuracy per epoch. Plot all three loss curves on one figure.
  4. Report: which optimizer converges fastest (fewest epochs to reach 95% test accuracy)? Which has the lowest final loss? Hypothesize why.

Expected result: Adam reaches 95% test accuracy in ~2 epochs, SGD+Momentum in ~4, SGD alone in ~6+. However, with careful lr tuning, SGD+Momentum can match Adam. Adam's advantage is robustness to lr choice.