Computer Vision — Week 5: Neural Networks for Vision

Topic 1 · Intuition

From Perceptron to MLP

Activation functions, loss surfaces, and the chain rule — the three pillars of every neural network for vision.

After this section you will be able to

Describe the perceptron model and its geometric interpretation as a linear decision boundary.
Distinguish the output behaviour of ReLU, Sigmoid, and Tanh activation functions and explain why non-linearity is essential.
Compute mean squared error and cross-entropy loss by hand and identify which suits classification tasks.

"Every convolution, attention head, and transformer block in modern computer vision is just a stack of perceptrons learning one curve at a time."

🧠

Why this matters: Before you can fine-tune ResNet or train a YOLO detector, you must understand what a single neuron does — what it learns, where it fails, and what "activation" actually means. This week is the foundation that makes every future topic legible.

💡

Analogy Bridge

Think of a single neuron as a light dimmer switch. The inputs are the wires, the weights are the resistance settings, and the activation function is the dimmer itself — it decides how much signal actually passes through. Stack a hundred dimmers in the right configuration and you can sculpt any shape of decision boundary.

MLP anatomy: input → fully-connected hidden layers with activation σ → output. Each edge carries a learned weight; each node applies a non-linearity.

ReLU

max(0, z) — sparse, fast, default choice

Sigmoid

1/(1+e⁻ᶻ) — squashes to (0,1), vanishes deep

Tanh

zero-centred sigmoid, range (−1,1)

Softmax

exp(zᵢ)/Σexp(zⱼ) — output probabilities

Problem

Perceptron & Linear Pre-activation

A single neuron computes a weighted sum of its inputs plus a bias, then passes the result through an activation function:

$$z = \mathbf{w}^\top \mathbf{x} + b = \sum_{i=1}^{n} w_i x_i + b$$

$$a = \sigma(z)$$

Pre-activation

raw weighted sum before non-linearity

Weight vector

learned strength of each input connection

Bias

shifts the boundary; allows fit when x = 0

Activation

introduces non-linearity (ReLU, Tanh…)

📝 Worked Example — Single Neuron Forward Pass

Given inputs x = [0.5, −1.0, 2.0], weights w = [0.3, −0.6, 0.1], bias b = 0.2.

Compute pre-activation: z = (0.3)(0.5) + (−0.6)(−1.0) + (0.1)(2.0) + 0.2

$$z = 0.15 + 0.60 + 0.20 + 0.20 = 1.15$$

Apply ReLU: a = max(0, 1.15) = 1.15. Apply Sigmoid: a = 1/(1+e⁻¹·¹⁵) ≈ 0.759.

z = 1.15 → ReLU output: 1.15 | Sigmoid output: ≈ 0.759

Quick Check

If all weights were 0, what would the neuron output regardless of input?

z = b = 0.2 → σ(0.2). The output depends only on the bias term. This is why biases allow the network to learn shifts even when inputs are zero.

Loss Functions

The loss quantifies how wrong the network's prediction is. Two canonical choices:

$$\mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2$$

$$\mathcal{L}_{\text{CE}} = -\sum_{k=1}^{K} y_k \log \hat{p}_k$$

MSE for regression; Cross-Entropy for classification (pairs with Softmax output).

📝 Worked Example — Cross-Entropy Loss

True label: class 2 (one-hot: y = [0, 0, 1]). Model outputs softmax probabilities: p̂ = [0.1, 0.2, 0.7].

Only the true class term survives: $\mathcal{L} = -\log(0.7) \approx 0.357$

If the model was wrong — p̂ = [0.6, 0.3, 0.1] — then $\mathcal{L} = -\log(0.1) \approx 2.303$. Higher loss → larger gradient signal.

CE loss scales with confidence: correct & confident ≈ 0, wrong & confident ≈ large

Quick Check

What cross-entropy loss does a perfect prediction of probability 1.0 give?

−log(1.0) = 0. Perfect confidence in the correct class yields zero loss.

⚠️

Common Mistake

Applying Sigmoid to hidden layers in deep nets causes vanishing gradients. Because σ'(z) ≤ 0.25, gradients shrink by ≥ 75 % at every layer. With 10 layers, the gradient reaching layer 1 is less than 10⁻⁶ of the original. Use ReLU (or LeakyReLU) for all hidden layers; reserve Sigmoid for the final binary output.

Solution

Pause & Predict

Before moving the slider: if you increase the input x from −3 to +3 for a ReLU neuron with weight w = 2 and bias b = −1, at what value of x does the neuron "turn on"?

Think about when z = wx + b transitions from negative to positive.

Try It: Single Neuron Live Calculator

Adjust weight, bias, and input to see the pre-activation z and output a for each activation function in real time.

Weight w 1.5

Bias b -0.5

ReLU Sigmoid Tanh z (pre-activation)

Live Calculation — z = w·x + b at x = 1.0

z = (1.5)(1.0) + (−0.5) = 1.00

ReLU(z)max(0, 1.00)= 1.000

Sigmoid(z)1/(1+e⁻¹·⁰⁰)≈ 0.731

Tanh(z)tanh(1.00)≈ 0.762

Implementation

Python · PyTorch — Building a 2-Layer MLP

import torch
import torch.nn as nn

# Define a 2-hidden-layer MLP for 10-class image classification
class SimpleMLP(nn.Module):
    def __init__(self, input_dim=784, hidden=256, n_classes=10):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, n_classes)
        )

    def forward(self, x):
        # Flatten 28×28 image to 784-dim vector
        x = x.view(x.size(0), -1)
        return self.net(x)

model = SimpleMLP()
dummy = torch.randn(4, 1, 28, 28)
logits = model(dummy)
print(logits.shape)  # torch.Size([4, 10])

Output

torch.Size([4, 10])

Key Takeaway

A neural network is a composition of linear transformations and non-linear activations — the activations are what make it capable of learning any decision boundary, no matter how complex.

🏥

Real-World Application

Diabetic Retinopathy Grading (Google/DeepMind, 2016)

An MLP head on top of convolutional features classifies retinal fundus images into 5 severity grades — from no disease to proliferative DR. The final fully-connected layer is exactly the linear + softmax step you just computed, applied to 2048-dimensional CNN features. The same neuron math, operating on richer input.

Checkpoint Quiz Topic 1 — Perceptron to MLP

Q1 A neuron has w = [2, −1] and b = 0.5. Input x = [1, 3]. What is z?

z = 2(1) + (−1)(3) + 0.5 = 2 − 3 + 0.5 = −0.5. ReLU would output 0; Sigmoid would output ≈ 0.378.

Q2 Why does a network with only linear layers (no activations) fail to learn non-linear patterns?

A composition of linear functions is still linear: W₃(W₂(W₁x)) = Wx. No matter how many layers, without activations the network can only learn a single hyperplane boundary — it collapses to a one-layer linear model.

Topic 2 · Mechanics

Backpropagation & Optimization

The chain rule turns the loss into a gradient signal that flows backwards through every weight — and optimizers decide how to use that signal.

After this section you will be able to

Derive the gradient of a loss with respect to a weight using the chain rule in a 2-layer network.
Explain the difference between vanilla SGD, SGD with Momentum, and Adam, including when each is preferred.
Identify vanishing and exploding gradient symptoms in a training loss curve.

"Backpropagation is just the chain rule — applied cleverly to a computational graph of millions of operations without re-computing anything twice."

⛰️

Why this matters: Every deep learning framework — PyTorch, JAX, TensorFlow — is built around automatic differentiation via backprop. Understanding it lets you diagnose exploding/vanishing gradients, design custom loss functions, and reason about why some architectures train faster than others.

💡

Analogy Bridge

Imagine you're trying to find the lowest valley on a foggy mountain (the loss surface). You can only feel the ground directly under your feet (local gradient). SGD takes small steps in the steepest downhill direction. Momentum adds a marble rolling down — it builds up speed. Adam carries its own GPS that adapts the step size for each direction independently.

Forward Pass — compute ŷ

→

Compute Loss ℒ

→

Backprop — compute ∂ℒ/∂W

→

Update Weights

The four-step training loop: forward pass → loss → backward pass (chain rule) → weight update. Repeated for every mini-batch.

O(P)

Complexity

Backprop computes all P gradients in one backward pass — same cost as forward

10⁻³

Typical lr

Adam default learning rate; often the best starting point

β₁=0.9

Adam β₁

Momentum decay — keeps 90 % of previous gradient direction

32–256

Batch size

Gradient estimated on mini-batch; trades noise for speed

Problem

Chain Rule — Gradient Flow

For a composition of functions $\mathcal{L} = f(g(h(\mathbf{W})))$, the chain rule gives:

$$\frac{\partial \mathcal{L}}{\partial W_{ij}} = \frac{\partial \mathcal{L}}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W_{ij}}$$

Each term is local — computed at that node — so the network never needs to see the full expression.

📝 Worked Example — Backprop Through One Layer

Network: z = wx + b, a = σ(z) = sigmoid(z), ℒ = (y − a)². Given: x = 2, w = 0.5, b = 0, y = 1.

Forward: z = 0.5 × 2 = 1.0, a = sigmoid(1.0) ≈ 0.731, ℒ = (1 − 0.731)² ≈ 0.0724

Backward step by step:
∂ℒ/∂a = −2(y − a) = −2(0.269) ≈ −0.538
∂a/∂z = σ(z)(1−σ(z)) = 0.731 × 0.269 ≈ 0.197
∂z/∂w = x = 2

$$\frac{\partial \mathcal{L}}{\partial w} = (-0.538)(0.197)(2) \approx -0.212$$

w ← w − η(∂ℒ/∂w) = 0.5 − 0.01 × (−0.212) = 0.5021 (with η = 0.01)

Quick Check

Why is the gradient negative (−0.212) and the weight update positive? What does this tell you about the loss landscape?

A negative gradient means increasing w would decrease the loss (the loss slopes downward as w increases). Subtracting a negative gradient moves w in the direction that lowers loss — exactly what we want.

Optimizers: SGD → Momentum → Adam

$$\text{SGD: } \mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w}\mathcal{L}$$ $$\text{Momentum: } \mathbf{v} \leftarrow \beta \mathbf{v} - \eta \nabla \mathcal{L}; \quad \mathbf{w} \leftarrow \mathbf{w} + \mathbf{v}$$ $$\text{Adam: } \mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\hat{\mathbf{m}}}{\sqrt{\hat{\mathbf{v}}} + \epsilon}$$

Adam maintains per-parameter first moment m̂ (gradient mean) and second moment v̂ (gradient variance), giving adaptive learning rates.

📝 Worked Example — One Adam Step

t=1, η=0.001, β₁=0.9, β₂=0.999, ε=1e-8. Gradient g = 0.4.

m ← 0.9(0) + 0.1(0.4) = 0.04 | v ← 0.999(0) + 0.001(0.16) = 0.00016

Bias-correct: m̂ = 0.04/(1−0.9¹) = 0.4 | v̂ = 0.00016/(1−0.999¹) ≈ 0.16

$$w \leftarrow w - 0.001 \times \frac{0.4}{\sqrt{0.16} + 10^{-8}} \approx w - 0.001$$

Adam step ≈ η × sign(g) at t=1 due to bias correction normalising the estimate

💡

Key Insight — Why Adam Wins in Practice

Adam adapts the effective learning rate per parameter. Parameters with large gradients (e.g., early output layer) get smaller effective steps; parameters with small or sparse gradients (e.g., embedding rows) get larger effective steps. This makes Adam far more robust to heterogeneous gradient scales across a deep network.

Solution

Pause & Predict

Before adjusting the learning rate slider: what do you predict will happen to the loss curve if η is set to 0.9 (very large)? What about η = 0.000001 (very small)?

Consider overshooting the minimum vs. crawling too slowly.

Try It: Learning Rate Effect Visualizer

Watch gradient descent on a 1D loss function (parabola). Adjust the learning rate to see convergence, oscillation, or divergence.

Learning Rate η 0.30

Loss surface Current position Trajectory

Optimizer equations

SGD: w ← w − η · ∂ℒ/∂w

Implementation

Python · PyTorch — Comparing Optimizers

import torch
import torch.nn as nn

model = SimpleMLP()
criterion = nn.CrossEntropyLoss()

# SGD with momentum
opt_sgd = torch.optim.SGD(
    model.parameters(), lr=0.01, momentum=0.9
)

# Adam — default hyperparameters work well
opt_adam = torch.optim.Adam(
    model.parameters(), lr=1e-3,
    betas=(0.9, 0.999), eps=1e-8
)

# One training step (use whichever optimizer)
optimizer = opt_adam
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f"Loss: {loss.item():.4f}")

Output

Loss: 2.3026 # untrained network ≈ ln(10) for 10 classes

Key Takeaway

Backpropagation is an efficient algorithm for computing all gradients simultaneously using the chain rule; the optimizer then decides how to translate those gradients into weight updates — and Adam's per-parameter adaptive learning rate makes it the default choice for most vision models.

🚗

Real-World Application

Tesla Autopilot Training Pipeline

Tesla trains its multi-task vision networks (object detection + depth + lane detection) using Adam with carefully scheduled learning rate warmup and cosine decay — applied across hundreds of GPUs. The same backprop loop you just implemented runs billions of times across billions of labelled driving frames. The chain rule is the engine of autonomous perception.

Checkpoint Quiz Topic 2 — Backpropagation

Q1 What does it mean when training loss oscillates wildly and doesn't decrease?

Learning rate is too large. The gradient descent step overshoots the minimum, bouncing back and forth on the loss surface. Fix: reduce η (try dividing by 10).

Q2 After 5 layers of Sigmoid activations, the gradient of the first layer is ≈ 10⁻⁷. What problem is this, and what is the standard fix?

Vanishing gradient — each Sigmoid squashes gradients by ≤ 0.25, so 5 layers yield 0.25⁵ ≈ 10⁻³, compounded further with the loss gradient. Fix: replace hidden Sigmoid activations with ReLU, add Batch Normalization, or use residual connections.

Topic 3 · Application

MNIST — The Vision Foundation Benchmark

Digit classification with a handcrafted MLP proves every concept from Topics 1 & 2 is real — and reveals exactly where a flat network hits its ceiling.

After this section you will be able to

Build a complete MNIST training loop — data loading, forward pass, loss, backward pass, and evaluation — in PyTorch.
Interpret training and validation loss curves to diagnose underfitting, overfitting, and convergence.
Justify why MLPs on pixel grids are limited and why convolutional layers address those limitations.

"MNIST is the 'Hello World' of computer vision — deceptively simple yet containing every lesson you need before tackling real-world image data."

✍️

Why this matters: MNIST is small enough to train in minutes on a laptop yet demonstrates the full learning cycle: data loading, batching, forward pass, cross-entropy loss, backprop, accuracy tracking, and the moment you realise pixel-wise MLPs break as soon as images shift by one pixel — motivating everything in Week 6 (CNNs).

💡

Analogy Bridge

An MLP on MNIST is like memorising what the digit "3" looks like pixel-by-pixel in a fixed position. The moment someone writes "3" slightly to the left, the network is confused — it learned position, not shape. Convolutional filters (Week 6) solve this by detecting edges and curves regardless of location — they are translation-equivariant.

Full MNIST pipeline: raw 28×28 image → normalise → flatten to 784-dim vector → MLP → softmax cross-entropy loss → predicted digit class.

70k

samples

60k training + 10k test; 28×28 grayscale

784

input dims

28×28 pixels flattened — ignores spatial structure

97–98%

MLP accuracy

Ceiling for 2-layer MLP; CNN reaches 99.7%

< 1 min

Train time

10 epochs on a laptop GPU or Colab CPU

Problem

Training Loop Components

Every supervised training run follows the same cycle:

DataLoader — batches images + labels; shuffles each epoch
Forward pass — model(x) → logits (shape [B, 10])
Loss — CrossEntropyLoss(logits, labels)
Backward — loss.backward() → fills .grad buffers
Step — optimizer.step() updates weights
Zero grad — optimizer.zero_grad() before next batch

📝 Worked Example — Accuracy Calculation

Batch of 4 images. Model logits (after softmax): p̂ = [[0.02, 0.85, 0.01, …], [0.70, 0.05, …], [0.03, 0.02, 0.91, …], [0.01, 0.02, 0.04, …, 0.88]]

Predicted classes = argmax per row: [1, 0, 2, 9]. True labels: [1, 0, 2, 7].

Correct = 3 out of 4. Batch accuracy = 3/4 = 75%.

Accuracy = (correct predictions) / (batch size) = 3/4 = 75.0%

Quick Check

A random model on MNIST (10 classes) would score approximately what accuracy?

≈ 10%. With 10 equally likely classes, random guessing is correct 1/10 of the time. Any model doing worse than 10% accuracy has learned something wrong!

Why MLPs Fail on Shifted Images

An MLP on a flattened 784-dim vector treats pixel (3, 5) and pixel (3, 6) as completely unrelated features. If the digit is shifted by 1 pixel, entirely different weights are activated. The network has no translation invariance.

Parameter count: 784×256 + 256×128 + 128×10 = 234,896 weights — but zero spatial awareness.

$$\text{MLP ceiling: } \approx 97-98\%$$ $$\text{CNN: } \approx 99.7\% \text{ (Week 6)}$$

📝 Worked Example — Counting MLP Parameters

Layer 1: 784 inputs × 256 neurons + 256 biases = 201,216

Layer 2: 256 × 128 + 128 = 32,896

Output: 128 × 10 + 10 = 1,290

Total: 201,216 + 32,896 + 1,290 = 235,402 trainable parameters

⚠️

Common Mistake

Forgetting optimizer.zero_grad() before each backward pass causes gradient accumulation. Gradients are added to existing .grad buffers — not replaced — so after N batches your gradients are N× too large, causing catastrophic weight updates. Always call zero_grad() at the top of the training loop body.

Solution

Pause & Predict

Looking at the digit sketch grid below: which digits do you think an MLP will confuse most often, and why?

Think about which digits look similar when drawn with similar stroke weights (3 vs 8, 4 vs 9, 1 vs 7).

Try It: MNIST Confusion Visualizer

Sketch-style visualisation of typical MLP confusion patterns. Toggle the epoch slider to see how confidence evolves as training progresses.

Training Epoch 5

Training accuracy Validation accuracy Training loss

Epoch metrics (simulated)

Epoch 5 — Train acc: 95.8% | Val acc: 94.2% | Loss: 0.142

Implementation

Python · PyTorch — Full MNIST Training Loop

import torch
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# 1. Data loading and normalisation
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_ds = datasets.MNIST(
    './data', train=True, download=True,
    transform=transform
)
loader = DataLoader(train_ds, batch_size=64, shuffle=True)

# 2. Model, loss, optimizer
model = SimpleMLP(input_dim=784, hidden=256, n_classes=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# 3. Training loop
for epoch in range(10):
    correct = total = 0
    for images, labels in loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        preds = outputs.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += labels.size(0)
    print(f"Epoch {epoch+1}: acc={correct/total*100:.2f}%")

Output

Epoch 1: acc=92.34% Epoch 2: acc=95.81% Epoch 3: acc=96.74% Epoch 4: acc=97.20% Epoch 5: acc=97.55% Epoch 6: acc=97.83% Epoch 7: acc=97.98% Epoch 8: acc=98.12% Epoch 9: acc=98.19% Epoch 10: acc=98.24%

Key Takeaway

A simple MLP achieves ~98% accuracy on MNIST, proving that the perceptron-to-MLP building blocks work — but the 2% ceiling reveals its spatial blindspot, which convolutional layers in Week 6 are designed to overcome.

📬

Real-World Application

USPS Handwritten ZIP Code Reader (LeCun, 1989)

MNIST was curated from US Postal Service handwritten digit envelopes. Yann LeCun's original network to read ZIP codes was a direct precursor to convolutional networks — he needed to solve the exact translation-invariance problem you just encountered with your MLP. Every modern OCR system, cheque reader, and form scanner traces its lineage to this single dataset.

Checkpoint Quiz Topic 3 — MNIST & MLP Limits

Q1 If training accuracy is 98% but test accuracy is 72%, what problem does this indicate?

Overfitting. The model has memorised training examples but fails to generalise. Fix: add Dropout layers, reduce model capacity, or train with data augmentation.

Q2 Why does an MLP trained on MNIST fail when the digit is shifted 3 pixels to the right at test time?

The MLP operates on flattened pixel vectors — pixel at position (i, j) and position (i, j+3) are completely independent features. A 3-pixel shift activates entirely different weights, producing a distribution the model has never seen. MLPs have no translation equivariance; convolutional layers do.

Function	f(x)	f′(x)
ReLU	1.500	1.000
Sigmoid	0.818	0.149
Tanh	0.905	0.180

Week 5 · Summary

What You've Learned

Three building blocks — neurons, gradients, and a classifier loop — that underpin every vision model you will build for the rest of the course.

🧠

Perceptron → MLP

A neuron is z = Wx + b followed by σ(z). Stack with ReLU activations and you can approximate any function on ℝⁿ. Loss (MSE or Cross-Entropy) measures how wrong the prediction is.

⛓️

Backpropagation

The chain rule propagates ∂ℒ/∂W backwards through every layer in a single O(P) pass. No re-computation needed. This is what loss.backward() does in PyTorch.

⚡

Optimizers

SGD takes fixed steps. Momentum builds velocity. Adam uses per-parameter adaptive rates — making it the default for vision tasks. Learning rate is your most critical hyperparameter.

✍️

MNIST in Practice

~98% MLP accuracy proves the math works. But pixel-wise MLPs lack translation invariance — shift a digit 1 pixel and accuracy collapses. This motivates convolutional layers in Week 6.

📉

Diagnosing Training

Oscillating loss → η too large. Loss not moving → η too small or vanishing gradients. Large train/val gap → overfitting. Small both → underfitting. Read the curve before tuning anything.

Coming up → Week 6

Convolutional Neural Networks — how a small sliding kernel replaces 234k weights with 3k, achieves translation equivariance, and pushes MNIST accuracy past 99.7%. Plus pooling, receptive fields, and the CNN feature hierarchy.

Go Deeper

Textbook chapters, interactive visualisations, and landmark papers to solidify your understanding of neural network fundamentals.

Textbook

Prince — Understanding Deep Learning, Ch. 3–4

Shallow networks, loss landscapes, and gradient descent. The clearest derivation of the MLP forward/backward pass in any modern textbook.

→ Primary reference

Textbook

Goodfellow et al. — Deep Learning, Ch. 6

Deep feedforward networks — covers the universal approximation theorem, gradient flow, and the historical context of activation function choices.

→ Mathematical depth

Video

3Blue1Brown — Neural Networks Series

Four-video series: "But what is a neural network?" through "Backpropagation calculus." The most visually intuitive explanation of backprop available.

→ Watch before the exam

Video

Géron — Hands-On ML, Ch. 10

Practical PyTorch MLP and training loop walkthrough. Covers Dropout, Batch Normalization, and learning rate scheduling — all directly applicable to MNIST.

→ Colab lab preparation

Paper

Kingma & Ba — Adam (2015)

The original Adam paper: 9 pages, clearly written, with ablation studies showing per-parameter adaptive rates outperform SGD+Momentum on vision benchmarks.

→ arxiv:1412.6980

Interactive

TensorFlow Playground

Build and train a small MLP in your browser — adjust layers, activations, and learning rate and watch the decision boundary form in real time. Best 10 minutes you'll spend before the midterm.

→ playground.tensorflow.org

Exercises

Practice Problems

Eight exercises covering perceptron math, activation functions, backpropagation by hand, optimizer mechanics, and a full MNIST training loop — all exam-style.

Theory · Perceptron & MLP Easy

Single Neuron Forward Pass

A neuron has weight vector w = [0.4, −0.7, 1.2], bias b = 0.3, and receives input x = [1.0, 2.0, 0.5].

Compute the pre-activation z.
Compute the output a using ReLU activation.
Compute the output a using Sigmoid activation (to 3 d.p.).

Formula: z = w₁x₁ + w₂x₂ + w₃x₃ + b = (0.4)(1.0) + (−0.7)(2.0) + (1.2)(0.5) + 0.3

Code · Activation Functions Easy

Implement & Compare Activations in NumPy

Without using PyTorch, implement the three activation functions and their derivatives as NumPy functions, then plot them over x ∈ [−5, 5].

Implement relu(x), sigmoid(x), and tanh_act(x).
Implement their derivatives relu_deriv(x), sigmoid_deriv(x), tanh_deriv(x).
Find the x value where Sigmoid's derivative is maximised. What is that maximum value?

Sigmoid derivative: σ′(x) = σ(x)(1 − σ(x)). This is maximised at x = 0 where σ(0) = 0.5, giving σ′(0) = 0.5 × 0.5 = 0.25.

Theory · Backpropagation Medium

Chain Rule Gradient Computation

A single neuron: z = wx + b, a = ReLU(z), loss ℒ = (y − a)². Given: x = 3, w = 0.5, b = −0.5, y = 2.

Compute z, a, and ℒ in the forward pass.
Compute ∂ℒ/∂a, ∂a/∂z, and ∂z/∂w step by step.
Compute ∂ℒ/∂w using the chain rule.
Compute the updated weight w′ with η = 0.1.

Remember the ReLU derivative: ∂a/∂z = 1 if z > 0, else 0. Since z = 0.5(3) + (−0.5) = 1.0 > 0, the derivative is 1.

Code · SGD & Adam Optimization Medium

Implement SGD and One Adam Step from Scratch

Using only NumPy (no PyTorch optimizers), implement the parameter update rules for SGD with momentum and a single Adam step.

Implement sgd_update(w, grad, lr=0.01, momentum=0.9, v_prev=0). Return updated w and new velocity v.
Implement adam_step(w, grad, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8). Return updated w, m, v.
With gradient g = [0.5, −0.3, 0.8] and w = [1.0, 2.0, 3.0], compute one Adam step from t=1, m=0, v=0.

Adam bias correction: m̂ = m / (1 − β₁ᵗ), v̂ = v / (1 − β₂ᵗ). At t=1: 1 − β₁¹ = 0.1, 1 − β₂¹ = 0.001.

Theory · Loss Functions Medium

Cross-Entropy Loss & Parameter Counting

A 3-class MLP outputs logits [2.1, 0.5, −0.8] for a sample with true label class 0.

Apply Softmax to the logits to get probabilities p̂ (to 4 d.p.).
Compute the Cross-Entropy loss ℒ_CE = −log(p̂_true).
Count the total parameters of an MLP with architecture 784 → 512 → 256 → 10 (include biases).

Softmax: p̂ᵢ = exp(zᵢ) / Σ exp(zⱼ). exp(2.1) ≈ 8.166, exp(0.5) ≈ 1.649, exp(−0.8) ≈ 0.449. Sum ≈ 10.264.

Code · MNIST Training Loop Medium

Full Training & Evaluation Pipeline

Build and train a complete MLP on MNIST in PyTorch, then analyse the results.

Define an MLP with architecture 784 → 256 → ReLU → 128 → ReLU → 10.
Train for 5 epochs on MNIST train set using Adam (lr=1e-3) and CrossEntropyLoss.
Write an evaluate(model, loader) function that returns accuracy % on the test set.
Report: final test accuracy, total training parameters, and the most commonly confused digit pair from the confusion matrix.

For the confusion matrix, use sklearn.metrics.confusion_matrix(y_true, y_pred). Look for the off-diagonal cell with the largest value — this is the most confused pair.

Synthesis · Theory: Activation & Vanishing Gradients Hard

Gradient Magnitude Through a Deep Sigmoid Network

Derive and calculate how the gradient magnitude shrinks as it propagates through a chain of Sigmoid activations.

Show that the maximum value of σ′(z) is 0.25 (at z = 0).
Assume every layer is at z = 0. Compute the gradient magnitude ratio ∥∂ℒ/∂W¹∥ / ∥∂ℒ/∂Wᴸ∥ for a network with L = 1, 5, 10, and 20 Sigmoid layers.
At L = 20, if the output gradient is ∥∂ℒ/∂W²⁰∥ = 1.0, what is ∥∂ℒ/∂W¹∥? Express as a power of 10.
Explain in one paragraph why ReLU resolves this and what its own failure mode is.

The gradient shrinks by a factor of σ′(z) ≤ 0.25 per layer. After L layers: ratio = (0.25)^L. For L=20: (0.25)^20 = (4⁻¹)^20 = 4⁻²⁰ ≈ 10⁻¹².

Synthesis · Code: Optimizer Comparison Hard

Benchmark SGD vs Momentum vs Adam on MNIST

Train the same MLP architecture three times — once with each optimizer — and produce a comparative analysis.

Use the fixed architecture: 784 → 256 → ReLU → 10. Same random seed (torch.manual_seed(42)) for each run.
Train each for 10 epochs: SGD (lr=0.01), SGD+Momentum (lr=0.01, momentum=0.9), Adam (lr=1e-3).
Record train loss and test accuracy per epoch. Plot all three loss curves on one figure.
Report: which optimizer converges fastest (fewest epochs to reach 95% test accuracy)? Which has the lowest final loss? Hypothesize why.

Expected result: Adam reaches 95% test accuracy in ~2 epochs, SGD+Momentum in ~4, SGD alone in ~6+. However, with careful lr tuning, SGD+Momentum can match Adam. Adam's advantage is robustness to lr choice.

Neural Networksfor Vision

From Perceptron to MLP

Perceptron & Linear Pre-activation

Loss Functions

Try It: Single Neuron Live Calculator

Diabetic Retinopathy Grading (Google/DeepMind, 2016)

Backpropagation & Optimization

Chain Rule — Gradient Flow

Optimizers: SGD → Momentum → Adam

Try It: Learning Rate Effect Visualizer

Tesla Autopilot Training Pipeline

MNIST — The Vision Foundation Benchmark

Training Loop Components

Why MLPs Fail on Shifted Images

Try It: MNIST Confusion Visualizer

USPS Handwritten ZIP Code Reader (LeCun, 1989)

Activation Function Explorer

What You've Learned

Perceptron → MLP

Backpropagation

Optimizers

MNIST in Practice

Diagnosing Training

Coming up → Week 6

Go Deeper

Prince — Understanding Deep Learning, Ch. 3–4

Goodfellow et al. — Deep Learning, Ch. 6

3Blue1Brown — Neural Networks Series

Géron — Hands-On ML, Ch. 10

Kingma & Ba — Adam (2015)

TensorFlow Playground

Practice Problems

Single Neuron Forward Pass

Implement & Compare Activations in NumPy

Chain Rule Gradient Computation

Implement SGD and One Adam Step from Scratch

Cross-Entropy Loss & Parameter Counting

Full Training & Evaluation Pipeline

Gradient Magnitude Through a Deep Sigmoid Network

Benchmark SGD vs Momentum vs Adam on MNIST

Neural Networks
for Vision