From perceptron to MLP — the mathematical foundations that power every modern computer vision system, built from scratch.
Activation functions, loss surfaces, and the chain rule — the three pillars of every neural network for vision.
"Every convolution, attention head, and transformer block in modern computer vision is just a stack of perceptrons learning one curve at a time."
Think of a single neuron as a light dimmer switch. The inputs are the wires, the weights are the resistance settings, and the activation function is the dimmer itself — it decides how much signal actually passes through. Stack a hundred dimmers in the right configuration and you can sculpt any shape of decision boundary.
MLP anatomy: input → fully-connected hidden layers with activation σ → output. Each edge carries a learned weight; each node applies a non-linearity.
A single neuron computes a weighted sum of its inputs plus a bias, then passes the result through an activation function:
If all weights were 0, what would the neuron output regardless of input?
The loss quantifies how wrong the network's prediction is. Two canonical choices:
MSE for regression; Cross-Entropy for classification (pairs with Softmax output).
What cross-entropy loss does a perfect prediction of probability 1.0 give?
Applying Sigmoid to hidden layers in deep nets causes vanishing gradients. Because σ'(z) ≤ 0.25, gradients shrink by ≥ 75 % at every layer. With 10 layers, the gradient reaching layer 1 is less than 10⁻⁶ of the original. Use ReLU (or LeakyReLU) for all hidden layers; reserve Sigmoid for the final binary output.
Before moving the slider: if you increase the input x from −3 to +3 for a ReLU neuron with weight w = 2 and bias b = −1, at what value of x does the neuron "turn on"?
Think about when z = wx + b transitions from negative to positive.
A neural network is a composition of linear transformations and non-linear activations — the activations are what make it capable of learning any decision boundary, no matter how complex.
An MLP head on top of convolutional features classifies retinal fundus images into 5 severity grades — from no disease to proliferative DR. The final fully-connected layer is exactly the linear + softmax step you just computed, applied to 2048-dimensional CNN features. The same neuron math, operating on richer input.
Q1 A neuron has w = [2, −1] and b = 0.5. Input x = [1, 3]. What is z?
Q2 Why does a network with only linear layers (no activations) fail to learn non-linear patterns?
The chain rule turns the loss into a gradient signal that flows backwards through every weight — and optimizers decide how to use that signal.
"Backpropagation is just the chain rule — applied cleverly to a computational graph of millions of operations without re-computing anything twice."
Imagine you're trying to find the lowest valley on a foggy mountain (the loss surface). You can only feel the ground directly under your feet (local gradient). SGD takes small steps in the steepest downhill direction. Momentum adds a marble rolling down — it builds up speed. Adam carries its own GPS that adapts the step size for each direction independently.
The four-step training loop: forward pass → loss → backward pass (chain rule) → weight update. Repeated for every mini-batch.
For a composition of functions $\mathcal{L} = f(g(h(\mathbf{W})))$, the chain rule gives:
Each term is local — computed at that node — so the network never needs to see the full expression.
Why is the gradient negative (−0.212) and the weight update positive? What does this tell you about the loss landscape?
Adam maintains per-parameter first moment m̂ (gradient mean) and second moment v̂ (gradient variance), giving adaptive learning rates.
Adam adapts the effective learning rate per parameter. Parameters with large gradients (e.g., early output layer) get smaller effective steps; parameters with small or sparse gradients (e.g., embedding rows) get larger effective steps. This makes Adam far more robust to heterogeneous gradient scales across a deep network.
Before adjusting the learning rate slider: what do you predict will happen to the loss curve if η is set to 0.9 (very large)? What about η = 0.000001 (very small)?
Consider overshooting the minimum vs. crawling too slowly.
Backpropagation is an efficient algorithm for computing all gradients simultaneously using the chain rule; the optimizer then decides how to translate those gradients into weight updates — and Adam's per-parameter adaptive learning rate makes it the default choice for most vision models.
Tesla trains its multi-task vision networks (object detection + depth + lane detection) using Adam with carefully scheduled learning rate warmup and cosine decay — applied across hundreds of GPUs. The same backprop loop you just implemented runs billions of times across billions of labelled driving frames. The chain rule is the engine of autonomous perception.
Q1 What does it mean when training loss oscillates wildly and doesn't decrease?
Q2 After 5 layers of Sigmoid activations, the gradient of the first layer is ≈ 10⁻⁷. What problem is this, and what is the standard fix?
Digit classification with a handcrafted MLP proves every concept from Topics 1 & 2 is real — and reveals exactly where a flat network hits its ceiling.
"MNIST is the 'Hello World' of computer vision — deceptively simple yet containing every lesson you need before tackling real-world image data."
An MLP on MNIST is like memorising what the digit "3" looks like pixel-by-pixel in a fixed position. The moment someone writes "3" slightly to the left, the network is confused — it learned position, not shape. Convolutional filters (Week 6) solve this by detecting edges and curves regardless of location — they are translation-equivariant.
Full MNIST pipeline: raw 28×28 image → normalise → flatten to 784-dim vector → MLP → softmax cross-entropy loss → predicted digit class.
Every supervised training run follows the same cycle:
A random model on MNIST (10 classes) would score approximately what accuracy?
An MLP on a flattened 784-dim vector treats pixel (3, 5) and pixel (3, 6) as completely unrelated features. If the digit is shifted by 1 pixel, entirely different weights are activated. The network has no translation invariance.
Parameter count: 784×256 + 256×128 + 128×10 = 234,896 weights — but zero spatial awareness.
Forgetting optimizer.zero_grad() before each backward pass causes gradient accumulation. Gradients are added to existing .grad buffers — not replaced — so after N batches your gradients are N× too large, causing catastrophic weight updates. Always call zero_grad() at the top of the training loop body.
Looking at the digit sketch grid below: which digits do you think an MLP will confuse most often, and why?
Think about which digits look similar when drawn with similar stroke weights (3 vs 8, 4 vs 9, 1 vs 7).
A simple MLP achieves ~98% accuracy on MNIST, proving that the perceptron-to-MLP building blocks work — but the 2% ceiling reveals its spatial blindspot, which convolutional layers in Week 6 are designed to overcome.
MNIST was curated from US Postal Service handwritten digit envelopes. Yann LeCun's original network to read ZIP codes was a direct precursor to convolutional networks — he needed to solve the exact translation-invariance problem you just encountered with your MLP. Every modern OCR system, cheque reader, and form scanner traces its lineage to this single dataset.
Q1 If training accuracy is 98% but test accuracy is 72%, what problem does this indicate?
Q2 Why does an MLP trained on MNIST fail when the digit is shifted 3 pixels to the right at test time?
Compare ReLU, Sigmoid, and Tanh side-by-side — with their derivatives overlaid. Adjust the input range and see how each function transforms values and what gradient magnitude it passes backwards.
Three building blocks — neurons, gradients, and a classifier loop — that underpin every vision model you will build for the rest of the course.
A neuron is z = Wx + b followed by σ(z). Stack with ReLU activations and you can approximate any function on ℝⁿ. Loss (MSE or Cross-Entropy) measures how wrong the prediction is.
The chain rule propagates ∂ℒ/∂W backwards through every layer in a single O(P) pass. No re-computation needed. This is what loss.backward() does in PyTorch.
SGD takes fixed steps. Momentum builds velocity. Adam uses per-parameter adaptive rates — making it the default for vision tasks. Learning rate is your most critical hyperparameter.
~98% MLP accuracy proves the math works. But pixel-wise MLPs lack translation invariance — shift a digit 1 pixel and accuracy collapses. This motivates convolutional layers in Week 6.
Oscillating loss → η too large. Loss not moving → η too small or vanishing gradients. Large train/val gap → overfitting. Small both → underfitting. Read the curve before tuning anything.
Convolutional Neural Networks — how a small sliding kernel replaces 234k weights with 3k, achieves translation equivariance, and pushes MNIST accuracy past 99.7%. Plus pooling, receptive fields, and the CNN feature hierarchy.
Textbook chapters, interactive visualisations, and landmark papers to solidify your understanding of neural network fundamentals.
Shallow networks, loss landscapes, and gradient descent. The clearest derivation of the MLP forward/backward pass in any modern textbook.
→ Primary referenceDeep feedforward networks — covers the universal approximation theorem, gradient flow, and the historical context of activation function choices.
→ Mathematical depthFour-video series: "But what is a neural network?" through "Backpropagation calculus." The most visually intuitive explanation of backprop available.
→ Watch before the examPractical PyTorch MLP and training loop walkthrough. Covers Dropout, Batch Normalization, and learning rate scheduling — all directly applicable to MNIST.
→ Colab lab preparationThe original Adam paper: 9 pages, clearly written, with ablation studies showing per-parameter adaptive rates outperform SGD+Momentum on vision benchmarks.
→ arxiv:1412.6980Build and train a small MLP in your browser — adjust layers, activations, and learning rate and watch the decision boundary form in real time. Best 10 minutes you'll spend before the midterm.
→ playground.tensorflow.orgEight exercises covering perceptron math, activation functions, backpropagation by hand, optimizer mechanics, and a full MNIST training loop — all exam-style.
A neuron has weight vector w = [0.4, −0.7, 1.2], bias b = 0.3, and receives input x = [1.0, 2.0, 0.5].
Formula: z = w₁x₁ + w₂x₂ + w₃x₃ + b = (0.4)(1.0) + (−0.7)(2.0) + (1.2)(0.5) + 0.3
Without using PyTorch, implement the three activation functions and their derivatives as NumPy functions, then plot them over x ∈ [−5, 5].
relu(x), sigmoid(x), and tanh_act(x).relu_deriv(x), sigmoid_deriv(x), tanh_deriv(x).Sigmoid derivative: σ′(x) = σ(x)(1 − σ(x)). This is maximised at x = 0 where σ(0) = 0.5, giving σ′(0) = 0.5 × 0.5 = 0.25.
A single neuron: z = wx + b, a = ReLU(z), loss ℒ = (y − a)². Given: x = 3, w = 0.5, b = −0.5, y = 2.
Remember the ReLU derivative: ∂a/∂z = 1 if z > 0, else 0. Since z = 0.5(3) + (−0.5) = 1.0 > 0, the derivative is 1.
Using only NumPy (no PyTorch optimizers), implement the parameter update rules for SGD with momentum and a single Adam step.
sgd_update(w, grad, lr=0.01, momentum=0.9, v_prev=0). Return updated w and new velocity v.adam_step(w, grad, m, v, t, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8). Return updated w, m, v.Adam bias correction: m̂ = m / (1 − β₁ᵗ), v̂ = v / (1 − β₂ᵗ). At t=1: 1 − β₁¹ = 0.1, 1 − β₂¹ = 0.001.
A 3-class MLP outputs logits [2.1, 0.5, −0.8] for a sample with true label class 0.
Softmax: p̂ᵢ = exp(zᵢ) / Σ exp(zⱼ). exp(2.1) ≈ 8.166, exp(0.5) ≈ 1.649, exp(−0.8) ≈ 0.449. Sum ≈ 10.264.
Build and train a complete MLP on MNIST in PyTorch, then analyse the results.
evaluate(model, loader) function that returns accuracy % on the test set.For the confusion matrix, use sklearn.metrics.confusion_matrix(y_true, y_pred). Look for the off-diagonal cell with the largest value — this is the most confused pair.
Derive and calculate how the gradient magnitude shrinks as it propagates through a chain of Sigmoid activations.
The gradient shrinks by a factor of σ′(z) ≤ 0.25 per layer. After L layers: ratio = (0.25)^L. For L=20: (0.25)^20 = (4⁻¹)^20 = 4⁻²⁰ ≈ 10⁻¹².
Train the same MLP architecture three times — once with each optimizer — and produce a comparative analysis.
torch.manual_seed(42)) for each run.Expected result: Adam reaches 95% test accuracy in ~2 epochs, SGD+Momentum in ~4, SGD alone in ~6+. However, with careful lr tuning, SGD+Momentum can match Adam. Adam's advantage is robustness to lr choice.