Computer Vision Foundations — Week 3

Finding Stable Anchors
in a Sea of Pixels

Master the detectors and descriptors that underpin every modern vision pipeline — from Harris corners to ORB, and from ratio-test matching to homography-based AR tracking.

Harris Corner Detector SIFT & DoG Scale-Space ORB & Binary Descriptors Feature Matching Szeliski Ch. 7

What is a Feature? & Harris Corner Detector

Corners, blobs, and interest points — the stable, distinctive image locations that algorithms can reliably detect and re-identify across viewpoints, scales, and lighting changes.

After this section you will be able to
  • Explain why corners are more useful than edges or flat regions for matching
  • Compute the Harris response $R = \det(M) - k\cdot\text{trace}(M)^2$ by hand and classify a pixel
  • Apply Non-Maximum Suppression to extract a sparse set of corner candidates

Imagine you're a detective trying to identify a location from two photographs taken at different times of day. You wouldn't focus on a blank wall or an empty sky — you'd look for corners of buildings, edges of windows, distinctive junctions. That intuition is exactly what computer vision feature detectors formalize: find the points that are unique, stable, and worth staking a match on.

🎯
Why this matters: Every downstream task — 3D reconstruction, augmented reality, image retrieval, visual SLAM — depends on finding the same physical point across different images. A detector that picks unstable, textureless regions wastes matching budget and produces wrong answers downstream. The structure tensor gives us a rigorous, eigenvalue-based answer to: "Is this location distinctive enough to stake a match on?"
Think of it this way

A corner is like a GPS waypoint: it sits at the intersection of two distinct edges, giving you a unique location you can return to reliably regardless of the route you took to get there. A flat wall is like open ocean — every direction looks the same, no useful landmark. An edge is like a coastline — you can tell you're on it, but not exactly where along it you are.

🔁
Repeatability
Same physical point detected under viewpoint and lighting changes
e.g. detected in both images of a scene
🎯
Distinctiveness
Descriptor around keypoint is unique in the image
e.g. low false-match rate
📌
Locality
Robust to partial occlusion — only a small patch is needed
e.g. 31×31 pixel window
Efficiency
Fast enough to detect thousands per frame in real time
e.g. <10 ms on CPU for 1080p
Uniform intensity λ₁ ≈ λ₂ ≈ 0 · Flat λ₁ ≫ λ₂ ≈ 0 · Edge λ₁ ≈ λ₂ ≫ 0 · Corner FLAT REGION EDGE CORNER

Eigenvalue ellipses reveal region type: flat patches have no preferred direction; edges change strongly in one direction only; corners change strongly in all directions.

Problem — Classify each pixel without computing eigenvalues

The Structure Tensor

To decide whether a patch is a corner, edge, or flat region, we compute the structure tensor $M$ — a 2×2 matrix encoding dominant gradient directions in a local window $W$:

$$M = \sum_{(x,y)\in W} w(x,y)\begin{bmatrix} I_x^2 & I_x I_y \\ I_x I_y & I_y^2 \end{bmatrix}$$

where $I_x$, $I_y$ are Sobel gradients and $w(x,y)$ is a Gaussian window. Eigenvalues $\lambda_1 \geq \lambda_2$ measure intensity change strength in the two principal directions.

📝 Worked Example — Structure Tensor Classification

Problem: A 3×3 window yields $M = \begin{bmatrix}12 & 0 \\ 0 & 12\end{bmatrix}$. Classify the region.

1
Find eigenvalues. Diagonal matrix → $\lambda_1 = 12$, $\lambda_2 = 12$.
2
Classify. Both $\lambda_1$ and $\lambda_2$ are large and equal — intensity changes strongly in all directions.
λ₁ = 12, λ₂ = 12 → Corner
Quick Check

If $M = \begin{bmatrix}80 & 0 \\ 0 & 2\end{bmatrix}$, what is the region type?

λ₁=80 ≫ λ₂=2 ≈ 0 → Edge (strong gradient in one direction only).

The Harris Response Function

Computing eigenvalues explicitly is expensive. Harris (1988) replaced it with a single scalar score using algebraic identities $\det(M)=\lambda_1\lambda_2$ and $\text{trace}(M)=\lambda_1+\lambda_2$:

$$R = \det(M) - k\cdot\text{trace}(M)^2$$

where $k \approx 0.04$–$0.06$ is the sensitivity parameter. Sign of $R$ directly encodes type: $R \gg 0$ → corner; $R \ll 0$ → edge; $|R| \approx 0$ → flat.

📝 Worked Example — Computing R

Problem: $M = \begin{bmatrix}6 & 2 \\ 2 & 4\end{bmatrix}$, $k=0.05$. Compute $R$ and classify.

1
det(M). $6 \times 4 - 2 \times 2 = 24 - 4 = 20$
2
trace(M)². $(6+4)^2 = 100$
3
Harris R. $R = 20 - 0.05 \times 100 = 15$
R = 15 > 0 → Corner detected
⚠️
Common Mistake

Confusing "large eigenvalue" with "strong edge": A single large eigenvalue ($\lambda_1 \gg \lambda_2 \approx 0$) means a strong gradient in one direction only — that's an edge, not a corner. A corner requires both eigenvalues large. Thresholding on $\lambda_1$ alone detects edge pixels as features.

Solution — Explore the eigenvalue classification space
Pause & Predict

Before moving the sliders: if you set λ₁ = 80 and λ₂ = 5, what region type do you predict? What happens to the Harris response R as you slowly increase λ₂ from 5 to 80?

Try It: Eigenvalue Region Classifier

Adjust the two eigenvalues and watch which region type the structure tensor predicts. The dot moves in $(\lambda_1, \lambda_2)$ space — watch the boundary lines shift.

70
60
0.04
Flat (|R|≈0) Edge (R<0) Corner (R>0) Current point
Harris Response — Live Calculation
Implementation — Harris detector in OpenCV

Python · OpenCV — Harris corner detection with NMS

import cv2 import numpy as np img = cv2.imread('building.jpg') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) gray = np.float32(gray) # Compute Harris R map for every pixel R = cv2.cornerHarris(gray, blockSize=5, ksize=3, k=0.05) # Dilate to merge nearby peaks, then threshold at 1% of max R R_dil = cv2.dilate(R, None) corners = R_dil > 0.01 * R_dil.max() result = img.copy() result[corners] = [0, 0, 255] # mark corners red print(f"R range: {R.min():.4f} to {R.max():.4f}") print(f"Detected corners: {corners.sum()}")
Output
R range: -0.0023 to 0.0841 Detected corners: 312
Key Takeaway

A corner is the unique pixel where intensity changes strongly in all directions simultaneously — and Harris detects this in constant time per pixel by replacing eigenvalue decomposition with a single algebraic score $R = \det(M) - k\cdot\text{trace}(M)^2$.

🗺️
Real-World Application

Visual SLAM: Corners as Persistent Landmarks

In Simultaneous Localization and Mapping (SLAM) systems used by autonomous robots, Harris corners serve as persistent 3D landmarks. A corner detected at pixel $(u,v)$ in frame $t$ is triangulated into world coordinates and re-observed in frame $t+1$ to estimate ego-motion. Systems such as ORB-SLAM3 track thousands of corners per frame to maintain a globally consistent map, enabling navigation in GPS-denied environments like warehouses and underground mines.

Checkpoint Features & Harris — 3 quick questions

1. A pixel has λ₁ = 90, λ₂ = 85, k = 0.05. What is the Harris response R?

det(M) = λ₁λ₂ = 90 × 85 = 7 650; trace(M)² = (90 + 85)² = 30 625; R = 7 650 − 0.05 × 30 625 = 6 118.75 ≫ 0 → Corner.

2. Why does Harris use det(M) − k · trace(M)² instead of computing eigenvalues directly?

Eigenvalue computation requires a square-root per pixel — very expensive at image scale. The algebraic identities det = λ₁λ₂ and trace = λ₁ + λ₂ let us evaluate the same classification with only multiplications and additions, running in constant time per pixel.

3. After computing the Harris response map, what does Non-Maximum Suppression (NMS) do?

NMS keeps only the pixels where R is strictly larger than all neighbours in a local window — converting clusters of high-R pixels around each corner into a single, maximally-distinctive point. Without NMS, the same corner would produce hundreds of nearby detections.

SIFT & ORB: Scale-Invariant Description & Matching

SIFT builds 128-dimensional gradient histograms that survive scale and rotation changes; ORB replaces every float with a single bit comparison — 100× faster and patent-free.

After this section you will be able to
  • Describe how SIFT builds a Gaussian scale-space and detects keypoints as DoG extrema
  • Compute Hamming distance between two ORB binary descriptors by hand
  • Apply Lowe's ratio test to filter false matches from a nearest-neighbour search

Take a photo of a building from across the street, then walk halfway and take another. Run Harris on both images and almost none of the corners match — they've moved, disappeared, or been replaced by new ones. A detector that only works at one scale is nearly useless in a world where cameras zoom, objects approach, and scene resolution varies. David Lowe's 2004 answer was to ask a different question: not "is this a corner?" but "at what scale does this point look most like a blob?"

🔭
Why this matters: SIFT was the breakthrough that made robust image matching practical across a 5× scale change. ORB then democratized it for embedded hardware by replacing 128 floats with 256 bits compared via a single CPU instruction. Together they define the modern foundation of feature-based vision.
🛂
Think of it this way

A SIFT descriptor is like a passport photo: regardless of how far away you are when the photo is taken, it's always cropped, normalized, and rotated to the same standard pose before being stored. ORB is like a QR code version of that passport — you throw away the grey-scale detail and keep only black-or-white comparisons, but it's fast enough to scan thousands per second.

1
Image
Input image
2
Gaussian pyramid (σ levels)
3
DoG extrema detection
4
orient θ
Orientation assignment
5
4×4 × 8-bin = 128-d
128-dim descriptor
Problem — Describe keypoints that survive scale, rotation & illumination changes

SIFT: Difference of Gaussians & 128-dim Descriptor

SIFT builds a Gaussian scale space $L(x,y,\sigma) = G(x,y,\sigma) * I(x,y)$ and subtracts adjacent levels to form the Difference of Gaussians (DoG), approximating the scale-normalised Laplacian:

$$D(x,y,\sigma) = L(x,y,k\sigma) - L(x,y,\sigma)$$

Keypoints are local extrema in the 3D $(x,y,\sigma)$ DoG volume — stable blob-like structures at their natural scale. Each keypoint gets a 128-dimensional descriptor from a 4×4 gradient orientation histogram grid, L2-normalised for illumination invariance.

📝 Worked Example — DoG Keypoint Detection

Problem: At pixel $(x,y)$: $L_1=140$ (σ=1.0), $L_2=122$ (σ=1.4), $L_3=108$ (σ=2.0). Compute the two DoG values and determine if this pixel is a candidate keypoint (assuming it is a local minimum in its 3D neighbourhood).

1
DoG between levels 1 and 2.
$D_1 = L_2 - L_1 = 122 - 140 = -18$
Negative: pixel was bright at fine scale — a blob peak.
2
DoG between levels 2 and 3.
$D_2 = L_3 - L_2 = 108 - 122 = -14$
Still negative but smaller — response concentrated near σ≈1.0.
3
Classify. If $D_1 = -18$ is a local minimum (smaller than all 26 neighbours in 3×3×3 DoG volume):
D₁ = −18 is a local minimum → Candidate SIFT keypoint at σ ≈ 1.0

ORB: FAST + rBRIEF & Hamming Distance

ORB replaces DoG detection with FAST corner detection and replaces the 128-float descriptor with a 256-bit binary string (rBRIEF). Each bit is one intensity comparison $I(p_i) < I(q_i)$ between a pre-learned pair of pixels, rotated by keypoint orientation θ for rotation invariance:

$$d_H(\mathbf{d}_1, \mathbf{d}_2) = \text{popcount}(\mathbf{d}_1 \oplus \mathbf{d}_2)$$

Hamming distance via XOR + popcount runs in a single CPU instruction on 64-bit integers — over 100× faster than SIFT's L2 norm on 128 floats.

📝 Worked Example — Hamming Distance

Problem: Two 8-bit ORB fragments are $\mathbf{d}_1 = 10110010_2$ and $\mathbf{d}_2 = 10011011_2$. Compute Hamming distance.

1
XOR.
$\mathbf{d}_1$: 1 0 1 1 0 0 1 0
$\mathbf{d}_2$: 1 0 0 1 1 0 1 1
XOR: 0 0 1 0 1 0 0 1
2
popcount. 3 bits differ (positions 5, 3, 0).
Hamming distance = 3 / 8 bits → moderate similarity

Lowe's Ratio Test — Filtering False Matches

Nearest-neighbour matching alone produces many false positives. Lowe's ratio test rejects ambiguous matches: find the two closest descriptors $d_1^{(1)}$ (nearest) and $d_1^{(2)}$ (second-nearest). Accept the match only if:

$$\frac{\text{dist}(\mathbf{d}_A, \mathbf{d}_B^{(1)})}{\text{dist}(\mathbf{d}_A, \mathbf{d}_B^{(2)})} < \tau \quad (\tau \approx 0.75)$$

If the nearest match is much closer than the second-nearest, the match is probably correct. If the two distances are similar, the feature is ambiguous — discard it.

📝 Worked Example — Ratio Test

Problem: Descriptor $\mathbf{d}_A$ has nearest-neighbour distance 42.0 and second-nearest 58.0, with $\tau = 0.75$. Does it pass the ratio test?

1
Compute ratio. $42.0 / 58.0 = 0.724$
2
Compare to threshold. $0.724 < 0.75$
Ratio = 0.724 < 0.75 → Match accepted ✓
⚠️
Common Mistake

Using SIFT for real-time applications on CPU without acceleration: OpenCV's SIFT_create is free since 2020 but still ~50–200 ms per frame on CPU for large images. For real-time (≥30 fps) applications, use ORB or AKAZE. SIFT is ideal for offline structure-from-motion or image retrieval where accuracy matters more than speed.

Solution — Explore the Gaussian scale-space interactively
Pause & Predict

Before moving the σ slider: at very large blur (σ = 8.0), do you expect more or fewer DoG extrema compared to σ = 1.0? Will the extrema correspond to fine texture details or to larger structural blobs?

Try It: Gaussian Scale-Space Explorer

Slide σ to see how blur progressively suppresses fine detail. The right panel shows the DoG — bright and dark spots mark potential keypoints at each scale.

σ = 1.0
0.75
Blurred image (left half) DoG positive peak DoG negative peak
Implementation — SIFT + ORB + ratio test in OpenCV

Python · OpenCV — SIFT detection and ORB matching with ratio test

import cv2 img1 = cv2.imread('scene_A.jpg', cv2.IMREAD_GRAYSCALE) img2 = cv2.imread('scene_B.jpg', cv2.IMREAD_GRAYSCALE) # ─── SIFT: detect + describe ──────────────────────────────── sift = cv2.SIFT_create(nfeatures=500) kp1, des1 = sift.detectAndCompute(img1, None) kp2, des2 = sift.detectAndCompute(img2, None) print(f"SIFT: {len(kp1)} kp, descriptor shape {des1.shape}") # ─── ORB: detect + describe ───────────────────────────────── orb = cv2.ORB_create(nfeatures=1000) kp3, des3 = orb.detectAndCompute(img1, None) print(f"ORB: {len(kp3)} kp, descriptor dtype {des3.dtype}") # ─── Lowe ratio test with SIFT L2 matching ────────────────── bf = cv2.BFMatcher(cv2.NORM_L2) raw = bf.knnMatch(des1, des2, k=2) good = [m for m, n in raw if m.distance / n.distance < 0.75] print(f"Ratio-test matches: {len(good)} / {len(raw)}")
Output
SIFT: 487 kp, descriptor shape (487, 128) ORB: 987 kp, descriptor dtype uint8 Ratio-test matches: 214 / 487
Key Takeaway

SIFT anchors every descriptor to the keypoint's natural scale and dominant orientation — making it invariant to zoom and rotation. ORB achieves comparable matching quality at 100× speed by replacing 128 floats with 256 bit-comparisons. Both rely on Lowe's ratio test to discard ambiguous matches before downstream geometry estimation.

📍
Real-World Application

Visual Place Recognition for Autonomous Navigation

SIFT underpins image-based localization pipelines used in autonomous vehicles and delivery robots. A query image is matched against a database of geo-tagged reference images via SIFT nearest-neighbour search. The matching keypoint pairs determine the 6-DoF camera pose via PnP. The NetVLAD retrieval pipeline and CMU-Seasons benchmark demonstrate that SIFT keypoints remain matchable across seasons — a query image captured months after the database was built can still be localised.

Checkpoint SIFT & ORB — 3 quick questions

1. Why does SIFT use Difference of Gaussians instead of raw image derivatives to detect keypoints?

DoG efficiently approximates the scale-normalised Laplacian-of-Gaussian (LoG), which gives a strong response at blob centres across scales. By detecting local extrema in the 3D DoG volume, SIFT finds points that are stable across a continuous scale range — something a single-scale derivative cannot do.

2. Two SIFT descriptors have nearest-neighbour distances 60 and 62. τ = 0.75. Does the match pass the ratio test?

Ratio = 60 / 62 = 0.968 > 0.75. The match is rejected — the two best candidates are nearly equidistant, meaning the feature is ambiguous and could match many things in the database.

3. ORB descriptors are 32 bytes (256 bits). What is the maximum possible Hamming distance between two ORB descriptors?

256 — every bit differs. In practice, a Hamming distance > 128 (50% of bits differ) is essentially a random match; typical acceptance thresholds are ≤60 bits (≤23%).

AR Marker Tracking & Document Scanning

Every time your phone straightens a whiteboard photo or Snapchat anchors an effect to a face, a feature matching pipeline runs at 30 fps — connecting the detectors and descriptors from Topics 1 and 2 into a complete, production-grade system.

After this section you will be able to
  • Construct a complete feature matching pipeline from detection to RANSAC-filtered homography
  • Explain why RANSAC is necessary to remove incorrect feature matches
  • Apply the homography matrix to warp a source image onto a target plane

Every time Adobe Scan straightens a photograph of a whiteboard, or Snapchat anchors virtual sunglasses to a moving face, or a robotic arm identifies a product on a conveyor belt — a matching pipeline is running 30 times per second, aligning features from one frame to the next. The math is the same as what you learned in Topics 1 and 2; what changes is how you chain it together under real-world noise.

🚀
Why this matters: Feature detection and description are only useful if you can align what you find. Homography estimation from feature matches is the glue that turns a list of keypoints into a geometric transformation — enabling AR overlay, document rectification, panorama stitching, and visual odometry. RANSAC is the critical piece that makes it work despite inevitable wrong matches.
🧩
Think of it this way

Imagine solving a jigsaw puzzle with some counterfeit pieces mixed in. You can't tell which pieces are fake until you try to fit several together. RANSAC does exactly this: it repeatedly tries small random subsets of matches to estimate a homography, keeps the model that explains the most pieces simultaneously, and declares the rest counterfeits (outliers).

1
Detect keypoints
2
128-d
Compute descriptors
3
Match + ratio test
4
RANSAC
RANSAC — reject outliers
5
Homography warp
Problem — Estimate a geometric transformation from noisy feature matches

Homography Matrix & the DLT Algorithm

A homography $H$ is a 3×3 matrix that maps every point $\mathbf{p}$ in one image to its corresponding point $\mathbf{p}'$ in another (related by a planar scene or pure rotation):

$$\mathbf{p}' = H\mathbf{p}, \quad H = \begin{bmatrix}h_{11}&h_{12}&h_{13}\\h_{21}&h_{22}&h_{23}\\h_{31}&h_{32}&h_{33}\end{bmatrix}$$

$H$ has 8 degrees of freedom (scale is fixed). The Direct Linear Transform (DLT) estimates $H$ from 4+ point correspondences by constructing a linear system from the cross-product constraint $\mathbf{p}' \times H\mathbf{p} = \mathbf{0}$.

📝 Worked Example — Applying a Homography

Problem: $H = \begin{bmatrix}1.2 & 0 & 50 \\ 0 & 1.2 & 30 \\ 0 & 0 & 1\end{bmatrix}$. Map point $\mathbf{p} = (100, 80, 1)^T$.

1
Multiply.
$\tilde{x}' = 1.2(100) + 0(80) + 50 = 170$
$\tilde{y}' = 0(100) + 1.2(80) + 30 = 126$
$\tilde{w}' = 0(100) + 0(80) + 1 = 1$
2
Normalize. Divide by $\tilde{w}'=1$:
p' = (170, 126) — scaled by 1.2× and translated by (50, 30)

RANSAC — Robust Estimation Under Outliers

Feature matching always produces some wrong matches (outliers). RANSAC (Random Sample Consensus) fits the model robustly by repeating:

  1. Randomly sample 4 match pairs
  2. Compute hypothesis $H$ from the 4 pairs
  3. Count inliers: matches where $\|\mathbf{p}' - H\mathbf{p}\| < \varepsilon$
  4. Keep the $H$ with the most inliers
  5. Refit $H$ using all inliers from the best hypothesis
📝 Worked Example — RANSAC Inlier Count

Problem: A hypothesis $H$ is tested against 10 matches. Reprojection errors (pixels) are: $[1.2, 18.4, 0.8, 2.1, 22.0, 1.5, 0.9, 19.3, 1.1, 2.4]$. With $\varepsilon = 5$ px, how many inliers?

1
Apply threshold. Inliers (error < 5): indices 0(1.2), 2(0.8), 3(2.1), 5(1.5), 6(0.9), 8(1.1), 9(2.4) → 7 inliers. Outliers: 1, 4, 7.
2
Inlier ratio. 7/10 = 70% inliers — a good hypothesis.
Inlier count = 7 / 10 → Accept this H as a strong candidate
💡
Key Insight

RANSAC's required iterations depend on the outlier ratio, not the dataset size. With 50% outliers and 4 points per sample, the probability of sampling 4 clean inliers in one trial is $(0.5)^4 = 6.25\%$. To get a 99% success probability you need $\lceil\log(0.01)/\log(1-0.0625)\rceil \approx 72$ iterations — regardless of whether you have 100 or 100,000 matches.

Solution — Matching pipeline demo
Pause & Predict

Before adjusting the threshold slider: if you set the ratio test threshold to 0.95 (very permissive), do you expect more or fewer final matches after RANSAC? Would they be more or less accurate?

Try It: Feature Matching Pipeline Simulator

Adjust matching quality and RANSAC threshold to see how inlier/outlier balance changes. Green lines = accepted inliers; red dashed = RANSAC outliers.

0.75
5 px
RANSAC inlier RANSAC outlier Keypoint
Pipeline Statistics
Implementation — Document scanning pipeline in OpenCV

Python · OpenCV — ORB + BFMatcher + RANSAC homography → document warp

import cv2 import numpy as np src = cv2.imread('document.jpg', cv2.IMREAD_GRAYSCALE) ref = cv2.imread('template.jpg', cv2.IMREAD_GRAYSCALE) # ─── Detect + describe with ORB ───────────────────────────── orb = cv2.ORB_create(nfeatures=2000) kp1, d1 = orb.detectAndCompute(src, None) kp2, d2 = orb.detectAndCompute(ref, None) # ─── Hamming brute-force match + ratio test ────────────────── bf = cv2.BFMatcher(cv2.NORM_HAMMING) raw = bf.knnMatch(d1, d2, k=2) good = [m for m, n in raw if m.distance / n.distance < 0.75] print(f"Good matches after ratio test: {len(good)}") # ─── RANSAC homography ────────────────────────────────────── if len(good) >= 4: pts1 = np.float32([kp1[m.queryIdx].pt for m in good]) pts2 = np.float32([kp2[m.trainIdx].pt for m in good]) H, mask = cv2.findHomography(pts1, pts2, cv2.RANSAC, 5.0) inliers = mask.ravel().sum() print(f"RANSAC inliers: {inliers} / {len(good)}") warped = cv2.warpPerspective(src, H, (ref.shape[1], ref.shape[0])) cv2.imwrite('rectified.jpg', warped)
Output
Good matches after ratio test: 183 RANSAC inliers: 141 / 183
Key Takeaway

A production feature pipeline chains five steps — detect, describe, match, RANSAC, warp — and each step's quality determines the next one's ceiling. RANSAC is what makes the pipeline robust: even with 40% outlier matches, it reliably recovers the correct homography by sampling small subsets and voting for geometric consistency.

📄
Real-World Application

Adobe Scan & AR Marker Tracking

Adobe Scan rectifies photographed documents into flat, print-ready PDFs by detecting the document boundary as four feature-matched corners and applying a perspective warp to the full-resolution image. AR marker tracking (e.g. ARCore, Vuforia) uses the same pipeline at 30 fps: ORB detects the marker corners, RANSAC estimates the homography, and the result feeds the camera pose matrix for 3D overlay. The key engineering insight is that RANSAC makes the system robust enough to handle motion blur, partial occlusion, and specular reflections that would defeat a naive least-squares homography.

Checkpoint AR Tracking & Document Scanning — 3 quick questions

1. Where does point (50, 40, 1)ᵀ map to under homography H?

$$H = \begin{bmatrix}2&0&10\\0&2&20\\0&0&1\end{bmatrix}$$

x̃′ = 2(50) + 10 = 110; ỹ′ = 2(40) + 20 = 100; w̃′ = 1. Result: (110, 100) — scaled 2× and translated.

2. Why does RANSAC need at least 4 point correspondences to estimate a homography?

A homography has 8 degrees of freedom (it's a 3×3 matrix with scale fixed). Each point correspondence gives 2 equations, so 4 pairs give 8 equations — just enough to solve for all 8 unknowns uniquely using DLT.

3. You have 300 matches with 60% outliers. Approximately how many RANSAC iterations are needed for 99% success probability (sample size = 4)?

P(all 4 inliers) = 0.4⁴ = 0.0256; iterations = ⌈log(0.01) / log(1 − 0.0256)⌉ = ⌈−4.605 / −0.02593⌉ ≈ 178.

Integrated Capstone Demo — Synthesis of Topics 1–3

A comprehensive, unified interactive simulation integrating all 3 core topics taught this week. See how descriptor similarity threshold and RANSAC reprojection error interact to produce a clean set of inlier matches between two synthetic image patches.

Adjust the controls to explore the trade-off between match quantity and match quality. Green lines survive both the ratio test and RANSAC; red dashed lines are rejected.

20
0.75
8
Inlier match Ratio-test reject RANSAC outlier Keypoint
Match Statistics — Live

Key Concepts

Four ideas that every CV engineer needs at their fingertips — from corner response to RANSAC homography.

📐

Corners from Eigenvalues

A pixel is a corner when intensity changes strongly in all directions — both eigenvalues $\lambda_1 \approx \lambda_2 \gg 0$. Harris approximates this with $R = \det(M) - k\cdot\text{trace}(M)^2$ in constant time per pixel.

🔭

SIFT: Natural Scale Anchoring

SIFT finds keypoints as DoG extrema in a 3D scale-space volume, anchors description to the blob's natural scale $\hat{\sigma}$, and builds a 128-dim gradient histogram — surviving 5× zoom changes and full rotation.

ORB: Bits Not Floats

ORB replaces SIFT's 128 floats with 256 binary comparisons (rBRIEF). Matching uses Hamming distance via XOR+popcount — a single CPU instruction, making 30 fps feature tracking feasible on mobile hardware.

🎯

Ratio Test & RANSAC

Lowe's ratio test discards ambiguous matches ($d_1/d_2 < \tau$). RANSAC estimates the homography robustly by voting across random 4-point samples — tolerating up to ~60% outliers.

Coming up — Week 4: Image Alignment & Panoramas: we take the homography from this week and push it further: DLT estimation, RANSAC robustification, and stitching overlapping photographs into seamless panoramas.

Deepen Your Understanding

Four resources — textbook, official docs, tutorial, and video — to solidify feature detection and matching.

Exercises

Practice problems covering Harris corner response, SIFT/ORB descriptors, and RANSAC homography — plus synthesis pipeline challenges.

1
Theory · Feature Detection & Harris Easy

Harris Response Classification

A 3×3 image window produces structure tensor $M = \begin{bmatrix}8 & 3 \\ 3 & 5\end{bmatrix}$ with sensitivity parameter $k = 0.06$.

  1. Compute $\det(M)$ and $\text{trace}(M)^2$.
  2. Compute the Harris response $R = \det(M) - k\cdot\text{trace}(M)^2$.
  3. Classify the pixel as flat, edge, or corner, and justify your answer.
  4. If $k$ were increased to $0.10$, would $R$ increase or decrease? What does this imply about detection sensitivity?

Step 1: $\det(M) = 8\times5 - 3\times3 = 40-9=31$; $\text{trace}(M)=(8+5)=13$; $\text{trace}(M)^2=169$.

Step 2: $R = 31 - 0.06\times169 = 31 - 10.14 = 20.86$.

Step 3: $R = 20.86 \gg 0$ → Corner. Both eigenvalues are large (verify: char. eq. $\lambda^2-13\lambda+31=0$ gives $\lambda \approx 10.3$ and $2.7$ — both positive and non-negligible).

Step 4: Higher $k$ → larger penalty on $\text{trace}(M)^2$ → smaller $R$ → harder to classify as corner. Increasing $k$ makes the detector more conservative (fewer false corners from edges).

2
Code · Feature Detection & Harris Easy

Harris Corner Detection Pipeline

Using OpenCV, implement a complete Harris corner detection pipeline:

  1. Load any grayscale image and convert to float32.
  2. Compute the Harris response map using cv2.cornerHarris with blockSize=5, ksize=3, k=0.05.
  3. Dilate the response map with cv2.dilate (kernel=None).
  4. Threshold at $0.01 \times R_{\max}$ and count detected corners.
  5. Mark detected corners in red on a copy of the original image and print the count.

Key steps: gray = np.float32(gray)R = cv2.cornerHarris(...)R_dil = cv2.dilate(R, None)mask = R_dil > 0.01 * R_dil.max()result[mask] = [0, 0, 255]. The dilate step merges nearby peaks before thresholding so you get one detection per corner, not a cluster.

3
Theory · SIFT & ORB Medium

Scale-Space DoG & Descriptor Distance

Two SIFT keypoints have 4-dimensional (truncated) descriptors $\mathbf{d}_1 = [0.6, 0.1, 0.7, 0.3]$ and $\mathbf{d}_2 = [0.5, 0.2, 0.6, 0.4]$ (both L2-normalised). Two ORB keypoint fragments are $\mathbf{b}_1 = 11001010_2$ and $\mathbf{b}_2 = 10110010_2$.

  1. Compute the L2 (Euclidean) distance between $\mathbf{d}_1$ and $\mathbf{d}_2$.
  2. Compute the Hamming distance between $\mathbf{b}_1$ and $\mathbf{b}_2$ via XOR+popcount.
  3. A SIFT match has nearest-neighbour distance 38.4 and second-nearest 52.0, with $\tau=0.75$. Does it pass the ratio test?
  4. Explain why ORB is preferred over SIFT for real-time (≥30 fps) applications on embedded hardware.

(a) $\|\mathbf{d}_1-\mathbf{d}_2\|=\sqrt{(0.1)^2+(0.1)^2+(0.1)^2+(0.1)^2}=\sqrt{0.04}=0.2$.

(b) XOR: $11001010 \oplus 10110010 = 01111000$; popcount = 4 bits differ → Hamming = 4.

(c) Ratio = $38.4/52.0 = 0.738 < 0.75$ → Passes.

(d) ORB compares 256-bit integers via XOR+popcount (single CPU instruction), vs. SIFT which computes L2 over 128 floats (128 multiplications + 127 additions). ORB is ~100× faster per comparison, enabling thousands of matches per millisecond.

4
Code · SIFT & ORB Medium

Descriptor Matching with Ratio Test

Implement a feature matching pipeline that compares SIFT and ORB on the same image pair:

  1. Detect and compute SIFT descriptors on two images.
  2. Use BFMatcher(NORM_L2) with knnMatch(k=2) and apply Lowe's ratio test at $\tau = 0.75$.
  3. Repeat steps 1–2 using ORB with BFMatcher(NORM_HAMMING).
  4. Print: number of raw matches, number passing ratio test, and ratio (good/raw) for both descriptors.

Key: for SIFT use cv2.BFMatcher(cv2.NORM_L2); for ORB use cv2.BFMatcher(cv2.NORM_HAMMING). Ratio test: [m for m, n in raw if m.distance / n.distance < 0.75]. Compare the good/raw ratios — SIFT typically filters more aggressively (lower ratio) because its 128-dim space has better separation.

5
Theory · AR Tracking & Document Scanning Medium

Homography Mapping & RANSAC Iterations

You have a homography $H = \begin{bmatrix}1.5 & 0 & 80 \\ 0 & 1.5 & 40 \\ 0 & 0 & 1\end{bmatrix}$. A feature matching pipeline returns 200 matches with an estimated 55% outlier ratio.

  1. Map the point $\mathbf{p} = (120, 90, 1)^T$ through $H$. Give the result in inhomogeneous coordinates.

  2. Compute the minimum number of RANSAC iterations needed for a 99% probability of finding one all-inlier sample of size 4.
  3. How many inlier matches do you expect from the 200 total (using the 45% inlier rate)?
  4. Is 4 inliers sufficient to estimate a reliable homography? What is the minimum for DLT?

(a) $\tilde{x}'=1.5(120)+80=260$; $\tilde{y}'=1.5(90)+40=175$; $w'=1$. Result: (260, 175).

(b) P(4 inliers) = $(0.45)^4=0.041$. Iterations = $\lceil\log(0.01)/\log(0.959)\rceil = \lceil{-4.605}/{-0.0419}\rceil \approx 110$ iterations.

(c) $200 \times 0.45 = 90$ expected inliers.

(d) DLT minimum is 4 correspondences (8 equations for 8 DoF). With 90 inliers available, RANSAC refits on all inliers for a much more accurate final $H$.

6
Code · AR Tracking & Document Scanning Medium

RANSAC Homography & Perspective Warp

Build the complete matching-to-warp pipeline:

  1. Detect ORB keypoints on two images and compute descriptors.
  2. Match with BFMatcher(NORM_HAMMING) + ratio test at $\tau = 0.75$.
  3. Use cv2.findHomography(..., cv2.RANSAC, 5.0) to estimate $H$ and return the inlier mask.
  4. Apply cv2.warpPerspective with $H$ to warp the source image onto the reference frame.
  5. Print: total matches, inlier count, inlier ratio, and save the warped image.

Key: H, mask = cv2.findHomography(pts1, pts2, cv2.RANSAC, 5.0) where pts1/pts2 are float32 arrays of keypoint coordinates from good matches. mask.ravel().sum() gives inlier count. cv2.warpPerspective(src, H, (w, h)) applies the warp. Check that H is not None before warping.

7
Synthesis · Theory: Multi-Scale Feature Matching Challenge Hard

Choosing the Right Detector and Parameters for a Given Application

You are designing a feature matching system for a mobile AR application that must run at 30 fps on a mid-range smartphone. The scene contains textured posters with sharp corners and smooth logo regions. The camera may zoom between 1× and 3× during a session.

  1. Explain whether Harris or SIFT is more appropriate for the keypoint detection step, given the 3× scale change requirement. Justify with reference to the structure tensor and DoG.
  2. Between SIFT 128-dim float and ORB 256-bit binary descriptors, which should you use for real-time matching, and why? Support with a computational cost comparison (multiply–add operations per descriptor comparison).
  3. If the ratio test threshold is relaxed from $\tau = 0.75$ to $\tau = 0.90$, how will this affect: (a) the number of accepted matches, (b) the homography estimation quality, and (c) the required number of RANSAC iterations?
  4. Given 400 matches with 40% outlier rate, compute the expected RANSAC iterations for 99% success with a sample size of 4. How does this change if the outlier rate rises to 70%?

(a) SIFT, because its DoG scale-space detects keypoints robustly across a 5× range; Harris operates at a fixed scale and cannot track the same feature across a 3× zoom.

(b) ORB: L2 on 128 floats = 128 multiplies + 127 adds ≈ 255 ops. ORB Hamming = 1 XOR + 1 popcount ≈ 2 ops. ORB is ~100× faster, essential for 30 fps.

(c) More accepted matches (a), lower quality due to more false positives (b), more RANSAC iterations needed to filter additional outliers (c).

(d) 40%: P=(0.6)⁴=0.1296; N=⌈log(0.01)/log(0.8704)⌉≈32 iter. 70%: P=(0.3)⁴=0.0081; N=⌈log(0.01)/log(0.9919)⌉≈562 iter — 17× more iterations.

8
Synthesis · Code: Document Scanning Pipeline Hard

End-to-End Document Rectification System

Build a complete document scanning pipeline that rectifies a photographed document onto a flat, print-ready output:

  1. Detect ORB features on a source (photographed document) and a reference (blank A4 template at known size).
  2. Match with BFMatcher + Hamming distance + ratio test.
  3. Estimate $H$ with RANSAC (reprojection threshold = 4.0 px). Verify inlier ratio ≥ 40% before proceeding.
  4. Warp the source to the reference frame using warpPerspective.
  5. Apply a sharpening kernel to the warped image: $K = \begin{bmatrix}0&-1&0\\-1&5&-1\\0&-1&0\end{bmatrix}$.
  6. Print a summary table: keypoints detected (src / ref), raw matches, ratio-test matches, RANSAC inliers, inlier ratio.

Sharpening: kernel = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])cv2.filter2D(warped, -1, kernel). Guard: if mask is None or mask.ravel().sum() / len(good) < 0.4: raise ValueError("Insufficient inliers"). Summary table with f-strings. ORB nfeatures=2000 recommended for document corners.