Master the homography matrix, the Direct Linear Transform, and RANSAC — the geometric core of panoramic stitching and medical image registration.
Classical Computer Vision
A homography is a 3×3 projective transformation that maps every pixel in one image plane to a corresponding pixel in another. The Direct Linear Transform (DLT) solves for this matrix from four or more point correspondences using a simple linear system.
cv2.findHomography to align two images and warp one onto the other.How does your phone seamlessly stitch five separate shots into a single wide panorama — perfectly aligned, with no visible seam? The answer is a single 3×3 matrix called the homography, which mathematically describes how any plane in 3D projects onto a different image plane.
A homography is like a GPS coordinate transform: just as GPS converts latitude/longitude from one datum to another so that the same physical location has a consistent address, H converts pixel coordinates from one camera view to another so that the same physical point lands at the right pixel in both images.
Homography computation pipeline: four point correspondences feed the DLT → SVD solver, producing H, which then warps the source image.
In homogeneous coordinates, a point $\mathbf{x} = (x, y, 1)^T$ in the source image maps to $\mathbf{x}' = (x', y', 1)^T$ in the destination image via:
The $\sim$ means equality up to scale. We normalize by setting $h_9 = 1$ (or $\|H\|_F = 1$), leaving 8 free unknowns.
For each correspondence $(x_i, y_i) \leftrightarrow (x'_i, y'_i)$, expand $\mathbf{x}' \times H\mathbf{x} = \mathbf{0}$ to get two linear equations in the 9 entries of H:
Stack all $A_i$ into matrix $A$ (size $2n \times 9$). The solution $\mathbf{h}$ is the right singular vector of $A$ corresponding to the smallest singular value — i.e., the last column of $V^T$ in the SVD $A = U\Sigma V^T$.
Background. Verify DLT gives the correct homography for a known pure translation.
Problem: Source corners: (0,0), (4,0), (4,4), (0,4). Destination corners: (2,1), (6,1), (6,5), (2,5). What is H?
If the destination point for (4,0) is (6,1), verify that $H \cdot (4,0,1)^T = (6,1,1)^T$ using the H above.
Myth: "A homography works for any 3D scene — I can align two photos of a building taken from different positions."
Reality: A homography is only valid when all mapped points lie on a single plane, or when the camera undergoes pure rotation (no translation). For scenes with depth variation, a homography will produce parallax errors at depth discontinuities. RANSAC partially mitigates this by finding the dominant plane.
If you move the bottom-right destination point from (5,5) to (7,5) — stretching the right side of the image horizontally — which column of H do you predict will change most?
Form your prediction first — then drag the slider below to verify ↓
A homography encodes all geometric alignment between two planar views in exactly 8 numbers — and the DLT algorithm recovers those numbers from as few as 4 point correspondences by solving a linear system via SVD.
Google's satellite layer stitches millions of aerial images taken at different times, altitudes, and angles. Each pair of adjacent tiles is aligned using homography estimated from GPS-tagged ground control points via DLT — the same algorithm you just computed by hand. Errors in H manifest as visible seams or "jello" distortions at tile boundaries.
Q1 How many degrees of freedom does a homography H have, and why not 9?
Q2 You have 6 point correspondences. How many rows will the DLT matrix $A$ have, and which right singular vector of $A$ gives the solution?
Q3 Why does homography alignment fail when stitching photos of a 3D room (walls, furniture, depth variation) taken while walking sideways?
Robust Estimation
Real-world feature matching always contains mismatches. RANSAC (Random Sample Consensus) iteratively samples minimal subsets of correspondences, estimates a homography from each, and selects the model with the most geometric support — surviving even 50%+ outliers.
cv2.findHomography(..., cv2.RANSAC) and interpret the inlier mask it returns.Imagine you're trying to find the best-fit line through 100 data points — but 60 of them are completely wrong due to sensor noise. Least-squares would be dragged far off by the noise. RANSAC instead repeatedly picks 2 random points, draws a line, checks how many of the remaining 98 points agree, and keeps the line with the most "votes". It turns a messy problem into a robust one.
RANSAC is like a democratic jury system: rather than accepting the opinion of every witness (including unreliable ones), you repeatedly form a small jury of randomly selected witnesses, reach a verdict, then count how many other witnesses agree. The verdict that earns the most agreement across all witnesses is declared the truth.
RANSAC algorithm: random sample → fit → count inliers → keep best model. Repeat $N$ times.
To guarantee that at least one RANSAC sample is outlier-free with probability $p$, we need:
A correspondence $(x_i, y_i) \leftrightarrow (x'_i, y'_i)$ is called an inlier for candidate H if its symmetric reprojection error is below threshold $\tau$:
OpenCV's default $\tau = 3$ px. After the main loop, the best H is re-estimated using all inliers via least-squares DLT, giving a more accurate final result.
Background. ORB matcher reports 200 matches, roughly half of which are expected to be wrong.
Problem: $\varepsilon = 0.50$, $p = 0.99$, $s = 4$ (homography). How many RANSAC iterations are needed?
If the outlier ratio drops to $\varepsilon = 0.30$ (better matcher), approximately how many iterations are needed at $p = 0.99$, $s = 4$?
N is surprisingly small. Even at 50% outliers, only ~74 iterations are needed. This is why RANSAC is fast in practice — it does not need to try all $\binom{n}{4}$ possible 4-tuples (which for n=200 matches would be ~64 million). The logarithmic formula shows that N grows slowly even as $\varepsilon$ approaches 0.5.
However, as $\varepsilon \to 1$ (nearly all outliers), $N$ grows rapidly to infinity — RANSAC is not a magic bullet for extremely noisy data.
If you increase the outlier ratio from 30% to 60%, do you predict the required iterations will double, quadruple, or grow more than 4×? Use the formula to reason before checking.
Form your prediction first — then drag the slider below to verify ↓
RANSAC turns an impossible least-squares problem (outlier-contaminated data) into a tractable one by repeatedly betting on small clean subsets — and the logarithmic iteration formula guarantees you only need ~74 samples to achieve 99% confidence even at 50% outliers.
Self-driving vehicles estimate their motion between frames by matching visual features (ORB/SIFT) between consecutive camera images. Road markings, distant trees, and moving cars all generate feature matches — but only the static background is geometrically consistent. RANSAC robustly finds the essential matrix or homography of the dominant static background, discarding matches on moving objects (pedestrians, other vehicles) as outliers, enabling safe ego-motion recovery even in busy traffic.
Q1 If $\varepsilon = 0.40$, $p = 0.99$, and $s = 4$, calculate the required number of RANSAC iterations N.
Q2 A RANSAC run returns a mask array with values [1,1,0,1,0,0,1,1,0,1]. What does a 0 in the mask mean, and how many inliers were found?
0 means that correspondence is an outlier — its reprojection error under the best H exceeded the threshold τ. There are 6 inliers (six 1s in the mask).Q3 Why does RANSAC re-estimate H using all inliers at the end, rather than just returning the H from the best random sample?
Application
Homography and RANSAC are the engines behind two major application domains: consumer panoramic photography and clinical multimodal image registration. Both reduce to the same geometric pipeline — detect, match, estimate, warp, blend — but differ in projection model, transformation type, and accuracy requirements.
warpPerspective.Your phone's panorama mode handles geometry that took photogrammetrists decades to formalize — and the same algorithm, with minor modifications, aligns a CT scan of a tumor taken last month with an MRI scan taken today, enabling a surgeon to see both in a single fused view.
Panoramic stitching is like assembling a jigsaw puzzle from photographs: each piece (image) overlaps with its neighbors, and the homography tells you exactly how to slide, rotate, and stretch each piece so the edges join seamlessly. Medical registration is the same puzzle, but the pieces were photographed by two completely different cameras (CT and MRI), so you must first learn how to translate one "color language" into the other.
Complete panorama stitching pipeline. Steps 1–3 were covered in Topics 1–2; Steps 4–6 translate H into the final composite.
Warping fills every pixel of the output canvas by back-projecting through $H^{-1}$ (inverse mapping avoids holes):
For wide panoramas (>90°), planar projection causes severe edge stretching. Cylindrical projection reduces this by mapping each image onto a virtual cylinder of radius $f$ before stitching:
where $f$ is the focal length in pixels and $(c_x, c_y)$ is the principal point (image centre).
| Transform | DOF | Preserves | Min Pairs |
|---|---|---|---|
| Translation | 2 | Shape, size, angles | 1 |
| Rigid (Euclidean) | 3 | Shape, size | 2 |
| Similarity | 4 | Shape, angles | 2 |
| Affine | 6 | Parallel lines | 3 |
| Homography | 8 | Straight lines | 4 |
| Deformable | ∞ | Nothing global | Dense field |
Aligning images from different modalities (CT, MRI, PET) requires choosing the correct transformation model based on the anatomy and imaging protocol:
The similarity metric also differs: Sum of Squared Differences (SSD) for same-modality, Mutual Information (MI) for cross-modality (e.g., CT↔MRI).
Background. Camera: $f = 800$ px, image size $1600 \times 900$, so $c_x = 800$, $c_y = 450$.
Problem: Map pixel $(x, y) = (1000, 450)$ to cylindrical coordinates.
For the same camera, what is $x_{cyl}$ for a pixel at $(x, y) = (800, 450)$ (the image centre)?
Myth: "I can stitch any two overlapping photos into a panorama using planar homography."
Reality: Planar homography works well only for small fields of view or when the camera strictly rotates without translation. Wide-angle panoramas exhibit strong parallax and edge distortion under planar projection. Cylindrical or spherical projection is required for robust wide-FOV stitching — which is why phone panorama apps always project onto a cylinder internally, even though the final output looks "flat."
If you increase the overlap between two images from 20% to 40%, do you predict the RANSAC homography estimate will be more or less accurate? Why?
Think about how many inlier matches are available — then adjust the overlap slider below to verify ↓
Panoramic stitching and medical image registration are both solved by the same geometric pipeline — detect, match, RANSAC, warp — but the correct transformation model must match the scene: homography for flat planes, cylindrical for wide fields, and deformable for soft tissue.
Radiation oncologists plan cancer treatment by fusing a CT scan (shows bone and tumour density) with an MRI scan (shows soft-tissue contrast) of the same patient. A rigid or affine registration aligns the two volumes so that anatomical landmarks coincide — then the oncologist sees bone detail from CT and tissue detail from MRI overlaid in one view, enabling precise tumour delineation and treatment field planning. Mutual Information maximization drives the registration metric, since CT and MRI pixel intensities are correlated but not identical.
Q1 Why does warpPerspective use inverse mapping (back-projecting from destination to source) rather than forward mapping?
Q2 A radiologist wants to align a pre-surgery MRI brain scan with a post-surgery MRI of the same patient. The brain volume may have shifted and rotated slightly due to repositioning in the scanner. Which transformation model is most appropriate, and why?
Q3 Why is Mutual Information (MI) preferred over Sum of Squared Differences (SSD) as the similarity metric for CT–MRI registration?
Interactive Lab
Move the four destination control points to reshape the right panel in real time. Watch how each change is encoded directly in the homography matrix H and reflected in the live warped grid below.
Week 4 Summary
Three topics, one geometric toolkit: homography encodes the alignment, DLT solves for it, and RANSAC makes both robust in the real world.
A 3×3 projective transform with 8 DOF that maps every pixel in one image plane onto another. The same matrix handles panoramas, AR markers, and document scanning.
DLT stacks each correspondence into rows of matrix A, then extracts H as the last right singular vector of A via SVD. Just 4 point pairs solve all 8 unknowns.
Randomly sample s points → fit H → count inliers (reprojection < τ) → repeat N times. Survives 50%+ outliers; N from the log formula guarantees 99% confidence.
Detect → match → RANSAC → warpPerspective → blend. Medical registration extends this to deformable transforms and cross-modality metrics like Mutual Information.
Neural Networks for Vision: from the perceptron to the full backpropagation algorithm, and why gradient descent on a loss surface finds useful image representations.
Further Reading
Primary textbook, geometry reference, OpenCV documentation, and lecture video — everything needed to master image alignment from first principles.
Image Alignment and Stitching. Covers DLT derivation, normalized DLT, RANSAC, cylindrical and spherical projection, and multi-image blending in full mathematical detail.
→ Primary reference for this weekMultiple View Geometry in Computer Vision. The definitive treatment of homographies, the DLT algorithm, normalized DLT, and the algebraic distance minimized by SVD.
→ For rigorous geometric proofsOfficial documentation for cv2.findHomography: method flags (LMEDS, RANSAC, RHO), confidence, threshold, and the returned inlier mask. Includes worked code examples.
Clear derivation of the RANSAC iteration count formula with visual examples of inlier/outlier splitting. Highly recommended for exam preparation on the N formula.
→ youtube.com/@CyrillStachnissPractice
8 exercises covering homography computation, DLT, RANSAC iteration counting, image warping, and a full panorama pipeline — mirroring the exam calculation style exactly.
Answer the following about a 2D homography H:
Part 1: H has 8 DOF. The 3×3 matrix has 9 entries, but H is only defined up to scale — multiplying all entries by any non-zero constant gives the same projective mapping. Fixing one entry (e.g., $h_{33} = 1$) or normalising $\|H\|_F = 1$ removes this ambiguity.
Part 2: Append a 1: $(3, 5, 1)^T$.
Part 3: Divide by the third component: $(9/2,\; 10/2) = (4.5,\; 5.0)$.
Write a Python function dlt_homography(pts_src, pts_dst) that computes H without using cv2.findHomography. Then verify it against OpenCV's result on the same point pairs.
U, S, Vt = np.linalg.svd(A).cv2.findHomography(pts_src, pts_dst)[0]. Max absolute difference should be < 1e-6.For correspondence $(x, y) \to (x', y')$, the two rows of $A_i$ are:
The SVD solution is h = Vt[-1] (the right singular vector for the smallest singular value). Reshape to 3×3, divide by H[2,2] to normalise.
Use the formula $N = \log(1-p) / \log(1-(1-\varepsilon)^s)$ to answer:
Part 1: $(1-0.35)^4 = 0.65^4 = 0.1785$. $N = \log(0.01)/\log(1-0.1785) = \log(0.01)/\log(0.8215) \approx -2 / (-0.1966) \approx \mathbf{10.2}$. Use $N = 11$.
Part 2: $(0.40)^4 = 0.0256$. $N = \log(0.01)/\log(0.9744) \approx -2/(-0.0260) \approx \mathbf{76.8}$. Going from 35% to 60% outliers increases N by ~7×.
Part 3: At $p = 0.99$: $N_{99} \approx 74$. At $p = 0.999$: $N = \log(0.001)/\log(0.9375) \approx -3/(-0.02703) \approx 111$. Factor $\approx \mathbf{1.5 \times}$.
Generate synthetic point correspondences with 40% random outliers and compare the homography quality with and without RANSAC.
cv2.findHomography(pts_src, pts_dst) (no RANSAC).cv2.findHomography(..., cv2.RANSAC, 5.0).True H = np.array([[1,0,30],[0,1,20],[0,0,1]], dtype=float). Inliers: apply H to random source points, add np.random.randn noise. Outliers: use np.random.uniform(0, 500, (8, 2)) for both src and dst independently.
Reprojection error: for each inlier pair apply the estimated H, convert from homogeneous, then compute Euclidean distance to true dst. Plain DLT typically yields errors of 10–50 px; RANSAC typically < 2 px.
For each step of the panorama pipeline below, explain why it is necessary and what goes wrong if it is skipped:
warpPerspective rather than forward mapping.1. Feature detection: Raw pixel comparison is sensitive to brightness changes and not repeatable. Keypoints with descriptors give compact, distinctive, viewpoint-tolerant representations.
2. Ratio test: Rejects ambiguous matches where the nearest and second-nearest descriptors have similar distances — discarding matches where one descriptor could plausibly match two different points reduces the outlier ratio before RANSAC.
3. RANSAC: Even after the ratio test, 20–40% of matches are wrong. Plain DLT on noisy matches yields a wildly inaccurate H. RANSAC isolates the geometrically consistent inliers.
4. Inverse mapping: Forward mapping leaves holes (unsampled destination pixels). Inverse mapping guarantees every output pixel is filled by looking up the corresponding source location.
Create a synthetic checkerboard image, define a perspective homography manually, and warp it using both cv2.warpPerspective and a hand-rolled inverse-mapping loop.
cv2.warpPerspective(checker, H, (400,400)). Save as warped_cv.png.For the manual inverse warp: H_inv = np.linalg.inv(H). For each pixel (xd, yd): p = H_inv @ [xd, yd, 1]; xs, ys = p[0]/p[2], p[1]/p[2]. Bilinear interpolation: take floor and ceil of (xs, ys), form 2×2 neighbourhood, weight by fractional part.
Checkerboard: board = np.indices((400,400)).sum(axis=0) // 40 % 2 * 255.
For each clinical/industrial scenario, identify the most appropriate transformation model (translation, rigid, similarity, affine, homography, or deformable), justify your choice, and state how many point correspondences are required to uniquely determine it:
1. Chest X-ray (same patient, repositioned): Rigid (2 DOF in 2D: translation + rotation). The rib cage doesn't deform; repositioning is a rigid body motion. 2 point pairs minimum.
2. Flat agricultural field, same altitude: Homography (8 DOF). The field is approximately planar and the camera translates — pure planar homography is valid. 4 pairs minimum.
3. Liver MRI ↔ intra-op ultrasound: Deformable / non-rigid. The liver deforms significantly due to breathing, gravity changes, and tissue displacement during surgery. A dense displacement field is required; modality difference requires Mutual Information metric.
4. Scanner rotation + uniform zoom: Similarity (4 DOF: translation × 2, rotation, isotropic scale). Rotation and uniform scaling, but no shear or perspective. 2 point pairs minimum.
Implement the full five-step pipeline end-to-end and evaluate stitching quality using a seam visibility metric.
warpPerspective with a wide output size.Alpha blend in overlap zone: define x_start (leftmost column where image 2 appears) and overlap width ow. For column x in overlap: alpha = (x - x_start) / ow. Output pixel = (1-alpha)*img1[y,x] + alpha*warped2[y,x].
Seam MAD: extract the 20-column-wide strip from both img1 and warped2 at the blend boundary, compute np.mean(np.abs(strip1 - strip2)). If > 5, RANSAC threshold or feature matching quality needs improvement.