Teach machines to detect edges, suppress noise, and inspect surfaces — one convolution at a time. The mathematical engine behind every modern vision pipeline.
Intuition
Every pixel becomes a weighted average of its neighbourhood. This single idea — a sliding kernel — is the mathematical foundation of blur, sharpening, edge detection, and the entire family of Convolutional Neural Networks.
cv2.filter2D with appropriate border handling."Every Instagram filter, every blurred background in portrait mode, every object detector — they all start with the same three-line operation: position a small grid of numbers over an image, multiply them element-wise, sum the products, write the result into a new pixel. Move one step. Repeat."
Convolution is like rubbing a stamp across wet ink: the stamp (kernel) has a fixed pattern of raised dots, and wherever it presses, those dots leave weighted imprints on the paper (image). A heavy dot in the centre means "this pixel matters most"; lighter dots at the edges mean "nearby pixels have a smaller say." Just as different stamp designs create different prints, different kernels create blur, sharpness, or edges.
Kernel convolution anatomy: each output pixel is the weighted sum of a 3×3 neighbourhood. The kernel slides one step at a time across the entire image.
Given image $I$ and kernel $K$ of size $(2m{+}1)\times(2n{+}1)$, the output at pixel $(r,c)$ is:
The minus signs flip the kernel before the sum — this is the formal definition of convolution. When $K$ is symmetric (e.g., Gaussian, box filter), flipping has no effect and convolution equals cross-correlation.
Background. A mean (box) filter replaces each pixel with the average of its neighbourhood. The 3×3 kernel has every weight equal to $\frac{1}{9}$. Because this kernel is symmetric, convolution and cross-correlation give the same result.
Problem: Neighbourhood values — top row: [80, 100, 120], middle: [90, 150, 110], bottom: [70, 80, 90]. Apply a 3×3 mean filter and find the output at the centre pixel.
If the centre pixel were 200 instead of 150, the sum becomes 940. What would the mean filter output be?
Applying a kernel near the image border requires a decision about pixels outside the image boundary:
BORDER_CONSTANT).BORDER_REFLECT).Cross-correlation vs convolution: cv2.filter2D performs cross-correlation (no kernel flip). PyTorch nn.Conv2d also implements cross-correlation and calls it convolution. For symmetric kernels the difference is zero. For asymmetric kernels (Sobel, emboss), flip manually before passing if true convolution is needed.
Problem: Sobel X kernel $K = [[-1,0,1],[-2,0,2],[-1,0,1]]$ is asymmetric. Show that correlation and convolution give different results on the same patch.
Is the Laplacian kernel $[[0,1,0],[1,-4,1],[0,1,0]]$ symmetric or asymmetric?
Myth: "cv2.filter2D computes the mathematical convolution, so I don't need to think about kernel orientation."
Reality: cv2.filter2D performs cross-correlation (no flip). For symmetric kernels (Gaussian, box, Laplacian) this is identical to convolution. For directional kernels like Sobel or Prewitt, the result has reversed gradient direction. Always check kernel symmetry before using filter2D for gradient-based operations.
Before adjusting the sliders: a 3×3 mean kernel and a 3×3 sharpen kernel are about to be applied to the same image. Which kernel do you predict will make the centre pixel brighter relative to its neighbours?
Hint: look at the centre weight of the sharpen kernel — it's much larger than 1/9.
Python · NumPy + OpenCV — Manual convolution and filter2D
Every spatial filter — from Instagram blur to ResNet feature extraction — is a single operation: place a kernel, multiply element-wise, sum, write output, slide one step, repeat.
Every convolutional layer in ResNet-50, EfficientNet, or YOLO executes the same 2D cross-correlation you just computed by hand — the only difference is that the kernel weights are learned from millions of training images rather than hand-coded. A ResNet-50 has over 50 conv layers each with dozens of kernels; understanding this single operation is the complete foundation for reading any CNN architecture diagram.
Q1 A 5×5 kernel is applied to a 256×256 image. How many multiplications does a single direct spatial convolution require (total, all output pixels)?
Q2 You use cv2.filter2D with the Sobel X kernel $[[-1,0,1],[-2,0,2],[-1,0,1]]$. A colleague says the output is "wrong" because filter2D uses cross-correlation, not convolution. Is your colleague correct?
Q3 Which border padding strategy should you choose to avoid dark artefacts at the image edges when blurring a face photo?
Mechanics
Gaussian smoothing is the optimal linear denoiser — and the mandatory pre-processing step before every edge detector. Sobel quantifies gradient magnitude; Canny chains four steps into the industry-standard edge pipeline.
"You want to detect the edge of a tumour in an MRI scan — but raw sensor data is noisy. Detect edges too early and you trace noise; blur too much and you erase the tumour boundary. Gaussian smoothing is the mathematical sweet spot: it suppresses noise while keeping the spatial information you need. Every edge detector in existence starts with it."
Gaussian blur is like squinting your eyes: small details (noise, texture) disappear but large structures (edges, shapes) remain visible. Canny edge detection is then like tracing only the strongest outlines you see while squinting — the four steps decide which lines are "real" edges versus accidental streaks.
Canny pipeline: Gaussian smoothing → Sobel gradient → Non-maximum suppression → Hysteresis thresholding. Each step narrows the response to thin, continuous, well-localised edges.
The Gaussian function in 2D is a bell-shaped surface centred at the origin. Normalised to unit sum, it forms a valid smoothing kernel:
$\sigma$ controls blur strength. The kernel is truncated at $\pm 3\sigma$, giving a $(6\sigma{+}1)\times(6\sigma{+}1)$ window.
Problem: Compute $G(0,0,1)$, $G(1,0,1)$, $G(1,1,1)$; then normalise the 3×3 kernel so it sums to 1.
For σ = 1, the centre weight is ~0.2043. Does a larger σ make the centre weight larger or smaller?
The image gradient at $(r,c)$ is approximated by the Sobel kernels:
Magnitude and direction then follow:
Problem: 3×3 patch — left column all 50, right column all 200, middle column all 125. Compute $G_x$ at the centre using Sobel X.
If all three columns were uniform (e.g., all 100), what would $G_x$ be?
Why Canny is better than raw Sobel thresholding: A single global threshold on $|\nabla I|$ produces thick blobs where edges are strong and nothing where they are weak. Non-maximum suppression (step 3) reduces each ridge to a 1-pixel-thin line; hysteresis (step 4) then extends strong edges into weaker regions, preserving long continuous boundaries. The result has thin, connected, well-localised edges that simple Sobel thresholding cannot match.
Move the σ slider to a large value (e.g., σ = 8). Will the Gaussian-blurred image show stronger or weaker Sobel edges compared to σ = 1? Why?
Hint: think about what high-σ blur does to sharp transitions in intensity.
Python · OpenCV — Gaussian blur, Sobel, and Canny
Gaussian blur suppresses noise before edge detection; Sobel measures gradient magnitude and direction; Canny chains smoothing → gradient → non-maximum suppression → hysteresis into the industry-standard edge pipeline.
In CT and MRI analysis, Sobel-based gradient maps provide the external energy field for active contour (snake) algorithms that delineate organ boundaries — the contour is attracted to high-gradient regions marking tissue interfaces. Pre-smoothing with $\sigma \approx 1.5$ suppresses MRI noise without erasing clinically relevant edges. Edge quality directly determines downstream segmentation accuracy and, for applications like radiation therapy planning, errors here translate directly to patient risk.
Q1 For σ = 1.0, the normalised centre weight of the 3×3 Gaussian is ≈ 0.2043. Does doubling σ to 2.0 make the centre weight larger or smaller? Explain briefly.
Q2 In the Canny pipeline, what does the non-maximum suppression step do and why is it needed?
Q3 You call cv2.Canny(img, 200, 400) and get very few edges. How should you adjust the thresholds, and why?
Application
Non-linear filters preserve edges that linear blur destroys. Combining median filtering, bilateral smoothing, and adaptive thresholding produces robust defect detectors and document scanners — pipelines used daily in manufacturing and fintech.
"A high-speed camera photographs PCB solder joints at 120 frames per second. Every 200th frame, a cosmic-ray hit corrupts a random pixel to 255 (salt noise). A Gaussian blur would smear that bright spot across nearby pixels, potentially masking a genuine solder defect. A median filter removes the corrupted pixel entirely — and never touches the edges. That one algorithmic choice is the difference between a false alarm and a detected fault."
The median filter is like a voting committee: each pixel's value is put to a neighbourhood vote, and the majority view wins. One corrupted pixel (salt or pepper) is always in the minority — it gets outvoted and removed. The mean filter, by contrast, lets every voice speak equally, so the noisy outlier skews the result for everyone nearby.
Industrial surface inspection pipeline: impulse noise removed first (median), then edge/gradient detection, then Otsu thresholding segments defect regions.
The median filter replaces each pixel with the median (middle value when sorted) of its neighbourhood. Because the median ignores extreme values, it is inherently outlier-resistant:
Problem: A 1D window of 5 pixels contains: [120, 255, 118, 0, 122]. Pixel 255 is salt noise and pixel 0 is pepper noise. Compute mean and median filter outputs.
If the same window had 3 salt pixels instead of 1 (i.e., [120, 255, 255, 255, 122]), would the median still recover the true value?
The bilateral filter weights neighbours by both spatial distance and intensity similarity:
$\sigma_s$ (spatial) controls how far pixels contribute; $\sigma_r$ (range) controls the intensity threshold below which pixels are treated as "same surface." Pixels across an edge have large $|I_p - I_q|$ → near-zero range weight → edge preserved.
Otsu thresholding automatically finds the optimal global threshold $T^*$ by maximising the inter-class variance between foreground and background:
For documents with uneven lighting, adaptive thresholding computes a local $T$ for each pixel's neighbourhood, making it robust to shadows and gradients that defeat Otsu.
Problem: Two adjacent pixels: $p$ (intensity 20) and $q$ (intensity 200), distance 1 pixel apart. $\sigma_s = 2$, $\sigma_r = 25$. Compute spatial and range weights.
If $\sigma_r$ is increased from 25 to 200, what happens to the range weight for the same pair (intensity difference = 180)?
Myth: "Gaussian blur and bilateral filter both smooth, so I can use them interchangeably to reduce noise."
Reality: Gaussian blur is a linear filter — it blurs everything uniformly, including edges. Bilateral is non-linear — it only smooths within uniform regions and leaves edges untouched. For industrial inspection (where you need to detect surface defects adjacent to clean regions) or portrait photography (smooth skin, sharp eyes), always prefer bilateral. Gaussian is faster and preferred only when edge preservation is not required.
The noise slider adds random salt-and-pepper pixels. Before toggling filters: at 20% noise density, which filter do you expect to fully recover the original image — mean, Gaussian, or median?
Hint: think about what "20% corrupted" means for the median of a 3×3 neighbourhood (9 pixels, 1.8 corrupted on average).
Python · OpenCV — Median, bilateral, and adaptive thresholding
Use median for impulse noise, bilateral for edge-preserving smoothing, and adaptive thresholding when illumination is uneven — combining them in the right order defines a production-ready industrial vision pipeline.
At Foxconn and similar electronics manufacturers, high-speed vision systems capture 10,000+ PCB images per hour. The pipeline — median → bilateral → Canny → contour analysis — runs in under 5 ms per frame on a GPU, flagging solder bridges, missing components, and pad oxidation in real time. The same filter stack, adapted with adaptive Otsu, powers bank cheque OCR, passport scanning, and form digitization at fintech companies — where uneven LED lighting above the scanning bed is corrected algorithmically rather than with expensive hardware.
Q1 A 3×3 window contains: [50, 52, 48, 51, 255, 49, 50, 48, 53]. Apply a median filter. What is the output?
Q2 You are processing a document photo taken under non-uniform lighting. Half the page is in shadow (pixel values ~80) and half is bright (values ~200). Why will cv2.threshold(..., cv2.THRESH_OTSU) fail here?
Q3 In the industrial inspection pipeline, why is the median filter applied before Canny edge detection, not after?
Apply any spatial filter to a synthetic test image (gradients, sharp edges, and texture) and compare input vs. output side by side. The kernel size slider shows how larger kernels increase blur or sharpening strength.
Week 2 Summary
Six ideas that every CV engineer needs at their fingertips — from CNNs to industrial inspection.
A sliding weighted sum of pixel neighbourhoods. The kernel encodes the filter's frequency response. Foundation of all spatial filtering and every CNN layer.
Optimal linear denoiser — separable, parametrised by σ. Larger σ → more blur. Mandatory pre-processing step before every edge detector and feature extractor.
Sobel estimates gradient via finite differences. Canny chains smoothing → gradient → NMS → hysteresis into the industry-standard edge detector producing thin, continuous, well-localised edges.
Non-linear and outlier-resistant. Removes salt-and-pepper noise completely while preserving edges. Cannot be expressed as a convolution — its strength is precisely its non-linearity.
Spatial × range weighting: smooths within uniform regions, silences cross-edge pixels. Used in portrait mode, HDR tone-mapping, and PCB surface inspection.
Feature Detection & Matching — Harris corners, scale-space theory, SIFT descriptors, and ORB matching. The kernel intuition from this week is the direct foundation for corner response functions.
Further Reading
Interactive tools and primary references to solidify your mastery of spatial filtering.
Browser-based live demo: pick any kernel and watch it applied pixel by pixel in real time. The best complement to the widget on this page.
Explore → InteractiveAnimated, interactive Fourier decomposition. Build signals from sine waves and watch the spectrum update live — essential context for frequency-domain filtering.
Explore →Definitive reference on linear filters, Gaussian pyramids, and frequency-domain analysis. Sections 3.2–3.5 map directly to this week's content. Available in the Knowledge directory.
Chapter 3Lecture series covering image filtering, edge detection, and the mathematical foundations at a rigorous level. Ideal preparation for the midterm's quantitative questions.
YouTube playlistTwo exercises per topic (theory + code) plus two synthesis challenges that combine all three topics into a single pipeline.
Explain the mathematical difference between 2D convolution and 2D cross-correlation. Under what condition are they equivalent? Why does this matter when using cv2.filter2D or PyTorch's nn.Conv2d?
Definition difference: Convolution flips the kernel 180° before sliding ($I * K$ has minus signs in the offset indices); cross-correlation does not flip. Equivalent when: $K$ is symmetric, i.e., $K = \text{rot}_{180}(K)$. This is true for Gaussian, box, and Laplacian kernels. For cv2.filter2D: implements cross-correlation. For symmetric kernels the result is identical to convolution. For Sobel (asymmetric), the gradient sign is reversed but the magnitude map is the same.
Implement a 3×3 mean filter manually (without cv2.filter2D). Apply it to a grayscale image at pixel position (100, 100). Then verify your result matches cv2.filter2D with BORDER_REFLECT.
Key steps: K = np.ones((3,3), dtype=np.float32) / 9. Extract patch = img[r-1:r+2, c-1:c+2]. Manual output: np.sum(patch * K). OpenCV: cv2.filter2D(img, -1, K, borderType=cv2.BORDER_REFLECT). Absolute difference should be 0 or floating-point epsilon (~1e-6).
For a Gaussian kernel with σ = 1.5, compute $G(0,0,1.5)$, $G(1,0,1.5)$, and $G(1,1,1.5)$ (unnormalised). What percentage of the total 3×3 kernel weight is concentrated in the centre pixel? How does this compare to σ = 1.0 (centre ≈ 20.4%)?
Use $G(x,y,\sigma) = \frac{1}{2\pi\sigma^2}\exp(-(x^2+y^2)/(2\sigma^2))$. For σ=1.5: $G(0,0) = \frac{1}{2\pi(2.25)} \approx 0.0707$. $G(1,0) = 0.0707 \times e^{-1/4.5} \approx 0.0707 \times 0.8007 \approx 0.0566$. $G(1,1) = 0.0707 \times e^{-2/4.5} \approx 0.0707 \times 0.6412 \approx 0.0453$. Raw sum = $0.0707 + 4(0.0566) + 4(0.0453) = 0.0707 + 0.2264 + 0.1812 = 0.4783$. Normalised centre = $0.0707/0.4783 \approx 14.8\%$ — smaller than σ=1.0's 20.4%, confirming a wider, flatter kernel at larger σ.
Load a grayscale image. Apply Gaussian blur with σ ∈ {0.5, 1, 2} before Canny (T_low=50, T_high=150). Count the number of white (edge) pixels in each output. What σ produces the most edges? The fewest? Why?
cv2.GaussianBlur with each σ, then run cv2.Canny on the result.np.count_nonzero(edges).σ=0.5: minimal smoothing → noisy edges → most white pixels. σ=2: heavy smoothing → strong noise suppression → fewest edges (only the most prominent boundaries survive). The Canny thresholds (50/150) remain fixed; it's the pre-smoothed noise level that changes which gradients cross T_high. Optimal σ for natural images is typically 1.0–1.5.
Two adjacent 3×3 image regions: the left region has all pixels ≈ 50, the right region has all pixels ≈ 200. For the centre pixel of the left region, its 3×3 neighbourhood contains 6 pixels ≈ 50 and 3 pixels ≈ 200 (border column). Compute mean and median outputs. Explain why the median preserves the edge but the mean blurs it.
Mean: $(6 \times 50 + 3 \times 200)/9 = (300+600)/9 = 100$. The edge is pulled toward the midpoint.
Median: Sorted: [50,50,50,50,50,50,200,200,200]. Median (index 4) = 50. As long as more than half the neighbourhood is on the same side of the edge, the median stays within that region. The mean has no such protection — it's corrupted by any minority of cross-edge pixels.
Apply cv2.bilateralFilter with σ_s ∈ {5, 20} and σ_r ∈ {15, 80} — a 2×2 grid of 4 combinations. Display the results side by side. Identify which combination gives the best edge-preserving denoising and explain the role of each parameter.
Best for edge-preserving denoising: large σ_s + small σ_r. Large σ_s spreads smoothing over a wide spatial area; small σ_r keeps the range weight tight — only pixels with similar intensity participate. This smooths within uniform regions widely while completely ignoring cross-edge pixels. Large σ_s + large σ_r ≈ strong Gaussian blur (blurs everything). Small σ_s + small σ_r = minimal, local, edge-aware smoothing.
For each scenario below, identify the optimal filter and justify your choice mathematically:
1. Gaussian blur (σ ≈ 1–2): Gaussian noise is zero-mean and i.i.d. — Gaussian blur is the optimal linear estimator (Wiener filter at equal noise model). Not edge-preserving, but the goal is pre-smoothing for Canny, so edges can be slightly diffused.
2. Median filter (k=5): Salt-and-pepper noise consists of isolated extreme outliers; the median is inherently outlier-resistant and preserves edges. At 5% density in a 9-pixel neighbourhood (~0.45 corrupted), the median is unaffected.
3. Bilateral filter (large σ_s, small σ_r): skin has gradually varying texture (within σ_r range); eyes/hair have sharp intensity boundaries (outside σ_r range). Bilateral smooths the former while preserving the latter.
Build a complete defect detection pipeline on a synthetic test image (gradient background + inserted bright spot to simulate a defect):
medianBlur(k=5) to remove the impulse noise.cv2.Canny on the denoised image and compare edge counts before/after denoising.Setup: img = np.tile(np.linspace(50,200,256).astype(np.uint8), (256,1)). Add noise: randomly set 10% of pixels to 0 or 255. After medianBlur: Canny edge count drops by ~80-90% (noise edges removed). Otsu: _, mask = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU). Defect area: np.sum(mask == 255). Compare against known injected defect size to validate the pipeline.