Computer Vision Foundations — Week 2: Spatial Filtering & Kernels

Intuition

Convolution as a Spatial Operation

Every pixel becomes a weighted average of its neighbourhood. This single idea — a sliding kernel — is the mathematical foundation of blur, sharpening, edge detection, and the entire family of Convolutional Neural Networks.

After this section you will be able to

Compute the 2D discrete convolution at any pixel by hand, given an image patch and a kernel.
Distinguish between convolution and cross-correlation and identify when they are equivalent.
Apply mean, sharpen, and edge kernel filters using cv2.filter2D with appropriate border handling.

"Every Instagram filter, every blurred background in portrait mode, every object detector — they all start with the same three-line operation: position a small grid of numbers over an image, multiply them element-wise, sum the products, write the result into a new pixel. Move one step. Repeat."

🎯

Why this matters: Convolution is not just a pre-processing step — it is the computational primitive of every CNN layer. Understanding how a 3×3 kernel slides and accumulates is the same understanding you need to read a ResNet architecture diagram, debug a feature map, or tune a receptive field. Everything in deep learning vision builds on this one operation.

🔗

Think of it this way

Convolution is like rubbing a stamp across wet ink: the stamp (kernel) has a fixed pattern of raised dots, and wherever it presses, those dots leave weighted imprints on the paper (image). A heavy dot in the centre means "this pixel matters most"; lighter dots at the edges mean "nearby pixels have a smaller say." Just as different stamp designs create different prints, different kernels create blur, sharpness, or edges.

Kernel convolution anatomy: each output pixel is the weighted sum of a 3×3 neighbourhood. The kernel slides one step at a time across the entire image.

3×3

typical kernel size

9 multiplications per output pixel — the most common size in CV and CNNs.

O(k²)

cost per pixel

Direct convolution cost scales with kernel area; FFT breaks this for large k.

50+

conv layers in ResNet-50

Every single one runs this same operation — just with learned kernel weights.

Problem

2D Discrete Convolution

Given image $I$ and kernel $K$ of size $(2m{+}1)\times(2n{+}1)$, the output at pixel $(r,c)$ is:

$$( I * K)[r,c] = \sum_{i=-m}^{m} \sum_{j=-n}^{n} I[r-i,\; c-j]\cdot K[i,j]$$

The minus signs flip the kernel before the sum — this is the formal definition of convolution. When $K$ is symmetric (e.g., Gaussian, box filter), flipping has no effect and convolution equals cross-correlation.

📝 Worked Example — 3×3 mean filter at centre pixel

Background. A mean (box) filter replaces each pixel with the average of its neighbourhood. The 3×3 kernel has every weight equal to $\frac{1}{9}$. Because this kernel is symmetric, convolution and cross-correlation give the same result.

Problem: Neighbourhood values — top row: [80, 100, 120], middle: [90, 150, 110], bottom: [70, 80, 90]. Apply a 3×3 mean filter and find the output at the centre pixel.

List all 9 values.
$80,\ 100,\ 120,\ 90,\ 150,\ 110,\ 70,\ 80,\ 90$

Sum them.
$80+100+120+90+150+110+70+80+90 = 890$

Divide by 9 (kernel sum).
$890 \div 9 \approx 98.9 \approx 99$

Output = 99 — the bright centre pixel (150) was pulled toward the neighbourhood average. This is what "blurring" means.

Physical interpretation.
The output is a weighted vote of 9 pixels. A mean filter blurs because isolated bright or dark pixels are diluted by their neighbours.

✔ Quick Check

If the centre pixel were 200 instead of 150, the sum becomes 940. What would the mean filter output be?

940 / 9 ≈ 104.4 ≈ 104. Even a very bright centre is reduced toward the neighbourhood average.

Padding Strategies & Correlation vs Convolution

Applying a kernel near the image border requires a decision about pixels outside the image boundary:

Zero-padding: Treat outside pixels as 0. Fast but introduces dark border artefacts. Default in deep learning (BORDER_CONSTANT).
Reflect padding: Mirror the image at the border. Eliminates dark edges; preferred in classical CV (BORDER_REFLECT).
Replicate padding: Repeat the nearest border pixel. Smooth transition, minimal artefact.

Cross-correlation vs convolution: cv2.filter2D performs cross-correlation (no kernel flip). PyTorch nn.Conv2d also implements cross-correlation and calls it convolution. For symmetric kernels the difference is zero. For asymmetric kernels (Sobel, emboss), flip manually before passing if true convolution is needed.

📝 Worked Example — Sobel X: why asymmetry matters

Problem: Sobel X kernel $K = [[-1,0,1],[-2,0,2],[-1,0,1]]$ is asymmetric. Show that correlation and convolution give different results on the same patch.

Correlation (cv2.filter2D default): apply $K$ directly without flipping. Detects left-to-right edges with positive response.

Convolution (true): flip $K$ 180° to get $[[1,0,-1],[2,0,-2],[1,0,-1]]$. This detects right-to-left edges — the gradient sign is reversed.

For Gaussian or box kernels (symmetric), correlation = convolution. For Sobel (asymmetric), the responses have opposite sign.

✔ Quick Check

Is the Laplacian kernel $[[0,1,0],[1,-4,1],[0,1,0]]$ symmetric or asymmetric?

Symmetric — rotating it 180° yields the same kernel. So cv2.filter2D gives the correct Laplacian result without any manual flip.

⚠️

Common Mistake

Myth: "cv2.filter2D computes the mathematical convolution, so I don't need to think about kernel orientation."

Reality: cv2.filter2D performs cross-correlation (no flip). For symmetric kernels (Gaussian, box, Laplacian) this is identical to convolution. For directional kernels like Sobel or Prewitt, the result has reversed gradient direction. Always check kernel symmetry before using filter2D for gradient-based operations.

Solution

Pause & Predict

Before adjusting the sliders: a 3×3 mean kernel and a 3×3 sharpen kernel are about to be applied to the same image. Which kernel do you predict will make the centre pixel brighter relative to its neighbours?

Hint: look at the centre weight of the sharpen kernel — it's much larger than 1/9.

Try It: Kernel Selector

Select a kernel — watch it slide over the pixel grid and accumulate the weighted sum at the output cell.

Kernel coverage Output pixel (accumulating) Kernel weight value

Live Calculation — Output at Centre Pixel

output = Σ(patch × kernel)

Implementation

Python · NumPy + OpenCV — Manual convolution and filter2D

import numpy as np
import cv2

# 3×3 mean (box) kernel — each weight = 1/9
K_mean = np.ones((3, 3), dtype=np.float32) / 9

img = cv2.imread('photo.jpg', cv2.IMREAD_GRAYSCALE).astype(np.float32)

# OpenCV filter2D: cross-correlation (kernel NOT flipped)
blurred = cv2.filter2D(
    img, -1, K_mean, borderType=cv2.BORDER_REFLECT
)

# Manual: compute one output pixel at (r=100, c=100)
r, c = 100, 100
patch = img[r-1:r+2, c-1:c+2]  # 3×3 neighbourhood
output = np.sum(patch * K_mean)   # element-wise × + sum
print(f'Manual: {output:.1f}  |  OpenCV: {blurred[r,c]:.1f}')

# Sharpen kernel: enhances centre, subtracts neighbours
K_sharpen = np.array([[ 0,-1, 0],
                       [-1, 5,-1],
                       [ 0,-1, 0]], dtype=np.float32)
sharp = cv2.filter2D(img, -1, K_sharpen,
                      borderType=cv2.BORDER_REFLECT)

stdout

Manual: 98.9 | OpenCV: 98.9

Key Takeaway

Every spatial filter — from Instagram blur to ResNet feature extraction — is a single operation: place a kernel, multiply element-wise, sum, write output, slide one step, repeat.

🧠

Real-World Application

CNNs Are Just 50+ Layers of Convolution with Learned Kernels

Every convolutional layer in ResNet-50, EfficientNet, or YOLO executes the same 2D cross-correlation you just computed by hand — the only difference is that the kernel weights are learned from millions of training images rather than hand-coded. A ResNet-50 has over 50 conv layers each with dozens of kernels; understanding this single operation is the complete foundation for reading any CNN architecture diagram.

Checkpoint Test your understanding of 2D Convolution

Q1 A 5×5 kernel is applied to a 256×256 image. How many multiplications does a single direct spatial convolution require (total, all output pixels)?

Output size ≈ 256×256 = 65,536 pixels (assuming same-padding). Each pixel requires 5×5 = 25 multiplications. Total = 65,536 × 25 = 1,638,400 multiplications.

Q2 You use cv2.filter2D with the Sobel X kernel $[[-1,0,1],[-2,0,2],[-1,0,1]]$. A colleague says the output is "wrong" because filter2D uses cross-correlation, not convolution. Is your colleague correct?

Your colleague is technically correct that filter2D uses cross-correlation (no flip). However, for Sobel X the practical impact is a sign reversal of the gradient — the magnitude map $|\nabla I|$ is identical. In most CV pipelines this does not matter. If the exact sign of $G_x$ matters (e.g., computing $\theta = \text{arctan}(G_y/G_x)$), flip the kernel before passing it.

Q3 Which border padding strategy should you choose to avoid dark artefacts at the image edges when blurring a face photo?

BORDER_REFLECT (mirror padding). Zero-padding (BORDER_CONSTANT) treats the outside as black, creating an artificial dark halo around the edge pixels after blurring. Reflect padding mirrors the image content at the border, so the average includes real pixel values on both sides.

Mechanics

Gaussian Blur & Edge Detection

Gaussian smoothing is the optimal linear denoiser — and the mandatory pre-processing step before every edge detector. Sobel quantifies gradient magnitude; Canny chains four steps into the industry-standard edge pipeline.

After this section you will be able to

Compute Gaussian kernel weights for any $\sigma$ by hand and verify normalisation.
Apply Sobel X and Y filters and compute gradient magnitude and direction at any patch.
Trace the four steps of the Canny pipeline and predict how changing $T_{\text{low}}/T_{\text{high}}$ affects the output edge map.

"You want to detect the edge of a tumour in an MRI scan — but raw sensor data is noisy. Detect edges too early and you trace noise; blur too much and you erase the tumour boundary. Gaussian smoothing is the mathematical sweet spot: it suppresses noise while keeping the spatial information you need. Every edge detector in existence starts with it."

🔬

Why this matters: Gradient-based edge detection is still the first step in many production CV systems — not just classical ones. Object detectors anchor their proposals to high-gradient regions. Segmentation models initialise contours at edges. Understanding Sobel and Canny gives you the ability to interpret, debug, and tune any modern system's feature extraction stage.

🔗

Think of it this way

Gaussian blur is like squinting your eyes: small details (noise, texture) disappear but large structures (edges, shapes) remain visible. Canny edge detection is then like tracing only the strongest outlines you see while squinting — the four steps decide which lines are "real" edges versus accidental streaks.

Gaussian Smooth (σ ≈ 1)

→

Sobel Gradient $G_x, G_y$

→

Non-Max Suppression

→

Hysteresis Threshold

Canny pipeline: Gaussian smoothing → Sobel gradient → Non-maximum suppression → Hysteresis thresholding. Each step narrows the response to thin, continuous, well-localised edges.

σ = 1–2

Gaussian pre-blur

Typical range for Canny pre-processing — suppresses noise without erasing fine edges.

2:1

T_high / T_low ratio

Canny's original recommended ratio for hysteresis thresholds (Canny, 1986).

k/2×

Gaussian speedup

Separability splits a k×k 2D Gaussian into two 1D passes, reducing cost by a factor of k/2.

Problem

The 2D Gaussian Kernel

The Gaussian function in 2D is a bell-shaped surface centred at the origin. Normalised to unit sum, it forms a valid smoothing kernel:

$$G(x,y,\sigma) = \frac{1}{2\pi\sigma^2} \exp\!\left(-\frac{x^2+y^2}{2\sigma^2}\right)$$

$\sigma$ controls blur strength. The kernel is truncated at $\pm 3\sigma$, giving a $(6\sigma{+}1)\times(6\sigma{+}1)$ window.

📝 Worked Example — Gaussian weights for σ = 1.0

Problem: Compute $G(0,0,1)$, $G(1,0,1)$, $G(1,1,1)$; then normalise the 3×3 kernel so it sums to 1.

Centre weight $G(0,0,1)$.
$G = \tfrac{1}{2\pi}\exp(0) = \tfrac{1}{2\pi} \approx 0.1592$

Edge weight $G(1,0,1)$.
$G = 0.1592 \times e^{-1/2} = 0.1592 \times 0.6065 \approx 0.0965$

Corner weight $G(1,1,1)$.
$G = 0.1592 \times e^{-1} = 0.1592 \times 0.3679 \approx 0.0585$

Assemble and normalise.
Raw sum: $1(0.1592) + 4(0.0965) + 4(0.0585) = 0.1592 + 0.3860 + 0.2340 = 0.7792$

Normalised: centre ≈ 0.2043 · edge ≈ 0.1238 · corner ≈ 0.0751 → sum = 1.0. Centre pixel carries ~20% of the total weight.

Separability insight.
Because $G(x,y,\sigma) = G_x(x,\sigma) \cdot G_y(y,\sigma)$, a 2D blur decomposes into one 1D horizontal pass then one 1D vertical pass — reducing $O(k^2)$ to $O(2k)$ per pixel. At $\sigma = 5$ ($k \approx 31$) that is a 15× speedup.

✔ Quick Check

For σ = 1, the centre weight is ~0.2043. Does a larger σ make the centre weight larger or smaller?

Smaller. A larger σ spreads the bell curve wider, so the centre peak is lower and the weight is shared more evenly across the kernel — this is precisely what produces stronger blurring.

Sobel Gradient & the Canny Pipeline

The image gradient at $(r,c)$ is approximated by the Sobel kernels:

$$G_x = \begin{bmatrix}-1&0&+1\\-2&0&+2\\-1&0&+1\end{bmatrix}{\!*\!}I, \quad G_y = \begin{bmatrix}-1&-2&-1\\0&0&0\\+1&+2&+1\end{bmatrix}{\!*\!}I$$

Magnitude and direction then follow:

$$|\nabla I| = \sqrt{G_x^2 + G_y^2}, \qquad \theta = \arctan\!\left(\frac{G_y}{G_x}\right)$$

📝 Worked Example — Sobel X at a vertical edge

Problem: 3×3 patch — left column all 50, right column all 200, middle column all 125. Compute $G_x$ at the centre using Sobel X.

Left column contribution (weights −1, −2, −1).
$(-1)(50)+(-2)(50)+(-1)(50) = -200$

Right column contribution (weights +1, +2, +1).
$(+1)(200)+(+2)(200)+(+1)(200) = +800$

Sum and interpret.
$G_x = -200 + 800 = 600$. Large positive value: strong left-to-right intensity jump — a vertical edge.

$G_x = 600$ · $G_y \approx 0$ (uniform rows) · $|\nabla I| = 600$ · $\theta = 0°$ (pointing right)

✔ Quick Check

If all three columns were uniform (e.g., all 100), what would $G_x$ be?

$G_x = 0$. The left column contribution is $(-1-2-1)(100) = -400$ and the right is $(+1+2+1)(100) = +400$. These cancel exactly — Sobel correctly reports no gradient in a flat region.

💡

Key Insight

Why Canny is better than raw Sobel thresholding: A single global threshold on $|\nabla I|$ produces thick blobs where edges are strong and nothing where they are weak. Non-maximum suppression (step 3) reduces each ridge to a 1-pixel-thin line; hysteresis (step 4) then extends strong edges into weaker regions, preserving long continuous boundaries. The result has thin, connected, well-localised edges that simple Sobel thresholding cannot match.

Solution

Pause & Predict

Move the σ slider to a large value (e.g., σ = 8). Will the Gaussian-blurred image show stronger or weaker Sobel edges compared to σ = 1? Why?

Hint: think about what high-σ blur does to sharp transitions in intensity.

Try It: Gaussian σ & Edge Response Explorer

Adjust σ to see how blur affects the edge response. Toggle filter type to compare Sobel magnitude vs. Canny output.

Gaussian σ 0.5

Gaussian curve shape (σ) Signal before filter Edge response / filtered output

Implementation

Python · OpenCV — Gaussian blur, Sobel, and Canny

import cv2
import numpy as np

img = cv2.imread('photo.jpg', cv2.IMREAD_GRAYSCALE)

# Gaussian blur — ksize=(0,0) infers kernel from sigmaX
blur1 = cv2.GaussianBlur(img, (0, 0), sigmaX=1.0)
blur2 = cv2.GaussianBlur(img, (0, 0), sigmaX=3.0)

# Sobel: horizontal and vertical gradient
Gx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
Gy = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
magnitude = np.sqrt(Gx**2 + Gy**2)
direction = np.arctan2(Gy, Gx) * (180 / np.pi)

# Laplacian of Gaussian (LoG) — second derivative
blurred = cv2.GaussianBlur(img, (0, 0), 1.5)
laplacian = cv2.Laplacian(blurred, cv2.CV_64F)

# Canny — full pipeline in one call
# threshold1=T_low, threshold2=T_high (ratio ≈ 1:2)
edges = cv2.Canny(img, threshold1=50, threshold2=150)

stdout / description

blur1.shape: (H, W) — Gaussian σ=1.0: light smooth, preserves most edges blur2.shape: (H, W) — Gaussian σ=3.0: moderate blur, softens fine texture magnitude: float64 array, range 0–1400 for typical 8-bit images edges: uint8 binary (0 or 255), thin 1-pixel-wide edge lines

Key Takeaway

Gaussian blur suppresses noise before edge detection; Sobel measures gradient magnitude and direction; Canny chains smoothing → gradient → non-maximum suppression → hysteresis into the industry-standard edge pipeline.

🏥

Real-World Application

Medical Image Segmentation: Organ Boundary Detection

In CT and MRI analysis, Sobel-based gradient maps provide the external energy field for active contour (snake) algorithms that delineate organ boundaries — the contour is attracted to high-gradient regions marking tissue interfaces. Pre-smoothing with $\sigma \approx 1.5$ suppresses MRI noise without erasing clinically relevant edges. Edge quality directly determines downstream segmentation accuracy and, for applications like radiation therapy planning, errors here translate directly to patient risk.

Checkpoint Test your understanding of Gaussian & Edges

Q1 For σ = 1.0, the normalised centre weight of the 3×3 Gaussian is ≈ 0.2043. Does doubling σ to 2.0 make the centre weight larger or smaller? Explain briefly.

Smaller. A larger σ spreads the distribution over more pixels, reducing the peak height. With σ = 2.0 the 3×3 kernel barely captures the bell shape — the weights become nearly uniform (approaching a mean filter), and the effective centre weight drops significantly.

Q2 In the Canny pipeline, what does the non-maximum suppression step do and why is it needed?

NMS thins wide gradient ridges to 1-pixel-wide lines by suppressing any pixel whose gradient magnitude is not a local maximum along the gradient direction. Without NMS, Canny would output thick blobs around edges (similar to thresholded Sobel). NMS is what gives Canny its characteristic thin, well-localised edge appearance.

Q3 You call cv2.Canny(img, 200, 400) and get very few edges. How should you adjust the thresholds, and why?

Lower both thresholds (e.g., cv2.Canny(img, 50, 100)). With T_high = 400, only very strong gradients qualify as "strong" edges. Lowering T_high admits more pixels as strong seeds; lowering T_low allows the hysteresis step to extend those seeds along weaker but connected edge segments. Keep the 2:1 ratio to maintain Canny's optimality.

Application

Industrial Surface Inspection & Document Binarization

Non-linear filters preserve edges that linear blur destroys. Combining median filtering, bilateral smoothing, and adaptive thresholding produces robust defect detectors and document scanners — pipelines used daily in manufacturing and fintech.

After this section you will be able to

Explain why the median filter removes salt-and-pepper noise without blurring edges, using a concrete 5-pixel numerical example.
Compare Gaussian blur, median filter, and bilateral filter for an edge-preserving denoising task — choosing the right tool for each noise type.
Build a two-stage defect detection pipeline (denoise → threshold) using OpenCV.

"A high-speed camera photographs PCB solder joints at 120 frames per second. Every 200th frame, a cosmic-ray hit corrupts a random pixel to 255 (salt noise). A Gaussian blur would smear that bright spot across nearby pixels, potentially masking a genuine solder defect. A median filter removes the corrupted pixel entirely — and never touches the edges. That one algorithmic choice is the difference between a false alarm and a detected fault."

🏭

Why this matters: Industrial cameras inherently generate impulse noise (bad pixels), while document scanners face spatially varying illumination that makes global thresholding fail. Knowing when to use median (impulse noise), bilateral (texture + edge), or adaptive Otsu (uneven lighting) is a day-one decision in any production vision pipeline — choosing the wrong filter at this stage cascades into downstream detection and segmentation errors.

🔗

Think of it this way

The median filter is like a voting committee: each pixel's value is put to a neighbourhood vote, and the majority view wins. One corrupted pixel (salt or pepper) is always in the minority — it gets outvoted and removed. The mean filter, by contrast, lets every voice speak equally, so the noisy outlier skews the result for everyone nearby.

Industrial surface inspection pipeline: impulse noise removed first (median), then edge/gradient detection, then Otsu thresholding segments defect regions.

Problem

Median Filter — Outlier-Resistant Denoising

The median filter replaces each pixel with the median (middle value when sorted) of its neighbourhood. Because the median ignores extreme values, it is inherently outlier-resistant:

Removes salt-and-pepper noise: A single corrupted pixel (0 or 255) cannot influence the median as long as the majority of the neighbourhood is uncorrupted.
Preserves edges: The median is never an interpolated value, so it cannot blur a sharp transition.
Cost: $O(k^2 \log k^2)$ per pixel (sort-based); not separable.

📝 Worked Example — Median vs. mean on impulse noise

Problem: A 1D window of 5 pixels contains: [120, 255, 118, 0, 122]. Pixel 255 is salt noise and pixel 0 is pepper noise. Compute mean and median filter outputs.

Mean filter output.
$(120+255+118+0+122)/5 = 615/5 = 123$ — pulled away from the true background value by both outliers.

Sort the window for median.
Sorted: $[0, 118, 120, 122, 255]$. Median = index 2 = 120.

Compare.
True background ≈ 120. Mean output = 123 (corrupted). Median output = 120 (exact).

The median perfectly recovers the true pixel value. The noise pixels (0 and 255) are at positions 0 and 4 in the sorted array — both outside the median regardless of their values.

✔ Quick Check

If the same window had 3 salt pixels instead of 1 (i.e., [120, 255, 255, 255, 122]), would the median still recover the true value?

No. Sorted: [120, 122, 255, 255, 255]. Median = 255. When more than half the window is corrupted, the median is dominated by the noise. This is why a larger kernel makes the filter more robust — but also slower and potentially blurring at narrow features.

Bilateral Filter & Adaptive Thresholding

The bilateral filter weights neighbours by both spatial distance and intensity similarity:

$$BF[I]_p = \frac{1}{W_p}\sum_{q \in \mathcal{N}} G_{\sigma_s}(\|p{-}q\|)\cdot G_{\sigma_r}(|I_p - I_q|)\cdot I_q$$

$\sigma_s$ (spatial) controls how far pixels contribute; $\sigma_r$ (range) controls the intensity threshold below which pixels are treated as "same surface." Pixels across an edge have large $|I_p - I_q|$ → near-zero range weight → edge preserved.

Otsu thresholding automatically finds the optimal global threshold $T^*$ by maximising the inter-class variance between foreground and background:

$$T^* = \arg\max_T\; \omega_0(T)\,\omega_1(T)\,[\mu_0(T) - \mu_1(T)]^2$$

For documents with uneven lighting, adaptive thresholding computes a local $T$ for each pixel's neighbourhood, making it robust to shadows and gradients that defeat Otsu.

📝 Worked Example — Why bilateral preserves edges

Problem: Two adjacent pixels: $p$ (intensity 20) and $q$ (intensity 200), distance 1 pixel apart. $\sigma_s = 2$, $\sigma_r = 25$. Compute spatial and range weights.

Spatial weight.
$G_{\sigma_s}(1) = \exp(-1^2/(2\cdot4)) = \exp(-0.125) \approx 0.882$ — high; they are adjacent.

Range weight.
$G_{\sigma_r}(180) = \exp(-180^2/(2\cdot625)) = \exp(-25.9) \approx 5\times10^{-12}$ — effectively zero.

Combined weight.
$0.882 \times 5\times10^{-12} \approx 0$. Even though $q$ is adjacent, its cross-edge intensity difference makes its contribution negligible.

Bilateral: same-surface pixels smooth together; cross-edge pixels are silenced → edge preserved, noise removed within uniform regions.

✔ Quick Check

If $\sigma_r$ is increased from 25 to 200, what happens to the range weight for the same pair (intensity difference = 180)?

$G_{200}(180) = \exp(-180^2/(2 \times 40000)) = \exp(-0.405) \approx 0.667$. Now the cross-edge pixel contributes 0.667 × 0.882 ≈ 0.59 weight — the bilateral filter starts blurring across the edge. Large $\sigma_r$ approximates Gaussian blur (no range awareness).

⚠️

Common Mistake

Myth: "Gaussian blur and bilateral filter both smooth, so I can use them interchangeably to reduce noise."

Reality: Gaussian blur is a linear filter — it blurs everything uniformly, including edges. Bilateral is non-linear — it only smooths within uniform regions and leaves edges untouched. For industrial inspection (where you need to detect surface defects adjacent to clean regions) or portrait photography (smooth skin, sharp eyes), always prefer bilateral. Gaussian is faster and preferred only when edge preservation is not required.

Solution

Pause & Predict

The noise slider adds random salt-and-pepper pixels. Before toggling filters: at 20% noise density, which filter do you expect to fully recover the original image — mean, Gaussian, or median?

Hint: think about what "20% corrupted" means for the median of a 3×3 neighbourhood (9 pixels, 1.8 corrupted on average).

Try It: Noise & Filter Comparison

Adjust noise level — compare how mean, median, and bilateral filters respond to salt-and-pepper noise on a synthetic PCB-like test image.

Noise level 15%

Salt (255) Pepper (0) Recovered pixel

Implementation

Python · OpenCV — Median, bilateral, and adaptive thresholding

import cv2
import numpy as np

img = cv2.imread('pcb.jpg', cv2.IMREAD_GRAYSCALE)

# ── Stage 1: remove salt-and-pepper noise ──
denoised = cv2.medianBlur(img, 5)  # ksize must be odd

# ── Stage 2: bilateral for edge-preserving smooth ──
smooth = cv2.bilateralFilter(
    denoised, d=9,
    sigmaColor=50,  # σ_r — intensity range
    sigmaSpace=9   # σ_s — spatial
)

# ── Stage 3a: global Otsu threshold ──
_, otsu = cv2.threshold(
    smooth, 0, 255,
    cv2.THRESH_BINARY + cv2.THRESH_OTSU
)

# ── Stage 3b: adaptive (for uneven illumination) ──
adaptive = cv2.adaptiveThreshold(
    smooth, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY, blockSize=21, C=4
)

stdout / description

denoised: salt-and-pepper removed, edges intact (compare to Gaussian which blurs edges) smooth: bilateral output — texture smoothed within regions, sharp edges preserved otsu: binary image with single global threshold T* chosen automatically adaptive: binary image with per-pixel local threshold — robust to shadows/gradients

Key Takeaway

Use median for impulse noise, bilateral for edge-preserving smoothing, and adaptive thresholding when illumination is uneven — combining them in the right order defines a production-ready industrial vision pipeline.

🏭

Real-World Application

PCB Defect Detection & Document Digitization at Scale

At Foxconn and similar electronics manufacturers, high-speed vision systems capture 10,000+ PCB images per hour. The pipeline — median → bilateral → Canny → contour analysis — runs in under 5 ms per frame on a GPU, flagging solder bridges, missing components, and pad oxidation in real time. The same filter stack, adapted with adaptive Otsu, powers bank cheque OCR, passport scanning, and form digitization at fintech companies — where uneven LED lighting above the scanning bed is corrected algorithmically rather than with expensive hardware.

Checkpoint Test your understanding of Non-Linear Filters & Application

Q1 A 3×3 window contains: [50, 52, 48, 51, 255, 49, 50, 48, 53]. Apply a median filter. What is the output?

Sorted: [48, 48, 49, 50, 50, 51, 52, 53, 255]. Median = position 4 = 50. The single salt pixel (255) is removed; the output is indistinguishable from the true background value.

Q2 You are processing a document photo taken under non-uniform lighting. Half the page is in shadow (pixel values ~80) and half is bright (values ~200). Why will cv2.threshold(..., cv2.THRESH_OTSU) fail here?

Otsu finds a single global threshold T* that best separates two pixel classes across the entire image. With non-uniform illumination, the text in the shadow region may have similar values to the background in the bright region — a single threshold cannot correctly classify both areas simultaneously. Adaptive thresholding solves this by computing a local T for each pixel's neighbourhood.

Q3 In the industrial inspection pipeline, why is the median filter applied before Canny edge detection, not after?

Salt-and-pepper pixels have gradients of ~200–255 (maximum possible). If Canny runs first, it will detect these noise spikes as very strong "edges" and include them in the edge map — creating false positives. Running median first removes the corrupted pixels so Canny only responds to genuine structural edges in the image.

Practice

8 Exercises

Two exercises per topic (theory + code) plus two synthesis challenges that combine all three topics into a single pipeline.

Theory · Convolution as a Spatial Operation Easy

Convolution vs Cross-Correlation

Explain the mathematical difference between 2D convolution and 2D cross-correlation. Under what condition are they equivalent? Why does this matter when using cv2.filter2D or PyTorch's nn.Conv2d?

Definition difference: Convolution flips the kernel 180° before sliding ($I * K$ has minus signs in the offset indices); cross-correlation does not flip. Equivalent when: $K$ is symmetric, i.e., $K = \text{rot}_{180}(K)$. This is true for Gaussian, box, and Laplacian kernels. For cv2.filter2D: implements cross-correlation. For symmetric kernels the result is identical to convolution. For Sobel (asymmetric), the gradient sign is reversed but the magnitude map is the same.

Code · Convolution as a Spatial Operation Easy

Box Filter from Scratch

Implement a 3×3 mean filter manually (without cv2.filter2D). Apply it to a grayscale image at pixel position (100, 100). Then verify your result matches cv2.filter2D with BORDER_REFLECT.

Extract the 3×3 neighbourhood patch around (100, 100).
Compute the weighted sum using a uniform kernel (all weights = 1/9).
Compare with the OpenCV result and print the absolute difference.

Key steps: K = np.ones((3,3), dtype=np.float32) / 9. Extract patch = img[r-1:r+2, c-1:c+2]. Manual output: np.sum(patch * K). OpenCV: cv2.filter2D(img, -1, K, borderType=cv2.BORDER_REFLECT). Absolute difference should be 0 or floating-point epsilon (~1e-6).

Theory · Gaussian Blur & Edge Detection Medium

Gaussian Kernel Weights for σ = 1.5

For a Gaussian kernel with σ = 1.5, compute $G(0,0,1.5)$, $G(1,0,1.5)$, and $G(1,1,1.5)$ (unnormalised). What percentage of the total 3×3 kernel weight is concentrated in the centre pixel? How does this compare to σ = 1.0 (centre ≈ 20.4%)?

Use $G(x,y,\sigma) = \frac{1}{2\pi\sigma^2}\exp(-(x^2+y^2)/(2\sigma^2))$. For σ=1.5: $G(0,0) = \frac{1}{2\pi(2.25)} \approx 0.0707$. $G(1,0) = 0.0707 \times e^{-1/4.5} \approx 0.0707 \times 0.8007 \approx 0.0566$. $G(1,1) = 0.0707 \times e^{-2/4.5} \approx 0.0707 \times 0.6412 \approx 0.0453$. Raw sum = $0.0707 + 4(0.0566) + 4(0.0453) = 0.0707 + 0.2264 + 0.1812 = 0.4783$. Normalised centre = $0.0707/0.4783 \approx 14.8\%$ — smaller than σ=1.0's 20.4%, confirming a wider, flatter kernel at larger σ.

Code · Gaussian Blur & Edge Detection Medium

Canny Parameter Tuning

Load a grayscale image. Apply Gaussian blur with σ ∈ {0.5, 1, 2} before Canny (T_low=50, T_high=150). Count the number of white (edge) pixels in each output. What σ produces the most edges? The fewest? Why?

Apply cv2.GaussianBlur with each σ, then run cv2.Canny on the result.
Count edge pixels with np.count_nonzero(edges).
Plot or print a comparison table.

σ=0.5: minimal smoothing → noisy edges → most white pixels. σ=2: heavy smoothing → strong noise suppression → fewest edges (only the most prominent boundaries survive). The Canny thresholds (50/150) remain fixed; it's the pre-smoothed noise level that changes which gradients cross T_high. Optimal σ for natural images is typically 1.0–1.5.

Theory · Industrial Vision Application Medium

Why Median Preserves Edges

Two adjacent 3×3 image regions: the left region has all pixels ≈ 50, the right region has all pixels ≈ 200. For the centre pixel of the left region, its 3×3 neighbourhood contains 6 pixels ≈ 50 and 3 pixels ≈ 200 (border column). Compute mean and median outputs. Explain why the median preserves the edge but the mean blurs it.

Mean: $(6 \times 50 + 3 \times 200)/9 = (300+600)/9 = 100$. The edge is pulled toward the midpoint.

Median: Sorted: [50,50,50,50,50,50,200,200,200]. Median (index 4) = 50. As long as more than half the neighbourhood is on the same side of the edge, the median stays within that region. The mean has no such protection — it's corrupted by any minority of cross-edge pixels.

Code · Industrial Vision Application Medium

Bilateral Filter σ Grid

Apply cv2.bilateralFilter with σ_s ∈ {5, 20} and σ_r ∈ {15, 80} — a 2×2 grid of 4 combinations. Display the results side by side. Identify which combination gives the best edge-preserving denoising and explain the role of each parameter.

Best for edge-preserving denoising: large σ_s + small σ_r. Large σ_s spreads smoothing over a wide spatial area; small σ_r keeps the range weight tight — only pixels with similar intensity participate. This smooths within uniform regions widely while completely ignoring cross-edge pixels. Large σ_s + large σ_r ≈ strong Gaussian blur (blurs everything). Small σ_s + small σ_r = minimal, local, edge-aware smoothing.

Synthesis · Theory: Filter Selection Challenge Hard

Choosing the Right Filter for Each Noise Type

For each scenario below, identify the optimal filter and justify your choice mathematically:

A security camera captures a night scene with Gaussian noise (σ = 20 intensity units). Goal: blur noise before Canny edge detection.
A PCB scanner has a faulty sensor that randomly sets 5% of pixels to 0 or 255. Goal: recover the original surface texture.
A portrait photo needs skin smoothing without blurring the eyes, lips, or hair outline.

1. Gaussian blur (σ ≈ 1–2): Gaussian noise is zero-mean and i.i.d. — Gaussian blur is the optimal linear estimator (Wiener filter at equal noise model). Not edge-preserving, but the goal is pre-smoothing for Canny, so edges can be slightly diffused.
2. Median filter (k=5): Salt-and-pepper noise consists of isolated extreme outliers; the median is inherently outlier-resistant and preserves edges. At 5% density in a 9-pixel neighbourhood (~0.45 corrupted), the median is unaffected.
3. Bilateral filter (large σ_s, small σ_r): skin has gradually varying texture (within σ_r range); eyes/hair have sharp intensity boundaries (outside σ_r range). Bilateral smooths the former while preserving the latter.

Synthesis · Code: Industrial Inspection Pipeline Hard

Build a Two-Stage Defect Detection Pipeline

Build a complete defect detection pipeline on a synthetic test image (gradient background + inserted bright spot to simulate a defect):

Create a 256×256 test image with a smooth gradient and add 10% salt-and-pepper noise.
Apply medianBlur(k=5) to remove the impulse noise.
Run cv2.Canny on the denoised image and compare edge counts before/after denoising.
Apply Otsu thresholding to segment the defect region. Count the defect area in pixels.

Setup: img = np.tile(np.linspace(50,200,256).astype(np.uint8), (256,1)). Add noise: randomly set 10% of pixels to 0 or 255. After medianBlur: Canny edge count drops by ~80-90% (noise edges removed). Otsu: _, mask = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU). Defect area: np.sum(mask == 255). Compare against known injected defect size to validate the pipeline.

Spatial Filtering
& Kernels

Convolution as a Spatial Operation

2D Discrete Convolution

Padding Strategies & Correlation vs Convolution

Try It: Kernel Selector

CNNs Are Just 50+ Layers of Convolution with Learned Kernels

Gaussian Blur & Edge Detection

The 2D Gaussian Kernel

Sobel Gradient & the Canny Pipeline

Try It: Gaussian σ & Edge Response Explorer

Medical Image Segmentation: Organ Boundary Detection

Industrial Surface Inspection & Document Binarization

Median Filter — Outlier-Resistant Denoising

Bilateral Filter & Adaptive Thresholding

Try It: Noise & Filter Comparison

PCB Defect Detection & Document Digitization at Scale

Filter Playground

Key Concepts

2D Convolution

Gaussian Smoothing

Sobel & Canny

Median Filter

Bilateral Filter

Coming up — Week 3

Deepen Your Understanding

Image Kernels Explained Visually — setosa.io

An Interactive Guide to the Fourier Transform

Szeliski — Computer Vision: Algorithms and Applications, Ch. 3

First Principles of Computer Vision — Shree Nayar (Columbia)

8 Exercises

Convolution vs Cross-Correlation

Box Filter from Scratch

Gaussian Kernel Weights for σ = 1.5

Canny Parameter Tuning

Why Median Preserves Edges

Bilateral Filter σ Grid

Choosing the Right Filter for Each Noise Type

Build a Two-Stage Defect Detection Pipeline

Spatial Filtering& Kernels

Convolution as a Spatial Operation

2D Discrete Convolution

Padding Strategies & Correlation vs Convolution

Try It: Kernel Selector

CNNs Are Just 50+ Layers of Convolution with Learned Kernels

Gaussian Blur & Edge Detection

The 2D Gaussian Kernel

Sobel Gradient & the Canny Pipeline

Try It: Gaussian σ & Edge Response Explorer

Medical Image Segmentation: Organ Boundary Detection

Industrial Surface Inspection & Document Binarization

Median Filter — Outlier-Resistant Denoising

Bilateral Filter & Adaptive Thresholding

Try It: Noise & Filter Comparison

PCB Defect Detection & Document Digitization at Scale

Filter Playground

Key Concepts

2D Convolution

Gaussian Smoothing

Sobel & Canny

Median Filter

Bilateral Filter

Coming up — Week 3

Deepen Your Understanding

Image Kernels Explained Visually — setosa.io

An Interactive Guide to the Fourier Transform

Szeliski — Computer Vision: Algorithms and Applications, Ch. 3

First Principles of Computer Vision — Shree Nayar (Columbia)

8 Exercises

Convolution vs Cross-Correlation

Box Filter from Scratch

Gaussian Kernel Weights for σ = 1.5

Canny Parameter Tuning

Why Median Preserves Edges

Bilateral Filter σ Grid

Choosing the Right Filter for Each Noise Type

Build a Two-Stage Defect Detection Pipeline

Spatial Filtering
& Kernels