Computer Vision Foundations — Week 1

Course Overview &
Image Formation

From Pixels to Semantic Understanding — Build the mathematical intuition that powers every modern computer vision system.

CV-AI Pipeline Image Formation Pixel Mathematics Color Spaces OpenCV NumPy

The CV-AI Pipeline

Every production vision system — from a self-driving car to a medical scanner — is built on the same five-stage pipeline. This section maps that pipeline end to end, giving you the mental model that ties every subsequent week together.

After this section you will be able to
  • Describe each of the five stages of the CV-AI pipeline and name the week in this course where each is covered in depth.
  • Trace a single image through all five stages — from raw sensor data to a predicted class label — writing down what transforms at each step.
  • Identify which pipeline stage is responsible for a given failure mode (e.g., blur, wrong label, slow inference).

Every time you unlock your phone with your face, every time a warehouse robot picks a box, every time an AI flags a suspicious X-ray — the same invisible assembly line of code runs underneath. Have you ever wondered what that pipeline actually looks like, step by step?

Why this matters: Understanding the end-to-end CV pipeline is the mental model that ties together everything in this course. Without it, individual techniques — filters, CNNs, detectors — feel like disconnected tricks. With it, you can diagnose any vision system and know exactly where to intervene.
Think of it this way

The CV-AI pipeline is like a restaurant kitchen: raw ingredients arrive (Acquisition), get washed and chopped (Pre-processing), combined into dishes (Feature Extraction), tasted by the head chef (Inference), and finally plated and served (Deployment). Each station has a clear job — and a burnt dish at one station cannot be fixed downstream.

1. Acquisition Camera / Sensor RGB pixels 2. Pre-processing Resize / Normalize Weeks 1–2 3. Feature Extraction CNN / ViT / SIFT Weeks 3–7, 12 4. Inference Classify / Detect / Seg Weeks 9–13 5. Deployment Edge / Cloud / TensorRT Week 14

The five-stage CV-AI pipeline — each stage maps to specific weeks in this course.

30
fps
Real-time vision systems must process one frame every 33 ms end to end
3
bytes / pixel
A single 4K RGB frame = 3840 × 2160 × 3 ≈ 24 MB of raw data per frame
5
pipeline stages
Acquisition → Pre-process → Extract → Infer → Deploy: the invariant structure of every production CV system
10×
compression
JPEG reduces a 24 MB raw 4K frame to roughly 2–8 MB with negligible perceptual quality loss
Problem

What is Computer Vision?

Computer Vision sits at the intersection of signal processing, linear algebra, and deep learning. The core challenge: a camera collapses a 3D world into a 2D array of numbers. Our task is to reverse this process — recovering structure, meaning, and intent from those numbers.

  • Classical CV (Weeks 1–4): Hand-crafted algorithms — filters, feature detectors, geometric transformers.
  • Deep CV (Weeks 5–12): Learned representations via CNNs and Vision Transformers.
  • Generative & Temporal CV (Weeks 13–14): Image synthesis, video understanding, and edge deployment.
📝 Worked Example — Tracing a pixel through the pipeline

Background. Follow a single pixel from a dashcam frame all the way to a stop-sign classification.

1
Acquisition. A CMOS sensor at 30 fps captures a 1920×1080 BGR frame. Pixel at row 540, col 960: value = [42, 18, 215] (blue-ish).
2
Pre-processing. Resize to 224×224, convert BGR→RGB, normalize: 215/255 ≈ 0.843 for the R channel.
3
Feature Extraction. A ResNet-50 backbone converts the 224×224×3 tensor into a 7×7×2048 feature map. The pixel's neighborhood now contributes to a 224-pixel receptive field.
4
Inference. A detection head outputs bounding boxes + class scores. Stop sign: 0.97, car: 0.02, background: 0.01. Threshold 0.5 → stop sign detected.
Prediction: STOP SIGN (confidence = 0.97) at bbox [x1=410, y1=120, x2=510, y2=220]
5
Deployment. On a TensorRT-optimized edge GPU, this full pipeline runs in 12 ms — well under the 33 ms budget for 30 fps.
Quick Check

If the inference stage runs in 8 ms but pre-processing takes 40 ms, which stage is the bottleneck? Which pipeline stage should the engineer optimize first?

Pre-processing (40 ms) is the bottleneck — it alone exceeds the 33 ms frame budget. Optimize it first (e.g., GPU-accelerated resize, batched normalization) before touching the inference stage.

The Five-Stage Pipeline — Formal Definition

Each stage is a deterministic function that maps one representation to another. The complete pipeline composes these functions:

$$\hat{y} = f_5 \circ f_4 \circ f_3 \circ f_2 \circ f_1(\text{scene})$$

where $f_1$ = Acquisition, $f_2$ = Pre-processing, $f_3$ = Feature Extraction, $f_4$ = Inference, $f_5$ = Deployment output.

Stage Input Output
AcquisitionPhotons / sceneRaw pixel array
Pre-processingRaw pixelsNormalized tensor
Feature ExtractionNormalized tensorFeature map / embedding
InferenceFeature mapPrediction ($\hat{y}$)
DeploymentPredictionAction / alert / display
📝 Worked Example — Pipeline latency budget calculation

Problem: A real-time system must run at 25 fps. Stage timings are: Acquisition 5 ms, Pre-processing 12 ms, Feature Extraction 18 ms, Inference 6 ms, Deployment 2 ms. Does the system meet the budget?

1
Frame budget. At 25 fps: $1000 \div 25 = 40$ ms per frame.
2
Total latency. $5 + 12 + 18 + 6 + 2 = 43$ ms.
3
Conclusion. 43 ms > 40 ms budget — the system misses 25 fps. Feature Extraction (18 ms) is the dominant stage.
Budget exceeded by 3 ms. Optimize Feature Extraction first (e.g., quantize the backbone to INT8, reducing it to ~11 ms → total 36 ms, within budget).
Quick Check

If Acquisition is parallelized with Pre-processing (they overlap in time), what is the new total latency?

Acquisition (5 ms) runs during Pre-processing (12 ms) — the bottleneck of the pair is 12 ms. New total: 12 + 18 + 6 + 2 = 38 ms. This is within the 40 ms budget.
Common Mistake

Treating the pipeline as a black box. Students often think "just use a pre-trained model" is a complete solution. But a model with perfect accuracy on benchmark data can fail in production if the Acquisition stage uses a different camera white balance, or if Pre-processing normalizes with the wrong mean/std. Every stage must match between training and deployment.

Solution
Pause & Predict

In the widget below, each button transforms a synthetic 8×8 pixel grid through one pipeline stage. Before clicking, predict: what will change visually when you apply "Grayscale"? What about "Edge Detect"?

Hint: think about what information is preserved vs. discarded at each stage.

Try It: CV Pipeline Stage Explorer

Click each pipeline stage to see the transformation applied to a synthetic image. Observe how information changes at each step.

R channel G channel B channel
Stage: Raw Pixels — values 0–255
Implementation
Python · OpenCV — Your First Image Pipeline
import cv2 import numpy as np import matplotlib.pyplot as plt # Stage 1 — Acquisition: load image (OpenCV reads as BGR) img_bgr = cv2.imread('photo.jpg') img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB) print(img_rgb.shape) # (H, W, 3) print(img_rgb.dtype) # uint8 — values 0–255 # Stage 2 — Pre-processing: resize & normalize img_small = cv2.resize(img_rgb, (224, 224)) img_norm = img_small.astype(np.float32) / 255.0 # [0.0, 1.0] # Stage 3 — Feature Extraction (placeholder: grayscale edges) gray = cv2.cvtColor(img_small, cv2.COLOR_RGB2GRAY) edges = cv2.Canny(gray, 50, 150) # Stage 4 — Inference (placeholder: pixel statistics) print(f'Mean: {img_norm.mean():.3f} Std: {img_norm.std():.3f}') # Stage 5 — Visualize output fig, axes = plt.subplots(1, 3, figsize=(12, 4)) axes[0].imshow(img_rgb); axes[0].set_title('Raw') axes[1].imshow(img_norm); axes[1].set_title('Normalized') axes[2].imshow(edges, cmap='gray'); axes[2].set_title('Edges') plt.tight_layout(); plt.show()
Output
(427, 640, 3) uint8 Mean: 0.499 Std: 0.287 [Three matplotlib panels: original photo | [0,1]-normalized (visually identical) | white edges on black background]
Key Takeaway

Every computer vision system is a composition of five deterministic stages — understanding which stage owns which transformation is the diagnostic skill that separates engineers from users.

AV
Real-World Application

Autonomous Vehicle Perception Stack

Systems like Waymo and Tesla Autopilot execute the full CV-AI pipeline at 30–60 fps in real time. Camera frames pass through HDR normalization and lens undistortion (Pre-processing), a multi-scale CNN backbone shared across task heads (Feature Extraction), and parallel inference for lane detection, object detection, and depth estimation — all within a 33 ms frame budget on dedicated NPUs.

Checkpoint CV-AI Pipeline — Retrieval Practice

Q1 A self-driving car camera captures at 30 fps. The full pipeline runs in 45 ms per frame. What is the maximum achievable frame rate, and which stage must be optimized to reach 30 fps?

Max frame rate = 1000 ms ÷ 45 ms ≈ 22.2 fps. To reach 30 fps (33 ms budget), the pipeline must save at least 12 ms. Identify the slowest stage (often Feature Extraction or Pre-processing) and target it first.

Q2 A model achieves 95% accuracy on the benchmark dataset but only 62% in the field. List two pipeline stages where the training–deployment mismatch could originate.

Acquisition: different camera sensor, resolution, or lighting at deployment vs. training. Pre-processing: normalization mean/std or resize algorithm differs between training pipeline and deployment code. Both cause distribution shift before the model ever sees the data.

Q3 Express the pipeline as a mathematical composition. If $f_3$ (Feature Extraction) is replaced with a faster but weaker function $\tilde{f}_3$, which output changes and which remains the same?

$\hat{y} = f_5 \circ f_4 \circ f_3 \circ f_2 \circ f_1(\text{scene})$. Replacing $f_3$ with $\tilde{f}_3$ changes $\hat{y}$ (prediction quality) and the latency of stage 3 — but $f_1, f_2$ outputs are identical since they are upstream of the substitution.

Images as 2D Discrete Signals: Pixels, Color Models & Spaces

A digital image is a 2D discrete signal — a function $f(x, y)$ mapping spatial coordinates to intensity values. Before building any vision system you must understand what these numbers mean physically, how they are stored, and how color is encoded across different color spaces.

After this section you will be able to
  • Compute the raw file size of any image given its width, height, and bit depth — and verify it with NumPy.
  • Apply the ITU-R BT.601 luminance formula to convert any RGB pixel to its grayscale value by hand and in code.
  • Explain why HSV separates hue from brightness, and give a practical example where this matters for a vision algorithm.

Your camera captures a sunset. To you it is a sky of orange and pink. To the computer it is a grid of numbers — nothing more. The question is: which numbers, arranged how, and what do they actually encode?

Why this matters: Every algorithm in this course operates on these numbers. If you misunderstand how a pixel's value is stored — bit depth, channel order, coordinate axes — every downstream calculation will silently be wrong. This is the foundation all other topics build on.
Think of it this way

An image is like a spreadsheet of paint chips: each cell (pixel) holds a number (intensity). RGB is three spreadsheets stacked — one for red, one for green, one for blue. HSV reorders the same information into hue (color wheel angle), saturation (color purity), and value (brightness). The paint is the same; only the labeling system changes.

RGB R, G, B ∈ [0, 255] additive, device-dependent cvtColor HSV H ∈ [0°,360°] S, V ∈ [0, 1] separates luminance cvtColor LAB L* lightness a* red-green axis b* blue-yellow axis Grayscale Y = 0.299R+0.587G+0.114B

Color space conversion pipeline: each representation encodes the same visual information differently, optimized for different tasks.

$H$
Height
Number of pixel rows in the image matrix
e.g. 1080 (Full HD)
$W$
Width
Number of pixel columns in the image matrix
e.g. 1920 (Full HD)
$B$
Bit Depth
Number of bits used to store each channel value
e.g. 8 bits → 0–255
$C$
Channels
Number of color components (1 = gray, 3 = RGB)
e.g. 3 (RGB)
8
bits / channel
Standard consumer cameras: 256 intensity levels per R, G, B channel
24
bits / pixel (RGB)
True color: 2²⁴ ≈ 16.7 million distinct colors per pixel
0.587
green weight
Human eyes are most sensitive to green — ITU-R BT.601 standard luminance weight
23.7
MB (4K raw)
3840 × 2160 × 3 bytes ≈ 24 MB uncompressed; JPEG compresses to ~3–8 MB
Problem

The Image as a Matrix

A grayscale image of width $W$ and height $H$ is an $H \times W$ integer matrix. Each element $I[r, c]$ is a pixel intensity in $\{0, 1, \ldots, 2^B - 1\}$ for bit depth $B$. A color image adds a channel dimension:

$$I \in \mathbb{Z}^{H \times W \times C}, \quad I[r,c,k] \in \{0,\ldots,2^B{-}1\}$$

The raw (uncompressed) byte count is:

$$\text{Bytes} = H \times W \times C \times \tfrac{B}{8}$$
📝 Worked Example — 4K image raw file size

Problem: A 4K camera captures at 3840 × 2160 in 24-bit RGB. What is the raw uncompressed size in MB?

1
Total pixels. $3840 \times 2160 = 8{,}294{,}400$ pixels.
2
Bytes per pixel. 24-bit RGB = 3 channels × 8 bits ÷ 8 = 3 bytes per pixel.
3
Total bytes. $8{,}294{,}400 \times 3 = 24{,}883{,}200$ bytes.
4
Convert to MB. $24{,}883{,}200 \div 1{,}048{,}576 \approx 23.73$ MB.
Raw 4K frame ≈ 23.73 MB. JPEG compresses this to ~3–8 MB (3–8× ratio).
Quick Check

A 16-bit HDR image at 1920 × 1080 with 3 channels — what is the raw size in MB?

1920 × 1080 × 3 × (16÷8) = 1920 × 1080 × 6 = 12,441,600 bytes ÷ 1,048,576 ≈ 11.86 MB.

Grayscale Luminance — Perceptual Weighting

Converting RGB to grayscale is not a simple average. Human eyes are more sensitive to green than red or blue. The ITU-R BT.601 standard encodes this perceptual weighting:

$$Y = 0.299 \cdot R + 0.587 \cdot G + 0.114 \cdot B$$
📝 Worked Example — RGB to grayscale luminance

Problem: Pixel RGB = (180, 120, 60). Compute the grayscale luminance $Y$.

1
Red contribution. $0.299 \times 180 = 53.82$
2
Green contribution. $0.587 \times 120 = 70.44$ — nearly double red, reflecting cone sensitivity.
3
Blue contribution. $0.114 \times 60 = 6.84$ — smallest weight.
4
Sum and round. $Y = 53.82 + 70.44 + 6.84 = 131.10 \approx 131$.
Grayscale luminance Y = 131 / 255 — a mid-bright gray.
Quick Check

A pure red pixel (255, 0, 0). What is its luminance $Y$? Is this brighter or darker than a pure green pixel (0, 255, 0)?

Pure red: $Y = 0.299 \times 255 = 76.2 \approx 76$. Pure green: $Y = 0.587 \times 255 = 149.7 \approx 150$. Green is brighter — our visual system is most sensitive to green wavelengths.

HSV Color Space — Perceptual Separation

RGB mixes hue and brightness into three correlated channels, making color filtering difficult. HSV separates them:

  • H (Hue): color type on the color wheel, $H \in [0°, 360°]$
  • S (Saturation): color purity — 0 = gray, 1 = fully saturated
  • V (Value): brightness — 0 = black, 1 = fully bright

Practical advantage: detecting a red object under changing illumination requires only a range check on $H$ — $V$ absorbs the lighting variation.

📝 Worked Example — RGB to HSV conversion

Problem: Convert RGB = (255, 128, 0) — an orange color — to HSV. Values normalized to [0, 1]: R = 1.0, G = 0.502, B = 0.

1
Find Cmax, Cmin, delta. Cmax = R = 1.0, Cmin = B = 0, delta = 1.0 - 0 = 1.0.
2
Hue. Since Cmax = R: $H = 60° \times \left(\tfrac{G - B}{\Delta} \bmod 6\right) = 60° \times \tfrac{0.502 - 0}{1.0} = 60° \times 0.502 = 30.1°$.
3
Saturation. $S = \Delta / \text{Cmax} = 1.0 / 1.0 = 1.0$ (fully saturated).
4
Value. $V = \text{Cmax} = 1.0$ (fully bright).
HSV ≈ (30°, 1.0, 1.0) — pure, bright orange on the color wheel.
Quick Check

If the same orange is viewed in dim light — all RGB values halved to (127, 64, 0) — which HSV component changes and which stays the same?

Only V (Value) decreases (from 1.0 to ~0.50). H stays ≈ 30° (same hue) and S stays ≈ 1.0 (same purity). This is why HSV is used for robust color detection under lighting changes.
!
Key Insight

OpenCV stores images as BGR, not RGB. When you call cv2.imread(), channel order is [Blue, Green, Red]. Feeding a BGR image to a model trained on RGB data will degrade performance because the model has learned that channel 0 = Red. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) immediately after loading.

Solution
Pause & Predict

The widget below lets you adjust R, G, B sliders and see the pixel color update live. Before moving the sliders, predict: if you set R=255, G=0, B=255, what color appears? What is the approximate grayscale luminance $Y$?

Hint: $Y = 0.299 \times 255 + 0.587 \times 0 + 0.114 \times 255$

Try It: Pixel Color & Luminance Explorer

Adjust the RGB sliders to see the resulting pixel color, grayscale luminance, and HSV representation update in real time. Observe how the luminance formula weights each channel.

180
120
60
Live Luminance Calculation
Y = 0.299×R + 0.587×G + 0.114×B
Implementation
Python · NumPy + OpenCV — Pixel dissection & color conversion
import numpy as np import cv2 img = cv2.imread('photo.jpg') img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # Always convert! # ── Matrix properties ───────────────────────────── print(img.shape) # (H, W, 3) — height × width × channels print(img.dtype) # uint8 — each channel value ∈ [0, 255] H, W, C = img.shape raw_bytes = H * W * C * (img.dtype.itemsize) print(f'Raw size: {raw_bytes / 1e6:.2f} MB') # ── Single pixel ────────────────────────────────── pixel = img[100, 200] # row=100, col=200 print(pixel) # e.g. [180, 120, 60] # ── Grayscale (luminance formula) ───────────────── R, G, B = img[100, 200] Y_manual = 0.299 * R + 0.587 * G + 0.114 * B gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY) print(round(Y_manual), gray[100, 200]) # both ≈ 131 # ── Color space conversion ───────────────────────── hsv = cv2.cvtColor(img, cv2.COLOR_RGB2HSV) lab = cv2.cvtColor(img, cv2.COLOR_RGB2LAB) print('HSV pixel:', hsv[100, 200]) # [H°/2, S×255, V×255] print('LAB pixel:', lab[100, 200]) # [L*, a*, b*]
Output
(427, 640, 3) uint8 Raw size: 0.82 MB [180 120 60] 131 131 # manual Y ≈ cv2 grayscale ✓ HSV pixel: [ 15 170 180] # H=30°, S=0.667, V=0.706 (OpenCV scales H by /2) LAB pixel: [134 20 34] # L*≈52, a*≈−5, b*≈13
Key Takeaway

An image is a 3D tensor $I \in \mathbb{Z}^{H \times W \times C}$ — choosing the right color space (RGB, HSV, LAB) is an algorithmic decision that directly determines how robust your feature extraction will be to lighting and viewpoint changes.

MRI
Real-World Application

Medical Image Preprocessing in Radiology AI

CT and MRI scanners produce 16-bit grayscale images with intensity values in Hounsfield Units (HU). Radiology AI pipelines normalize these to a specific window (e.g., bone: [−400, 1000] HU) before feeding into a CNN, transforming the raw 65,536-level signal into a perceptually meaningful [0, 255] range — the clinical equivalent of choosing the right color space for the task.

Checkpoint Pixels & Color Models — Retrieval Practice

Q1 A camera captures a 2000 × 1500 image in 16-bit RGB. What is the uncompressed file size in MB? Show each calculation step.

Bytes = 2000 × 1500 × 3 × 2 = 18,000,000 bytes ÷ 1,048,576 ≈ 17.17 MB. (16 bits = 2 bytes per channel value.)

Q2 A pixel has RGB = (50, 200, 100). Compute its ITU-R BT.601 grayscale luminance $Y$ to one decimal place. Which channel contributes the most?

$Y = 0.299 \times 50 + 0.587 \times 200 + 0.114 \times 100 = 14.95 + 117.40 + 11.40 = \mathbf{143.75} \approx 144$. Green contributes the most (117.4 out of 143.75).

Q3 You need to detect yellow traffic cones regardless of whether it is sunny or overcast. Should you threshold on RGB or HSV? Explain which HSV components you would use.

Use HSV. Set a narrow range on H (≈ 20°–35° for orange-yellow), allow S > 0.4 (avoid gray/white), and leave V unconstrained so the detector works in both bright sunlight and overcast shade. RGB would require separate thresholds for every lighting condition.

How Cameras Feed AI: Autonomous Driving & Medical Imaging

The CV-AI pipeline and pixel mathematics are not academic exercises — they power two of the most consequential application domains in modern engineering. This section grounds Week 1 concepts in autonomous driving perception and radiology AI, showing exactly where every formula you learned appears in production systems.

After this section you will be able to
  • Map each of the five pipeline stages to a specific component of an autonomous vehicle perception system and a radiology AI workflow.
  • Explain why autonomous driving uses multi-camera fusion and which pipeline stage integrates the sensor streams.
  • Calculate the total raw data throughput (MB/s) for a given camera configuration, demonstrating why pre-processing must be hardware-accelerated.

A self-driving car has no eyes — only cameras, LiDAR, and radar. A radiologist AI has never seen a patient — only grayscale 2D projections of 3D anatomy. Yet both systems must make life-or-death decisions in milliseconds. How do a few thousand lines of code, built on the pixel mathematics we just learned, achieve this?

Why this matters: Seeing concrete applications at the start of the course transforms abstract mathematics into engineering intuition. When you understand that the HSV color space you computed by hand is what helps a car detect lane markings in rain, the formula stops being a chore and becomes a tool.
Think of it this way

An autonomous vehicle's camera system is like a team of photographers shooting at 60 fps from six different angles simultaneously — the CV pipeline is the photo editor that processes every shot in under 16 ms, highlights the important parts, and tells the driver what to do next. The radiology AI is the expert reviewer who has memorized a million X-rays and can spot an anomaly a human eye would miss.

FRONT 4K@30fps SIDE-L SIDE-R REAR Multi-Camera Fusion Pre-processing Stage CNN Backbone Feature Maps Feature Extraction Inference Heads Lane Detection Object Detection Depth Estimation Drive Decision Steer / Brake ≤ 33 ms latency Total budget: 33 ms @ 30 fps

Autonomous driving perception: multi-camera acquisition → GPU pre-processing → shared CNN backbone → parallel inference heads → drive decision, all within one 33 ms frame budget.

8
cameras
Typical AV camera ring — front, rear, four pillars, two narrow-forward — for 360° coverage
~1.4
GB/s raw
8 cameras × 4K × 30 fps × 3 bytes ≈ 1.4 GB/s raw sensor throughput — must be GPU-compressed immediately
94.5%
AUC (skin lesion)
Esteva et al. (2017) showed a CNN matches dermatologist accuracy at classifying skin cancer from images
16-bit
medical images
CT / MRI produce 16-bit grayscale Hounsfield Units — 256× more dynamic range than a consumer camera
Problem

Autonomous Driving — Pipeline in Action

A modern AV perception stack maps directly onto the five-stage CV-AI pipeline:

  • Acquisition: 6–8 cameras, LiDAR, radar operating simultaneously at up to 60 fps.
  • Pre-processing: GPU-accelerated lens undistortion, HDR tone mapping, color normalization, and multi-camera temporal synchronization.
  • Feature Extraction: A shared CNN backbone (e.g., BEVFusion architecture) fuses camera and LiDAR features in a unified Bird's-Eye View representation.
  • Inference: Parallel heads predict lane lines, 3D bounding boxes, driveable area, and depth — all from the same feature map.
  • Deployment: Results drive a trajectory planner in ≤ 33 ms; on dedicated AI chips (NPU), the full pipeline may run in 8–15 ms.

The critical engineering challenge: raw throughput from 8 cameras at 4K/30fps is approximately 1.4 GB/s — this must be compressed and processed on a GPU in parallel, not sequentially on a CPU.

📝 Worked Example — AV raw camera throughput calculation

Problem: An AV has 6 cameras, each capturing 1920 × 1080 RGB at 30 fps. What is the total raw data rate in MB/s?

1
Bytes per frame (one camera). $1920 \times 1080 \times 3 = 6{,}220{,}800$ bytes ≈ 5.93 MB/frame.
2
MB/s per camera. $5.93 \times 30 = 177.98$ MB/s per camera.
3
Total for 6 cameras. $177.98 \times 6 = 1{,}067.9$ MB/s ≈ 1.07 GB/s.
Total raw throughput ≈ 1.07 GB/s — equivalent to 64 GB of data every minute. Hardware-accelerated pre-processing is not optional.
Quick Check

If the AV upgrades to 4K cameras (3840 × 2160) at the same 30 fps and 6 cameras, by what factor does the raw throughput increase?

New pixels per frame: 3840 × 2160 = 8,294,400 vs. old 2,073,600. Factor = 8,294,400 ÷ 2,073,600 = . Total throughput becomes ~4.28 GB/s.

Medical Imaging — Radiology AI Workflow

Radiology AI applies the same five-stage pipeline to 2D projections of 3D anatomy:

  • Acquisition: CT, MRI, or X-ray scanner produces a DICOM file — a 16-bit grayscale image in Hounsfield Units (HU), where air = −1000 HU and dense bone = +1000 HU.
  • Pre-processing: Windowing normalizes HU to [0, 255] for a clinically relevant range (e.g., lung window: [−1000, −200] HU). The same image shows lungs, soft tissue, or bones depending on the chosen window.
  • Feature Extraction: A CNN (e.g., U-Net encoder) extracts anatomical features from the windowed image.
  • Inference: A segmentation head delineates tumors, vessels, or organs at pixel level.
  • Deployment: Results are overlaid on the DICOM viewer as a color-coded mask for the radiologist.
📝 Worked Example — Hounsfield Unit window normalization

Problem: A lung CT uses window center $WC = -600$ HU and window width $WW = 1500$ HU. A pixel has value $H = -400$ HU. Map it to a [0, 255] display value.

1
Window boundaries. Lower = $WC - WW/2 = -600 - 750 = -1350$ HU. Upper = $WC + WW/2 = -600 + 750 = 150$ HU.
2
Linear mapping. The pixel $H = -400$ HU is within the window: $\text{display} = \dfrac{H - \text{lower}}{\text{upper} - \text{lower}} \times 255 = \dfrac{-400 - (-1350)}{150 - (-1350)} \times 255$.
3
Compute. $= \dfrac{950}{1500} \times 255 = 0.633 \times 255 = 161.5 \approx 162$.
Display value = 162 / 255 — a medium-bright gray, representing lung tissue above average air density.
Quick Check

Using the same window, what display value does a pixel with $H = -1350$ HU map to? What does this represent visually?

$H = -1350$ = lower boundary → display = 0 / 255 = pure black. This is the minimum of the window — anything at or below this HU is rendered as black (pure air).
Common Mistake

Assuming higher camera resolution always means better AI performance. Quadrupling resolution (e.g., 1080p → 4K) multiplies raw data by 4× and inference latency by 4–16× (due to larger feature maps), which can push the pipeline over the real-time budget. In AV systems, engineers often use a mix: one high-res forward-facing camera for long-range detection and lower-res fisheye cameras for wide-angle coverage — matching sensor specs to the task, not maximizing resolution uniformly.

Solution
Pause & Predict

The widget below simulates the camera throughput calculation for different AV configurations. Before adjusting the sliders, predict: if you double the number of cameras from 4 to 8 and also double the frame rate from 30 to 60 fps, by what factor does the total data rate increase?

Hint: throughput scales linearly with both number of cameras and frame rate.

Try It: AV Camera Throughput Calculator

Adjust camera count, resolution, and frame rate. The live calculation shows the raw data throughput and whether it exceeds typical GPU pre-processing capacity (~2 GB/s).

6
1080p
30
Live Throughput Calculation
Throughput = cameras × (W×H×3) × fps / 1e6 [MB/s]
Implementation
Python · OpenCV — Multi-frame throughput simulation
import cv2 import numpy as np import time # Simulate AV camera throughput for N cameras N_CAMS = 6 W, H = 1920, 1080 FPS = 30 bytes_per_frame = W * H * 3 # RGB mbps_per_cam = bytes_per_frame * FPS / 1e6 total_mbps = mbps_per_cam * N_CAMS print(f'Per camera: {mbps_per_cam:.1f} MB/s') print(f'Total {N_CAMS} cams: {total_mbps:.1f} MB/s ({total_mbps/1000:.2f} GB/s)') # Simulate Hounsfield Unit windowing (medical imaging) def apply_window(hu_image, wc, ww): lo = wc - ww // 2 hi = wc + ww // 2 clipped = np.clip(hu_image, lo, hi) return ((clipped - lo) / (hi - lo) * 255).astype(np.uint8) # Example: lung window (WC=-600, WW=1500) ct_slice = np.random.randint(-1000, 1000, (512, 512), dtype=np.int16) lung_window = apply_window(ct_slice, wc=-600, ww=1500) print(f'CT window result: {lung_window.dtype}, range {lung_window.min()}–{lung_window.max()}')
Output
Per camera: 177.9 MB/s Total 6 cams: 1067.9 MB/s (1.07 GB/s) CT window result: uint8, range 0–255
Key Takeaway

The same CV-AI pipeline and pixel mathematics power both autonomous driving (multi-camera real-time perception) and radiology AI (16-bit HU windowing into semantic segmentation masks) — the engineering principles learned in Week 1 are the literal building blocks of these life-critical systems.

RX
Real-World Application

AI-Assisted Radiology: Chest X-Ray Triage

CheXNet (Rajpurkar et al., Stanford) demonstrated that a DenseNet-121 trained on 112,000 frontal chest X-rays can detect pneumonia with radiologist-level accuracy. The pipeline is identical to what we built: 16-bit DICOM → windowed to 8-bit → normalized to ImageNet mean/std → CNN backbone → classification head. The entire inference per image runs in under 100 ms, enabling real-time emergency triage at scale.

Checkpoint Real-World Applications — Retrieval Practice

Q1 An AV has 8 cameras at 2048 × 1536 resolution, 60 fps, RGB. Calculate the total raw data rate in GB/s. Is this feasible to stream to a single CPU without GPU acceleration?

Bytes/frame/camera = 2048×1536×3 = 9,437,184 bytes ≈ 9.0 MB. Rate/camera = 9.0×60 = 540 MB/s. Total = 540×8 = 4320 MB/s ≈ 4.32 GB/s. A modern CPU memory bandwidth is ~50 GB/s but pre-processing this stream sequentially at ~5 GHz would require ~86 billion operations/second — not feasible on a single CPU; dedicated GPU/ISP is mandatory.

Q2 A CT pixel has HU value +300 (dense muscle/bone). Using a soft-tissue window (WC = 40, WW = 400), compute the display value [0–255]. Is the pixel rendered light or dark?

Lower = 40 − 200 = −160, Upper = 40 + 200 = 240. Since HU = +300 > Upper = 240, the pixel is clipped to 255 (pure white). Dense bone exceeds the soft-tissue window and is rendered as bright white.

Q3 An AV engineer argues they should switch from RGB cameras to grayscale cameras to halve the data rate. What is one advantage and one significant disadvantage of this change?

Advantage: data rate drops from C=3 to C=1 → reduces throughput by 3×, cutting GPU memory bandwidth requirements significantly. Disadvantage: color information (e.g., traffic light red/green/yellow, red stop signs, lane marking colors) is lost, degrading detection of color-coded objects and reducing model accuracy for semantic tasks that rely on hue.

Color Space Explorer

Drag the RGB sliders to set any pixel color. Watch how the same color is represented in RGB, HSV, and LAB simultaneously — and see the grayscale luminance Y computed by the BT.601 formula in real time.

Adjust any slider to update all three color space representations and the swatch panels instantly. The live calculation panel shows the full numerical derivation.

255
127
0
Live Color Space Derivation
Y = 0.299×R + 0.587×G + 0.114×B

What You Must Remember

Three ideas that will appear in every week of this course.

The 5-Stage Pipeline is Universal

Every production vision system — AV, radiology AI, factory inspection — maps onto Acquisition → Pre-processing → Feature Extraction → Inference → Deployment. Memorize this structure; it is your diagnostic framework for any CV failure.

Images Are Tensors: $I \in \mathbb{Z}^{H \times W \times C}$

A pixel is a number, not a color. Raw size = $H \times W \times C \times (B/8)$ bytes. Grayscale luminance follows ITU-R BT.601: $Y = 0.299R + 0.587G + 0.114B$. These formulas will reappear in every processing stage.

Color Space = Algorithmic Choice

RGB is device-native; HSV separates hue from brightness for robust color detection under changing illumination; LAB is perceptually uniform for distance-based comparisons. Choose the space that makes your downstream algorithm simplest.

Coming up — Week 2: Spatial Filtering & Kernels

We leave the raw pixel grid and enter the frequency world. You will learn to design 2D convolution kernels that blur, sharpen, and detect edges — the Pre-processing stage's core tool — and understand why every CNN layer is fundamentally a learned filter bank.

Deepen Your Understanding

Curated resources that extend the Week 1 concepts — mix of textbook chapters, video lectures, and interactive tools.

Week 1 Exercises

Eight exercises: one theory and one code per topic, plus two synthesis challenges. All methods mirror the worked examples in this week's slides.

1 Theory · CV-AI Pipeline Easy

Pipeline Latency Budget

A real-time system must run at 25 fps. The measured stage timings are: Acquisition 4 ms, Pre-processing 15 ms, Feature Extraction 20 ms, Inference 5 ms, Deployment 3 ms.

(a) What is the frame budget in ms? (b) What is the total pipeline latency? (c) Does the system meet the 25 fps budget? (d) Which stage should be optimized first and by how much must it be reduced to just meet budget?

Frame budget = 1000 ÷ fps. Total latency = sum of all stages. The slowest stage is the primary optimization target. Show each step numerically.
2 Code · CV-AI Pipeline Easy

Build the Five-Stage Pipeline in Python

Using OpenCV and NumPy, implement a function cv_pipeline(path) that: (1) loads an image and converts BGR → RGB, (2) resizes to 224×224 and normalizes to [0, 1], (3) converts to grayscale and applies Canny edge detection (thresholds 50, 150), (4) prints the mean and std of the normalized image, (5) returns the edge image.

Test with any JPEG. Print the shapes and dtype at each stage.

Use cv2.imread → cvtColor(BGR2RGB) → resize → /255.0 → cvtColor(RGB2GRAY) → Canny. Print .shape and .dtype after each transform.
3 Theory · Pixel & Color Models Medium

File Size & Luminance Calculations

(a) A 1080p 8-bit RGB image: compute the raw uncompressed size in MB. Show the formula and every multiplication step.

(b) A pixel has RGB = (220, 80, 150). Compute its ITU-R BT.601 grayscale luminance $Y$ to one decimal place. Which channel contributes the most?

(c) A 12-bit RAW image at 4000 × 3000 with 3 channels: compute the raw size in MB. How does it compare to the 8-bit version of the same resolution?

Raw bytes = H × W × C × (B÷8). Luminance: $Y = 0.299R + 0.587G + 0.114B$. For 12-bit: B=12, so (B÷8) = 1.5 bytes per channel.
4 Code · Pixel & Color Models Medium

Color Space Conversion & Verification

Write a Python function analyze_pixel(img_path, row, col) that: (1) loads the image (BGR → RGB), (2) reads the pixel at (row, col) and prints its RGB values, (3) computes grayscale luminance manually using BT.601 and prints it, (4) computes the same value via cv2.cvtColor and asserts they differ by at most 1 (integer rounding), (5) converts the full image to HSV and prints the HSV value at (row, col).

Manual Y: Y = 0.299*R + 0.587*G + 0.114*B; int(round(Y)). OpenCV HSV: H is stored as H÷2 in uint8, so multiply by 2 to get degrees.
5 Theory · Real-World Applications Medium

AV Throughput & HU Window Normalization

(a) An AV has 10 cameras capturing 1280 × 720 RGB at 60 fps. Compute the total raw data rate in MB/s and GB/s. Show each step.

(b) A CT pixel has Hounsfield Unit value $H = -200$ HU. Using a bone window (WC = 400, WW = 1800), compute the display value on a [0, 255] scale. Is the pixel rendered light or dark in this window? What anatomical structure might this represent?

(a) Bytes per frame per cam = W×H×3. Rate = bytes × fps × N_cams ÷ 1e6. (b) Lower = WC − WW/2, Upper = WC + WW/2. Display = (H − lower) / (upper − lower) × 255, clipped to [0, 255].
6 Code · Real-World Applications Medium

Simulate Multi-Camera Throughput & HU Windowing

Write two Python functions:

(a) camera_throughput(n_cams, width, height, fps) → returns total MB/s.

(b) apply_hu_window(hu_array, wc, ww) → clips to the window and maps to uint8 [0, 255].

Test (a) with 6 cameras, 1920×1080, 30 fps. Test (b) on a synthetic np.linspace(-1000, 1000, 512) array with lung window (WC=−600, WW=1500). Print min, max, and dtype of the result.

For (b): lo = wc-ww//2; hi = wc+ww//2; clipped = np.clip(hu, lo, hi); out = ((clipped-lo)/(hi-lo)*255).astype(np.uint8)
7 Synthesis · Theory: End-to-End System Design Hard

Design a Real-Time Object Counting Pipeline

A factory requires a camera system that counts products on a conveyor belt at 20 items/second. The system must run in real time (no dropped frames). You can choose from: 640×480 at 60 fps, 1280×720 at 30 fps, or 1920×1080 at 15 fps.

(a) Compute the raw data rate for each option in MB/s. (b) Assuming Feature Extraction takes 18 ms per frame and all other stages take 8 ms total, which resolution options meet the real-time budget? (c) The products are orange; should you use RGB or HSV thresholding? Specify which HSV ranges you would use. (d) If lighting changes between day and night shifts, which HSV component should be allowed to vary and which should be kept constant for the detection threshold?

Frame budget = 1000 ÷ fps. Remember total latency = 18 + 8 = 26 ms. For orange: H ≈ 10°–25°, S > 0.5, V varies with lighting.
8 Synthesis · Code: Full Pre-processing Pipeline Hard

Build a Color-Robust Detection Pre-processor

Implement preprocess_for_detection(img_bgr, target_size=(224,224)) that:

(1) Converts BGR → RGB. (2) Creates an HSV mask for orange objects (H ∈ [10, 25] in degrees, S > 0.4, V > 0.2). (3) Applies the mask to zero out non-orange regions. (4) Resizes the masked image to target_size. (5) Normalizes to float32 [0, 1]. (6) Returns both the normalized RGB image and the binary mask.

Print the unique values in the mask (should be 0 and 255) and the dtype / shape of the final output. Test with a synthetic image containing an orange rectangle.

OpenCV HSV: H stored as H÷2 (so 10°→5, 25°→12). Use cv2.inRange(hsv, lower, upper) to create the mask, then cv2.bitwise_and to apply it.