From Pixels to Semantic Understanding — Build the mathematical intuition that powers every modern computer vision system.
Computer Vision
Every production vision system — from a self-driving car to a medical scanner — is built on the same five-stage pipeline. This section maps that pipeline end to end, giving you the mental model that ties every subsequent week together.
Every time you unlock your phone with your face, every time a warehouse robot picks a box, every time an AI flags a suspicious X-ray — the same invisible assembly line of code runs underneath. Have you ever wondered what that pipeline actually looks like, step by step?
The CV-AI pipeline is like a restaurant kitchen: raw ingredients arrive (Acquisition), get washed and chopped (Pre-processing), combined into dishes (Feature Extraction), tasted by the head chef (Inference), and finally plated and served (Deployment). Each station has a clear job — and a burnt dish at one station cannot be fixed downstream.
The five-stage CV-AI pipeline — each stage maps to specific weeks in this course.
Computer Vision sits at the intersection of signal processing, linear algebra, and deep learning. The core challenge: a camera collapses a 3D world into a 2D array of numbers. Our task is to reverse this process — recovering structure, meaning, and intent from those numbers.
Background. Follow a single pixel from a dashcam frame all the way to a stop-sign classification.
If the inference stage runs in 8 ms but pre-processing takes 40 ms, which stage is the bottleneck? Which pipeline stage should the engineer optimize first?
Each stage is a deterministic function that maps one representation to another. The complete pipeline composes these functions:
where $f_1$ = Acquisition, $f_2$ = Pre-processing, $f_3$ = Feature Extraction, $f_4$ = Inference, $f_5$ = Deployment output.
| Stage | Input | Output |
|---|---|---|
| Acquisition | Photons / scene | Raw pixel array |
| Pre-processing | Raw pixels | Normalized tensor |
| Feature Extraction | Normalized tensor | Feature map / embedding |
| Inference | Feature map | Prediction ($\hat{y}$) |
| Deployment | Prediction | Action / alert / display |
Problem: A real-time system must run at 25 fps. Stage timings are: Acquisition 5 ms, Pre-processing 12 ms, Feature Extraction 18 ms, Inference 6 ms, Deployment 2 ms. Does the system meet the budget?
If Acquisition is parallelized with Pre-processing (they overlap in time), what is the new total latency?
Treating the pipeline as a black box. Students often think "just use a pre-trained model" is a complete solution. But a model with perfect accuracy on benchmark data can fail in production if the Acquisition stage uses a different camera white balance, or if Pre-processing normalizes with the wrong mean/std. Every stage must match between training and deployment.
In the widget below, each button transforms a synthetic 8×8 pixel grid through one pipeline stage. Before clicking, predict: what will change visually when you apply "Grayscale"? What about "Edge Detect"?
Hint: think about what information is preserved vs. discarded at each stage.
Every computer vision system is a composition of five deterministic stages — understanding which stage owns which transformation is the diagnostic skill that separates engineers from users.
Systems like Waymo and Tesla Autopilot execute the full CV-AI pipeline at 30–60 fps in real time. Camera frames pass through HDR normalization and lens undistortion (Pre-processing), a multi-scale CNN backbone shared across task heads (Feature Extraction), and parallel inference for lane detection, object detection, and depth estimation — all within a 33 ms frame budget on dedicated NPUs.
Q1 A self-driving car camera captures at 30 fps. The full pipeline runs in 45 ms per frame. What is the maximum achievable frame rate, and which stage must be optimized to reach 30 fps?
Q2 A model achieves 95% accuracy on the benchmark dataset but only 62% in the field. List two pipeline stages where the training–deployment mismatch could originate.
Q3 Express the pipeline as a mathematical composition. If $f_3$ (Feature Extraction) is replaced with a faster but weaker function $\tilde{f}_3$, which output changes and which remains the same?
Image Fundamentals
A digital image is a 2D discrete signal — a function $f(x, y)$ mapping spatial coordinates to intensity values. Before building any vision system you must understand what these numbers mean physically, how they are stored, and how color is encoded across different color spaces.
Your camera captures a sunset. To you it is a sky of orange and pink. To the computer it is a grid of numbers — nothing more. The question is: which numbers, arranged how, and what do they actually encode?
An image is like a spreadsheet of paint chips: each cell (pixel) holds a number (intensity). RGB is three spreadsheets stacked — one for red, one for green, one for blue. HSV reorders the same information into hue (color wheel angle), saturation (color purity), and value (brightness). The paint is the same; only the labeling system changes.
Color space conversion pipeline: each representation encodes the same visual information differently, optimized for different tasks.
A grayscale image of width $W$ and height $H$ is an $H \times W$ integer matrix. Each element $I[r, c]$ is a pixel intensity in $\{0, 1, \ldots, 2^B - 1\}$ for bit depth $B$. A color image adds a channel dimension:
The raw (uncompressed) byte count is:
Problem: A 4K camera captures at 3840 × 2160 in 24-bit RGB. What is the raw uncompressed size in MB?
A 16-bit HDR image at 1920 × 1080 with 3 channels — what is the raw size in MB?
Converting RGB to grayscale is not a simple average. Human eyes are more sensitive to green than red or blue. The ITU-R BT.601 standard encodes this perceptual weighting:
Problem: Pixel RGB = (180, 120, 60). Compute the grayscale luminance $Y$.
A pure red pixel (255, 0, 0). What is its luminance $Y$? Is this brighter or darker than a pure green pixel (0, 255, 0)?
RGB mixes hue and brightness into three correlated channels, making color filtering difficult. HSV separates them:
Practical advantage: detecting a red object under changing illumination requires only a range check on $H$ — $V$ absorbs the lighting variation.
Problem: Convert RGB = (255, 128, 0) — an orange color — to HSV. Values normalized to [0, 1]: R = 1.0, G = 0.502, B = 0.
If the same orange is viewed in dim light — all RGB values halved to (127, 64, 0) — which HSV component changes and which stays the same?
OpenCV stores images as BGR, not RGB. When you call cv2.imread(), channel order is [Blue, Green, Red]. Feeding a BGR image to a model trained on RGB data will degrade performance because the model has learned that channel 0 = Red. Always convert with cv2.cvtColor(img, cv2.COLOR_BGR2RGB) immediately after loading.
The widget below lets you adjust R, G, B sliders and see the pixel color update live. Before moving the sliders, predict: if you set R=255, G=0, B=255, what color appears? What is the approximate grayscale luminance $Y$?
Hint: $Y = 0.299 \times 255 + 0.587 \times 0 + 0.114 \times 255$
An image is a 3D tensor $I \in \mathbb{Z}^{H \times W \times C}$ — choosing the right color space (RGB, HSV, LAB) is an algorithmic decision that directly determines how robust your feature extraction will be to lighting and viewpoint changes.
CT and MRI scanners produce 16-bit grayscale images with intensity values in Hounsfield Units (HU). Radiology AI pipelines normalize these to a specific window (e.g., bone: [−400, 1000] HU) before feeding into a CNN, transforming the raw 65,536-level signal into a perceptually meaningful [0, 255] range — the clinical equivalent of choosing the right color space for the task.
Q1 A camera captures a 2000 × 1500 image in 16-bit RGB. What is the uncompressed file size in MB? Show each calculation step.
Q2 A pixel has RGB = (50, 200, 100). Compute its ITU-R BT.601 grayscale luminance $Y$ to one decimal place. Which channel contributes the most?
Q3 You need to detect yellow traffic cones regardless of whether it is sunny or overcast. Should you threshold on RGB or HSV? Explain which HSV components you would use.
Applications
The CV-AI pipeline and pixel mathematics are not academic exercises — they power two of the most consequential application domains in modern engineering. This section grounds Week 1 concepts in autonomous driving perception and radiology AI, showing exactly where every formula you learned appears in production systems.
A self-driving car has no eyes — only cameras, LiDAR, and radar. A radiologist AI has never seen a patient — only grayscale 2D projections of 3D anatomy. Yet both systems must make life-or-death decisions in milliseconds. How do a few thousand lines of code, built on the pixel mathematics we just learned, achieve this?
An autonomous vehicle's camera system is like a team of photographers shooting at 60 fps from six different angles simultaneously — the CV pipeline is the photo editor that processes every shot in under 16 ms, highlights the important parts, and tells the driver what to do next. The radiology AI is the expert reviewer who has memorized a million X-rays and can spot an anomaly a human eye would miss.
Autonomous driving perception: multi-camera acquisition → GPU pre-processing → shared CNN backbone → parallel inference heads → drive decision, all within one 33 ms frame budget.
A modern AV perception stack maps directly onto the five-stage CV-AI pipeline:
The critical engineering challenge: raw throughput from 8 cameras at 4K/30fps is approximately 1.4 GB/s — this must be compressed and processed on a GPU in parallel, not sequentially on a CPU.
Problem: An AV has 6 cameras, each capturing 1920 × 1080 RGB at 30 fps. What is the total raw data rate in MB/s?
If the AV upgrades to 4K cameras (3840 × 2160) at the same 30 fps and 6 cameras, by what factor does the raw throughput increase?
Radiology AI applies the same five-stage pipeline to 2D projections of 3D anatomy:
Problem: A lung CT uses window center $WC = -600$ HU and window width $WW = 1500$ HU. A pixel has value $H = -400$ HU. Map it to a [0, 255] display value.
Using the same window, what display value does a pixel with $H = -1350$ HU map to? What does this represent visually?
Assuming higher camera resolution always means better AI performance. Quadrupling resolution (e.g., 1080p → 4K) multiplies raw data by 4× and inference latency by 4–16× (due to larger feature maps), which can push the pipeline over the real-time budget. In AV systems, engineers often use a mix: one high-res forward-facing camera for long-range detection and lower-res fisheye cameras for wide-angle coverage — matching sensor specs to the task, not maximizing resolution uniformly.
The widget below simulates the camera throughput calculation for different AV configurations. Before adjusting the sliders, predict: if you double the number of cameras from 4 to 8 and also double the frame rate from 30 to 60 fps, by what factor does the total data rate increase?
Hint: throughput scales linearly with both number of cameras and frame rate.
The same CV-AI pipeline and pixel mathematics power both autonomous driving (multi-camera real-time perception) and radiology AI (16-bit HU windowing into semantic segmentation masks) — the engineering principles learned in Week 1 are the literal building blocks of these life-critical systems.
CheXNet (Rajpurkar et al., Stanford) demonstrated that a DenseNet-121 trained on 112,000 frontal chest X-rays can detect pneumonia with radiologist-level accuracy. The pipeline is identical to what we built: 16-bit DICOM → windowed to 8-bit → normalized to ImageNet mean/std → CNN backbone → classification head. The entire inference per image runs in under 100 ms, enabling real-time emergency triage at scale.
Q1 An AV has 8 cameras at 2048 × 1536 resolution, 60 fps, RGB. Calculate the total raw data rate in GB/s. Is this feasible to stream to a single CPU without GPU acceleration?
Q2 A CT pixel has HU value +300 (dense muscle/bone). Using a soft-tissue window (WC = 40, WW = 400), compute the display value [0–255]. Is the pixel rendered light or dark?
Q3 An AV engineer argues they should switch from RGB cameras to grayscale cameras to halve the data rate. What is one advantage and one significant disadvantage of this change?
Interactive Lab
Drag the RGB sliders to set any pixel color. Watch how the same color is represented in RGB, HSV, and LAB simultaneously — and see the grayscale luminance Y computed by the BT.601 formula in real time.
Week 1 Summary
Three ideas that will appear in every week of this course.
Every production vision system — AV, radiology AI, factory inspection — maps onto Acquisition → Pre-processing → Feature Extraction → Inference → Deployment. Memorize this structure; it is your diagnostic framework for any CV failure.
A pixel is a number, not a color. Raw size = $H \times W \times C \times (B/8)$ bytes. Grayscale luminance follows ITU-R BT.601: $Y = 0.299R + 0.587G + 0.114B$. These formulas will reappear in every processing stage.
RGB is device-native; HSV separates hue from brightness for robust color detection under changing illumination; LAB is perceptually uniform for distance-based comparisons. Choose the space that makes your downstream algorithm simplest.
We leave the raw pixel grid and enter the frequency world. You will learn to design 2D convolution kernels that blur, sharpen, and detect edges — the Pre-processing stage's core tool — and understand why every CNN layer is fundamentally a learned filter bank.
Further Reading
Curated resources that extend the Week 1 concepts — mix of textbook chapters, video lectures, and interactive tools.
The authoritative reference for the pinhole camera model, sensor physics, and color space mathematics. Chapter 2 covers exactly what we introduced in Week 1, with full derivations.
→ szeliski.org/Book Docs · OpenCVOfficial documentation for all color space conversion formulas used in cv2.cvtColor(). Includes the exact BGR↔RGB↔HSV↔LAB equations.
Images in Python are NumPy arrays. This guide covers array creation, slicing, broadcasting, and dtype operations — the foundational skills for every coding exercise in this course.
→ numpy.org Docs · PyTorchThe standard transforms (Resize, Normalize, ToTensor) that implement the Pre-processing stage for PyTorch models. Understanding these is essential before Weeks 5–7.
→ pytorch.orgPractice
Eight exercises: one theory and one code per topic, plus two synthesis challenges. All methods mirror the worked examples in this week's slides.
A real-time system must run at 25 fps. The measured stage timings are: Acquisition 4 ms, Pre-processing 15 ms, Feature Extraction 20 ms, Inference 5 ms, Deployment 3 ms.
(a) What is the frame budget in ms? (b) What is the total pipeline latency? (c) Does the system meet the 25 fps budget? (d) Which stage should be optimized first and by how much must it be reduced to just meet budget?
Using OpenCV and NumPy, implement a function cv_pipeline(path) that: (1) loads an image and converts BGR → RGB, (2) resizes to 224×224 and normalizes to [0, 1], (3) converts to grayscale and applies Canny edge detection (thresholds 50, 150), (4) prints the mean and std of the normalized image, (5) returns the edge image.
Test with any JPEG. Print the shapes and dtype at each stage.
cv2.imread → cvtColor(BGR2RGB) → resize → /255.0 → cvtColor(RGB2GRAY) → Canny. Print .shape and .dtype after each transform.(a) A 1080p 8-bit RGB image: compute the raw uncompressed size in MB. Show the formula and every multiplication step.
(b) A pixel has RGB = (220, 80, 150). Compute its ITU-R BT.601 grayscale luminance $Y$ to one decimal place. Which channel contributes the most?
(c) A 12-bit RAW image at 4000 × 3000 with 3 channels: compute the raw size in MB. How does it compare to the 8-bit version of the same resolution?
Write a Python function analyze_pixel(img_path, row, col) that: (1) loads the image (BGR → RGB), (2) reads the pixel at (row, col) and prints its RGB values, (3) computes grayscale luminance manually using BT.601 and prints it, (4) computes the same value via cv2.cvtColor and asserts they differ by at most 1 (integer rounding), (5) converts the full image to HSV and prints the HSV value at (row, col).
Y = 0.299*R + 0.587*G + 0.114*B; int(round(Y)). OpenCV HSV: H is stored as H÷2 in uint8, so multiply by 2 to get degrees.(a) An AV has 10 cameras capturing 1280 × 720 RGB at 60 fps. Compute the total raw data rate in MB/s and GB/s. Show each step.
(b) A CT pixel has Hounsfield Unit value $H = -200$ HU. Using a bone window (WC = 400, WW = 1800), compute the display value on a [0, 255] scale. Is the pixel rendered light or dark in this window? What anatomical structure might this represent?
Write two Python functions:
(a) camera_throughput(n_cams, width, height, fps) → returns total MB/s.
(b) apply_hu_window(hu_array, wc, ww) → clips to the window and maps to uint8 [0, 255].
Test (a) with 6 cameras, 1920×1080, 30 fps. Test (b) on a synthetic np.linspace(-1000, 1000, 512) array with lung window (WC=−600, WW=1500). Print min, max, and dtype of the result.
lo = wc-ww//2; hi = wc+ww//2; clipped = np.clip(hu, lo, hi); out = ((clipped-lo)/(hi-lo)*255).astype(np.uint8)A factory requires a camera system that counts products on a conveyor belt at 20 items/second. The system must run in real time (no dropped frames). You can choose from: 640×480 at 60 fps, 1280×720 at 30 fps, or 1920×1080 at 15 fps.
(a) Compute the raw data rate for each option in MB/s. (b) Assuming Feature Extraction takes 18 ms per frame and all other stages take 8 ms total, which resolution options meet the real-time budget? (c) The products are orange; should you use RGB or HSV thresholding? Specify which HSV ranges you would use. (d) If lighting changes between day and night shifts, which HSV component should be allowed to vary and which should be kept constant for the detection threshold?
Implement preprocess_for_detection(img_bgr, target_size=(224,224)) that:
(1) Converts BGR → RGB. (2) Creates an HSV mask for orange objects (H ∈ [10, 25] in degrees, S > 0.4, V > 0.2). (3) Applies the mask to zero out non-orange regions. (4) Resizes the masked image to target_size. (5) Normalizes to float32 [0, 1]. (6) Returns both the normalized RGB image and the binary mask.
Print the unique values in the mask (should be 0 and 255) and the dtype / shape of the final output. Test with a synthetic image containing an orange rectangle.
cv2.inRange(hsv, lower, upper) to create the mask, then cv2.bitwise_and to apply it.