From raw data to AI-ready features — master the mathematics behind every system that hears, sees, and understands the continuous physical universe.
Have you ever asked Shazam to recognise a song in a noisy café, or watched a noise-cancelling headphone silently erase the roar of an airplane engine? Both start with the same invisible step — converting a physical wave into numbers — before any "smart" algorithm can even begin.
A signal pipeline is like a film production line: raw footage captured on set (x(t)) passes through editing (ADC sampling), colour grading (DSP features), and a final cut review (AI model) before the audience ever sees it. Skip any stage and the film is unwatchable — your neural network is the audience, and it cannot start until every upstream stage delivers its output cleanly.
A signal is any quantity that varies over time (or space) and carries information — a microphone voltage, a blood-pressure reading, the brightness of a camera pixel. In the physical world these are continuous-time signals $x(t)$: defined for every real instant $t \in \mathbb{R}$, with infinitely many values. Every digital device instead stores a discrete-time signal $x[n]$: a list of numbers at integer indices $n \in \mathbb{Z}$ only.
The bridge is sampling: $x[n] = x(n \cdot T_s)$, where $T_s = 1/f_s$ is the time gap between measurements. At CD quality ($f_s = 44{,}100$ Hz), $T_s \approx 22.7\ \mu\text{s}$ — one snapshot every 22 microseconds. The pipeline diagram below shows exactly where this conversion happens.
In computer science, we often treat data as static databases or text files. However, the physical world communicates in continuous waves of air pressure, light, electrical voltages, and mechanical vibrations. AI models cannot process these directly.
Digital Signal Processing (DSP) is the science of translating this continuous physical reality into optimal, low-noise digital arrays. It is the crucial "feature engineering" layer that ensures machine learning models receive clean, highly expressive feature vectors.
An end-to-end AI system mapping physical world events to decisions represents a sequence of mathematical function applications:
$$x(t) \xrightarrow{\text{ADC}} x[n] \xrightarrow{\text{DSP}} y[n] \xrightarrow{\text{Feature Extr.}} \mathbf{h} \xrightarrow{\text{AI Model}} \hat{y}$$
Background & Purpose: Human speech travels through the air as continuous variations in pressure. For a neural network like Whisper to recognize spoken words, this high-dimensional wave must be digitized, transformed into the frequency domain, mapped to a perceptual scale, and tokenized.
Problem: A user speaks the word "AI" into a microphone. Trace how the acoustic signal $x(t)$ transforms through a 16 kHz digital audio pipeline into a prediction $\hat{y}$.
Phone-call audio uses $f_s = 8{,}000$ Hz instead of 16,000 Hz. What is $T_s$ (the time gap between consecutive samples) in microseconds?
Myth: "A digital signal is just a smooth analog signal that got rounded off — the original wave is still there between samples."
Reality: Once sampled, the values between integer indices do not exist in the digital domain. They can be reconstructed mathematically (interpolation), but only if the Nyquist condition is met. If the signal contains frequencies above $f_s/2$, those frequencies are permanently aliased — no reconstruction can recover them.
The widget shows coloured dots flowing through 4 stages: Wave x(t) → Samples x[n] → Features h → Classification ŷ. Before dragging the speed slider — do you predict all stages move faster uniformly, or will one stage appear as a visual bottleneck that the dots queue behind?
Form your prediction first — then drag the Transmission Speed slider to verify ↓
Every AI system that touches the physical world is only as good as its DSP front-end — $x(t)$ must become a clean $x[n]$ before any neural network can start processing.
ANC systems in modern wireless earbuds capture continuous external ambient noise waves $x(t)$ using miniature, high-sensitivity microphones. An ultra-low latency DSP chip samples the noise, computes the exact inverted phase sequence ($\phi \leftarrow \phi + \pi$, representing a $180^\circ$ phase shift), and drives the speaker cone to emit destructive anti-noise. This entire capture-compute-emit loop must execute in under **15 microseconds**. If processing delay exceeds this threshold, the anti-noise wave arrives late, causing constructive interference that amplifies the noise rather than canceling it.
Q1What does $T_s = 1/f_s$ represent physically? Compute $T_s$ for $f_s = 22{,}050$ Hz.
Q2A microphone samples at $f_s = 48{,}000$ Hz. Compute $T_s$ in microseconds.
Q3Why can't a neural network directly process the continuous voltage signal $x(t)$ from a microphone, even if the microphone has perfect quality?
A * np.cos(2 * np.pi * f / fs * n + phi) from memory given the formulaEvery WAV file on your hard drive is a list of integers. Every MP3, every speech recognition input, every ECG trace — just numbers. Yet somehow, those numbers encode a guitar solo, a voice, or a heartbeat. The key is that almost any sequence of numbers can be built from a handful of cosine waves added together.
A sine wave is like a single pure colour of light — it carries exactly one frequency. Just as mixing red, green, and blue light at different intensities creates any visible colour on your screen, mixing sine waves at different frequencies and amplitudes creates any sound. The Fourier synthesis formula is the mixing board: $A_k$ is how much of each colour you add.
In Python, a signal is a NumPy
array. Each element x[n] is the value at integer index n. The
fundamental building block is the discrete cosine:
Three parameters fully describe any sinusoidal component in a discrete signal:
$$x[n] = A \cos\!\left(\frac{2\pi f}{f_s} n + \varphi\right)$$
$A$ = peak amplitude | $f$ = frequency in Hz | $f_s$ = sample rate | $\varphi$ = phase offset (rad)
Background: 440 Hz is the international tuning standard for Concert A (A4). Every digital audio file stores this note as a list of numbers computed exactly by this formula.
Problem: Write the discrete formula for a 440 Hz tone at amplitude 0.8, sampled at $f_s = 44{,}100$ Hz, with zero phase. Compute the values at $n = 0, 1, 2$.
Using the same formula, compute $x[0]$ if $A = 1.5$ instead of $0.8$ (all other parameters unchanged).
Any periodic signal can be built by summing sinusoids at harmonic frequencies. This is Fourier's insight — the DNA of all DSP.
A periodic signal with fundamental frequency $f_0$ is the sum of harmonic components:
$$x[n] = \sum_{k=1}^{K} A_k \cos\!\left(\tfrac{2\pi k f_0}{f_s}\, n + \varphi_k\right)$$
A square wave uses only odd harmonics ($k = 1, 3, 5, \ldots$) with amplitudes $A_k = \frac{4}{\pi k}$. Adding more harmonics sharpens the edges.
Background: Digital clock pulses, PWM motor signals, and binary data lines are square waves. Understanding their harmonic content explains why cables need sufficient bandwidth.
Problem: Compute frequencies and amplitudes for the first three odd harmonics ($k=1,3,5$) of a 100 Hz square wave.
Compute the amplitude $A_7$ for the 7th harmonic ($k = 7$) of the same 100 Hz square wave.
When you synthesise a square wave by adding more and more odd harmonics, the corners get sharper — but the overshoot spike at each edge never disappears. It permanently sits at approximately 9% above the intended amplitude. This is the Gibbs phenomenon: a mathematical consequence of approximating a discontinuity with a finite Fourier series, not a coding error.
In digital audio engineering, this overshoot causes clipping artifacts in low-bitrate encoders. Engineers apply window functions (Hann, Hamming) to suppress it — which is exactly what we cover in Week 6.
The widget has sliders for $A$, $f$, and $\varphi$. Based on the formula $x[n] = A\cos\!\left(\tfrac{2\pi f}{f_s}n + \varphi\right)$ — before touching anything: if you double $f$ from 3 Hz to 6 Hz, does the wave get taller or do the cycles compress together? If you double $A$, do the cycles compress or does the peak height change?
Write your two predictions — then drag each slider to verify ↓
n = np.arange(N)
integer sample index
fs = 44100
sample rate (Hz)
A = 0.8
peak amplitude
f = 440
frequency (Hz)
phi = 0.0
phase offset (rad)
x = A * np.cos(2*np.pi*f/fs*n + phi)
the full discrete signal
Every sound a computer can store or generate is a list of numbers built by summing sinusoids — and every DSP algorithm this semester is an operation on the parameters $A$, $f$, and $\varphi$ of those sinusoids.
When you export a song as MP3 at 128 kbps vs 320 kbps, you are controlling how many Fourier harmonics survive compression. A 128 kbps encoder discards high-frequency components ($f > 16$ kHz) by zeroing the corresponding $A_k$ coefficients in the discrete Fourier representation. At 320 kbps, harmonics up to $\sim 20$ kHz are preserved. The perceptual quality difference you hear is literally the difference in which $A_k$ values survived quantization.
Q1State the discrete sinusoid formula and identify what each of the four parameters controls.
Q2Compute $x[0]$ and $x[1]$ for $A = 2.0$, $f = 880$ Hz, $f_s = 44{,}100$ Hz, $\varphi = 0$.
Q3A square wave uses only odd harmonics with $A_k = 4/(\pi k)$. Why does adding more harmonics sharpen the edges but never eliminate the ~9% overshoot spike?
wavfile.write()librosa.yin()np.cos() → scale → wavfile.write() → librosa.load() → librosa.yin() → np.median()You have been writing A * np.cos(2 * np.pi * f / fs * n) for the last
twenty minutes. What if you could just play that line of code and hear what it sounds like — right
inside your notebook? And what if you could then feed that sound to a real pitch detector and verify
your math was correct?
Scaling a float array to int16 is like adjusting a photo's exposure before saving as JPEG: your camera captures raw light as floating-point values (0.0–1.0), but a JPEG pixel must be an integer from 0–255, so you multiply by 255 and clamp. We do exactly the same — multiply the float signal by 32,767 and cast to np.int16 — before writing the WAV file to disk.
librosa.yin() detects pitch within ±1 Hz of the true fundamental
A 16-bit WAV file is literally a
list of integers written to disk. scipy.io.wavfile.write converts a NumPy float
array into that binary format.
Three steps from equation to playable file:
$n \in [0, f_s)$
$x[n] = A\cos\!\left(\frac{2\pi f}{f_s}n\right)$
int16 ← clip(x × 32767)
wavfile.write("out.wav", fs, int16)
Background: A 16-bit WAV file stores amplitude values as integers in the range $[-32768, +32767]$. We must scale our float array to this range before writing.
Problem: Generate 1 second of a 440 Hz tone at $f_s = 44{,}100$ Hz and save it as a 16-bit WAV file.
(x * 32767).astype(np.int16)
wavfile.write("A4.wav", 44100, samples)If amplitude $A = 1.0$ instead of $0.8$, what int16 integer does $x[0]$ become after scaling by 32,767?
Once we can generate audio, we can close the loop: load the file back, run a pitch detector, and verify the detected frequency matches our input.
librosa.yin() estimates the fundamental frequency $f_0$ frame-by-frame using the
YIN algorithm. It works by computing the difference function:
$$d(\tau) = \sum_{j=1}^{W}\bigl(x[j] - x[j+\tau]\bigr)^2$$
The lag $\tau$ where $d(\tau) \approx 0$ is the period of the fundamental. The fundamental frequency is then $f_0 = f_s / \tau$.
Background: After saving a 440 Hz tone as a WAV file, we need to verify the synthesis was correct by running a pitch detector and comparing the result to our input frequency.
Problem: We synthesised a
440 Hz tone at $f_s = 44{,}100$ Hz. What period $\tau$ (in samples) should
librosa.yin() detect?
np.median(librosa.yin(y, fmin=200, fmax=800)) returns 440.0 Hz
— matches our synthesis input.If we synthesised A3 (220 Hz — one octave lower) at the same $f_s = 44{,}100$ Hz, what period $\tau$ (in samples) should librosa.yin() detect?
Myth: "The scale factor should be 32768, because 16-bit audio has $2^{15} = 32768$ positive levels."
Reality: A signed 16-bit integer (np.int16) has an
asymmetric range: $[-32768, +32767]$. The maximum positive value is $32767$, not
$32768$. Using 32768 as your scale factor risks producing a value of $+32768$ for a peak
amplitude of $1.0$, which overflows to $-32768$ — an audible crack at every peak of the
waveform.
The widget lets you toggle odd harmonics ($k = 1, 3, 5, 7$). Based on $x[n] = \sum_k \tfrac{4}{\pi k}\cos(2\pi k f_0 n/f_s)$ — before clicking: if you activate only $k=1$ plus $k=3$, do you predict the waveform will look more square or more like a smooth sine wave, compared to $k=1$ alone? Why does adding more odd harmonics make the shape squarer?
Make your prediction — then toggle the checkboxes to verify ↓
Synthesis is three steps: compute $x[n]$, scale by 32,767, save with wavfile.write() — verification is one line: np.median(librosa.yin(y, ...)) must return your original frequency within ±1 Hz.
Modern TTS engines like WaveNet and VITS do exactly what we just did — but in reverse and at scale. Given a text transcript, the model outputs thousands of sinusoidal parameters ($A_k$, $f_k$, $\varphi_k$) per millisecond. These are summed into a NumPy-like array, converted to 16-bit integers, and streamed to the speaker. The perceptual naturalness of Siri, Google Assistant, or Amazon Polly is ultimately determined by how accurately the model predicts those Fourier coefficients.
Q1Why do we scale a float signal by 32,767 rather than 32,768 before casting to np.int16?
Q2A tone is synthesised at $f_0 = 261.63$ Hz (middle C) with $f_s = 44{,}100$ Hz. What period $\tau$ (in samples) should librosa.yin() detect?
Q3Why do we use np.median(f0) rather than np.mean(f0) when reporting the detected pitch?
librosa.yin() returns a frame-by-frame estimate. At the start and end of the signal (partial windows), it may return outlier values or 0 Hz. The median is robust to these outliers; the mean would be pulled toward them.Week 1 Recap
Every concept from today connects to a real tool you will use this semester — and for the rest of your career.
Digital signals are integer-indexed sequences, not continuous curves. The sample rate $f_s$ is the single most important parameter in your code — it defines what "time" means to your algorithm.
Any periodic signal decomposes into harmonics. A 100 Hz square wave is $\sum_{k\text{ odd}} \frac{4}{\pi k}\cos(2\pi k f_0/f_s \cdot n)$. The Gibbs overshoot (~9%) never disappears — that is physics, not a bug.
np.cos() → scale to int16 → wavfile.write() → librosa.yin(). np.median(f0) should return your original
frequency within ±1 Hz.
Written by Allen B. Downey. An excellent, code-first introduction to digital signal processing principles using Python sequences and operations.
Written by Steven W. Smith. A comprehensive, free textbook explaining sampling limits, discretization math, and practical DSP applications.
Written by Richard G. Lyons. Widely considered the gold standard for clear, intuitive mathematical explanations of sampling, complex math, and filters.
The official scientific programming manual for SciPy's signal processing toolkit, detailing butterworth filters, spectrograms, and FFT parameters.
Practice
Rigorous problem sets covering mathematical derivations and functional coding tasks, followed by advanced synthesis applications.
Classify each scenario as a continuous-time signal $x(t)$ or a discrete-time
signal $x[n]$, with a brief physical justification:
(a) The electrical voltage output of an analog microphone capturing speech.
(b) Ambient server-room temperature logged once every 10 minutes.
(c) An ECG trace scanned at 500 samples/second by a digital monitor.
(d) Barometric pressure measured by a physical aneroid gauge at sea level.
Implement the first two stages of the AI data pipeline in Python.
(a) Create integer indices n = np.arange(0, 100) representing 100 discrete samples
at $f_s = 100$ Hz.
(b) Compute the discrete signal $x[n] = 4\cos(2\pi \times 5 / 100 \times n)$.
(c) Print the first 5 values and verify that len(x) equals 100.
(d) What does the ratio $f/f_s = 5/100 = 0.05$ represent physically?
np.arange(100) for indices. The ratio $f/f_s$ is the
normalized frequency — the fraction of a full cycle completed per sample. At 0.05,
exactly 5 full cycles fit into 100 samples.A discrete signal is defined as $x[n] = 3\cos(2\pi \times 220 / 44100 \times n + \pi/4)$.
(a) Identify the amplitude $A$, frequency $f$, sample rate $f_s$, and phase $\varphi$.
(b) How many samples does one complete cycle occupy? (Hint: $N = f_s / f$)
(c) Compute $x[0]$ and $x[1]$ numerically to 3 decimal places.
(d) A4 is 440 Hz. Write the formula for the signal one octave lower (A3 = 220 Hz, same amplitude
but zero phase).
Synthesize a 100 Hz square wave using the first five odd harmonics ($k = 1, 3, 5, 7, 9$) at $f_s = 44{,}100$ Hz.
(a) Create n = np.arange(0, fs) for one second of samples.
(b) Compute $x[n] = \sum_{k \in \{1,3,5,7,9\}} \frac{4}{\pi k}\cos(2\pi k \times 100 / f_s \times n)$.
(c) Print the peak amplitude and verify it is approximately 1.18 (Gibbs overshoot).
(d) How does the waveform change if you increase the number of harmonics to $k = 99$? Describe
without running.
(4/(np.pi*k)) * np.cos(2*np.pi*k*100/fs*n) for each odd $k$. With 5 harmonics,
np.max(np.abs(x)) ≈ 1.179. With $k=99$ the wave looks much more square,
overshoot narrows to thin edge spikes (~9% above 1).A 16-bit WAV file stores amplitude values as integers in $[-32768, +32767]$.
(a) A float array has values in $[-1.0, +1.0]$. What is the correct integer after scaling by
32767? Compute the integer for $x = 0.5$.
(b) Calculate the file size in kilobytes for 3 seconds of mono audio at $f_s = 44{,}100$ Hz with
16-bit depth.
(c) A tone at 261.63 Hz is middle C (C4). How many samples does one cycle occupy at $f_s = 44{,}100$ Hz?
(d) Why is the scale factor 32767 rather than 32768?
Build the complete synthesis-to-analysis loop in Python.
(a) Generate $x[n] = 0.8\cos(2\pi \times 440 / 44100 \times n)$ for 1 second at $f_s = 44{,}100$
Hz.
(b) Save it as "A4.wav" using scipy.io.wavfile.write after scaling to
int16.
(c) Load the file with librosa.load("A4.wav", sr=44100) and run
librosa.yin(y, fmin=200, fmax=800).
(d) Print np.median(f0) and verify it equals approximately 440.0 Hz.
(x * 32767).astype(np.int16).
librosa.load returns a float32 array normalized to $[-1, 1]$ — no need to
rescale. np.median(f0) should return very close to 440.0 Hz (±1 Hz is
acceptable).Design a keyword-spotting system that detects the word "Hey" in live microphone audio.
(a) Write the discrete signal model $x[n]$ for speech sampled at $f_s = 16{,}000$ Hz. What
amplitude range is typical for a 16-bit microphone?
(b) The speech band spans 80–3400 Hz. What minimum sample rate satisfies Nyquist? Propose a
practical $f_s$ and justify your choice.
(c) If the system stores 30 seconds of audio in a 16-bit buffer, calculate the buffer size in
MB.
(d) Sketch the four-stage pipeline: Capture → Preprocess → Feature Extract → Classify. For each
stage, name the Python function you would use.
sounddevice.rec() → scipy.signal.butter() →
librosa.yin() → model.predict().Two guitar strings sound at 440 Hz and 444 Hz simultaneously. Build the full analysis
pipeline:
(a) Generate both discrete signals at $f_s = 44{,}100$ Hz for 2 seconds, then compute their sum
$x[n]$.
(b) Using the identity $\cos A + \cos B = 2\cos\!\left(\frac{A-B}{2}\right)\cos\!\left(\frac{A+B}{2}\right)$, predict the beat frequency
and carrier frequency analytically.
(c) Verify computationally: find the envelope of $x[n]$ using
np.abs(scipy.signal.hilbert(x)) and measure the time between successive amplitude
peaks to confirm your predicted beat period.
(d) Compute $\text{SNR}_{dB} = 10\log_{10}(P_s/P_n)$ of the composite signal against a
noise floor of $\sigma = 0.01$.
scipy.signal.find_peaks to measure envelope peaks. Signal power
$\approx 1$; SNR $\approx 10\log_{10}(1/0.0001) = 40$ dB.