Signal Processing Foundations — Week 1

Course Overview &
Foundations

From raw data to AI-ready features — master the mathematics behind every system that hears, sees, and understands the continuous physical universe.

Continuous vs. Discrete Sine Waves Fourier Synthesis AI Data Pipeline Real → Digital
Why Signal Processing? & The End-to-End AI Pipeline
Before a neural network can classify an audio clip, detect a pedestrian, or parse medical data, physical waveforms must be converted, filtered, and compressed into mathematical features.
After this section you will be able to
  • Classify any real-world signal as continuous $x(t)$ or discrete $x[n]$ and justify the classification with a physical reason
  • Explain the role of each stage in the 4-step AI data pipeline: Physical World → ADC Sampling → DSP Features → AI Model → Prediction
  • Calculate the sampling interval $T_s = 1/f_s$ for any sample rate and state what it represents physically

Have you ever asked Shazam to recognise a song in a noisy café, or watched a noise-cancelling headphone silently erase the roar of an airplane engine? Both start with the same invisible step — converting a physical wave into numbers — before any "smart" algorithm can even begin.

🎯
Why this matters: Signal processing is the mandatory first layer in every AI system that touches the physical world — audio, video, medical sensors, radar. Without it, even the most powerful neural network receives nothing but meaningless electrical noise.
🏭
Think of it this way

A signal pipeline is like a film production line: raw footage captured on set (x(t)) passes through editing (ADC sampling), colour grading (DSP features), and a final cut review (AI model) before the audience ever sees it. Skip any stage and the film is unwatchable — your neural network is the audience, and it cannot start until every upstream stage delivers its output cleanly.

What is a Signal? — $x(t)$ vs $x[n]$

A signal is any quantity that varies over time (or space) and carries information — a microphone voltage, a blood-pressure reading, the brightness of a camera pixel. In the physical world these are continuous-time signals $x(t)$: defined for every real instant $t \in \mathbb{R}$, with infinitely many values. Every digital device instead stores a discrete-time signal $x[n]$: a list of numbers at integer indices $n \in \mathbb{Z}$ only.

The bridge is sampling: $x[n] = x(n \cdot T_s)$, where $T_s = 1/f_s$ is the time gap between measurements. At CD quality ($f_s = 44{,}100$ Hz), $T_s \approx 22.7\ \mu\text{s}$ — one snapshot every 22 microseconds. The pipeline diagram below shows exactly where this conversion happens.

Physical World x(t) continuous wave ADC Sampling x[n] integer sequence DSP / Features h ∈ ℝⁿ feature vector AI Model f(h) neural network Prediction ŷ "speech" / class Week 1 Week 2 Weeks 3–7 Weeks 9–11 Weeks 12–14
44,100
samples / sec
CD-quality audio — the standard since Sony/Philips in 1980
65,536
amplitude levels
16-bit quantization — 2¹⁶ discrete integer values
15 μs
ANC loop budget
Max DSP latency before anti-noise arrives late and amplifies noise
22.7
μs per sample
Time gap $T_s = 1/f_s$ between consecutive samples at CD quality — the physical meaning of discrete time
Problem

Engineering the Input Layer of AI

In computer science, we often treat data as static databases or text files. However, the physical world communicates in continuous waves of air pressure, light, electrical voltages, and mechanical vibrations. AI models cannot process these directly.

Digital Signal Processing (DSP) is the science of translating this continuous physical reality into optimal, low-noise digital arrays. It is the crucial "feature engineering" layer that ensures machine learning models receive clean, highly expressive feature vectors.

The Mathematical Transformation Chain

An end-to-end AI system mapping physical world events to decisions represents a sequence of mathematical function applications:

$$x(t) \xrightarrow{\text{ADC}} x[n] \xrightarrow{\text{DSP}} y[n] \xrightarrow{\text{Feature Extr.}} \mathbf{h} \xrightarrow{\text{AI Model}} \hat{y}$$

📝 Worked Example — Tracing a Speech Recognition Pipeline

Background & Purpose: Human speech travels through the air as continuous variations in pressure. For a neural network like Whisper to recognize spoken words, this high-dimensional wave must be digitized, transformed into the frequency domain, mapped to a perceptual scale, and tokenized.

Problem: A user speaks the word "AI" into a microphone. Trace how the acoustic signal $x(t)$ transforms through a 16 kHz digital audio pipeline into a prediction $\hat{y}$.

1
Continuous Capture ($x(t)$):
The sound wave creates electrical voltage changes in the microphone. This is modeled as a continuous function:
$x(t) = f(t), \quad t \in [0, T] \subset \mathbb{R}$
where $t$ represents absolute physical time.
2
Sampling & Quantization ($x[n]$):
An Analog-to-Digital Converter (ADC) samples $x(t)$ at a sampling rate $f_s = 16,000$ Hz. The sampling interval is:
$T_s = \frac{1}{f_s} = \frac{1}{16000} = 62.5\ \mu\text{s}$
This yields a sequence of integers: $x[n] = x(n \cdot 62.5 \times 10^{-6}),\ n \in \mathbb{Z}$. Each sample value is quantized to a 16-bit integer (representing one of 65,536 levels).
3
Spectral Processing ($y[n]$) — preview, covered Week 4–6:
The sampled sequence is cut into short 25 ms windows. A frequency transform (the Discrete Fourier Transform, derived in Week 4) is applied to each window to reveal which frequencies are present at each moment. The output shifts our view from "amplitude changing over time" to "frequency intensity changing over time." Don't worry about the formula yet — the concept is: we zoom into frequency content.
4
Perceptual Feature Mapping ($\mathbf{h}$) — preview, covered Week 10:
A bank of overlapping filters compresses the frequency data to match how human hearing works — our ears perceive pitch on a logarithmic scale, not a linear one. The result is a compact feature vector $\mathbf{h} \in \mathbb{R}^{80}$ (called a Mel-spectrogram slice). We derive the filter math in Week 10.
5
AI Model Inference ($\hat{y}$):
The Mel-spectrogram grid $\mathbf{h}$ is passed through a convolutional encoder and transformer decoder. The model outputs token probabilities, resulting in the prediction:
$\hat{y} = \text{"AI"}$
Physical pressure $x(t) \to$ 16 kHz discrete $x[n] \to$ Mel-spectrogram vector $\mathbf{h} \in \mathbb{R}^{80} \to$ Neural Network $\to$ text token "AI".
✔ Quick Check

Phone-call audio uses $f_s = 8{,}000$ Hz instead of 16,000 Hz. What is $T_s$ (the time gap between consecutive samples) in microseconds?

$T_s = 1/8000 = 125\ \mu\text{s}$ — double the gap of 16 kHz, so half the sampling density.
⚠️
Common Mistake

Myth: "A digital signal is just a smooth analog signal that got rounded off — the original wave is still there between samples."

Reality: Once sampled, the values between integer indices do not exist in the digital domain. They can be reconstructed mathematically (interpolation), but only if the Nyquist condition is met. If the signal contains frequencies above $f_s/2$, those frequencies are permanently aliased — no reconstruction can recover them.

Solution
🤔 Pause & Predict

The widget shows coloured dots flowing through 4 stages: Wave x(t)Samples x[n]Features hClassification ŷ. Before dragging the speed slider — do you predict all stages move faster uniformly, or will one stage appear as a visual bottleneck that the dots queue behind?

Form your prediction first — then drag the Transmission Speed slider to verify ↓

Try It: Visualizing the AI Data Pipeline Flow

Slide the speed controller to watch raw physical sound waves flow through sampling, digital filtering, spectral feature grids, and neural classification nodes — mirroring the chain $x(t)\!\xrightarrow{\text{ADC}}\!x[n]\!\xrightarrow{\text{DSP}}\!\mathbf{h}\!\xrightarrow{\text{AI}}\!\hat{y}$.

3x
Status: Streaming Data...
● Wave $x(t)$ ● Samples $x[n]$ ● Features $\mathbf{h}$ ● Classification $\hat{y}$
Implementation
Python · NumPy — Continuous & Discrete Sampling
import numpy as np # Step 1: Simulate 1 second of continuous time at high resolution t_continuous = np.linspace(0, 1, 100000) x_t = np.sin(2 * np.pi * 5 * t_continuous) # 5 Hz physical wave # Step 2: Simulate physical sampling at 100 Hz fs = 100 n = np.arange(0, fs) # Integer sample indices t_sampled = n / fs # Sampling instants: n * Ts x_n = np.sin(2 * np.pi * 5 * t_sampled) # Discrete sequence x[n]
Output
Key Takeaway

Every AI system that touches the physical world is only as good as its DSP front-end — $x(t)$ must become a clean $x[n]$ before any neural network can start processing.

🗣️
Real-World Application

Real-Time Active Noise Cancellation (ANC) in Wireless Earbuds

ANC systems in modern wireless earbuds capture continuous external ambient noise waves $x(t)$ using miniature, high-sensitivity microphones. An ultra-low latency DSP chip samples the noise, computes the exact inverted phase sequence ($\phi \leftarrow \phi + \pi$, representing a $180^\circ$ phase shift), and drives the speaker cone to emit destructive anti-noise. This entire capture-compute-emit loop must execute in under **15 microseconds**. If processing delay exceeds this threshold, the anti-noise wave arrives late, causing constructive interference that amplifies the noise rather than canceling it.

✦ Checkpoint Check Your Understanding — Why Signal Processing?

Q1What does $T_s = 1/f_s$ represent physically? Compute $T_s$ for $f_s = 22{,}050$ Hz.

Answer: $T_s$ is the time gap between consecutive samples. $T_s = 1/22050 \approx 45.4\ \mu\text{s}$ — one sample every 45.4 microseconds.

Q2A microphone samples at $f_s = 48{,}000$ Hz. Compute $T_s$ in microseconds.

Answer: $T_s = 1/48000 \approx 20.8\ \mu\text{s}$ — one sample every 20.8 microseconds.

Q3Why can't a neural network directly process the continuous voltage signal $x(t)$ from a microphone, even if the microphone has perfect quality?

Answer: Neural networks operate on finite arrays of numbers. $x(t)$ is a continuous function with infinitely many values — it must first be sampled by an ADC to produce a finite integer sequence $x[n]$ before any digital computation can begin.
Sine Wave Anatomy & Fourier Synthesis
Every signal your computer stores is just a list of numbers — and every one of those numbers can be expressed as a sum of sine waves. This section gives you the two fundamental tools: the discrete sinusoid formula and Fourier synthesis.
After this section you will be able to
  • Compute $x[n] = A\cos\!\left(\tfrac{2\pi f}{f_s}n + \varphi\right)$ at any sample index $n$ by hand, given numerical values for all four parameters
  • Write the NumPy one-liner A * np.cos(2 * np.pi * f / fs * n + phi) from memory given the formula
  • Predict how changing $A$, $f$, or $\varphi$ independently affects the waveform's amplitude, cycle density, or starting position

Every WAV file on your hard drive is a list of integers. Every MP3, every speech recognition input, every ECG trace — just numbers. Yet somehow, those numbers encode a guitar solo, a voice, or a heartbeat. The key is that almost any sequence of numbers can be built from a handful of cosine waves added together.

🎯
Why this matters: In Topic 1 we saw that the DSP box transforms $x[n]$ into feature vector $\mathbf{h}$ that feeds the AI model. The fundamental building block inside that box is the sinusoid — so this section is where we zoom in. Every DSP algorithm this semester — FFT, filters, Mel-spectrograms, wavelets — is ultimately an operation on the parameters $A$, $f$, and $\varphi$ of these sinusoids. Master the formula now and every later algorithm slots into place naturally.
🎨
Think of it this way

A sine wave is like a single pure colour of light — it carries exactly one frequency. Just as mixing red, green, and blue light at different intensities creates any visible colour on your screen, mixing sine waves at different frequencies and amplitudes creates any sound. The Fourier synthesis formula is the mixing board: $A_k$ is how much of each colour you add.

$A$
Amplitude
Peak height of the wave — controls loudness or signal strength
e.g. 0.8 = 80% of full volume
$f$
Frequency
How many complete cycles occur per second — controls pitch
e.g. 440 Hz = Concert A
$f_s$
Sample Rate
How many samples are taken per second — defines the time ruler
e.g. 44,100 Hz = CD quality
$\varphi$
Phase
Where in its cycle the wave starts at $n = 0$, in radians
e.g. π/2 = starts at peak
$n$
Sample Index
Integer index into the array — the discrete "time" variable
e.g. n = 0, 1, 2, … 44099
440
Hz — Concert A
International tuning standard since 1939 (ISO 16)
100.2
samples / cycle
At $f_s = 44{,}100$, one 440 Hz cycle occupies $44100/440$ samples
~9%
Gibbs overshoot
Fourier square-wave approximation always overshoots by ≈ 9%, no matter how many harmonics you add
rad per cycle
One full cosine cycle always spans exactly $2\pi$ radians — the link between frequency and angle
Problem

The Discrete Sinusoid x[n]

In Python, a signal is a NumPy array. Each element x[n] is the value at integer index n. The fundamental building block is the discrete cosine:

Discrete Sinusoid Formula

Three parameters fully describe any sinusoidal component in a discrete signal:

$$x[n] = A \cos\!\left(\frac{2\pi f}{f_s} n + \varphi\right)$$

$A$ = peak amplitude  |  $f$ = frequency in Hz  |  $f_s$ = sample rate  |  $\varphi$ = phase offset (rad)

📝 Worked Example — Generating a 440 Hz Concert A

Background: 440 Hz is the international tuning standard for Concert A (A4). Every digital audio file stores this note as a list of numbers computed exactly by this formula.

Problem: Write the discrete formula for a 440 Hz tone at amplitude 0.8, sampled at $f_s = 44{,}100$ Hz, with zero phase. Compute the values at $n = 0, 1, 2$.

1
Identify parameters:
$A = 0.8$, $f = 440$ Hz, $f_s = 44{,}100$ Hz, $\varphi = 0$
2
Compute the normalized frequency:
$\frac{f}{f_s} = \frac{440}{44100} \approx 0.009977$ cycles per sample
3
Evaluate at $n = 0$:
$x[0] = 0.8\cos(0) = 0.8 \times 1 = 0.800$
4
Evaluate at $n = 1$ and $n = 2$:
$x[1] = 0.8\cos(2\pi \times 0.009977) = 0.8\cos(0.06270) \approx 0.798$
$x[2] = 0.8\cos(0.12540) \approx 0.794$
Array starts: [0.800, 0.798, 0.794, ...] — one cycle completes every 100.2 samples (= 44100/440).
✔ Quick Check

Using the same formula, compute $x[0]$ if $A = 1.5$ instead of $0.8$ (all other parameters unchanged).

$x[0] = 1.5 \cdot \cos(0) = 1.5 \cdot 1 = 1.5$

Fourier Synthesis

Any periodic signal can be built by summing sinusoids at harmonic frequencies. This is Fourier's insight — the DNA of all DSP.

Sum of Harmonics

A periodic signal with fundamental frequency $f_0$ is the sum of harmonic components:

$$x[n] = \sum_{k=1}^{K} A_k \cos\!\left(\tfrac{2\pi k f_0}{f_s}\, n + \varphi_k\right)$$

A square wave uses only odd harmonics ($k = 1, 3, 5, \ldots$) with amplitudes $A_k = \frac{4}{\pi k}$. Adding more harmonics sharpens the edges.

📝 Worked Example — Synthesizing a 100 Hz Square Wave (3 terms)

Background: Digital clock pulses, PWM motor signals, and binary data lines are square waves. Understanding their harmonic content explains why cables need sufficient bandwidth.

Problem: Compute frequencies and amplitudes for the first three odd harmonics ($k=1,3,5$) of a 100 Hz square wave.

1
Fundamental ($k=1$):
$f_1 = 1 \times 100 = 100$ Hz, $A_1 = 4/(\pi \times 1) \approx 1.273$
2
3rd harmonic ($k=3$):
$f_3 = 300$ Hz, $A_3 = 4/(3\pi) \approx 0.424$ (exactly $1/3$ of fundamental)
3
5th harmonic ($k=5$):
$f_5 = 500$ Hz, $A_5 = 4/(5\pi) \approx 0.255$ (exactly $1/5$ of fundamental)
Three-term approximation: $$\begin{aligned} x[n] &\approx 1.273\cos(\omega_1 n) \\ &+ 0.424\cos(3\omega_1 n) \\ &+ 0.255\cos(5\omega_1 n) \end{aligned}$$
✔ Quick Check

Compute the amplitude $A_7$ for the 7th harmonic ($k = 7$) of the same 100 Hz square wave.

$A_7 = \dfrac{4}{\pi \times 7} \approx 0.182$ — exactly $\tfrac{1}{7}$ of the fundamental's amplitude.
💡
Key Insight — The Gibbs Phenomenon

When you synthesise a square wave by adding more and more odd harmonics, the corners get sharper — but the overshoot spike at each edge never disappears. It permanently sits at approximately 9% above the intended amplitude. This is the Gibbs phenomenon: a mathematical consequence of approximating a discontinuity with a finite Fourier series, not a coding error.

In digital audio engineering, this overshoot causes clipping artifacts in low-bitrate encoders. Engineers apply window functions (Hann, Hamming) to suppress it — which is exactly what we cover in Week 6.

Solution
🤔 Pause & Predict

The widget has sliders for $A$, $f$, and $\varphi$. Based on the formula $x[n] = A\cos\!\left(\tfrac{2\pi f}{f_s}n + \varphi\right)$ — before touching anything: if you double $f$ from 3 Hz to 6 Hz, does the wave get taller or do the cycles compress together? If you double $A$, do the cycles compress or does the peak height change?

Write your two predictions — then drag each slider to verify ↓

Try It: Sine Wave Parameter Tuner

Drag the sliders to change $A$, $f$, and $\varphi$ and watch the equation $x[n] = A\cos\!\left(\tfrac{2\pi f}{f_s}n + \varphi\right)$ update live — before writing a single line of code.

2.0
3 Hz
0.00 rad
── x[n] = 2.0·cos(2π·3·n/fs +0.00)
Live Calculation — Equation Substitution
x[n] = A · cos(2π × f/44100 × n + φ) = 2.0 · cos(2π × 3/44100 × n + 0.00) = 2.0 · cos(0.000428 × n + 0.00)
n = 0:2.0 × cos(0.0000) = 2.0 × 1.0000= 2.0000
n = 1:2.0 × cos(0.0004) = 2.0 × 1.0000= 2.0000
n = 3675 (T/4):2.0 × cos(1.5708) = 2.0 × 0.0000= 0.0000
Implementation
Math → Python: Direct Symbol Mapping
$n$ n = np.arange(N) integer sample index
$f_s$ fs = 44100 sample rate (Hz)
$A$ A = 0.8 peak amplitude
$f$ f = 440 frequency (Hz)
$\varphi$ phi = 0.0 phase offset (rad)
$x[n]$ x = A * np.cos(2*np.pi*f/fs*n + phi) the full discrete signal
Python · NumPy — Discrete Sinusoid & Fourier Synthesis
import numpy as np fs = 44100 # sample rate (Hz) n = np.arange(0, fs) # integer indices 0..44099 # ── Discrete sinusoid: 440 Hz Concert A ────────────────────── A, f, phi = 0.8, 440, 0.0 x_tone = A * np.cos(2 * np.pi * f / fs * n + phi) # ── Fourier synthesis: 100 Hz square wave (5 harmonics) ────── f0 = 100 x_sq = np.zeros(fs) for k in [1, 3, 5, 7, 9]: x_sq += (4 / (np.pi * k)) * np.cos(2 * np.pi * k * f0 / fs * n)
Output
Key Takeaway

Every sound a computer can store or generate is a list of numbers built by summing sinusoids — and every DSP algorithm this semester is an operation on the parameters $A$, $f$, and $\varphi$ of those sinusoids.

🎵
Real-World Application

Audio Codec Quality Settings & Bitrate Tradeoffs

When you export a song as MP3 at 128 kbps vs 320 kbps, you are controlling how many Fourier harmonics survive compression. A 128 kbps encoder discards high-frequency components ($f > 16$ kHz) by zeroing the corresponding $A_k$ coefficients in the discrete Fourier representation. At 320 kbps, harmonics up to $\sim 20$ kHz are preserved. The perceptual quality difference you hear is literally the difference in which $A_k$ values survived quantization.

✦ Checkpoint Check Your Understanding — Sine Waves & Fourier Synthesis

Q1State the discrete sinusoid formula and identify what each of the four parameters controls.

Answer: $x[n] = A\cos\!\left(\tfrac{2\pi f}{f_s}n + \varphi\right)$. $A$ = peak amplitude (loudness); $f$ = frequency in Hz (pitch); $f_s$ = sample rate (time ruler); $\varphi$ = phase offset in radians (starting position in cycle).

Q2Compute $x[0]$ and $x[1]$ for $A = 2.0$, $f = 880$ Hz, $f_s = 44{,}100$ Hz, $\varphi = 0$.

Answer: Normalised freq $= 880/44100 \approx 0.01995$. $x[0] = 2.0\cos(0) = 2.000$. $x[1] = 2.0\cos(2\pi \times 0.01995) \approx 2.0 \times 0.9921 \approx 1.984$.

Q3A square wave uses only odd harmonics with $A_k = 4/(\pi k)$. Why does adding more harmonics sharpen the edges but never eliminate the ~9% overshoot spike?

Answer: The overshoot (Gibbs phenomenon) is a mathematical consequence of approximating a discontinuity with a finite sum of smooth sinusoids. No matter how many harmonics you add, the spike remains at ~9% — it is physics, not a coding error.
Pitch Synthesizer — Hearing Equations Come Alive
The ultimate test of understanding a formula is making it do something real. In this section we turn numpy arrays into audio, and then use the same tools that power Siri and Spotify to analyse pitch.
After this section you will be able to
  • Convert a float NumPy signal array to a playable 16-bit WAV file by applying the correct scale factor and wavfile.write()
  • Calculate the expected detection period $\tau = f_s / f_0$ for any synthesised tone and verify the result from librosa.yin()
  • Trace the complete synthesis-to-verification loop: np.cos() → scale → wavfile.write()librosa.load()librosa.yin()np.median()

You have been writing A * np.cos(2 * np.pi * f / fs * n) for the last twenty minutes. What if you could just play that line of code and hear what it sounds like — right inside your notebook? And what if you could then feed that sound to a real pitch detector and verify your math was correct?

🎯
Why this matters: Audio synthesis is the "Hello World" of signal processing. Once you can generate, save, and analyse a sound in Python, you have a complete feedback loop — every formula you write becomes something you can hear and measure, not just simulate on paper.
📷
Think of it this way

Scaling a float array to int16 is like adjusting a photo's exposure before saving as JPEG: your camera captures raw light as floating-point values (0.0–1.0), but a JPEG pixel must be an integer from 0–255, so you multiply by 255 and clamp. We do exactly the same — multiply the float signal by 32,767 and cast to np.int16 — before writing the WAV file to disk.

① Generate np.cos(…) float64 array ② Scale × 32767 range [−32768, 32767] ③ Cast .astype(int16) 16-bit integer ④ Write wavfile.write() .wav binary file ⑤ Verify librosa.yin() detected f₀ ≈ input f
32,767
int16 max value
$2^{15} - 1$ — the largest positive integer in a 16-bit signed format
86 kB
file size
1 second of mono 16-bit audio at 44,100 Hz = 88,200 bytes
±1 Hz
YIN accuracy
librosa.yin() detects pitch within ±1 Hz of the true fundamental
Problem

From Array to Audio File

A 16-bit WAV file is literally a list of integers written to disk. scipy.io.wavfile.write converts a NumPy float array into that binary format.

Synthesis Pipeline

Three steps from equation to playable file:

$n \in [0, f_s)$
$x[n] = A\cos\!\left(\frac{2\pi f}{f_s}n\right)$
int16 ← clip(x × 32767)
wavfile.write("out.wav", fs, int16)

📝 Worked Example — Synthesizing and Saving a 440 Hz Tone

Background: A 16-bit WAV file stores amplitude values as integers in the range $[-32768, +32767]$. We must scale our float array to this range before writing.

Problem: Generate 1 second of a 440 Hz tone at $f_s = 44{,}100$ Hz and save it as a 16-bit WAV file.

1
Generate the array:
$n = [0, 1, 2, \ldots, 44099]$ → $x[n] = 0.8\cos(2\pi \times 440/44100 \times n)$
Range of $x$: $[-0.8, +0.8]$ (float64)
2
Convert to 16-bit integer:
Multiply by $32767$: range becomes $[-26{,}214, +26{,}214]$
Cast: (x * 32767).astype(np.int16)
3
Write and verify:
wavfile.write("A4.wav", 44100, samples)
File size: $44{,}100 \times 2$ bytes $= 88{,}200$ bytes $\approx 86$ kB
A 1-second WAV file encoding a 440 Hz Concert A tone, playable in any audio player.
✔ Quick Check

If amplitude $A = 1.0$ instead of $0.8$, what int16 integer does $x[0]$ become after scaling by 32,767?

$x[0] = 1.0 \cdot \cos(0) = 1.0 \;\to\; 1.0 \times 32{,}767 = 32{,}767$ — the maximum int16 value. Any larger amplitude would overflow to a negative number.

Pitch Detection with librosa

Once we can generate audio, we can close the loop: load the file back, run a pitch detector, and verify the detected frequency matches our input.

librosa Pitch Detection — YIN Algorithm

librosa.yin() estimates the fundamental frequency $f_0$ frame-by-frame using the YIN algorithm. It works by computing the difference function:

$$d(\tau) = \sum_{j=1}^{W}\bigl(x[j] - x[j+\tau]\bigr)^2$$

The lag $\tau$ where $d(\tau) \approx 0$ is the period of the fundamental. The fundamental frequency is then $f_0 = f_s / \tau$.

📝 Worked Example — Verifying 440 Hz Detection

Background: After saving a 440 Hz tone as a WAV file, we need to verify the synthesis was correct by running a pitch detector and comparing the result to our input frequency.

Problem: We synthesised a 440 Hz tone at $f_s = 44{,}100$ Hz. What period $\tau$ (in samples) should librosa.yin() detect?

1
Expected period in samples:
$\tau = f_s / f_0 = 44{,}100 / 440 \approx 100.2$ samples
2
YIN finds the lag $\tau$ where $d(\tau) \approx 0$:
At lag 100 and 101 samples, the cosine wave nearly repeats itself, so $d(\tau)$ dips close to zero.
3
Recover $f_0$:
$f_0 = f_s / \tau = 44{,}100 / 100.2 \approx 440.1$ Hz
np.median(librosa.yin(y, fmin=200, fmax=800)) returns 440.0 Hz — matches our synthesis input.
✔ Quick Check

If we synthesised A3 (220 Hz — one octave lower) at the same $f_s = 44{,}100$ Hz, what period $\tau$ (in samples) should librosa.yin() detect?

$\tau = 44{,}100 / 220 = 200.45$ samples — exactly double the 440 Hz period, because A3 completes one full cycle in twice as many samples.
⚠️
Common Mistake

Myth: "The scale factor should be 32768, because 16-bit audio has $2^{15} = 32768$ positive levels."

Reality: A signed 16-bit integer (np.int16) has an asymmetric range: $[-32768, +32767]$. The maximum positive value is $32767$, not $32768$. Using 32768 as your scale factor risks producing a value of $+32768$ for a peak amplitude of $1.0$, which overflows to $-32768$ — an audible crack at every peak of the waveform.

Solution
🤔 Pause & Predict

The widget lets you toggle odd harmonics ($k = 1, 3, 5, 7$). Based on $x[n] = \sum_k \tfrac{4}{\pi k}\cos(2\pi k f_0 n/f_s)$ — before clicking: if you activate only $k=1$ plus $k=3$, do you predict the waveform will look more square or more like a smooth sine wave, compared to $k=1$ alone? Why does adding more odd harmonics make the shape squarer?

Make your prediction — then toggle the checkboxes to verify ↓

Try It: Harmonic Pitch Synthesizer

Toggle odd harmonics to see $x[n] = \sum_{k} \tfrac{4}{\pi k}\cos(2\pi k f_0 n/f_s)$ build from a pure tone toward a square wave — the live equation updates with each toggle.

Live Synthesis Equation
x[n] = 1.27·cos(2πf₀n/fs)
Live Calculation — At n = 0, cos(0) = 1 for all harmonics
k = 1:4/(π×1) × cos(0) = 1.2732 × 1.0000= 1.2732
k = 3:4/(π×3) × cos(0) = 0.4244 × 1.0000(inactive)
k = 5:4/(π×5) × cos(0) = 0.2546 × 1.0000(inactive)
k = 7:4/(π×7) × cos(0) = 0.1819 × 1.0000(inactive)
x[0] =sum of active terms= 1.2732
Implementation
Python · NumPy / SciPy / librosa — Synthesis & Pitch Detection
import numpy as np from scipy.io import wavfile import librosa fs = 44100 n = np.arange(fs) # 1 second of samples # ── Synthesise a 440 Hz Concert A ──────────────────────────── x = 0.8 * np.cos(2 * np.pi * 440 / fs * n) wavfile.write("A4.wav", fs, (x * 32767).astype(np.int16)) # ── Detect pitch to verify ─────────────────────────────────── y, sr = librosa.load("A4.wav", sr=fs) f0 = librosa.yin(y, fmin=200, fmax=800) print(f"Detected f0: {np.median(f0):.1f} Hz") # → 440.0 Hz
Output
Detected f0: 440.0 Hz
Key Takeaway

Synthesis is three steps: compute $x[n]$, scale by 32,767, save with wavfile.write() — verification is one line: np.median(librosa.yin(y, ...)) must return your original frequency within ±1 Hz.

🗣️
Real-World Application

Text-to-Speech (TTS) Synthesis — How Siri Builds a Voice

Modern TTS engines like WaveNet and VITS do exactly what we just did — but in reverse and at scale. Given a text transcript, the model outputs thousands of sinusoidal parameters ($A_k$, $f_k$, $\varphi_k$) per millisecond. These are summed into a NumPy-like array, converted to 16-bit integers, and streamed to the speaker. The perceptual naturalness of Siri, Google Assistant, or Amazon Polly is ultimately determined by how accurately the model predicts those Fourier coefficients.

✦ Checkpoint Check Your Understanding — Pitch Synthesizer

Q1Why do we scale a float signal by 32,767 rather than 32,768 before casting to np.int16?

Answer: A signed 16-bit integer has the asymmetric range $[-32768, +32767]$. The maximum positive value is $32767 = 2^{15}-1$. Scaling by 32,768 for a peak of 1.0 produces $+32768$, which overflows to $-32768$ — an audible crack at every positive peak.

Q2A tone is synthesised at $f_0 = 261.63$ Hz (middle C) with $f_s = 44{,}100$ Hz. What period $\tau$ (in samples) should librosa.yin() detect?

Answer: $\tau = 44100 / 261.63 \approx 168.6$ samples.

Q3Why do we use np.median(f0) rather than np.mean(f0) when reporting the detected pitch?

Answer: librosa.yin() returns a frame-by-frame estimate. At the start and end of the signal (partial windows), it may return outlier values or 0 Hz. The median is robust to these outliers; the mean would be pulled toward them.
Pitch Synthesizer & Real-Time Oscilloscope
Generate physical pressure waves using your browser's Web Audio API. Explore how different waveforms look on a live temporal oscilloscope and hear their harmonic timbres.
Synthesizer Type
Sine Wave
← Time domain: first 10 milliseconds → Oscilloscope Sweep Rate: Synchronized to Pitch

Three Ideas to Carry Forward

Every concept from today connects to a real tool you will use this semester — and for the rest of your career.

📡

Every AI Audio Model Reads $x[n]$

Digital signals are integer-indexed sequences, not continuous curves. The sample rate $f_s$ is the single most important parameter in your code — it defines what "time" means to your algorithm.

📐

Sine Waves Are the Atoms of Sound

Any periodic signal decomposes into harmonics. A 100 Hz square wave is $\sum_{k\text{ odd}} \frac{4}{\pi k}\cos(2\pi k f_0/f_s \cdot n)$. The Gibbs overshoot (~9%) never disappears — that is physics, not a bug.

🎵

Synthesis → Save → Detect in 4 Lines

np.cos() → scale to int16 → wavfile.write()librosa.yin(). np.median(f0) should return your original frequency within ±1 Hz.

Coming up — Week 2: Sampling & Aliasing: We now know how to generate $x[n]$ using a formula. But real-world signals arrive from microphones, sensors, and cameras — we don't control them. How many samples per second are enough? What happens when we take too few? The Nyquist-Shannon theorem answers both questions, and the consequence of ignoring it — aliasing — will fundamentally change how you think about every recording device you have ever used.
Further Reading & Textbooks
Expand your understanding with standard text materials written by leading researchers in digital signal processing and data engineering.

Exercises

Rigorous problem sets covering mathematical derivations and functional coding tasks, followed by advanced synthesis applications.

1 Theory · Why Signal Processing
Easy

Signal Classification

Classify each scenario as a continuous-time signal $x(t)$ or a discrete-time signal $x[n]$, with a brief physical justification:
(a) The electrical voltage output of an analog microphone capturing speech.
(b) Ambient server-room temperature logged once every 10 minutes.
(c) An ECG trace scanned at 500 samples/second by a digital monitor.
(d) Barometric pressure measured by a physical aneroid gauge at sea level.

Ask: is the independent variable a continuous range $t \in \mathbb{R}$, or a sequence of integer indices $n \in \mathbb{Z}$? Analog sensors output continuous-time; digital recording always produces discrete-time.
2 Code · Why Signal Processing
Easy

AI Data Pipeline in NumPy

Implement the first two stages of the AI data pipeline in Python.
(a) Create integer indices n = np.arange(0, 100) representing 100 discrete samples at $f_s = 100$ Hz.
(b) Compute the discrete signal $x[n] = 4\cos(2\pi \times 5 / 100 \times n)$.
(c) Print the first 5 values and verify that len(x) equals 100.
(d) What does the ratio $f/f_s = 5/100 = 0.05$ represent physically?

Use np.arange(100) for indices. The ratio $f/f_s$ is the normalized frequency — the fraction of a full cycle completed per sample. At 0.05, exactly 5 full cycles fit into 100 samples.
3 Theory · Signals as Math
Easy

Discrete Sinusoid Parameters

A discrete signal is defined as $x[n] = 3\cos(2\pi \times 220 / 44100 \times n + \pi/4)$.
(a) Identify the amplitude $A$, frequency $f$, sample rate $f_s$, and phase $\varphi$.
(b) How many samples does one complete cycle occupy? (Hint: $N = f_s / f$)
(c) Compute $x[0]$ and $x[1]$ numerically to 3 decimal places.
(d) A4 is 440 Hz. Write the formula for the signal one octave lower (A3 = 220 Hz, same amplitude but zero phase).

Match to $A\cos(2\pi f/f_s \cdot n + \varphi)$. One cycle: $N = 44100/220 = 200.45$ samples. $x[0]=3\cos(\pi/4)=3\times\frac{\sqrt{2}}{2}\approx 2.121$. A3 at zero phase: $x[n]=3\cos(2\pi\times220/44100\times n)$.
4 Code · Signals as Math
Medium

Fourier Synthesis of a Square Wave

Synthesize a 100 Hz square wave using the first five odd harmonics ($k = 1, 3, 5, 7, 9$) at $f_s = 44{,}100$ Hz.
(a) Create n = np.arange(0, fs) for one second of samples.
(b) Compute $x[n] = \sum_{k \in \{1,3,5,7,9\}} \frac{4}{\pi k}\cos(2\pi k \times 100 / f_s \times n)$.
(c) Print the peak amplitude and verify it is approximately 1.18 (Gibbs overshoot).
(d) How does the waveform change if you increase the number of harmonics to $k = 99$? Describe without running.

Use a loop summing (4/(np.pi*k)) * np.cos(2*np.pi*k*100/fs*n) for each odd $k$. With 5 harmonics, np.max(np.abs(x)) ≈ 1.179. With $k=99$ the wave looks much more square, overshoot narrows to thin edge spikes (~9% above 1).
5 Theory · Pitch Synthesizer
Easy

WAV File Format & Bit Depth

A 16-bit WAV file stores amplitude values as integers in $[-32768, +32767]$.
(a) A float array has values in $[-1.0, +1.0]$. What is the correct integer after scaling by 32767? Compute the integer for $x = 0.5$.
(b) Calculate the file size in kilobytes for 3 seconds of mono audio at $f_s = 44{,}100$ Hz with 16-bit depth.
(c) A tone at 261.63 Hz is middle C (C4). How many samples does one cycle occupy at $f_s = 44{,}100$ Hz?
(d) Why is the scale factor 32767 rather than 32768?

$x=0.5 \to$ int16 $= \lfloor 0.5 \times 32767 ceil = 16384$. File size: $3 \times 44100 \times 2 = 264{,}600$ bytes $\approx 258$ kB. C4 cycle: $44100/261.63 \approx 168.6$ samples. 32767 not 32768 because int16 max is $2^{15}-1$ (asymmetric range).
6 Code · Pitch Synthesizer
Medium

Synthesise, Save, and Detect Pitch

Build the complete synthesis-to-analysis loop in Python.
(a) Generate $x[n] = 0.8\cos(2\pi \times 440 / 44100 \times n)$ for 1 second at $f_s = 44{,}100$ Hz.
(b) Save it as "A4.wav" using scipy.io.wavfile.write after scaling to int16.
(c) Load the file with librosa.load("A4.wav", sr=44100) and run librosa.yin(y, fmin=200, fmax=800).
(d) Print np.median(f0) and verify it equals approximately 440.0 Hz.

Scale: (x * 32767).astype(np.int16). librosa.load returns a float32 array normalized to $[-1, 1]$ — no need to rescale. np.median(f0) should return very close to 440.0 Hz (±1 Hz is acceptable).
7 Synthesis · Theory: Complete Signal Pipeline Design
Hard

Acoustic Keyword Spotter System Design

Design a keyword-spotting system that detects the word "Hey" in live microphone audio.
(a) Write the discrete signal model $x[n]$ for speech sampled at $f_s = 16{,}000$ Hz. What amplitude range is typical for a 16-bit microphone?
(b) The speech band spans 80–3400 Hz. What minimum sample rate satisfies Nyquist? Propose a practical $f_s$ and justify your choice.
(c) If the system stores 30 seconds of audio in a 16-bit buffer, calculate the buffer size in MB.
(d) Sketch the four-stage pipeline: Capture → Preprocess → Feature Extract → Classify. For each stage, name the Python function you would use.

Typical 16-bit range: $[-32768, 32767]$. Nyquist minimum: $2 \times 3400 = 6{,}800$ Hz; practical choice: 16 kHz (matches standard speech models). Buffer: $16000 \times 30 \times 2 = 960{,}000$ bytes $\approx 0.96$ MB. Pipeline functions: sounddevice.rec()scipy.signal.butter()librosa.yin()model.predict().
8 Synthesis · Code: End-to-End Signal Analysis
Hard

Beat Frequency Verification Pipeline

Two guitar strings sound at 440 Hz and 444 Hz simultaneously. Build the full analysis pipeline:
(a) Generate both discrete signals at $f_s = 44{,}100$ Hz for 2 seconds, then compute their sum $x[n]$.
(b) Using the identity $\cos A + \cos B = 2\cos\!\left(\frac{A-B}{2}\right)\cos\!\left(\frac{A+B}{2}\right)$, predict the beat frequency and carrier frequency analytically.
(c) Verify computationally: find the envelope of $x[n]$ using np.abs(scipy.signal.hilbert(x)) and measure the time between successive amplitude peaks to confirm your predicted beat period.
(d) Compute $\text{SNR}_{dB} = 10\log_{10}(P_s/P_n)$ of the composite signal against a noise floor of $\sigma = 0.01$.

Beat frequency $= |444-440| = 4$ Hz; period $= 0.25$ s; carrier $= 442$ Hz. Use scipy.signal.find_peaks to measure envelope peaks. Signal power $\approx 1$; SNR $\approx 10\log_{10}(1/0.0001) = 40$ dB.