From the impulse response DNA to the sliding-window sum — master the computational engine behind every noise filter, audio effect, and convolutional neural network layer.
Intuition
A system is any rule that transforms an input signal into an output signal. Out of all possible systems, signal processing focuses almost exclusively on LTI systems — those obeying two symmetries (linearity and time-invariance) that make their behavior completely predictable from a single experiment.
Play a middle-C on a digital piano at normal volume: you get a clean note. Play it twice as hard: you get exactly the same note at twice the volume — no distortion, no surprises. Now play two notes simultaneously: the output is the sum of the two notes processed independently, just as you would expect. This predictable, symmetric behavior is what engineers call linearity, and it's the foundation of every digital filter ever built.
An LTI system is like a recipe that scales perfectly: double the ingredients, get exactly double the result; make the dish at 6 pm vs. 7 pm and get the exact same dish. Just as a reliable recipe gives consistent, predictable results regardless of quantity or timing, an LTI system gives predictable outputs regardless of input amplitude or when the signal starts.
Every signal processing algorithm is a system. LTI systems are special: the impulse response $h[n]$ fully captures all behavior.
A system is linear if scaling and adding inputs produces identically scaled and added outputs. Formally:
$$\mathcal{T}\{a \cdot x_1[n] + b \cdot x_2[n]\} = a \cdot \mathcal{T}\{x_1[n]\} + b \cdot \mathcal{T}\{x_2[n]\}$$
Must hold for all signals $x_1, x_2$ and all scalars $a, b \in \mathbb{R}$.
Background. Compare $\mathcal{T}\{ax_1+bx_2\}$ against $a\mathcal{T}\{x_1\}+b\mathcal{T}\{x_2\}$.
Problem: Test $\mathcal{T}\{x\} = 3x[n]+5$ with $a=2, b=1$.
Is $\mathcal{T}\{x[n]\} = 5x[n]$ linear? Apply the zero-input test first.
Quick linearity test: Does $\mathcal{T}\{0\}=0$? Any constant offset anywhere in the rule immediately fails linearity.
A system is time-invariant if delaying the input by $n_0$ delays the output by exactly $n_0$:
$$\text{If } y[n] = \mathcal{T}\{x[n]\},\quad\text{then } y[n-n_0] = \mathcal{T}\{x[n-n_0]\}$$
The system's behavior is identical regardless of when the signal is applied.
Background. For any LTI system, substituting $x[n]=\delta[n]$ reads off $h[n]$ directly.
Problem: Find $h[n]$ for $\mathcal{T}\{x[n]\} = 0.5x[n] - 0.25x[n-1]$.
Is this system BIBO stable? Compute $\sum|h[n]|$.
One impulse test produces $h[n]$; convolution with any $x[n]$ then gives the exact output for that input.
A system is Bounded-Input Bounded-Output stable if and only if its impulse response is absolutely summable:
$$\sum_{n=-\infty}^{\infty} |h[n]| < \infty$$
Problem: Is $h[n] = (0.7)^n u[n]$ BIBO stable?
Myth: "Any system that multiplies the input by a constant is linear."
Reality: $y[n] = 3x[n] + 5$ multiplies by 3 but is NOT linear — the constant +5 means $\mathcal{T}\{0\} = 5 \neq 0$. Only systems where every term contains the input (no standalone constants) can be linear.
Switch the widget below to "Non-linear: y = x²". Before touching the sliders: do you predict that the green expected curve (superposition) will match the red actual output?
Form your prediction first — then select the system type to verify ↓
An LTI system's entire behavior — for every possible input that will ever exist — is encoded in one array: the impulse response $h[n]$, obtained by feeding the system a single unit impulse.
Every digital equalizer — from the one on your phone to a professional studio console — relies entirely on the LTI assumption. Because audio processing is linear and time-invariant, an engineer can apply bass boost and treble cut as independent filters and simply add their outputs. If the console were non-linear, bass and treble would interact unpredictably and couldn't be processed independently. LTI is the mathematical guarantee that a 31-band parametric EQ works exactly as advertised.
Q1State the zero-input test for linearity. If $\mathcal{T}\{0\} = 7$, is the system linear?
Q2Find $h[n]$ for $y[n] = 0.8x[n] - 0.4x[n-1]$. Is the system FIR or IIR?
Q3Why does a time-varying coefficient (e.g. $y[n] = n \cdot x[n]$) make a system time-varying, even if it is linear?
Mechanics
Convolution is the single operation that computes every LTI system's output. It slides the impulse response over the input, computing a weighted sum at each position. Master the flip-and-slide mechanic and you understand the mathematical core of every noise filter, audio effect, and CNN layer.
np.convolve(), and verify their results match.Imagine running a weighted magnifying glass along a row of numbers. At each position it sits over a small neighborhood, multiplies each value by a weight, sums everything up, and writes down one output number. Slide it one step right, repeat. That systematic sliding-weighted-sum is exactly what convolution does — and it's why every CNN layer, every audio filter, every edge detector is mathematically the same operation.
Convolution is like a barista's recipe card sliding along a drink order: at each position they read the "weights" from the card, multiply by the quantities in the order, add the results, and pour one output serving. Slide the card to the next item, repeat. The card is $h[n]$, the order is $x[n]$, and each poured serving is one sample of $y[n]$.
$$y[n] = (x * h)[n] = \sum_{k=-\infty}^{\infty} x[k]\;h[n-k]$$
Background. $y[n] = \sum_k x[k]\,h[n-k]$. For finite signals, sum only over the active support of $x$.
Problem: Compute $y = x * h$ where $x = [1, 2, 0, -1]$ (N=4) and $h = [0.5, 1.0, 0.5]$ (M=3).
What is the output length if $x$ has 7 samples and $h$ has 4 taps?
At each position $n$: flip $h$, overlay on $x$, multiply element-wise, sum → one sample of $y[n]$. Slide one step and repeat.
| Property | Formula | Physical Meaning |
|---|---|---|
| Commutativity | $x*h = h*x$ | Order of operands doesn't matter |
| Associativity | $(x*h_1)*h_2 = x*(h_1*h_2)$ | Cascaded filters can be merged |
| Distributivity | $x*(h_1+h_2)=(x*h_1)+(x*h_2)$ | Parallel filters can be combined |
| Identity | $x*\delta = x$ | Convolving with impulse = no change |
| Property | FIR | IIR |
|---|---|---|
| Duration of h[n] | Finite (N taps) | Infinite (never zero) |
| Feedback loops? | No | Yes (recursive) |
| Always stable? | Yes | Depends on design |
| Example | Moving average | Exponential smoother |
Problem: Show $x * h = h * x$ for $x=[1,2]$, $h=[3,4]$.
Convolution always increases signal length: $L = N + M - 1$. A 1-second audio recording convolved with a 2-second reverb impulse response produces a 3-second output — the reverb "tail" is physically real energy that extends past the input. This is not a side effect; it is correct and expected.
In the interactive below, set the kernel to "Moving Avg (3-tap)" and drag the position slider to $n=2$. Before checking: what value do you predict for $y[2]$ using the convolution sum with $x = [0.5, 1.0, 1.8, \ldots]$ and $h = [1/3, 1/3, 1/3]$?
Compute $y[2] = \frac{1}{3}(x[0]+x[1]+x[2])$ first, then drag the slider to verify ↓
Convolution $y[n] = \sum_k x[k]\,h[n-k]$ is the flip-and-slide weighted sum that computes every LTI system's output — and it is mathematically identical to what every CNN layer does to feature maps.
Every convolutional layer in a CNN performs exactly the same sliding-window operation described here — applied to 2D image arrays instead of 1D signals. The learned kernel weights $h$ (called "filters" in ML) are trained by gradient descent to detect edges, textures, or other features. A single forward pass through ResNet-50 performs billions of convolution operations. Understanding the 1D convolution sum is the direct prerequisite for understanding how CNNs extract features from raw pixel data.
Q1Give the formula for the output length when convolving a signal of length $N$ with a filter of length $M$. What does each extra sample represent physically?
Q2Compute $y[0]$ and $y[1]$ for $x=[3, -1, 2]$ and $h=[0.5, 0.5]$ using the convolution sum. Show all multiplication terms.
Q3Why is $h[n-k]$ written with $n-k$ instead of just $k$? What does the "$n-k$" argument physically represent?
Application
Convolution is not just theory — it powers every reverb plugin, every noise-cancelling headphone, and every audio signal chain in professional production. This section connects the convolution formula to real acoustic engineering: designing echo impulse responses, cascading filters, and understanding BIBO stability in the context of real systems.
np.convolve.Stand inside an empty concert hall and clap your hands once, sharply. The burst of sound echoes off the walls, bounces across the ceiling, and slowly fades. Record that entire decay — from the first clap to the last echo — and you have captured the hall's complete acoustic signature. Convolve any music recording with that one measurement, and you recreate the experience of performing inside that exact hall from anywhere in the world.
Designing an echo is like setting up a perfect mirror relay: direct sound reaches your ear first ($\delta[n]$), then a quieter reflected copy arrives $d$ samples later ($\alpha\,\delta[n-d]$). Cascade two such relays and the reflections multiply. This is exactly what $h_{total} = h_1 * h_2$ computes — the combined fingerprint of both mirrors in sequence.
A simple echo adds a delayed, attenuated copy of the dry signal:
$$y[n] = x[n] + \alpha \cdot x[n - d]$$
$\alpha \in (0,1)$ = echo gain (volume), $d$ = echo delay in samples. At $f_s = 44{,}100$ Hz, a 0.3-second echo: $d = 0.3 \times 44{,}100 = 13{,}230$ samples.
Background. The echo system is itself LTI. Substitute $x[n]=\delta[n]$ to find $h[n]$.
Problem: Find $h[n]$ for $y[n]=x[n]+0.5\cdot x[n-4]$.
Is this echo system BIBO stable? Compute $\sum|h[n]|$.
When two LTI systems are connected in series, the combined impulse response is the convolution of the two individual responses:
$$h_{total}[n] = h_1[n] * h_2[n]$$
Two filters in series are equivalent to one filter whose impulse response is their convolution. Order does not matter (commutativity): $h_1 * h_2 = h_2 * h_1$.
Problem: $h_1=[1,-1]$ (edge detector), $h_2=[0.5,0.5]$ (smoother). Find $h_{total}$.
Cascaded LTI systems merge into one equivalent system — commutativity means order is always rearrangeable.
Active noise-cancelling headphones measure the ambient noise $x_{noise}[n]$ with a reference microphone, then convolve it with an adaptive $h[n]$ chosen to produce the exact anti-noise signal:
$$y_{anti}[n] = x_{noise}[n] * h_{ANC}[n] \approx -x_{noise}[n]$$
When $y_{anti}$ reaches your ear simultaneously with the noise, they cancel: $x_{noise} + y_{anti} \approx 0$. The convolution operation — and the LTI framework — is what makes this physically possible.
Myth: "A feedback echo with $|\alpha| = 1$ produces a stable, infinite echo that sustains forever."
Reality: $|\alpha| = 1$ makes the system BIBO unstable. Each echo copy has the same amplitude as the previous, so the output grows without bound — the system accumulates energy indefinitely and will clip or oscillate destructively. Stability requires $|\alpha| < 1$.
In the widget below, set gain α to 0.9 and delay to 8 samples. Before moving the slider: do you predict the echo copies will grow, stay constant, or decay? What happens if you set α = 0.99?
Check your prediction using the BIBO stability condition $|\alpha| < 1$ first, then verify with the widget ↓
Any acoustic effect — echo, reverb, noise cancellation — is just an impulse response $h[n]$ away: measure the system once with an impulse, then apply $y = x * h$ to any dry input to produce the transformed output.
Convolution reverb plugins in professional Digital Audio Workstations (Logic Pro, Ableton, Pro Tools) implement exactly $y = x * h$ where $h$ is a real-world Room Impulse Response. A single 2-second cathedral RIR at 44.1 kHz contains 88,200 samples. Convolving a 3-minute vocal track with it requires over 700 billion multiplications naively — which is why DAWs use FFT-based fast convolution ($\mathcal{O}(N\log N)$ instead of $\mathcal{O}(NM)$), cutting the computation by three orders of magnitude to run in real time on a laptop CPU.
Q1Write the impulse response $h[n]$ for an echo with gain $\alpha = 0.7$ and delay $d = 5$ samples. Is it FIR or IIR?
Q2Two filters are cascaded: $h_1=[2, -1]$ and $h_2=[1, 1]$. Compute $h_{total}[0]$, $h_{total}[1]$, and $h_{total}[2]$.
Q3A noise-cancelling headphone's ANC filter must satisfy $h_{ANC} * x_{noise} \approx -x_{noise}$. What impulse response $h_{ANC}$ achieves exact cancellation, and why is it physically impossible to realize perfectly?
Interactive Lab
Select a filter type and input signal, then watch the full convolution computed in real time. Try the Impulse input to see the impulse response h[n] directly — this is the fundamental test from Topic 1.
Week 3 Summary
The four ideas from this week that every filter designer, audio engineer, and ML practitioner carries permanently.
$y[n]=\mathcal{T}\{x[n]\}$ — any rule mapping input samples to output. Linear: $\mathcal{T}\{0\}=0$ and superposition holds. Time-invariant: shifting the input shifts the output by the same amount.
Feed any LTI system a unit impulse $\delta[n]$ and record the output. That one recording $h[n]$ completely characterizes the system's behavior for every future input via convolution.
$y[n]=\sum_k x[k]\,h[n-k]$. Output length $= N+M-1$. Commutative, associative, distributive. Foundation of every CNN layer, audio filter, and reverb plugin.
$h_{echo}[n]=\delta[n]+\alpha\delta[n-d]$ with $|\alpha|<1$ for BIBO stability. Cascade $h_{total}=h_1*h_2$. Real acoustic spaces measured as RIRs and replayed via convolution.
Go Deeper
Curated references to reinforce every concept from this week. Start with the 3Blue1Brown video for the best visual intuition of convolution.
The definitive visual explanation — from polynomial multiplication to signal processing intuition. Builds exactly the flip-and-slide picture from Topic 2.
Watch on YouTube → 🕹️ Interactive · Paul FalstadReal-time interactive filter applet — design FIR/IIR filters, see frequency response, impulse response, and convolution live in the browser. Reinforces Topics 1 and 2.
Open Applet → ▶ Video · NumerystPython-first walkthrough of the convolution sum — starts from scratch and builds to np.convolve. Directly reinforces the code from Topic 2.
University-level lecture connecting impulse/step response analysis to the convolution integral — bridges Topic 1 (impulse response) to Topic 2 (convolution).
Watch on YouTube →Practice
Rigorous problem sets covering mathematical derivations and functional coding tasks, followed by advanced synthesis applications.
Test the system $\mathcal{T}\{x[n]\} = 3x[n] + 5$ analytically:
(a) Linearity test: Apply the superposition property with $a=2, b=1$ and two signals $x_1, x_2$. Show whether $\mathcal{T}\{2x_1+x_2\} = 2\mathcal{T}\{x_1\}+\mathcal{T}\{x_2\}$ holds.
(b) Time-invariance test: Show whether $\mathcal{T}\{x[n-n_0]\} = y[n-n_0]$ holds for any delay $n_0$.
(c) State clearly whether each property holds and identify the exact term that breaks it.
(a) Implement two system operators in Python: T1(x) = 3*x and T2(x) = 3*x + 5. Use np.random.randn(4) for $x_1, x_2$ with $a=2, b=-1$. Verify linearity with np.allclose().
(b) For the FIR system $y[n] = 0.5x[n] - 0.25x[n-1] + 0.1x[n-2]$, use scipy.signal.unit_impulse(10) and lfilter([0.5,-0.25,0.1],[1],delta) to compute $h[n]$.
(c) Verify BIBO stability: print $\sum|h[n]|$ and confirm the result is finite.
from scipy import signal; delta = signal.unit_impulse(10); h = signal.lfilter(b, a, delta). For stability: np.sum(np.abs(h)) should be finite and small.Given $x[n] = [1, 2, 0, -1]$ (length 4) and $h[n] = [0.5, 1.0, 0.5]$ (length 3):
(a) State the output length $L = N + M - 1$.
(b) Manually compute $y[0]$ through $y[5]$ using the convolution sum $y[n]=\sum_k x[k]\,h[n-k]$. Show every multiplication term for each output sample.
(c) Write the full output sequence and verify that $y[0]+y[1]+\ldots+y[5]$ equals $(\sum x[n]) \times (\sum h[n])$.
For $x=[1, 2, 0, -1]$ and $h=[0.5, 1.0, 0.5]$:
(a) Implement a manual convolution using a nested for loop (no library calls).
(b) Verify using np.convolve(x, h). Assert they match with np.allclose.
(c) Demonstrate commutativity: show np.convolve(x,h) equals np.convolve(h,x).
(d) Apply a 3-tap moving average filter $h=[1/3,1/3,1/3]$ to the noisy signal x = np.sin(np.linspace(0,2*np.pi,50)) + 0.5*np.random.randn(50) and print the first 5 values of the smoothed output.
for n in range(L): for k in range(len(x)): if 0 <= n-k < len(h): y[n] += x[k]*h[n-k]. For the noisy case use np.convolve(x, h, mode='valid') to avoid edge effects.(a) Write the impulse response $h[n]$ for an echo with $\alpha=0.6$ and $d=5$ samples. List all non-zero values explicitly.
(b) Two filters are cascaded: $h_1=[1,-1]$ (first-difference) and $h_2=[0.5,0.5]$ (averaging). Compute $h_{total}[n] = h_1[n]*h_2[n]$ manually step by step. Then compute $h_2*h_1$ and confirm they are equal (commutativity).
(c) Is the echo system in (a) BIBO stable? Compute $\sum|h[n]|$.
Synthesize a multi-echo effect:
(a) Create a test signal: a single pulse of value 1 at $n=50$, zero elsewhere, total length 200 samples.
(b) Build the echo impulse response: $h[n]=\delta[n]+0.6\,\delta[n-30]+0.36\,\delta[n-60]$ (three echoes, each 60% of the previous).
(c) Compute the output via np.convolve(x, h). Print the output length.
(d) Find and print the indices of the three largest non-zero values in $y$ — they should be at 50, 80, and 110.
h = np.zeros(61); h[0]=1.0; h[30]=0.6; h[60]=0.36. The output should have peaks at 50, 80, and 110. Use np.argsort(np.abs(y))[-3:] to find the three largest indices.A cathedral reverb uses a Room Impulse Response of $M = 88{,}200$ samples (2 s at 44.1 kHz). The dry vocal recording has $N = 2{,}646{,}000$ samples (1 minute).
(a) Calculate the exact number of multiplications for direct time-domain convolution ($\mathcal{O}(NM)$).
(b) FFT-based convolution uses an FFT of size $L$ = next power of 2 above $N+M-1$. Find $L$ (use $2^{\lceil\log_2(N+M-1)\rceil}$), then estimate total operations as $3L\log_2 L$.
(c) Calculate the speedup ratio. At what input length does FFT convolution become faster than direct convolution (crossover point)?
The exponential smoothing filter is $y[n] = \alpha\,x[n] + (1-\alpha)\,y[n-1]$, with $\alpha=0.3$.
(a) Show analytically that $h[n] = \alpha(1-\alpha)^n\,u[n]$ is the impulse response (infinite → IIR).
(b) Implement the filter recursively in Python and apply it to a noisy sine wave: x = np.sin(np.linspace(0,4*np.pi,200)) + 0.5*np.random.randn(200).
(c) Verify BIBO stability: compute $\sum_{n=0}^{500}|h[n]|$ numerically and show it converges to $\alpha/(1-(1-\alpha)) = 1$.
(d) Print the maximum absolute error between the noisy input and the smoothed output to confirm the filter reduces noise.
y = np.zeros(len(x)); y[0] = alpha*x[0]; [y.__setitem__(i, alpha*x[i]+(1-alpha)*y[i-1]) for i in range(1,len(x))]. For stability: h = alpha*(1-alpha)**np.arange(501); np.sum(np.abs(h)) should be close to 1.0.