Representing audio: waveforms, spectrograms, codecs
Sound is a one-dimensional pressure signal. A microphone samples it at a fixed rate \(f_s\) (16 kHz for speech, 44.1/48 kHz for music) into a stream of amplitudes — the waveform. One second of 16 kHz mono audio is 16,000 numbers; one minute is nearly a million. Raw waveforms are faithful but punishingly long and locally meaningless: a single sample tells you almost nothing. The first job of any audio model is to find a representation that is shorter and where structure is visible.
The classical answer is the Short-Time Fourier Transform (STFT). Slide a window of \(N\) samples across the waveform, hop by \(H\) samples each step, and take the Fourier transform of each windowed frame. The magnitude of the result is the spectrogram — a 2-D image with time on one axis and frequency on the other, brightness encoding energy.
The uncertainty tradeoff. A short window localizes when a sound happened but smears which frequency it was; a long window resolves frequency finely but blurs timing. The two resolutions are reciprocal — \(\Delta t \cdot \Delta f \approx 1\) — so you cannot have both at once. This is not an engineering limitation but the audio form of the Heisenberg–Gabor limit; speech pipelines settle near a 25 ms window with a 10 ms hop as a pragmatic compromise.
The mel scale. Human hearing is roughly logarithmic in frequency — we distinguish 200 Hz from 300 Hz easily but barely separate 8,000 Hz from 8,100 Hz. So spectrograms for speech are warped onto the perceptual mel scale, pooling the linear bins into ~80 mel bands that are dense at low frequencies and sparse at high ones. The result — the log-mel spectrogram — is the input that almost every modern speech model actually consumes.
# EQ MM4.1: a spectrogram (STFT magnitude) of a synthetic two-tone signal
import numpy as np
fs = 16000 # sample rate (Hz)
t = np.arange(0, 0.5, 1/fs) # half a second
x = np.sin(2*np.pi*440*t) + 0.5*np.sin(2*np.pi*2000*t) # 440 Hz + 2 kHz
N, H = 400, 160 # 25 ms window, 10 ms hop
win = np.hanning(N)
frames = []
for m in range(0, len(x) - N, H):
seg = x[m:m+N] * win
spec = np.abs(np.fft.rfft(seg)) # magnitude spectrum of this frame
frames.append(spec)
S = np.array(frames) # shape: (time frames, freq bins)
freqs = np.fft.rfftfreq(N, 1/fs) # Hz of each bin
mid = S[S.shape[0]//2] # one frame from the middle
# local maxima (a bin taller than both neighbours) = the spectral ridges
loc = (mid[1:-1] > mid[:-2]) & (mid[1:-1] > mid[2:])
peakf = freqs[1:-1][loc]
peaks = peakf[np.argsort(mid[1:-1][loc])[-2:]] # two strongest distinct ridges
print(f"spectrogram shape (frames x bins): {S.shape}")
print(f"frequency resolution df = fs/N : {fs/N:.0f} Hz")
print(f"two dominant tones recovered : {np.sort(peaks).round(0)} Hz")
plot_xy(freqs, mid) # the spectrum of one frame
From continuous spectra to discrete tokens — neural codecs. Transformers want discrete tokens, not floating-point spectrograms. A neural audio codec closes that gap. Models such as SoundStream and Meta's EnCodec are convolutional autoencoders trained with residual vector quantization (RVQ): the encoder compresses the waveform to a low-rate latent, a stack of codebooks quantizes it into integer codes — each codebook correcting the residual of the previous one — and the decoder reconstructs high-fidelity audio. The output is a grid of discrete tokens, perhaps 75 frames per second across 8 codebooks, that a language model can predict exactly like text.
Speech recognition: Whisper
Automatic speech recognition (ASR) turns audio into text. The classical stack chained a hidden Markov model acoustic model, a pronunciation dictionary, and an n-gram language model — three separately trained components glued by hand. The connectionist temporal classification (CTC) loss and attention-based sequence-to-sequence models collapsed this into single neural networks. Whisper (OpenAI, 2022) is the canonical modern endpoint: a plain encoder–decoder transformer trained on 680,000 hours of weakly labelled, multilingual audio scraped from the web.
The architecture is deliberately ordinary. Audio is resampled to 16 kHz, converted to an 80-bin log-mel spectrogram over 30-second chunks, and fed to a transformer encoder. A transformer decoder then autoregressively predicts text tokens, attending to the encoder output — exactly the machine-translation recipe of Vol II, with mel frames in place of source words. Special tokens in the decoder prompt steer the task: transcribe vs. translate, language id, with-or-without timestamps. One model, one objective, many jobs.
Whisper's robustness came not from architecture but from scale and weak supervision: enormous, noisy, diverse data taught it to handle accents, background noise, and code-switching far better than systems trained on clean curated corpora. The cost is occasional hallucination — when the audio is silent or unintelligible, an autoregressive decoder will confidently invent fluent text, because nothing in the objective penalizes a plausible-sounding guess. This is the honest caveat: Whisper trades the brittleness of old pipelines for the failure mode of all generative decoders.
The discriminative alternative. Not all ASR is generative. CTC-based and self-supervised encoders such as wav2vec 2.0 learn audio representations from unlabelled speech, then attach a thin output head; they align frames to characters without a decoder and so cannot hallucinate sentences, but they need an external language model for top accuracy. Encoder–decoder models like Whisper fold the language model inside — convenient, at the price of the failure mode above.
Text-to-speech: Tacotron to neural TTS
Text-to-speech (TTS) runs recognition in reverse: text in, waveform out. Like ASR it was once a pipeline — text normalization, a pronunciation model, a duration model, and a vocoder that built sound from parameters, every stage hand-tuned and audibly robotic. Neural TTS rebuilt it as two learned stages connected by the spectrogram.
Stage 1 — text to spectrogram (the acoustic model). Tacotron 2 (2017) is the archetype: a sequence-to-sequence model with attention that reads characters and emits a log-mel spectrogram frame by frame, learning prosody, rhythm, and emphasis directly from data. Its weakness was the autoregressive attention itself — it could skip or repeat words when alignment slipped. Non-autoregressive successors such as FastSpeech 2 fixed this by predicting an explicit duration for each input token and expanding the sequence in one shot, trading a little naturalness for speed and rock-solid alignment.
Stage 2 — spectrogram to waveform (the vocoder). The spectrogram threw away phase, so the vocoder must synthesize plausible samples. WaveNet (2016) was the breakthrough: an autoregressive convolutional network using dilated causal convolutions to model raw audio one sample at a time, conditioned on the spectrogram. Quality was a leap; speed was a disaster — generating 24,000 samples per second one-at-a-time. The field's history since is the race to keep WaveNet's fidelity without its latency: parallel flows (Parallel WaveNet), and the now-dominant GAN vocoders (HiFi-GAN) that generate an entire waveform in a single non-autoregressive forward pass.
# EQ MM4.3: warp linear FFT frequencies onto the perceptual mel scale
import numpy as np
fs, N = 16000, 400
freqs = np.fft.rfftfreq(N, 1/fs) # linear Hz of each FFT bin
def hz_to_mel(f): return 2595*np.log10(1 + f/700)
def mel_to_hz(m): return 700*(10**(m/2595) - 1)
n_mels = 8 # tiny filterbank for display
mel_max = hz_to_mel(fs/2) # Nyquist in mels
edges_mel = np.linspace(0, mel_max, n_mels+1) # equal spacing in MEL...
edges_hz = mel_to_hz(edges_mel) # ...is geometric spacing in HZ
print(f"Nyquist : {fs/2:.0f} Hz = {mel_max:.0f} mel")
print(f"mel(1000): {hz_to_mel(1000):.1f} (calibrated so 1 kHz ~ 1000 mel)")
print("mel-band edges in Hz (note they widen as frequency rises):")
print(np.round(edges_hz, 0))
widths = np.diff(edges_hz)
print("band widths (Hz) :", np.round(widths, 0))
plot_xy(freqs, hz_to_mel(freqs)) # the warping curve itself
Audio language models
Once a neural codec (§4.1) turns audio into discrete tokens, the boundary between speech and language dissolves. An audio language model is a transformer trained to predict the next audio token exactly as a text LLM predicts the next word. Google's AudioLM (2022) demonstrated the idea: prompt it with a few seconds of speech or piano and it continues them coherently — preserving speaker identity, prosody, even recording-room acoustics — with no transcript and no explicit acoustic model, purely from the token statistics.
This unification is why a single model can now transcribe, translate, answer, and speak. Text-conditioned variants — Meta's MusicGen for music, VALL-E for zero-shot voice cloning from a three-second sample — prepend text or a reference clip and let the same next-token machinery do the rest. The frontier "omni" assistants (GPT-4o, Gemini) push it further: one transformer ingests interleaved text, audio, and image tokens and emits audio tokens directly, with no separate ASR or TTS stage in the loop. The pipeline of §4.2 and §4.3 has folded into a single sequence model.
Honest limits. Audio tokens are dense — tens of codebook tokens per second across multiple streams — so context windows fill fast, and long-form coherence (a five-minute song that develops, a ten-turn spoken dialogue that remembers) remains hard. Zero-shot voice cloning from seconds of audio is also a live misuse and consent concern; watermarking and detection are active, unsettled research, not solved problems.
Real-time & streaming
A spoken-conversation assistant lives or dies on latency. Humans tolerate roughly 200–300 ms of silence before a reply feels unnatural; the cascade of ASR → LLM → TTS, each waiting for the previous to finish, easily blows past a full second. Real-time audio is therefore an engineering discipline about cutting the time-to-first-sound, not just model quality.
Streaming, not chunking. Whisper's 30-second window is fine for files but fatal for conversation. Streaming ASR processes audio in small overlapping chunks and emits partial transcripts as words arrive, using causal or limited-lookahead attention so it never waits for the end of the utterance. On the output side, streaming TTS begins vocoding the first words while the acoustic model is still producing later frames — a software pipeline rather than a sequence of blocking stages.
Full-duplex and the future. The hardest problems are conversational, not acoustic: detecting end-of-turn quickly without cutting the speaker off, handling barge-in (the user interrupting mid-reply), and modelling backchannels ("mm-hmm"). Systems such as Kyutai's Moshi attack this with a full-duplex architecture that listens and speaks simultaneously over parallel token streams, dropping perceived latency toward 200 ms. End-to-end audio language models point the same way: the lower the latency budget, the more pressure to collapse the pipeline into one model — which is exactly the arc this chapter traced.
Sound was the first non-text modality to fully become a token stream; the next frontier is a model of the world itself. Chapter 05: world models — systems that learn the dynamics of an environment from observation, predict the future of a scene, and let agents plan and act inside a learned simulation.
References
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision.
- van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio.
- Défossez, A., Copet, J., Synnaeve, G. & Adi, Y. (2022). High Fidelity Neural Audio Compression.
- Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.
- Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
- Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.
- Borsos, Z. et al. (2022). AudioLM: A Language Modeling Approach to Audio Generation.
- Défossez, A. et al. (2024). Moshi: A Speech-Text Foundation Model for Real-Time Dialogue.