AI // ENCYCLOPEDIA / MULTIMODAL / 04 / SPEECH & AUDIO INDEX NEXT: WORLD MODELS →
MULTIMODAL & WORLD MODELS · CHAPTER 04 / 06

Speech & Audio Models

Speech systems were once pipelines of hand-engineered parts: acoustic models, pronunciation lexicons, language models, and vocoders. Speech is now treated as a sequence-modeling problem, handled by the same transformers used for text. This chapter starts from the raw representation, covering waveforms, spectrograms, and the neural codecs that turn sound into discrete tokens, then the recognition, synthesis, and end-to-end audio language models built on top.

LEVELCORE READING TIME≈ 24 MIN BUILDS ONMULTIMODAL 01–02 INSTRUMENTSSPECTROGRAM · MEL EXPLORER · TTS PIPELINE
4.1

Representing audio: waveforms, spectrograms, codecs

Sound is a one-dimensional pressure signal. A microphone samples it at a fixed rate \(f_s\) (16 kHz for speech, 44.1/48 kHz for music) into a stream of amplitudes — the waveform. One second of 16 kHz mono audio is 16,000 numbers; one minute is nearly a million. Raw waveforms are faithful but punishingly long and locally meaningless: a single sample tells you almost nothing. The first job of any audio model is to find a representation that is shorter and where structure is visible.

The classical answer is the Short-Time Fourier Transform (STFT). Slide a window of \(N\) samples across the waveform, hop by \(H\) samples each step, and take the Fourier transform of each windowed frame. The magnitude of the result is the spectrogram — a 2-D image with time on one axis and frequency on the other, brightness encoding energy.

EQ MM4.1 — SHORT-TIME FOURIER TRANSFORM $$ X[m, k] \;=\; \sum_{n=0}^{N-1} x[mH + n]\, w[n]\, e^{-\,i\, 2\pi k n / N}, \qquad \text{spectrogram } S[m,k] = \big| X[m,k] \big|^2 $$
\(m\) indexes the time frame, \(k\) the frequency bin, \(w[n]\) a tapering window (Hann is standard) that suppresses edge artifacts. The window length \(N\) sets the frequency resolution \(\Delta f = f_s/N\); the hop \(H\) sets how many frames per second. Phase is usually discarded — most audio models work on magnitude alone and reconstruct phase later.

The uncertainty tradeoff. A short window localizes when a sound happened but smears which frequency it was; a long window resolves frequency finely but blurs timing. The two resolutions are reciprocal — \(\Delta t \cdot \Delta f \approx 1\) — so you cannot have both at once. This is not an engineering limitation but the audio form of the Heisenberg–Gabor limit; speech pipelines settle near a 25 ms window with a 10 ms hop as a pragmatic compromise.

EQ MM4.2 — TIME–FREQUENCY RESOLUTION $$ \Delta f = \frac{f_s}{N}, \qquad \Delta t = \frac{N}{f_s}, \qquad \Delta t \cdot \Delta f = 1 $$
At \(f_s = 16{,}000\) Hz with \(N = 400\) samples (a 25 ms window): \(\Delta f = 40\) Hz and \(\Delta t = 25\) ms. Double the window to 800 samples and frequency resolution halves to 20 Hz — but each frame now blurs 50 ms of time. You trade one axis for the other; their product is fixed.
True or false: a spectrogram trades time resolution against frequency resolution — making the analysis window longer improves \(\Delta f\) but worsens \(\Delta t\). (Answer true or false.)
From EQ MM4.2, \(\Delta f = f_s/N\) shrinks as the window \(N\) grows (finer frequency), while \(\Delta t = N/f_s\) grows (coarser time). Their product \(\Delta t\cdot\Delta f = 1\) is constant, so improving one axis necessarily degrades the other. The statement is true.

The mel scale. Human hearing is roughly logarithmic in frequency — we distinguish 200 Hz from 300 Hz easily but barely separate 8,000 Hz from 8,100 Hz. So spectrograms for speech are warped onto the perceptual mel scale, pooling the linear bins into ~80 mel bands that are dense at low frequencies and sparse at high ones. The result — the log-mel spectrogram — is the input that almost every modern speech model actually consumes.

EQ MM4.3 — HERTZ TO MEL $$ m(f) \;=\; 2595 \,\log_{10}\!\left( 1 + \frac{f}{700} \right) $$
The 700 Hz break and 2595 constant are calibrated so that 1,000 Hz maps to ~1,000 mel. Below the break the mapping is near-linear; above it, near-logarithmic. Mel filterbank bins are spaced uniformly in mel, which means they widen geometrically in Hz — exactly matching the ear's declining pitch resolution.
PYTHON · RUNNABLE IN-BROWSER
# EQ MM4.1: a spectrogram (STFT magnitude) of a synthetic two-tone signal
import numpy as np
fs = 16000                                  # sample rate (Hz)
t  = np.arange(0, 0.5, 1/fs)                # half a second
x  = np.sin(2*np.pi*440*t) + 0.5*np.sin(2*np.pi*2000*t)   # 440 Hz + 2 kHz

N, H = 400, 160                             # 25 ms window, 10 ms hop
win  = np.hanning(N)
frames = []
for m in range(0, len(x) - N, H):
    seg  = x[m:m+N] * win
    spec = np.abs(np.fft.rfft(seg))         # magnitude spectrum of this frame
    frames.append(spec)
S = np.array(frames)                        # shape: (time frames, freq bins)

freqs = np.fft.rfftfreq(N, 1/fs)            # Hz of each bin
mid   = S[S.shape[0]//2]                    # one frame from the middle
# local maxima (a bin taller than both neighbours) = the spectral ridges
loc   = (mid[1:-1] > mid[:-2]) & (mid[1:-1] > mid[2:])
peakf = freqs[1:-1][loc]
peaks = peakf[np.argsort(mid[1:-1][loc])[-2:]]   # two strongest distinct ridges
print(f"spectrogram shape (frames x bins): {S.shape}")
print(f"frequency resolution df = fs/N    : {fs/N:.0f} Hz")
print(f"two dominant tones recovered      : {np.sort(peaks).round(0)} Hz")
plot_xy(freqs, mid)                         # the spectrum of one frame
edits are live — break it on purpose
INSTRUMENT MM4.1 — WAVEFORM → SPECTROGRAMSTFT MAGNITUDE · EQ MM4.1
WAVEFORM (TIME DOMAIN)
SPECTROGRAM (TIME × FREQUENCY, BRIGHTER = MORE ENERGY)
SAMPLE RATE
16 kHz
FREQ RESOLUTION Δf
WINDOW Δt
Two pure tones produce two bright horizontal ridges. Widen the window N and watch the ridges sharpen vertically (finer Δf) while each frame covers more time (coarser Δt) — EQ MM4.2 made visible. Move the tones close together: only a long window resolves them as two lines.

From continuous spectra to discrete tokens — neural codecs. Transformers want discrete tokens, not floating-point spectrograms. A neural audio codec closes that gap. Models such as SoundStream and Meta's EnCodec are convolutional autoencoders trained with residual vector quantization (RVQ): the encoder compresses the waveform to a low-rate latent, a stack of codebooks quantizes it into integer codes — each codebook correcting the residual of the previous one — and the decoder reconstructs high-fidelity audio. The output is a grid of discrete tokens, perhaps 75 frames per second across 8 codebooks, that a language model can predict exactly like text.

EQ MM4.4 — RESIDUAL VECTOR QUANTIZATION $$ \hat{z} \;=\; \sum_{q=1}^{Q} e_q\big(c_q\big), \qquad c_q = \arg\min_{j} \big\| r_{q-1} - e_q(j) \big\|^2, \quad r_q = r_{q-1} - e_q(c_q),\ \ r_0 = z $$
Latent \(z\) is encoded by \(Q\) codebooks in sequence. Codebook 1 picks its nearest entry \(e_1(c_1)\); codebook 2 quantizes the leftover residual \(r_1\), and so on. Stacking \(Q\) codebooks of \(K\) entries each gives \(K^Q\) effective combinations at only \(Q\log_2 K\) bits — the trick that lets RVQ hit telephone-to-hi-fi rates with tiny codebooks. Audio is now a token stream.
4.2

Speech recognition: Whisper

Automatic speech recognition (ASR) turns audio into text. The classical stack chained a hidden Markov model acoustic model, a pronunciation dictionary, and an n-gram language model — three separately trained components glued by hand. The connectionist temporal classification (CTC) loss and attention-based sequence-to-sequence models collapsed this into single neural networks. Whisper (OpenAI, 2022) is the canonical modern endpoint: a plain encoder–decoder transformer trained on 680,000 hours of weakly labelled, multilingual audio scraped from the web.

The architecture is deliberately ordinary. Audio is resampled to 16 kHz, converted to an 80-bin log-mel spectrogram over 30-second chunks, and fed to a transformer encoder. A transformer decoder then autoregressively predicts text tokens, attending to the encoder output — exactly the machine-translation recipe of Vol II, with mel frames in place of source words. Special tokens in the decoder prompt steer the task: transcribe vs. translate, language id, with-or-without timestamps. One model, one objective, many jobs.

EQ MM4.5 — WHISPER'S AUTOREGRESSIVE OBJECTIVE $$ p(\,y \mid \text{audio}\,) \;=\; \prod_{t=1}^{T} p\big(y_t \mid y_{<t},\, \mathrm{Encoder}(\text{log-mel})\big) $$
The decoder factorizes the transcript \(y\) left-to-right, conditioning each token on the audio encoding and the text emitted so far — identical to a text language model except that the conditioning is a spectrogram. Because the special-token prompt selects the task, the same weights transcribe English, translate Japanese-to-English, or emit word-level timestamps. The model is a translator whose source language is sound.

Whisper's robustness came not from architecture but from scale and weak supervision: enormous, noisy, diverse data taught it to handle accents, background noise, and code-switching far better than systems trained on clean curated corpora. The cost is occasional hallucination — when the audio is silent or unintelligible, an autoregressive decoder will confidently invent fluent text, because nothing in the objective penalizes a plausible-sounding guess. This is the honest caveat: Whisper trades the brittleness of old pipelines for the failure mode of all generative decoders.

True or false: Whisper is an encoder–decoder transformer for speech recognition — the encoder ingests a log-mel spectrogram and the decoder autoregressively predicts text tokens. (Answer true or false.)
Whisper resamples audio to 16 kHz, computes an 80-bin log-mel spectrogram, encodes it with a transformer encoder, and decodes text with a transformer decoder following EQ MM4.5 — the standard sequence-to-sequence transformer, with audio as the source. The statement is true.

The discriminative alternative. Not all ASR is generative. CTC-based and self-supervised encoders such as wav2vec 2.0 learn audio representations from unlabelled speech, then attach a thin output head; they align frames to characters without a decoder and so cannot hallucinate sentences, but they need an external language model for top accuracy. Encoder–decoder models like Whisper fold the language model inside — convenient, at the price of the failure mode above.

4.3

Text-to-speech: Tacotron to neural TTS

Text-to-speech (TTS) runs recognition in reverse: text in, waveform out. Like ASR it was once a pipeline — text normalization, a pronunciation model, a duration model, and a vocoder that built sound from parameters, every stage hand-tuned and audibly robotic. Neural TTS rebuilt it as two learned stages connected by the spectrogram.

Stage 1 — text to spectrogram (the acoustic model). Tacotron 2 (2017) is the archetype: a sequence-to-sequence model with attention that reads characters and emits a log-mel spectrogram frame by frame, learning prosody, rhythm, and emphasis directly from data. Its weakness was the autoregressive attention itself — it could skip or repeat words when alignment slipped. Non-autoregressive successors such as FastSpeech 2 fixed this by predicting an explicit duration for each input token and expanding the sequence in one shot, trading a little naturalness for speed and rock-solid alignment.

EQ MM4.6 — TWO-STAGE NEURAL TTS $$ \text{text} \;\xrightarrow{\ \text{acoustic model}\ }\; \text{log-mel } S \;\xrightarrow{\ \text{vocoder}\ }\; \text{waveform } x $$
The acoustic model (Tacotron 2, FastSpeech 2) maps text to a mel spectrogram — the what to say and how. The vocoder maps that spectrogram to samples — the how it sounds. The spectrogram is a deliberate bottleneck: a perceptually meaningful, low-rate intermediate that lets the two halves be trained, swapped, and debugged independently. Modern systems collapse both stages end-to-end, but the spectrogram remains the conceptual seam.

Stage 2 — spectrogram to waveform (the vocoder). The spectrogram threw away phase, so the vocoder must synthesize plausible samples. WaveNet (2016) was the breakthrough: an autoregressive convolutional network using dilated causal convolutions to model raw audio one sample at a time, conditioned on the spectrogram. Quality was a leap; speed was a disaster — generating 24,000 samples per second one-at-a-time. The field's history since is the race to keep WaveNet's fidelity without its latency: parallel flows (Parallel WaveNet), and the now-dominant GAN vocoders (HiFi-GAN) that generate an entire waveform in a single non-autoregressive forward pass.

EQ MM4.7 — WAVENET: AUTOREGRESSIVE SAMPLES $$ p(x) \;=\; \prod_{n=1}^{T} p\big(x_n \mid x_{n-1}, x_{n-2}, \ldots, x_1\big), \qquad \text{receptive field } = 2^{L}\ \text{with } L\ \text{dilated layers} $$
Each sample is conditioned on all previous samples; dilated convolutions (dilation \(1,2,4,\ldots,2^{L-1}\)) make the receptive field grow exponentially with depth, so a few dozen layers cover thousands of samples cheaply. This is the same autoregressive factorization as a text LLM (Vol II) — applied to 16-bit audio. Its serial nature is exactly why GAN vocoders, which emit all samples at once, replaced it in production.
PYTHON · RUNNABLE IN-BROWSER
# EQ MM4.3: warp linear FFT frequencies onto the perceptual mel scale
import numpy as np
fs, N = 16000, 400
freqs = np.fft.rfftfreq(N, 1/fs)            # linear Hz of each FFT bin

def hz_to_mel(f): return 2595*np.log10(1 + f/700)
def mel_to_hz(m): return 700*(10**(m/2595) - 1)

n_mels = 8                                   # tiny filterbank for display
mel_max = hz_to_mel(fs/2)                     # Nyquist in mels
edges_mel = np.linspace(0, mel_max, n_mels+1) # equal spacing in MEL...
edges_hz  = mel_to_hz(edges_mel)              # ...is geometric spacing in HZ

print(f"Nyquist  : {fs/2:.0f} Hz  =  {mel_max:.0f} mel")
print(f"mel(1000): {hz_to_mel(1000):.1f}   (calibrated so 1 kHz ~ 1000 mel)")
print("mel-band edges in Hz (note they widen as frequency rises):")
print(np.round(edges_hz, 0))
widths = np.diff(edges_hz)
print("band widths (Hz)        :", np.round(widths, 0))
plot_xy(freqs, hz_to_mel(freqs))             # the warping curve itself
edits are live — break it on purpose
Using EQ MM4.3, what is the mel value of \(f = 4000\) Hz? Use \(m(f) = 2595\,\log_{10}(1 + f/700)\). (Round to the nearest mel.)
\(1 + 4000/700 = 6.7143\); \(\log_{10}(6.7143) = 0.82712\); \(2595 \times 0.82712 = 2146.4\). Rounded, the mel value is 2146 mel. Note that doubling Hz from 2 kHz to 4 kHz adds far fewer than 2× the mels — the scale is compressing the highs, just as the ear does.
INSTRUMENT MM4.2 — SAMPLE-RATE / MEL-BIN EXPLORERNYQUIST & FILTERBANK · EQ MM4.3
MEL FILTERBANK EDGES (UNIFORM IN MEL → WIDENING IN HZ)
NYQUIST (MAX FREQ)
8.0 kHz
LOWEST BAND WIDTH
HIGHEST BAND WIDTH
Nyquist says the highest representable frequency is \(f_s/2\): halve the sample rate and the spectrum's ceiling drops with it. The mel bins are spaced evenly in mels but the bars show them widening toward the right — that geometric stretch in Hz is exactly the ear's declining pitch resolution. Add bins and each band narrows.
INSTRUMENT MM4.3 — TTS PIPELINE ANATOMYTEXT → MEL → WAVEFORM · EQ MM4.6
TEXT
ACOUSTIC MODEL
LOG-MEL
VOCODER
WAVEFORM
ARCHITECTURE NOTE
Click any stage to inspect it, or STEP through the pipeline. The mel spectrogram sits at the center as a deliberate bottleneck (EQ MM4.6). Flip to END-TO-END to see how modern systems fuse the acoustic model and vocoder while keeping the mel as an internal seam.
4.4

Audio language models

Once a neural codec (§4.1) turns audio into discrete tokens, the boundary between speech and language dissolves. An audio language model is a transformer trained to predict the next audio token exactly as a text LLM predicts the next word. Google's AudioLM (2022) demonstrated the idea: prompt it with a few seconds of speech or piano and it continues them coherently — preserving speaker identity, prosody, even recording-room acoustics — with no transcript and no explicit acoustic model, purely from the token statistics.

EQ MM4.8 — NEXT-AUDIO-TOKEN PREDICTION $$ p(\mathbf{a}) \;=\; \prod_{t=1}^{T} p\big(a_t \mid a_{<t}\big), \qquad a_t \in \{1, \ldots, K\}\ \text{(codec tokens, EQ MM4.4)} $$
Identical in form to the text objective of Vol II and to WaveNet (EQ MM4.7) — only the vocabulary changed, from words to RVQ codes. The subtlety is the codec's hierarchy: AudioLM splits tokens into coarse "semantic" tokens (content, long-range structure) and fine "acoustic" tokens (timbre, detail), modelling them in stages so the transformer captures meaning before texture. Speech generation became language modelling over a sound vocabulary.

This unification is why a single model can now transcribe, translate, answer, and speak. Text-conditioned variants — Meta's MusicGen for music, VALL-E for zero-shot voice cloning from a three-second sample — prepend text or a reference clip and let the same next-token machinery do the rest. The frontier "omni" assistants (GPT-4o, Gemini) push it further: one transformer ingests interleaved text, audio, and image tokens and emits audio tokens directly, with no separate ASR or TTS stage in the loop. The pipeline of §4.2 and §4.3 has folded into a single sequence model.

Honest limits. Audio tokens are dense — tens of codebook tokens per second across multiple streams — so context windows fill fast, and long-form coherence (a five-minute song that develops, a ten-turn spoken dialogue that remembers) remains hard. Zero-shot voice cloning from seconds of audio is also a live misuse and consent concern; watermarking and detection are active, unsettled research, not solved problems.

4.5

Real-time & streaming

A spoken-conversation assistant lives or dies on latency. Humans tolerate roughly 200–300 ms of silence before a reply feels unnatural; the cascade of ASR → LLM → TTS, each waiting for the previous to finish, easily blows past a full second. Real-time audio is therefore an engineering discipline about cutting the time-to-first-sound, not just model quality.

Streaming, not chunking. Whisper's 30-second window is fine for files but fatal for conversation. Streaming ASR processes audio in small overlapping chunks and emits partial transcripts as words arrive, using causal or limited-lookahead attention so it never waits for the end of the utterance. On the output side, streaming TTS begins vocoding the first words while the acoustic model is still producing later frames — a software pipeline rather than a sequence of blocking stages.

EQ MM4.9 — END-TO-END SPOKEN LATENCY $$ L_{\text{total}} \;=\; L_{\text{ASR}} + L_{\text{LLM}} + L_{\text{TTS}} \;\xrightarrow{\ \text{streaming}\ }\; \max\big(L_{\text{ASR}}^{\text{first}},\, L_{\text{TTS}}^{\text{first}}\big) + L_{\text{LLM}}^{\text{first-token}} $$
A blocking cascade adds the latency of every stage. Streaming overlaps them, so what the user perceives is dominated by the time to the first transcript token, first LLM token, and first audio sample — not the totals. Native audio-in/audio-out models (§4.4) shortcut the chain entirely: no ASR or TTS hop, so \(L_{\text{ASR}}\) and \(L_{\text{TTS}}\) vanish from the budget. The goal is overlap, then elimination.

Full-duplex and the future. The hardest problems are conversational, not acoustic: detecting end-of-turn quickly without cutting the speaker off, handling barge-in (the user interrupting mid-reply), and modelling backchannels ("mm-hmm"). Systems such as Kyutai's Moshi attack this with a full-duplex architecture that listens and speaks simultaneously over parallel token streams, dropping perceived latency toward 200 ms. End-to-end audio language models point the same way: the lower the latency budget, the more pressure to collapse the pipeline into one model — which is exactly the arc this chapter traced.

NEXT

Sound was the first non-text modality to fully become a token stream; the next frontier is a model of the world itself. Chapter 05: world models — systems that learn the dynamics of an environment from observation, predict the future of a scene, and let agents plan and act inside a learned simulation.

4.R

References

  1. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI — Whisper; an encoder–decoder transformer trained on 680k hours, the model behind §4.2.
  2. van den Oord, A. et al. (2016). WaveNet: A Generative Model for Raw Audio. DeepMind — dilated causal convolutions for autoregressive audio (EQ MM4.7); the vocoder breakthrough.
  3. Défossez, A., Copet, J., Synnaeve, G. & Adi, Y. (2022). High Fidelity Neural Audio Compression. Meta — EnCodec; residual-vector-quantized neural codec (EQ MM4.4) that turns audio into discrete tokens.
  4. Shen, J. et al. (2018). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. Google — Tacotron 2; the two-stage acoustic-model + vocoder template of §4.3 (EQ MM4.6).
  5. Ren, Y. et al. (2021). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. ICLR 2021 — non-autoregressive TTS with explicit duration prediction; the alignment fix for Tacotron.
  6. Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Meta — self-supervised audio encoder; the discriminative alternative to generative ASR.
  7. Borsos, Z. et al. (2022). AudioLM: A Language Modeling Approach to Audio Generation. Google — next-audio-token prediction over codec tokens (EQ MM4.8); the audio-LM of §4.4.
  8. Défossez, A. et al. (2024). Moshi: A Speech-Text Foundation Model for Real-Time Dialogue. Kyutai — full-duplex spoken dialogue; the streaming / low-latency direction of §4.5.