Autoencoders & VAEs — AI Encyclopedia

5.1

Autoencoders — learning to compress

An autoencoder is a network trained to copy its input to its output — a task that sounds trivial until you choke the path between them. An encoder $f_\theta$ maps the input $x \in \mathbb{R}^d$ down to a code $z \in \mathbb{R}^k$ with $k \ll d$; a decoder $g_\phi$ maps the code back up to a reconstruction $\hat{x}$. Both are trained jointly to make $\hat{x}$ look like $x$:

EQ N5.1 — RECONSTRUCTION OBJECTIVE $$ z = f_\theta(x), \qquad \hat{x} = g_\phi(z), \qquad \mathcal{L}(\theta,\phi) = \frac{1}{N}\sum_{i=1}^{N} \big\lVert x^{(i)} - g_\phi\!\big(f_\theta(x^{(i)})\big) \big\rVert_2^2 $$

Mean squared error is the default for continuous inputs (pixels, embeddings); binary cross-entropy per pixel is standard for $[0,1]$ images. The whole trick is the bottleneck: the code $z$ is narrower than $x$, so a perfect copy is impossible and the network must spend its few code dimensions on whatever explains the most variance. Nothing in the loss says "find structure" — structure is the only way to win at copying through a constriction.

The label is the input itself, so autoencoders are self-supervised: they need no annotations, just data. What they learn is a coordinate system — a chart of the low-dimensional manifold that the high-dimensional data lives near. A 28×28 image has 784 pixels, but the set of handwritten digits occupies a far thinner sheet inside that 784-dimensional cube; the bottleneck is the network's estimate of how thin.

The cleanest case is fully linear. Let the encoder be a matrix $W \in \mathbb{R}^{k\times d}$ and the decoder $W^\top$, with mean-centered data and squared-error loss. The optimum is not unique, but the subspace it spans is: it is exactly the span of the top $k$ principal components of the data (Baldi & Hornik, 1989). A linear autoencoder rediscovers PCA from scratch — gradient descent on reconstruction error walks straight to the eigenvectors of the covariance matrix.

WHY IT MATTERS

The linear case is the Rosetta stone. It tells you an autoencoder's job is dimensionality reduction, and that the bottleneck width $k$ is choosing how many directions of variance to keep. Nonlinear encoders simply bend PCA's flat hyperplane into a curved manifold — same goal, more expressive chart. See Vol I · EQ 4.x for PCA via the SVD.

A single-hidden-layer linear autoencoder with code width $k$, trained to minimize squared reconstruction error on mean-centered data, recovers the same subspace as the top $k$ principal components. True or false? (Enter true or false.)

With linear $f,g$ and MSE loss, the global optimum projects onto the span of the top-$k$ eigenvectors of the data covariance — exactly PCA's subspace. Individual weights differ (any invertible mixing of the $k$ directions reconstructs equally well), but the subspace is identical. Answer: true.

PYTHON · RUNNABLE IN-BROWSER

# Linear autoencoder == PCA: train by gradient descent, compare to eigenvectors
import numpy as np
rng = np.random.default_rng(0)
d, k, N = 8, 2, 600

# data on a 2D plane (k=2) embedded in 8D, plus small noise -> intrinsic rank 2
basis = np.linalg.qr(rng.normal(size=(d, k)))[0]       # true 2D subspace
X = rng.normal(size=(N, k)) @ basis.T + 0.02 * rng.normal(size=(N, d))
X -= X.mean(0)                                          # center (PCA assumes this)

# PCA: top-k right-singular vectors = the "answer" subspace
_, _, Vt = np.linalg.svd(X, full_matrices=False)
P = Vt[:k]                                              # k x d principal axes

# Linear AE: encoder We (d->k), decoder Wd (k->d). Minimize ||X - (X We) Wd||^2
We = 0.1 * rng.normal(size=(d, k))
Wd = 0.1 * rng.normal(size=(k, d))
for step in range(6000):
    Z = X @ We                                         # codes:  N x k
    R = Z @ Wd - X                                     # residual: N x d
    We -= 0.08 * (X.T @ (R @ Wd.T) / N)                # dL/dWe
    Wd -= 0.08 * (Z.T @ R / N)                         # dL/dWd

err = np.linalg.norm(X - (X @ We) @ Wd) / np.linalg.norm(X)
Wq = np.linalg.qr(We)[0]                                # orthonormal basis of code space
overlap = np.linalg.svd(P @ Wq, compute_uv=False)      # cos(principal angles); 1 == aligned
print(f"AE relative reconstruction error : {err:.4f}")
print(f"cos(principal angles) AE vs PCA  : {np.round(overlap, 4)}")
print("error tiny, cosines ~1  ->  the AE found the PCA plane, just rotated in it")

edits are live — break it on purpose

INSTRUMENT N5.1 — BOTTLENECK EXPLORERLATENT WIDTH k vs RECONSTRUCTION · PCA SURROGATE

LATENT WIDTH k 8

INPUT DIM d

VARIANCE KEPT

—

COMPRESSION d / k

—

A synthetic 64-dim dataset whose variance decays across components (the usual heavy-headed spectrum). The bars show how much of each component a width-$k$ code can keep; the mint curve is cumulative variance retained — the best possible reconstruction at that bottleneck. Slide $k$ low to feel the squeeze: the first handful of axes carry most of the signal, and every dimension past the manifold's intrinsic rank buys almost nothing.

5.2

Denoising & overcomplete variants

A bottleneck is one way to stop an autoencoder from learning the useless identity map. It is not the only way, and not always the best. If you let $k \ge d$ — an overcomplete code — a vanilla autoencoder can cheat by copying the input straight through, learning nothing. Three families of regularizer break that shortcut while keeping a wide, expressive code.

Denoising autoencoders (DAE) corrupt the input, then demand a clean reconstruction. The network sees $\tilde{x} = x + \varepsilon$ (added Gaussian noise, or random pixel masking) and must produce the original $x$:

EQ N5.2 — DENOISING OBJECTIVE $$ \tilde{x} \sim q(\tilde{x}\mid x), \qquad \mathcal{L}_{\text{DAE}} = \mathbb{E}_{x}\,\mathbb{E}_{\tilde{x}\sim q(\cdot\mid x)} \big\lVert x - g_\phi\!\big(f_\theta(\tilde{x})\big) \big\rVert_2^2 $$

Copying is now impossible — the noisy input is not the target. To undo corruption the network must learn the shape of the data: it pushes corrupted points back onto the clean manifold. Vincent et al. (2008) showed the denoiser implicitly learns the score — the gradient of the log-density, $\nabla_x \log p(x)$ — pointing toward where real data lives. That same insight is the seed of modern diffusion models (Chapter 07): a diffusion model is, in essence, a denoising autoencoder trained at every noise level at once.

Two cousins regularize differently. Sparse autoencoders allow a wide code but penalize how many units fire at once — an $L_1$ penalty or a KL term that pins each unit's average activation to a small target. The code stays overcomplete, but any single input lights up only a few dimensions, so each one specializes into an interpretable feature. (This is exactly the mechanism behind today's sparse-autoencoder interpretability work, which decomposes an LLM's dense activations into thousands of monosemantic features.) Contractive autoencoders add $\lVert J_f(x)\rVert_F^2$, the squared Frobenius norm of the encoder's Jacobian, forcing the code to be insensitive to small input perturbations — flat along directions that don't matter, responsive only along the manifold.

Variant	What stops the identity map	What the code becomes
Undercomplete	narrow bottleneck $k < d$	Top directions of variance (PCA-like).
Denoising	corrupt input, clean target	A projection back onto the data manifold; learns the score.
Sparse	$L_1$ / KL activation penalty	Overcomplete but few-active; specialized, often interpretable features.
Contractive	Jacobian-norm penalty	Locally invariant code, flat off the manifold.

All four share one moral: an autoencoder is only as good as the pressure you put on its code. Remove every constraint and it learns the identity; impose the right one and it learns the data's geometry.

INSTRUMENT N5.2 — DENOISING AUTOENCODERCORRUPT → ENCODE → RECONSTRUCT · 1D SIGNAL

NOISE σ 0.30

CODE WIDTH k 4

CORRUPTED MSE

—

DENOISED MSE

—

NOISE REMOVED

—

The clean signal is a smooth manifold spanned by a few low-frequency basis functions (the "data"). We add noise σ, then project the corrupted signal onto the top-$k$ basis — the linear denoiser an autoencoder converges to. Watch the reconstruction snap back toward the clean curve: a $k$-dimensional code can't represent the high-frequency noise, so the noise is discarded. Raise σ and the denoised MSE stays far below the corrupted MSE — that gap is the autoencoder doing its job.

5.3

The latent space

The code $z$ is not just a compressed file — it is a place. The set of all codes the encoder can produce is the latent space, and its geometry is where autoencoders earn their keep. Distances in latent space correspond to perceptual or semantic distances in data space far better than raw pixel distance does: two photos of the same face under different lighting are far apart in pixels but close in a good latent.

This is what makes the latent space useful for more than compression. You can interpolate: decode $g_\phi\big((1-t)\,z_a + t\,z_b\big)$ and sweep $t$ from 0 to 1 to morph smoothly from one example to another. You can cluster in latent space, where classes separate cleanly. You can do nearest-neighbour retrieval on codes instead of inputs. And you can detect anomalies: a point the autoencoder reconstructs poorly is, by construction, off the manifold it learned — high reconstruction error is an unsupervised novelty score.

THE CATCH

A plain autoencoder's latent space has holes. Training only constrains the codes the encoder actually emits; the space between and around them is unconstrained. Decode a random point — or a midpoint between two clusters — and you often get garbage, because the decoder was never asked to make that region meaningful. The latent is a scatter of trained islands in an empty sea. You cannot reliably sample new data from it. Fixing this hole is the entire motivation for the variational autoencoder.

5.4

Variational autoencoders

The variational autoencoder (VAE) of Kingma & Welling (2013) closes the holes by making two changes. First, the encoder no longer outputs a single point — it outputs a distribution: a mean $\mu(x)$ and a (log-)variance $\log\sigma^2(x)$ defining $q_\phi(z\mid x) = \mathcal{N}(\mu, \sigma^2 I)$. Second, the loss regularizes that distribution toward a standard normal prior $p(z) = \mathcal{N}(0, I)$. Encode a point and you get a fuzzy ball, not a dot; train the whole dataset and the balls overlap to tile a smooth, gap-free Gaussian cloud you can sample from at will.

The objective is a lower bound on the data log-likelihood, the evidence lower bound (ELBO):

EQ N5.3 — THE ELBO (VAE LOSS) $$ \log p_\theta(x) \;\ge\; \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathrm{KL}}\!\big(q_\phi(z\mid x)\,\Vert\,p(z)\big)}_{\text{regularizer}} $$

The VAE maximizes the ELBO, equivalently minimizes $-\text{ELBO}$. Two terms in tension: the first says "encode enough of $x$ that the decoder can rebuild it"; the second says "keep $q(z\mid x)$ close to the prior so the latent stays a tidy, sampleable Gaussian." The VAE loss is reconstruction error plus the KL divergence to the prior — that single sentence is the whole model. Crank a weight $\beta$ on the KL term and you get the $\beta$-VAE (Higgins et al., 2017), which trades reconstruction sharpness for more disentangled, axis-aligned latents.

For diagonal Gaussians the KL term has a clean closed form — no sampling needed to compute it:

EQ N5.4 — KL OF DIAGONAL GAUSSIAN TO N(0, I) $$ D_{\mathrm{KL}}\!\big(\mathcal{N}(\mu,\sigma^2 I)\,\Vert\,\mathcal{N}(0,I)\big) \;=\; \frac{1}{2}\sum_{j=1}^{k}\Big(\sigma_j^2 + \mu_j^2 - 1 - \log\sigma_j^2\Big) $$

Each latent dimension contributes independently. The term is zero exactly when $\mu_j = 0$ and $\sigma_j = 1$ — i.e. when that dimension is the prior. It penalizes a code for drifting from the origin ($\mu_j^2$) or collapsing to a spike ($-\log\sigma_j^2$ blows up as $\sigma_j \to 0$). This pressure is what fills the gaps: every encoded ball is pushed to overlap the others around the origin.

One obstacle remains. The ELBO contains an expectation over $z \sim q_\phi(z\mid x)$, and sampling $z$ is not differentiable — you can't backpropagate through a random draw. The fix is the reparameterization trick: move the randomness outside the network. Instead of sampling $z$ directly, sample a fixed-noise $\varepsilon \sim \mathcal{N}(0, I)$ and build $z$ as a deterministic, differentiable function of $\mu$, $\sigma$, and $\varepsilon$:

EQ N5.5 — REPARAMETERIZATION TRICK $$ z = \mu(x) + \sigma(x) \odot \varepsilon, \qquad \varepsilon \sim \mathcal{N}(0, I) $$

Now $z$ is a smooth function of the parameters $(\mu,\sigma)$ with the stochasticity quarantined in $\varepsilon$, so gradients flow through $\mu$ and $\sigma$ cleanly. $\odot$ is elementwise product. This one line is what makes the VAE trainable end-to-end by ordinary backprop — arguably the paper's most reused idea, now standard far beyond VAEs (it powers the policy-gradient reparameterizations in Vol III and the noise schedules of diffusion).

The VAE training objective (the negative ELBO it minimizes) is the reconstruction loss plus the KL divergence from the approximate posterior $q_\phi(z\mid x)$ to the prior $p(z)$. True or false? (Enter true or false.)

EQ N5.3 maximizes $\mathbb{E}_q[\log p(x\mid z)] - D_{\mathrm{KL}}(q\Vert p)$. Flipping sign to a loss: minimize $(-\text{reconstruction}) + D_{\mathrm{KL}}(q\Vert p)$ — i.e. reconstruction loss plus KL to the prior. Answer: true.

A one-dimensional VAE latent has $\mu = 1$ and $\sigma = 1$. Using EQ N5.4, what is the KL divergence $D_{\mathrm{KL}}\big(\mathcal{N}(1,1)\,\Vert\,\mathcal{N}(0,1)\big)$? (Recall $\log 1 = 0$.)

$\tfrac12\big(\sigma^2 + \mu^2 - 1 - \log\sigma^2\big) = \tfrac12\big(1 + 1 - 1 - \log 1\big) = \tfrac12(1 + 1 - 1 - 0) = \tfrac12 \cdot 1 = $ 0.5 nats. The mean is one standard deviation off the prior; the variance already matches, so all the cost comes from $\mu^2$.

PYTHON · RUNNABLE IN-BROWSER

# VAE reparameterization trick: z = mu + sigma*eps, plus the closed-form KL
import numpy as np
rng = np.random.default_rng(0)

mu      = np.array([1.0, -0.5, 0.0])     # encoder mean for one input
log_var = np.array([0.0,  0.0, 2.0])     # encoder log-variance (sigma^2)
sigma   = np.exp(0.5 * log_var)          # -> sigma = [1, 1, e]

# draw many z via the trick; randomness lives only in eps ~ N(0, I)
eps = rng.normal(size=(20000, 3))
z   = mu + sigma * eps                    # broadcast: deterministic in (mu, sigma)
print("sample mean  ~ mu   :", np.round(z.mean(0), 3), " target", mu)
print("sample std   ~ sigma:", np.round(z.std(0),  3), " target", np.round(sigma, 3))

# closed-form KL( N(mu,sigma^2) || N(0,I) ) = 0.5 * sum(sigma^2 + mu^2 - 1 - log sigma^2)
kl = 0.5 * np.sum(sigma**2 + mu**2 - 1.0 - log_var)
print(f"\nKL to prior (nats)  : {kl:.4f}")
print("check dim 0 (mu=1, sig=1): 0.5*(1+1-1-0) =", 0.5*(1+1-1-0), "-> 0.5")
print("dim 2 (mu=0, sig=e) pays for an over-wide variance, log_var=2")

edits are live — break it on purpose

INSTRUMENT N5.3 — VAE LATENT SAMPLER2D LATENT GRID → DECODED OUTPUTS · EQ N5.5

LATENT RANGE (± std) 2.5

KL WEIGHT β 1.0

PRIOR MASS COVERED

—

GRID KL (mean)

—

DISENTANGLEMENT

—

We walk a grid across the 2D latent and decode each cell — the classic VAE "latent atlas." Because the prior is $\mathcal{N}(0,I)$, the centre is dense (common samples) and the corners are rare. Each tile shows a synthetic decoded shape whose two factors of variation (curvature, orientation) are driven by the two latent axes. Raise β and the axes become more independent — disentangled — at the cost of blurrier, lower-contrast outputs. The dashed contours are the prior's 1σ and 2σ rings: anything inside them is a plausible sample.

5.5

Representation learning & uses

Autoencoders matter today less as standalone generators and more as representation learners — machinery for turning raw data into compact, structured codes that everything downstream consumes.

Pretraining & transfer. Train an encoder unsupervised on a mountain of unlabelled data, then attach a small classifier head and fine-tune on a little labelled data. The masked-autoencoding idea (mask patches, reconstruct them) is the visual analogue of masked language modelling — MAE (He et al., 2022) made it the dominant self-supervised recipe for vision transformers.
Anomaly detection. Train on normal data only; flag inputs the model reconstructs poorly. High reconstruction error means "off the learned manifold" — fraud, defects, intrusions, equipment faults.
The latent backbone of generative AI. The single most consequential use: a VAE compresses images into a small latent grid, and a diffusion model (Chapter 07) does its expensive denoising in that latent space instead of in pixels. This is the "VAE" inside Stable Diffusion and its descendants — latent diffusion is why a consumer GPU can generate megapixel images. The autoencoder isn't the generator; it's the compression layer that makes the generator affordable.
Discrete codes for sequence models. The VQ-VAE (van den Oord et al., 2017) replaces the Gaussian latent with a learned codebook, turning images, audio, or video into sequences of discrete tokens that an autoregressive transformer can then model exactly like text — the basis of many modern image and audio generators.

It is worth being honest about the trade-offs experts actually argue over. VAE samples are notoriously blurry compared to GANs (Chapter 06) and diffusion: the Gaussian decoder and the averaging implied by the ELBO smear high-frequency detail. Posterior collapse is the classic failure — when the decoder is powerful enough to ignore $z$, the KL term drives $q(z\mid x)$ all the way to the prior and the latent carries no information; KL-annealing, free-bits, and weaker decoders are the usual countermeasures. And the ELBO is a bound, not the likelihood itself: a higher ELBO does not guarantee better samples, and "good representation" and "good generation" are not the same objective. The VAE's enduring win is not photorealism — it is a well-organized, sampleable latent space, which is exactly what the rest of the generative stack needed.

The VAE buys a smooth latent at the price of blur. The next chapter takes the opposite bet: drop the explicit likelihood entirely and learn to generate by competition. Chapter 06 — GANs: a generator and a discriminator locked in a minimax game, the sharpest samples in deep learning and the hardest training dynamics to tame.

5.R

References

Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) — deep autoencoders, trained layer-wise, beat PCA at nonlinear dimensionality reduction (§5.1).
Kingma, D. P. & Welling, M. (2013). Auto-Encoding Variational Bayes. ICLR 2014 — the VAE, the ELBO (EQ N5.3), and the reparameterization trick (EQ N5.5).
Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. ICML 2008 — the denoising autoencoder (EQ N5.2) and the manifold-projection view that seeds diffusion.
Baldi, P. & Hornik, K. (1989). Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima. Neural Networks 2(1) — proves the linear autoencoder optimum spans the top-k PCA subspace (§5.1).
Higgins, I. et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017 — weighting the KL term to encourage disentangled latent factors (§5.4).
van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE). NeurIPS 2017 — discrete codebook latents that let autoregressive models generate over autoencoder tokens (§5.5).
Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022 — the VAE-compressed latent space that makes Stable Diffusion affordable (§5.5).
He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022 — masked autoencoding as a strong self-supervised pretext for vision transformers (§5.5).

Variant	What stops the identity map	What the code becomes
Undercomplete	narrow bottleneck \(k < d\)	Top directions of variance (PCA-like).
Denoising	corrupt input, clean target	A projection back onto the data manifold; learns the score.
Sparse	\(L_1\) / KL activation penalty	Overcomplete but few-active; specialized, often interpretable features.
Contractive	Jacobian-norm penalty	Locally invariant code, flat off the manifold.