Autoencoders — learning to compress
An autoencoder is a network trained to copy its input to its output — a task that sounds trivial until you choke the path between them. An encoder \(f_\theta\) maps the input \(x \in \mathbb{R}^d\) down to a code \(z \in \mathbb{R}^k\) with \(k \ll d\); a decoder \(g_\phi\) maps the code back up to a reconstruction \(\hat{x}\). Both are trained jointly to make \(\hat{x}\) look like \(x\):
The label is the input itself, so autoencoders are self-supervised: they need no annotations, just data. What they learn is a coordinate system — a chart of the low-dimensional manifold that the high-dimensional data lives near. A 28×28 image has 784 pixels, but the set of handwritten digits occupies a far thinner sheet inside that 784-dimensional cube; the bottleneck is the network's estimate of how thin.
The cleanest case is fully linear. Let the encoder be a matrix \(W \in \mathbb{R}^{k\times d}\) and the decoder \(W^\top\), with mean-centered data and squared-error loss. The optimum is not unique, but the subspace it spans is: it is exactly the span of the top \(k\) principal components of the data (Baldi & Hornik, 1989). A linear autoencoder rediscovers PCA from scratch — gradient descent on reconstruction error walks straight to the eigenvectors of the covariance matrix.
The linear case is the Rosetta stone. It tells you an autoencoder's job is dimensionality reduction, and that the bottleneck width \(k\) is choosing how many directions of variance to keep. Nonlinear encoders simply bend PCA's flat hyperplane into a curved manifold — same goal, more expressive chart. See Vol I · EQ 4.x for PCA via the SVD.
# Linear autoencoder == PCA: train by gradient descent, compare to eigenvectors
import numpy as np
rng = np.random.default_rng(0)
d, k, N = 8, 2, 600
# data on a 2D plane (k=2) embedded in 8D, plus small noise -> intrinsic rank 2
basis = np.linalg.qr(rng.normal(size=(d, k)))[0] # true 2D subspace
X = rng.normal(size=(N, k)) @ basis.T + 0.02 * rng.normal(size=(N, d))
X -= X.mean(0) # center (PCA assumes this)
# PCA: top-k right-singular vectors = the "answer" subspace
_, _, Vt = np.linalg.svd(X, full_matrices=False)
P = Vt[:k] # k x d principal axes
# Linear AE: encoder We (d->k), decoder Wd (k->d). Minimize ||X - (X We) Wd||^2
We = 0.1 * rng.normal(size=(d, k))
Wd = 0.1 * rng.normal(size=(k, d))
for step in range(6000):
Z = X @ We # codes: N x k
R = Z @ Wd - X # residual: N x d
We -= 0.08 * (X.T @ (R @ Wd.T) / N) # dL/dWe
Wd -= 0.08 * (Z.T @ R / N) # dL/dWd
err = np.linalg.norm(X - (X @ We) @ Wd) / np.linalg.norm(X)
Wq = np.linalg.qr(We)[0] # orthonormal basis of code space
overlap = np.linalg.svd(P @ Wq, compute_uv=False) # cos(principal angles); 1 == aligned
print(f"AE relative reconstruction error : {err:.4f}")
print(f"cos(principal angles) AE vs PCA : {np.round(overlap, 4)}")
print("error tiny, cosines ~1 -> the AE found the PCA plane, just rotated in it")
Denoising & overcomplete variants
A bottleneck is one way to stop an autoencoder from learning the useless identity map. It is not the only way, and not always the best. If you let \(k \ge d\) — an overcomplete code — a vanilla autoencoder can cheat by copying the input straight through, learning nothing. Three families of regularizer break that shortcut while keeping a wide, expressive code.
Denoising autoencoders (DAE) corrupt the input, then demand a clean reconstruction. The network sees \(\tilde{x} = x + \varepsilon\) (added Gaussian noise, or random pixel masking) and must produce the original \(x\):
Two cousins regularize differently. Sparse autoencoders allow a wide code but penalize how many units fire at once — an \(L_1\) penalty or a KL term that pins each unit's average activation to a small target. The code stays overcomplete, but any single input lights up only a few dimensions, so each one specializes into an interpretable feature. (This is exactly the mechanism behind today's sparse-autoencoder interpretability work, which decomposes an LLM's dense activations into thousands of monosemantic features.) Contractive autoencoders add \(\lVert J_f(x)\rVert_F^2\), the squared Frobenius norm of the encoder's Jacobian, forcing the code to be insensitive to small input perturbations — flat along directions that don't matter, responsive only along the manifold.
| Variant | What stops the identity map | What the code becomes |
|---|---|---|
| Undercomplete | narrow bottleneck \(k < d\) | Top directions of variance (PCA-like). |
| Denoising | corrupt input, clean target | A projection back onto the data manifold; learns the score. |
| Sparse | \(L_1\) / KL activation penalty | Overcomplete but few-active; specialized, often interpretable features. |
| Contractive | Jacobian-norm penalty | Locally invariant code, flat off the manifold. |
All four share one moral: an autoencoder is only as good as the pressure you put on its code. Remove every constraint and it learns the identity; impose the right one and it learns the data's geometry.
The latent space
The code \(z\) is not just a compressed file — it is a place. The set of all codes the encoder can produce is the latent space, and its geometry is where autoencoders earn their keep. Distances in latent space correspond to perceptual or semantic distances in data space far better than raw pixel distance does: two photos of the same face under different lighting are far apart in pixels but close in a good latent.
This is what makes the latent space useful for more than compression. You can interpolate: decode \(g_\phi\big((1-t)\,z_a + t\,z_b\big)\) and sweep \(t\) from 0 to 1 to morph smoothly from one example to another. You can cluster in latent space, where classes separate cleanly. You can do nearest-neighbour retrieval on codes instead of inputs. And you can detect anomalies: a point the autoencoder reconstructs poorly is, by construction, off the manifold it learned — high reconstruction error is an unsupervised novelty score.
A plain autoencoder's latent space has holes. Training only constrains the codes the encoder actually emits; the space between and around them is unconstrained. Decode a random point — or a midpoint between two clusters — and you often get garbage, because the decoder was never asked to make that region meaningful. The latent is a scatter of trained islands in an empty sea. You cannot reliably sample new data from it. Fixing this hole is the entire motivation for the variational autoencoder.
Variational autoencoders
The variational autoencoder (VAE) of Kingma & Welling (2013) closes the holes by making two changes. First, the encoder no longer outputs a single point — it outputs a distribution: a mean \(\mu(x)\) and a (log-)variance \(\log\sigma^2(x)\) defining \(q_\phi(z\mid x) = \mathcal{N}(\mu, \sigma^2 I)\). Second, the loss regularizes that distribution toward a standard normal prior \(p(z) = \mathcal{N}(0, I)\). Encode a point and you get a fuzzy ball, not a dot; train the whole dataset and the balls overlap to tile a smooth, gap-free Gaussian cloud you can sample from at will.
The objective is a lower bound on the data log-likelihood, the evidence lower bound (ELBO):
For diagonal Gaussians the KL term has a clean closed form — no sampling needed to compute it:
One obstacle remains. The ELBO contains an expectation over \(z \sim q_\phi(z\mid x)\), and sampling \(z\) is not differentiable — you can't backpropagate through a random draw. The fix is the reparameterization trick: move the randomness outside the network. Instead of sampling \(z\) directly, sample a fixed-noise \(\varepsilon \sim \mathcal{N}(0, I)\) and build \(z\) as a deterministic, differentiable function of \(\mu\), \(\sigma\), and \(\varepsilon\):
# VAE reparameterization trick: z = mu + sigma*eps, plus the closed-form KL
import numpy as np
rng = np.random.default_rng(0)
mu = np.array([1.0, -0.5, 0.0]) # encoder mean for one input
log_var = np.array([0.0, 0.0, 2.0]) # encoder log-variance (sigma^2)
sigma = np.exp(0.5 * log_var) # -> sigma = [1, 1, e]
# draw many z via the trick; randomness lives only in eps ~ N(0, I)
eps = rng.normal(size=(20000, 3))
z = mu + sigma * eps # broadcast: deterministic in (mu, sigma)
print("sample mean ~ mu :", np.round(z.mean(0), 3), " target", mu)
print("sample std ~ sigma:", np.round(z.std(0), 3), " target", np.round(sigma, 3))
# closed-form KL( N(mu,sigma^2) || N(0,I) ) = 0.5 * sum(sigma^2 + mu^2 - 1 - log sigma^2)
kl = 0.5 * np.sum(sigma**2 + mu**2 - 1.0 - log_var)
print(f"\nKL to prior (nats) : {kl:.4f}")
print("check dim 0 (mu=1, sig=1): 0.5*(1+1-1-0) =", 0.5*(1+1-1-0), "-> 0.5")
print("dim 2 (mu=0, sig=e) pays for an over-wide variance, log_var=2")
Representation learning & uses
Autoencoders matter today less as standalone generators and more as representation learners — machinery for turning raw data into compact, structured codes that everything downstream consumes.
- Pretraining & transfer. Train an encoder unsupervised on a mountain of unlabelled data, then attach a small classifier head and fine-tune on a little labelled data. The masked-autoencoding idea (mask patches, reconstruct them) is the visual analogue of masked language modelling — MAE (He et al., 2022) made it the dominant self-supervised recipe for vision transformers.
- Anomaly detection. Train on normal data only; flag inputs the model reconstructs poorly. High reconstruction error means "off the learned manifold" — fraud, defects, intrusions, equipment faults.
- The latent backbone of generative AI. The single most consequential use: a VAE compresses images into a small latent grid, and a diffusion model (Chapter 07) does its expensive denoising in that latent space instead of in pixels. This is the "VAE" inside Stable Diffusion and its descendants — latent diffusion is why a consumer GPU can generate megapixel images. The autoencoder isn't the generator; it's the compression layer that makes the generator affordable.
- Discrete codes for sequence models. The VQ-VAE (van den Oord et al., 2017) replaces the Gaussian latent with a learned codebook, turning images, audio, or video into sequences of discrete tokens that an autoregressive transformer can then model exactly like text — the basis of many modern image and audio generators.
It is worth being honest about the trade-offs experts actually argue over. VAE samples are notoriously blurry compared to GANs (Chapter 06) and diffusion: the Gaussian decoder and the averaging implied by the ELBO smear high-frequency detail. Posterior collapse is the classic failure — when the decoder is powerful enough to ignore \(z\), the KL term drives \(q(z\mid x)\) all the way to the prior and the latent carries no information; KL-annealing, free-bits, and weaker decoders are the usual countermeasures. And the ELBO is a bound, not the likelihood itself: a higher ELBO does not guarantee better samples, and "good representation" and "good generation" are not the same objective. The VAE's enduring win is not photorealism — it is a well-organized, sampleable latent space, which is exactly what the rest of the generative stack needed.
The VAE buys a smooth latent at the price of blur. The next chapter takes the opposite bet: drop the explicit likelihood entirely and learn to generate by competition. Chapter 06 — GANs: a generator and a discriminator locked in a minimax game, the sharpest samples in deep learning and the hardest training dynamics to tame.
References
- Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks.
- Kingma, D. P. & Welling, M. (2013). Auto-Encoding Variational Bayes.
- Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. (2008). Extracting and Composing Robust Features with Denoising Autoencoders.
- Baldi, P. & Hornik, K. (1989). Neural Networks and Principal Component Analysis: Learning from Examples without Local Minima.
- Higgins, I. et al. (2017). β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.
- van den Oord, A., Vinyals, O. & Kavukcuoglu, K. (2017). Neural Discrete Representation Learning (VQ-VAE).
- Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.
- He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners.