AI // ENCYCLOPEDIA / VOL II / 13 / INTERPRETABILITY INDEX NEXT: CAPSTONE →
VOLUME II — THE LLM FIELD MANUAL · CHAPTER 13 / 13

Mechanistic Interpretability & Sparse Autoencoders

A trained network is not a black box because it is mysterious. It is a black box because it stores far more concepts than it has neurons, smeared across overlapping directions in activation space. Mechanistic interpretability tries to read that representation back out as human-legible features and circuits, and sparse autoencoders are the current best tool for prying the features apart. This chapter covers the goal, the superposition that makes it hard, the dictionary-learning fix, the causal tests that keep it honest, and the limits the field is candid about.

LEVELADVANCED READING TIME≈ 30 MIN BUILDS ONCH 02 · CH 03 · CH 12 INSTRUMENTSSUPERPOSITION · SAE TRADE-OFF
13.1

The goal: reverse-engineering learned algorithms

Training tells you that a network works. It does not tell you how. The weights are the only artifact, and they encode whatever algorithm gradient descent happened to find. Mechanistic interpretability is the program of recovering that algorithm: not "which input pixels mattered" (the question saliency maps ask) but "what computation is this set of weights actually running, expressed in terms a person can audit." The unit of explanation is the circuit — a subgraph of features and the weights connecting them that together implement an identifiable function, such as "detect that the current token repeats an earlier bigram, then copy what came next."

The motivation is partly scientific and partly safety. If a model can be made to lie, refuse, or pursue a goal, you would like to find the internal mechanism that does so, verify it causally, and ideally intervene. Saliency and attention-weight visualization were the first attempts; both proved too coarse, because they describe where a network looked, not what it concluded. The mechanistic program instead insists on three commitments: name the intermediate variables (features), describe the computation over them (circuits), and confirm each claim with a causal experiment rather than a correlation.

Two facts make this hard, and they structure the rest of the chapter. First, the natural unit you can read off directly — the neuron — usually does not correspond to a single concept. Second, the reason is not noise but an efficient encoding the network chose deliberately: superposition, packing more features than it has dimensions. Sections 13.2 and 13.3 establish the problem; 13.4 gives the tool that addresses it; 13.5 supplies the causal tests; 13.6 is the honest accounting of what we can and cannot yet claim.

FRAME

Keep one distinction front of mind throughout. A neuron is a coordinate of the activation vector — a basis direction fixed by the architecture. A feature is a direction in that same space that the network actually uses to represent one concept. They coincide only by accident. Interpretability is largely the search for the feature directions hiding inside the neuron basis.

13.2

Features versus neurons, and polysemanticity

If features lined up with neurons, interpretability would be a labeling exercise: read each neuron's top-activating inputs, write down the concept. Some neurons do behave this way — a few are cleanly monosemantic, firing for exactly one human-recognizable thing. Most are not. A typical neuron is polysemantic: its top activations include unrelated concepts — French text, DNA codons, and HTTP headers all lighting up the same coordinate. There is no single caption you can write for it that is faithful.

Polysemanticity is the central obstacle. It is not that the network failed to organize itself; it is that the natural basis (the neurons) is the wrong coordinate system. The concepts are still in there, but they live along directions that cut across many neurons at once. Reading the activation vector in the neuron basis is like reading a paragraph with the word boundaries deleted: every letter is present, but the units of meaning are not where the gaps are.

EQ 13.1 — THE LINEAR REPRESENTATION ASSUMPTION $$ x \;\approx\; \sum_{i=1}^{F} a_i\, \mathbf{d}_i, \qquad \mathbf{d}_i \in \mathbb{R}^{d},\; \lVert \mathbf{d}_i \rVert = 1,\; a_i \ge 0 $$
An activation vector \(x\in\mathbb{R}^{d}\) is modeled as a sparse, non-negative combination of \(F\) unit feature directions \(\mathbf{d}_i\), where \(a_i\) is how strongly feature \(i\) is present. This linear representation assumption — concepts are directions, and they add — underlies almost all of mechanistic interpretability. A neuron is just the special case \(\mathbf{d}_i = e_i\), a standard basis vector. Polysemanticity is what you get when the true \(\mathbf{d}_i\) are not axis-aligned, so each axis collects pieces of several features.

Why would training produce non-axis-aligned features? Because there is no pressure to align them. Gradient descent optimizes the loss, and the loss is invariant to how you rotate the hidden basis: any rotation of the representation, undone by the next layer's weights, computes the same function. The network is free to place features wherever is cheapest, and §13.3 argues that "cheapest" usually means "not one per neuron."

A neuron's top-activating dataset examples cluster into three clearly distinct, unrelated concepts: legal contracts, chess notation, and Python tracebacks. How many monosemantic features would a faithful description need to assign to this single neuron's behavior (i.e. how many separate concepts is it encoding)?
A monosemantic feature captures exactly one concept. This neuron fires for three unrelated ones, so a faithful account needs 3 distinct features — the neuron is polysemantic, mixing three directions onto one axis. That mismatch is precisely why the neuron basis is the wrong place to read features, and why we need a method (the SAE of §13.4) to recover the three underlying directions.
13.3

The superposition hypothesis

The leading explanation for polysemanticity is superposition: a network represents more features than it has dimensions by storing them as directions that are not orthogonal but merely close to orthogonal. With \(d\) dimensions you get only \(d\) perfectly orthogonal directions, but you can pack many more nearly-orthogonal ones into the same space — the Johnson–Lindenstrauss bound says the number you can fit while keeping pairwise overlap below \(\varepsilon\) grows exponentially in \(d\). The price is interference: any two features whose directions are not exactly perpendicular leak a little into each other's readout.

EQ 13.2 — INTERFERENCE FROM NON-ORTHOGONALITY $$ \hat{a}_i \;=\; \mathbf{d}_i^{\top} x \;=\; \mathbf{d}_i^{\top}\!\Big(\textstyle\sum_j a_j \mathbf{d}_j\Big) \;=\; a_i \;+\; \underbrace{\sum_{j\neq i} a_j\,(\mathbf{d}_i^{\top}\mathbf{d}_j)}_{\text{cross-talk}} $$
Reading feature \(i\) by projecting onto its direction recovers the true activation \(a_i\) plus a sum of contaminations, each weighted by the cosine \(\mathbf{d}_i^{\top}\mathbf{d}_j\) between feature \(i\) and every other active feature \(j\). If all directions were orthogonal the cross-talk term vanishes and \(\hat a_i = a_i\) exactly — but then you are capped at \(d\) features. Superposition trades a small, tolerable amount of interference for a large gain in capacity. It works because of the next idea: sparsity.

Superposition is only viable because real features are sparse: on any given input, almost none of the thousands of possible features are active. If two rarely-active features happen to share a slightly-overlapping direction, they almost never fire together, so the cross-talk in EQ 13.2 almost never materializes. Sparsity is what lets the network get away with the overpacking. Elhage et al. (2022) showed in toy models that as feature sparsity rises, networks transition from a "one feature per neuron" regime into dense superposition, often arranging features into tidy geometric structures (antipodal pairs, pentagons, tetrahedra) that minimize worst-case interference.

EQ 13.3 — WHEN SUPERPOSITION PAYS $$ \text{loss} \;\approx\; \underbrace{\text{(features not represented)}}_{\text{capacity cost}} \;+\; \underbrace{\textstyle\sum_{i}\Pr[a_i\!\neq\!0]\sum_{j\neq i}\Pr[a_j\!\neq\!0]\,(\mathbf{d}_i^{\top}\mathbf{d}_j)^2}_{\text{interference cost}} $$
A schematic of the trade-off the network optimizes. Forcing orthogonality kills the interference term but caps you at \(d\) features, paying the capacity cost. Packing more features incurs interference, but each term is scaled by the product of two firing probabilities — so when features are sparse (small \(\Pr[a\neq0]\)) the interference cost is tiny and superposition wins. The sparser the features, the more you can pack. This is the quantitative heart of the hypothesis.
Two unit feature directions are \(\mathbf{d}_i = (0.6,\ 0.8)\) and \(\mathbf{d}_j = (1.0,\ 0.0)\). Their cosine similarity \(\mathbf{d}_i^{\top}\mathbf{d}_j\) is the per-unit interference one leaks into the other (EQ 13.2). Compute it.
Both vectors are already unit length (\(\sqrt{0.6^2+0.8^2}=1\), \(\sqrt{1^2+0^2}=1\)). The cosine is the dot product: \((0.6)(1.0) + (0.8)(0.0) = \) 0.6. That is a large overlap — these two features would interfere heavily if ever active together, so the network would only pack them this way if they almost never co-occur. Truly orthogonal features (cosine 0) carry zero cross-talk.
INSTRUMENT 13.1 — SUPERPOSITION: PACKING F FEATURES INTO d DIMENSIONSINTERFERENCE & RECONSTRUCTION · EQ 13.2
F / d (PACKING)
MEAN |COSINE|
RECON ERROR (RMS)
Left: the \(F\) feature directions placed in \(d\) dimensions (shown as a Gram-matrix heatmap of pairwise cosines — the diagonal is 1, off-diagonal is interference). Right: a bar of read-out error. Push F up past d and the off-diagonal heats up: you cannot keep many vectors orthogonal in few dimensions, so cross-talk rises. Now lower L0 (fewer features active per input): error falls even at high F/d, because sparse features rarely collide. That is superposition's bargain — overpack, but stay sparse.
PYTHON · RUNNABLE IN-BROWSER
# Superposition: pack F features into d dims, measure interference (EQ 13.2/13.3)
import numpy as np
rng = np.random.default_rng(0)
d, F, k = 8, 40, 3          # dims, features, active-per-input (L0)

D = rng.normal(0, 1, (d, F))
D /= np.linalg.norm(D, axis=0, keepdims=True)   # unit feature directions
G = D.T @ D                                       # Gram matrix of cosines
off = G[~np.eye(F, dtype=bool)]                   # off-diagonal = interference
print(f"F/d packing ratio     : {F/d:.2f}  (>1 means superposition)")
print(f"mean |cosine| off-diag: {np.abs(off).mean():.4f}")
print(f"max  |cosine| off-diag: {np.abs(off).max():.4f}")

errs = []                                          # read-out error vs sparsity
for kk in (1, 2, 4, 8):
    e = []
    for _ in range(400):
        a = np.zeros(F); idx = rng.choice(F, kk, replace=False)
        a[idx] = rng.uniform(0.5, 1.5, kk)         # kk active features
        x = D @ a                                   # superposed activation
        a_hat = D.T @ x                             # project to read each feature
        e.append(np.sqrt(np.mean((a_hat - a)**2)))
    errs.append(np.mean(e))
    print(f"L0={kk}: mean recon RMSE over {F} features = {errs[-1]:.4f}")
print("\nfewer active features -> less collision -> lower error: sparsity buys capacity.")
plot_xy([1,2,4,8], errs)                            # error rises with density
edits are live — break it on purpose
13.4

Sparse autoencoders: dictionary learning for features

If superposition is the disease, dictionary learning is the cure, and a sparse autoencoder (SAE) is how the field runs it at scale. The idea: train a small auxiliary network to re-express each activation vector \(x\) as a sparse, non-negative combination of a large, learned dictionary of directions — exactly the form of EQ 13.1. If we force the code to be sparse, individual dictionary entries are pushed toward each capturing one concept, recovering the monosemantic features the neuron basis hid.

EQ 13.4 — SAE ENCODE / DECODE $$ \mathbf{f}(x) \;=\; \mathrm{ReLU}\!\big(W_{\text{enc}}(x - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}}\big), \qquad \hat{x} \;=\; W_{\text{dec}}\,\mathbf{f}(x) + \mathbf{b}_{\text{dec}} $$
The encoder maps an activation \(x\in\mathbb{R}^{d}\) to a much wider code \(\mathbf{f}(x)\in\mathbb{R}^{m}\) with \(m\gg d\) (the dictionary is overcomplete, typically 8×–64× wider). ReLU forces non-negativity and lets most entries be exactly zero. The decoder rebuilds \(x\) as \(\hat x = \sum_i f_i(x)\,W_{\text{dec},:,i}\): each column of \(W_{\text{dec}}\) is a learned feature direction \(\mathbf{d}_i\), and \(f_i(x)\) is how active that feature is. The SAE is just EQ 13.1 made trainable.

What stops the SAE from learning the trivial identity (one dictionary atom per input)? The loss. It balances two terms: reconstruct \(x\) faithfully, and keep the code sparse.

EQ 13.5 — RECONSTRUCTION + SPARSITY LOSS $$ \mathcal{L} \;=\; \underbrace{\lVert x - \hat{x} \rVert_2^2}_{\text{reconstruction (fidelity)}} \;+\; \lambda\,\underbrace{\lVert \mathbf{f}(x)\rVert_1}_{\text{sparsity penalty}} $$
The first term wants a perfect copy; the second, an \(L_1\) penalty on the code, wants as few active features as possible. The coefficient \(\lambda\) sets the exchange rate. \(L_1\) is a convex surrogate for the true target, the \(L_0\) "count of nonzeros" — it both zeroes out weak features and shrinks the survivors (which is why modern SAEs add tricks like a TopK constraint or a JumpReLU to penalize count directly without shrinking magnitudes). The number of nonzero entries per input is the SAE's L0, the headline sparsity metric.

Training an SAE is then ordinary gradient descent on EQ 13.5 over a large bank of cached model activations. Three engineering realities dominate the practice:

  • Dead features. Some dictionary atoms stop activating on any input and contribute nothing. A large fraction can die during training. Practitioners track the dead-feature count and revive them (resampling, auxiliary "ghost" losses, or careful initialization), because a dead atom is wasted dictionary width.
  • The width / L0 / fidelity trilemma. Wider dictionaries (\(m\) larger) and lower \(\lambda\) both buy lower reconstruction error, but lower \(\lambda\) also raises L0 (less sparse, less monosemantic), and wider dictionaries cost compute and tend to split one concept into many near-duplicate atoms. There is no single best point; you choose where on the frontier to sit.
  • Feature splitting and absorption. As you add width, a coarse feature ("a dog") fractures into finer ones ("a dog, in French", "a dog, as subject"). Useful sometimes, misleading others — it means the count of "features" is partly an artifact of dictionary size, not a fact about the model.

The landmark demonstrations are Anthropic's "Towards Monosemanticity" (Bricken et al., 2023), which extracted thousands of interpretable features from a one-layer transformer, and "Scaling Monosemanticity" (Templeton et al., 2024), which pushed SAEs to a production-scale model (Claude 3 Sonnet) and found millions of features, including abstract and multilingual ones, that could be located and manipulated. Those results are the reason SAEs went from a curiosity to the field's default decomposition tool.

An SAE produces the feature code \( \mathbf{f}(x) = (0,\ 0.8,\ 0,\ 1.2,\ 0,\ 0,\ 0.4,\ 0) \) for some input. The sparsity metric L0 is the number of nonzero entries. What is the L0 of this code?
Count the strictly-nonzero entries: \(0.8,\ 1.2,\) and \(0.4\) are active; the other five are exactly zero. So \(L_0 = \) 3. A low L0 means few features explain each input — the goal of the sparsity penalty in EQ 13.5. (Note \(L_1 = 0.8+1.2+0.4 = 2.4\) is a different, magnitude-weighted quantity.)
PYTHON · RUNNABLE IN-BROWSER
# Train a tiny SAE on synthetic sparse data by plain gradient descent (EQ 13.4/13.5)
import numpy as np
rng = np.random.default_rng(0)
d, m, k = 6, 20, 2          # input dim, dictionary width (overcomplete), true active-per-input

Dtrue = rng.normal(0, 1, (d, m)); Dtrue /= np.linalg.norm(Dtrue, axis=0, keepdims=True)
def batch(B):                                   # sparse codes -> superposed activations
    A = np.zeros((B, m))
    for r in range(B):
        idx = rng.choice(m, k, replace=False); A[r, idx] = rng.uniform(.5, 1.5, k)
    return A @ Dtrue.T, A                        # x = D a  (shape B x d), and true code

Wenc = rng.normal(0, .1, (m, d)); Wdec = rng.normal(0, .1, (d, m)); lr, lam = .05, .04
for step in range(600):
    x, _ = batch(128)
    f = np.maximum(0, x @ Wenc.T)               # encode: ReLU code (EQ 13.4)
    xh = f @ Wdec.T                             # decode
    e = xh - x
    gWdec = e.T @ f / len(x)
    gf = (e @ Wdec) * (f > 0) + lam * np.sign(f) * (f > 0)   # recon + L1 grads (EQ 13.5)
    gWenc = gf.T @ x / len(x)
    Wdec -= lr * gWdec; Wenc -= lr * gWenc
    Wdec /= np.linalg.norm(Wdec, axis=0, keepdims=True)      # keep atoms unit-norm

x, _ = batch(2000); f = np.maximum(0, x @ Wenc.T); xh = f @ Wdec.T
recon = np.sqrt(np.mean((x - xh)**2)) / np.sqrt(np.mean(x**2))
L0 = (f > 1e-3).sum(1).mean()
dead = int((f.max(0) < 1e-3).sum())
print(f"relative reconstruction error : {recon:.3f}")
print(f"mean L0 (active feats / input): {L0:.2f}   (true sparsity k = {k})")
print(f"dead dictionary atoms         : {dead} / {m}")
print("the SAE learns a sparse code that rebuilds x while firing only a few atoms.")
edits are live — break it on purpose
INSTRUMENT 13.2 — SAE SPARSITY–FIDELITY TRADE-OFFDICTIONARY WIDTH & L1 WEIGHT vs L0 / RECON · EQ 13.5
RECON ERROR
L0 (ACTIVE / INPUT)
MONOSEMANTICITY
The Pareto curve the field actually navigates. Raise λ: the code gets sparser (L0 falls, features more monosemantic) but reconstruction error climbs — you are dropping real signal. Lower λ toward 0: near-perfect reconstruction but a dense, polysemantic code that has explained nothing. Widen the dictionary: the whole curve shifts down (better fidelity at a given L0), at compute cost and the risk of feature splitting. The dot is your current operating point; there is no free corner.
13.5

Circuits, motifs & causal methods

Features are nouns; circuits are the sentences. A circuit is a set of features wired together by the model's weights to compute something. The most thoroughly reverse-engineered example is the induction head (Chapter 03), a two-head circuit that implements in-context copying: a "previous-token" head writes each token's predecessor into the residual stream, and a second head then attends from the current token to the position after its earlier copy, predicting "what came next last time." Olsson et al. (2022) traced the abrupt appearance of induction heads to the sharp drop in training loss that coincides with models acquiring in-context learning — a rare case where a capability, a circuit, and a loss-curve kink were tied together.

Two cheap tools are the everyday workhorses of circuit-finding:

  • The logit lens. Apply the model's final unembedding to the residual stream at intermediate layers, as if you had stopped early. This reveals the model's "current best guess" forming layer by layer, and where in depth a prediction crystallizes. It is approximate (the residual stream is not calibrated to be decoded early), but it is a fast first read on where a computation happens.
  • Ablations. Zero out (or mean-out) a component — a head, a neuron, an SAE feature — and measure how much the output changes. A large change implicates the component; a negligible one rules it out. Ablation is correlational about importance but a direct causal test of necessity.

The gold standard, though, is activation patching (also called causal tracing or interchange intervention). It isolates which internal location causes a behavior by transplanting activations between two runs.

EQ 13.6 — ACTIVATION PATCHING (CAUSAL TRACING) $$ \Delta_{\ell,p} \;=\; M\!\big(x_{\text{clean}}\,\big|\, h^{(\ell)}_{p} \!\leftarrow\! h^{(\ell)}_{p}(x_{\text{corrupt}})\big) \;-\; M\!\big(x_{\text{clean}}\big) $$
Run the model on a clean prompt and on a corrupted one (e.g. a key fact swapped). Then re-run the clean prompt but patch in the corrupted run's activation \(h\) at one location — layer \(\ell\), position \(p\) — and measure how much the output metric \(M\) (say, the logit of the correct token) moves. A large \(\Delta\) means that single location carries the information that decides the answer. Sweeping \((\ell,p)\) draws a causal map of where a fact or computation lives. Meng et al. (2022, ROME) used exactly this to localize factual associations to mid-layer MLPs, then edited them.

Patching is causal because it asks a counterfactual: had this one location held the other run's value, would the answer change? That is stronger than ablation (which only asks "does removing it hurt") and far stronger than attention-weight inspection (which asks nothing causal at all). The same logic, applied to SAE features instead of raw activations, is how the field tests whether a discovered feature is a genuine causal variable or merely a correlate.

In an activation-patching run, the clean prompt gives the correct token a logit of \(2.1\). After patching a corrupted activation into layer 8, position 5, the correct-token logit becomes \(6.3\). What is the patching effect \(\Delta_{8,5}\) (EQ 13.6) at that location?
\(\Delta = \text{patched} - \text{clean} = 6.3 - 2.1 = \) 4.2. A large positive effect means this single location carries decisive information for the prediction — the kind of hotspot a causal-tracing sweep is built to find. Locations with \(\Delta \approx 0\) are causally irrelevant to this behavior.
PYTHON · RUNNABLE IN-BROWSER
# Activation-patching toy: swap one hidden coordinate, measure output change (EQ 13.6)
import numpy as np
rng = np.random.default_rng(1)
d = 8
Wout = rng.normal(0, 1, d)                       # toy readout: logit = Wout . h

def hidden(prompt_seed, fact):                    # a toy "model": fact lives in coord 3
    rng2 = np.random.default_rng(prompt_seed)
    h = rng2.normal(0, 0.3, d)                     # generic context
    h[3] = 2.5 * fact                              # the decisive coordinate
    return h

h_clean   = hidden(7, fact=+1)                     # clean run: correct fact
h_corrupt = hidden(7, fact=-1)                     # corrupted run: wrong fact
base = Wout @ h_clean
print(f"clean logit baseline: {base:+.3f}\n")

print(" patch coord | logit after patch |  delta")
for c in range(d):
    h = h_clean.copy()
    h[c] = h_corrupt[c]                            # patch ONE coordinate from corrupt run
    delta = (Wout @ h) - base
    flag = "  <-- decisive" if abs(delta) > 1.0 else ""
    print(f"     {c:2d}     |      {Wout @ h:+7.3f}     | {delta:+7.3f}{flag}")
print("\nonly the coordinate that stores the fact (coord 3) moves the logit -- causal tracing")
print("localizes the computation to that one place, exactly as ROME does in real MLPs.")
edits are live — break it on purpose
13.6

Feature steering, and honest limits

Once you have a feature direction and a causal test that it matters, you can steer: add the feature's decoder vector to the residual stream (or clamp its SAE activation to a chosen value) and watch the behavior change. Templeton et al. (2024) made this concrete and a little famous — clamping a "Golden Gate Bridge" feature high made the model fixate on the bridge across unrelated prompts. More usefully, steering can amplify or suppress features tied to safety-relevant behavior, offering a knob that operates on concepts rather than tokens.

EQ 13.7 — FEATURE STEERING / CLAMPING $$ x' \;=\; x \;+\; \alpha\,\mathbf{d}_i, \qquad \text{or clamp } f_i(x) \leftarrow c \;\Rightarrow\; x' = x + \big(c - f_i(x)\big)\,\mathbf{d}_i $$
Adding \(\alpha\) units of feature \(i\)'s direction \(\mathbf{d}_i\) to the activation pushes the model toward (or, with \(\alpha<0\), away from) that concept. Clamping fixes the feature's activation to a target \(c\) regardless of input. The same direction that reads the feature (EQ 13.2) also writes it — the linear representation hypothesis (EQ 13.1) is what makes steering possible at all. Steering is the strongest evidence a feature is real: if writing the direction reliably produces the concept, the direction is doing causal work, not just correlating.

Now the honest part, because this is a young field and overclaiming is the main hazard:

  • SAE features are not ground truth. They are the output of a particular SAE with a particular width and \(\lambda\). Change those and you get different features. "The model has \(N\) features" is shorthand for "this SAE found \(N\) atoms," not a discovered constant of the network.
  • Evaluation is genuinely hard. There is no gold label for "is this feature monosemantic." Proxies (auto-interpretation scores from an LLM judge, activation sparsity, steering specificity) all have failure modes. A feature can look clean on its top activations yet fire spuriously on the long tail.
  • Reconstruction is incomplete. Even good SAEs leave a residual; the part of the activation they fail to reconstruct may contain exactly the computation you care about. Recent work also finds SAEs can miss features that probing finds, and can introduce artifacts (feature absorption, composition) that are properties of the SAE, not the model.
  • Scaling cost. Training SAEs for every layer of a frontier model, wide enough to resolve rare features, is a major compute and storage undertaking — and the dictionary needed seems to grow with model scale.
  • Circuits remain mostly hand-found. Fully reverse-engineering a nontrivial behavior end-to-end is still rare, labor-intensive, and validated case by case. We can read fragments of the algorithm, not yet the whole program.

The honest scorecard. Mechanistic interpretability has moved from "neurons are confusing" to a working pipeline: superposition explains the confusion, SAEs decompose it into features, and causal methods (patching, ablation, steering) test those features for real. That is genuine progress, and at frontier scale it has produced steerable, human-legible concepts that did not exist as tools two years ago. What it has not done is deliver a complete, verified account of any large model's behavior, or a feature set anyone calls canonical. Treat extracted features as useful, falsifiable hypotheses — powerful when they pass a causal test, provisional always.

NEXT

You can now read a model from the outside in: features inside neurons, circuits inside weights, and causal tests that keep the story honest. The capstone assembles the whole volume into one end-to-end picture — how a token becomes an embedding, mixes through attention or a state-space scan, is trained, aligned, fine-tuned, compressed, scaled at inference, and finally made legible by the tools of this chapter — and shows where each idea lives in a real 2026 deployment.

13.R

References

  1. Elhage, N. et al. (2022). Toy Models of Superposition. Anthropic / Transformer Circuits — the superposition hypothesis and feature geometry behind EQ 13.2–13.3.
  2. Bricken, T. et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Anthropic Transformer Circuits — the SAE recipe of EQ 13.4–13.5 on a one-layer model.
  3. Templeton, A. et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic Transformer Circuits — production-scale SAEs and feature steering (§13.4, §13.6).
  4. Olsson, C. et al. (2022). In-context Learning and Induction Heads. Anthropic / Transformer Circuits — the induction-head circuit and its link to in-context learning (§13.5).
  5. Meng, K., Bau, D., Andonian, A. & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT (ROME). NeurIPS 2022 — causal tracing / activation patching of EQ 13.6, then weight editing.