AI // ENCYCLOPEDIA / VOL II / 11 / FRONTIER 2026 INDEX NEXT: CAPSTONE →
THE LLM FIELD MANUAL · CHAPTER 11

The 2026 Frontier — State-Space Models & Extreme Quantization

For eight years the Transformer had no serious rival. That changed. State-space models now match attention's quality at linear cost, and post-training quantization is pushing useful models below four bits per weight. This chapter maps the two pressures squeezing the Transformer from opposite ends, a cheaper way to mix tokens and a cheaper way to store them, and is candid about where the contest is settled versus merely promising.

LEVELADVANCED READING TIME≈ 28 MIN BUILDS ONCH 03 · CH 07 · CH 09 INSTRUMENTSSCALING · BIT-WIDTH · ARCH MATRIX
11.1

The post-Transformer landscape (2026)

The Transformer (Chapter 02) won because attention is parallel in training and expressive at any range. Its one structural flaw never went away: self-attention compares every token to every other token, so compute and the score matrix both grow as \(O(n^2)\) in sequence length \(n\). The KV cache (Chapter 03) then turns inference memory linear in \(n\) and unbounded over a conversation. Every long-context technique of the last few years — FlashAttention, GQA, sliding windows, MLA — is a tax cut on a fundamentally quadratic object, not a repeal of it.

By 2026 two distinct lines of attack have matured enough to ship in frontier-scale models:

  • Sub-quadratic sequence mixers. Replace softmax attention with an operator that costs \(O(n)\): selective state-space models (Mamba, Mamba-2) and modern linear attention. These carry a fixed-size recurrent state instead of a growing cache, so memory at decode time is \(O(1)\) per layer regardless of context length.
  • Extreme weight compression. Push the bits-per-weight of a frozen model down past the 4-bit floor that Chapter 07 treated as practical — to 3, 2.x, even ~1.58 effective bits — with rotation/incoherence preprocessing and learned codebooks that keep quality close to the 16-bit original.

These are orthogonal: the first attacks how tokens talk to each other, the second attacks how a weight is stored. The 2026 stack increasingly uses both — a quantized hybrid model is now an ordinary deployment. The honest caveat up front: the Transformer is contested, not dethroned. Pure-attention models still hold the top of most reasoning and recall leaderboards; SSMs win decisively on long-context throughput and memory, and the strongest shipping designs are hybrids that keep a few attention layers for the tasks attention is uniquely good at.

EQ 11.1 — THE QUADRATIC THAT STARTED IT $$ \underbrace{C_{\text{attn}}(n) = \Theta(n^2 d)}_{\text{compute}}, \qquad \underbrace{M_{\text{attn}}(n) = \Theta(n)\;\text{(KV cache)}}_{\text{decode memory}} \quad\text{vs.}\quad \underbrace{C_{\text{ssm}}(n) = \Theta(n\, d\, N)}_{\text{compute}}, \quad \underbrace{M_{\text{ssm}}(n) = \Theta(1)}_{\text{decode memory}} $$
\(d\) is model width, \(N\) the SSM state size (typically 16–128, a constant). Attention's compute is quadratic in \(n\); an SSM's is linear. At decode time attention's per-step cost grows with the cache it must re-read, while an SSM folds the entire past into a fixed-size state and pays the same per token forever. The whole chapter lives in the gap between \(n^2\) and \(n\).
A pure-attention layer costs \( \Theta(n^2 d) \). If you double the sequence length \(n\) (keeping \(d\) fixed), by what factor does its compute grow?
Compute scales as \(n^2\). Doubling \(n\) multiplies cost by \(2^2 = \) 4×. A linear-time SSM, by contrast, would only get \(2^1 = 2\)× more expensive — that growing gap is the entire motivation for sub-quadratic mixers.
11.2

State-space models & SSD (Mamba-2)

A state-space model is a linear recurrence borrowed from control theory. It carries a hidden state \(h_t \in \mathbb{R}^{N}\) that summarizes everything seen so far, updates it from the current input \(x_t\), and reads an output \(y_t\) off it:

EQ 11.2 — DISCRETE STATE-SPACE RECURRENCE $$ h_t = \bar{A}\, h_{t-1} + \bar{B}\, x_t, \qquad y_t = C\, h_t $$
\(\bar{A} \in \mathbb{R}^{N\times N}\) is the state-transition matrix (how the past decays and rotates), \(\bar{B}\) writes the new input into state, \(C\) reads the output. The bar denotes discretization: a step size \(\Delta\) turns a continuous-time system into this per-token update, e.g. \(\bar{A} = \exp(\Delta A)\). Because the recurrence is linear, the state never grows — it is a fixed \(N\)-dimensional running summary, so decode memory is \(O(1)\) (EQ 11.1). The cost: a classical SSM uses the same \(\bar{A},\bar{B},C\) for every token, so it cannot choose what to remember.

That last sentence is the whole reason early SSMs (S4) lost to Transformers on language. Attention is content-aware — it routes based on what the tokens actually say — while a fixed recurrence treats every token identically. Mamba's contribution was to make the SSM selective: let \(\bar{B}\), \(C\), and the step \(\Delta\) be functions of the input. Now the model can decide, per token, how much to write, how much to read, and how fast to forget — a learnable gate that lets it ignore filler and latch onto salient tokens, recovering much of attention's selectivity at linear cost.

EQ 11.3 — SELECTION: INPUT-DEPENDENT PARAMETERS $$ \bar{B}_t = s_B(x_t), \quad C_t = s_C(x_t), \quad \Delta_t = \mathrm{softplus}\big(s_\Delta(x_t)\big) \;\Longrightarrow\; h_t = \bar{A}_t\, h_{t-1} + \bar{B}_t\, x_t $$
The projections \(s_B, s_C, s_\Delta\) are small learned linear maps of the current token. A large \(\Delta_t\) means "this token matters — overwrite state and reset the clock"; a small \(\Delta_t\) means "skip it, let the state coast". This input-dependence breaks the convolutional shortcut older SSMs relied on, so Mamba uses a hardware-aware parallel scan (a prefix-sum over the sequence) to stay fast on GPUs without ever materializing an \(n\times n\) matrix.

Mamba-2 reframed all of this with one structural result: state-space duality (SSD). A selective SSM with a scalar-times-identity \(\bar{A}\) is mathematically equivalent to a particular structured masked attention — a 1-semiseparable matrix transform. The recurrence and an attention-like matmul compute the same function by two different routes, with two different costs:

EQ 11.4 — SSD: TWO DUAL FORMS OF ONE OPERATOR $$ y = \underbrace{\big(L \circ (C B^{\top})\big)\, x}_{\text{quadratic / "attention" form: } O(n^2)} \;=\; \underbrace{\textstyle\sum \text{(linear scan over states)}}_{\text{linear / "recurrent" form: } O(n N)} $$
\(L\) is a lower-triangular causal mask whose entries are the cumulative products of the per-step decays. The quadratic form is great for training — it's a big matmul that saturates tensor cores. The linear form is great for inference — \(O(1)\) state per step. Mamba-2 trains in the quadratic form and decodes in the linear one, getting both. The duality is also why "Transformers are SSMs": attention is a special, more expensive point on the same spectrum.

Practically: Mamba-2 matches or beats a same-size Transformer on language modeling perplexity and on many downstream tasks, trains efficiently because of the matmul-friendly dual form, and decodes with constant memory. Where it still loses is precise, long-range recall — copying an exact phrase or table value from far back, or in-context retrieval of a specific token — because a fixed \(N\)-dimensional state is a lossy summary of an arbitrarily long past, whereas a KV cache keeps every key verbatim. That asymmetry is exactly what motivates §11.3's hybrids.

True or false: a selective (linear-time) state-space model's compute scales with sequence length \(n\) as \(O(n)\) — linear — rather than the \(O(n^2)\) of softmax attention. (Enter true or false.)
The recurrence in EQ 11.2 processes each of the \(n\) tokens once, doing \(O(dN)\) work per token with \(N\) a fixed constant — total \(O(n\,d\,N)\), which is linear in \(n\). The SSD linear form (EQ 11.4) confirms it. Softmax attention's all-pairs score matrix is \(O(n^2)\). So the statement is true.
PYTHON · RUNNABLE IN-BROWSER
# O(n^2) attention vs O(n) SSM: cost as context grows (EQ 11.1 / 11.4)
import numpy as np
d, N = 4096, 16                          # model width, SSM state size
ns = np.array([512, 2048, 8192, 32768, 131072, 1048576])

attn = ns.astype(float)**2 * d           # all-pairs scores ~ n^2 d
ssm  = ns.astype(float) * d * N          # linear scan        ~ n d N
ratio = attn / ssm                       # how many x more work attention does

print(f"{'context n':>10} {'attn FLOPs':>12} {'ssm FLOPs':>12} {'attn/ssm':>10}")
for n, a, s, r in zip(ns, attn, ssm, ratio):
    print(f"{n:>10,} {a:>12.2e} {s:>12.2e} {r:>10,.0f}x")

print("\ndoubling n quadruples attention but only doubles the SSM;")
print(f"at 1M tokens attention does {ratio[-1]:,.0f}x the work of the scan.")
plot_xy(np.log2(ns), np.log2(ratio))     # log-log: the gap is a rising line
edits are live — break it on purpose
INSTRUMENT 11.1 — SCALING EXPLORER: O(n²) vs O(n)COMPUTE & DECODE MEMORY · EQ 11.1
ATTENTION COMPUTE (n²d)
SSM COMPUTE (n·d·N)
ATTN / SSM WORK
Both axes are log-scaled. Drag context right: the attention curve climbs twice as steeply as the SSM curve because of the extra factor of \(n\). At 1M tokens the SSM does thousands of times less work, and — unlike attention — its decode-time state stays fixed at \(N\) numbers no matter how long the context grows.
11.3

Linear & hybrid attention at scale

SSMs are one route to \(O(n)\); linear attention is the other, and SSD (EQ 11.4) shows they are close cousins. Standard attention computes \(\mathrm{softmax}(QK^\top)V\), and the softmax is what forces you to build the \(n\times n\) matrix first. Replace the softmax with a feature map \(\phi(\cdot)\) so the similarity factorizes, and associativity lets you reorder the matmuls:

EQ 11.5 — LINEAR ATTENTION BY REASSOCIATION $$ \mathrm{Attn}(Q,K,V)_i = \frac{\phi(q_i)^{\top} \sum_{j\le i} \phi(k_j)\, v_j^{\top}}{\phi(q_i)^{\top} \sum_{j\le i} \phi(k_j)} \;=\; \frac{\phi(q_i)^{\top} S_i}{\phi(q_i)^{\top} z_i} $$
Instead of \((QK^\top)V\) — an \(n\times n\) matrix — compute \(Q(K^\top V)\): a \(d\times d\) running sum. \(S_i = \sum_{j\le i}\phi(k_j)v_j^\top\) and \(z_i = \sum_{j\le i}\phi(k_j)\) are a fixed-size state updated token by token, exactly like an SSM's \(h_t\). This is \(O(n d^2)\) — linear in \(n\). The price: dropping the softmax removes its sharp, content-addressable selectivity, so naive linear attention historically underperforms at fine-grained recall. Modern variants (gated linear attention, DeltaNet, RWKV-7, RetNet's decay) add forgetting gates and delta-rule updates to claw most of it back.

The decisive engineering insight of 2024–2026 is that you do not have to choose. The single missing capability of every linear mixer — exact long-range lookup — is precisely what attention is best at and cheapest to use sparingly. So the strongest sub-quadratic models are hybrids: mostly Mamba/linear layers for the bulk of token mixing, with a thin interleaving of full-attention layers (often combined with sliding-window attention) to handle copying and retrieval.

  • Jamba (AI21, 2024) interleaves Mamba and Transformer blocks with a mixture-of-experts MLP, shipping a 256K-context production model whose KV cache is a fraction of a same-size pure Transformer's.
  • Zamba / Samba / Griffin-style designs mix gated linear recurrences with local attention; Griffin's recurrence (RG-LRU) is a close relative of the selective SSM.
  • Ratio of attention layers matters: empirically a small fraction — often roughly 1 attention layer per 5–7 SSM layers — recovers nearly all of a full Transformer's recall while keeping most of the linear-cost savings. The exact ratio is an open, model-specific tuning question, not a solved constant.

Honest status. Hybrids are the current sweet spot, but "how few attention layers can you get away with" is genuinely contested and depends on the task mix; recall-heavy and tool-use workloads want more attention, long-document summarization wants less. No published hybrid has yet displaced the best pure Transformers at the very top of frontier reasoning evals — the win is on cost and context length, not (yet) on peak capability.

INSTRUMENT 11.2 — ARCHITECTURE COMPARISON MATRIXTRANSFORMER · MAMBA · HYBRID
TOKEN-MIX COST
DECODE MEMORY
EXACT RECALL
Toggle the three families and read the trade-off across rows. The Transformer pays \(O(n^2)\) compute and a growing KV cache for perfect recall; Mamba-2 buys \(O(n)\) compute and \(O(1)\) memory at the cost of lossy recall; the hybrid keeps a few attention layers to recover recall while staying mostly linear. There is no free lunch — pick which axis you can least afford to lose.
11.4

Extreme quantization — TurboQuant & sub-4-bit

The second pressure is storage. Chapter 07 established quantization as the deployment unit — bits per weight — and treated 4-bit (GPTQ, AWQ, NF4) as the practical floor. By 2026 that floor has moved. The reason it can move is the same statistical fact NF4 exploited: trained weights are roughly Gaussian and highly compressible, so the bits you spend should match where the information actually is.

The model that explains the trade-off is simple. A \(b\)-bit quantizer has \(2^b\) levels. Spread them over a value range and the spacing — hence the rounding error — shrinks geometrically as you add bits:

EQ 11.6 — QUANTIZATION ERROR vs BIT-WIDTH $$ \text{levels} = 2^b, \quad \text{step } \Delta = \frac{R}{2^b - 1}, \quad \mathrm{RMSE} \approx \frac{\Delta}{\sqrt{12}} \;\propto\; 2^{-b}, \qquad \text{size} = \tfrac{b}{8}\ \text{bytes/param} $$
\(R\) is the (clipped) range of the weights. For a uniform quantizer the rounding error is uniform on \([-\Delta/2,\,\Delta/2]\), whose RMS is \(\Delta/\sqrt{12}\). The headline: each extra bit halves the error but only adds \(\tfrac18\) byte. Going 16→8→4 bits is nearly free in quality; the pain starts below 4, where halving error is no longer enough to hide the few large-magnitude outlier weights that dominate the loss. Beating EQ 11.6 below 4 bits requires non-uniform codebooks and preprocessing, not just smaller steps.

Three ideas, stacked, are what make sub-4-bit work:

  • Error-aware rounding (GPTQ). Don't round each weight independently. Quantize column by column and, using second-order (Hessian) information from a calibration set, push the rounding error of each weight into the weights not yet quantized — so the layer's output error, not the per-weight error, is what's minimized.
  • Incoherence processing & rotation (QuIP#, QuaRot, SpinQuant). Multiply weights (and activations) by random orthogonal/Hadamard rotations. A rotation preserves the matmul but spreads the outliers out, making the distribution more uniform and far easier to quantize. QuIP# adds a lattice (E8) vector codebook on top, reaching ~2 bits with surprisingly little loss.
  • Fast unbiased rounding (TurboQuant). The 2025 line of work makes the rotation/quantization step itself near-optimal and cheap: data-oblivious, distortion-near-the-information-theoretic-limit quantizers that run fast enough to apply at inference time, narrowing the gap between what's provably achievable and what's practical at 2–4 bits.

The most cited demonstration that sub-4-bit can be a training target, not just a post-hoc squeeze, is BitNet b1.58: weights constrained to the ternary set \(\{-1, 0, +1\}\) — about \(\log_2 3 \approx 1.58\) bits each — trained from scratch, reportedly matching full-precision Transformers at billions of parameters while replacing most multiplies with additions. It remains contested how far this holds at the very largest scales, but it reframed the floor from "4 bits" to "less than 2".

True or false: pushing a model below 4 bits per weight ("sub-4-bit" quantization) is fundamentally a trade — it shrinks the stored model size but raises quantization error, costing some quality. (Enter true or false.)
EQ 11.6 makes both sides explicit: size \(=\tfrac{b}{8}\) bytes/param falls as \(b\) drops, while error \(\propto 2^{-b}\) rises. Above 4 bits the quality loss is negligible; below 4 it becomes real and you must spend cleverness (rotation, codebooks, error-aware rounding) to keep it small. There is no free lunch — it is a size-versus-quality trade. True.
A 70B-parameter model is quantized to \(b = 2\) bits per weight. Using \( \text{bytes/param} = b/8 \) (EQ 11.6), how many GB do the weights occupy? (Use \(1\,\text{GB} = 10^9\) bytes.)
Bytes per parameter \(= b/8 = 2/8 = 0.25\). Total \(= 70\times10^9 \times 0.25 = 1.75\times10^{10}\) bytes \(= \) 17.5 GB — versus 140 GB at FP16. That is the prize: a 70B model that fits a single 24 GB consumer card, if you can hold the quality.
PYTHON · RUNNABLE IN-BROWSER
# Quantization error vs bit-width, down to 2-3 bits (EQ 11.6)
import numpy as np
rng = np.random.default_rng(0)
w = rng.normal(0, 1, 200_000)            # trained weights ~ Gaussian
R = 6.0                                   # clip to +/-3 sigma -> range 6

bits = [16, 8, 4, 3, 2]
rel = []
for b in bits:
    levels = 2**b
    step = R / (levels - 1)
    q = np.round(np.clip(w, -R/2, R/2) / step) * step   # uniform quantize
    err = np.sqrt(np.mean((w - q)**2)) / np.sqrt(np.mean(w**2))
    rel.append(err)
    print(f"{b:>2} bit | {levels:>6} levels | {b/8:>5.3f} B/param | rel RMSE {err:.4f}")

print("\nerror roughly halves per added bit (RMSE ~ 2^-b);")
print("8->4 bits barely moves it, but 4->2 bits multiplies it ~5x -- the sub-4-bit wall.")
plot_xy(bits, rel)                        # error climbs sharply below 4 bits
edits are live — break it on purpose
INSTRUMENT 11.3 — BIT-WIDTH TRADE-OFFSIZE vs QUALITY · EQ 11.6
WEIGHTS SIZE
REL. QUANT ERROR
REGIME
Drag bit-width from 16 down toward 1.58 (BitNet's ternary floor). Above 4 bits, size falls while the dashed error curve barely moves — nearly free. Below 4, error climbs steeply and the curve enters the red zone where naive uniform quantization breaks; only rotation + codebook methods (QuIP#, TurboQuant) keep quality usable there. The vertical line marks the 4-bit floor of Chapter 07.
11.5

What's racing the Transformer

Step back and the 2026 frontier is a four-cornered race, not a coronation. Each contender trades a different axis:

FamilyToken-mix costDecode memoryBest atWeakest at
TransformerO(n²)O(n) KV cacheExact recall, reasoning, ecosystem maturityLong-context cost & memory
SSM (Mamba-2)O(n)O(1) stateThroughput, very long context, streamingPrecise long-range copy / retrieval
Linear / gated attnO(n)O(1) stateCheap mixing; close cousin of SSDSharp content-addressable lookup
HybridO(n) + few O(n²)small KV + stateMost of both worlds; current sweet spotTuning the attention ratio; not yet peak-SOTA

Orthogonal to all four sits quantization: any of them can be squeezed to 4, 3, or ~2 bits, so the real deployment object in 2026 is "a hybrid, in 4-bit" rather than any single pure design. Two further frontier currents press on the same surface and were treated earlier in this volume — mixture-of-experts (Chapter 09), which cuts active FLOPs per token by routing to a few experts, and diffusion language models (Chapter 10), which replace left-to-right decoding with parallel iterative refinement. None of these is mutually exclusive; a 2026 system can be an MoE hybrid SSM-Transformer served in 4-bit.

The honest scorecard. SSMs and linear attention have won the long-context efficiency argument — at million-token scale there is no contest. They have not won the peak-capability argument: as of 2026 the best reasoning and recall results still come from attention-heavy models, and the most successful sub-quadratic designs hedge by keeping attention layers. The Transformer's monopoly is broken; its leadership is not. The likeliest 2026–2027 outcome is not a successor but a blend — and knowing which operator to spend where is the new architectural skill.

NEXT

You now have the whole machine — from the residual stream to the 2026 frontier. The capstone assembles every chapter into one end-to-end picture: how a token becomes an embedding, flows through attention or a state-space scan, gets trained, aligned, fine-tuned, compressed, and finally served — and where each chapter's idea lives in a real deployment.

11.R

References

  1. Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024 — the selective SSM and hardware-aware scan behind EQ 11.2–11.3.
  2. Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024 — Mamba-2 and the SSD duality of EQ 11.4.
  3. Gu, A., Goel, K. & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces (S4). ICLR 2022 — the structured SSM that started the line (§11.2).
  4. Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020 — the reassociation trick of EQ 11.5.
  5. Lieber, O. et al. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. AI21 — a production-scale interleaved Mamba/attention/MoE hybrid (§11.3).
  6. Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023 — second-order error-aware rounding (§11.4).
  7. Tseng, A., Chee, J., Sun, Q., Kuleshov, V. & De Sa, C. (2024). QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks. ICML 2024 — incoherence processing + E8 lattice codebooks toward ~2 bits.
  8. Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet b1.58). Microsoft Research — ternary {-1,0,+1} weights trained from scratch (§11.4).
  9. Ashkboos, S. et al. (2024). QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. NeurIPS 2024 — Hadamard rotations that spread weight/activation outliers (§11.4).