Anatomy of a block
A modern decoder block is two sub-layers — self-attention then a feed-forward MLP — each wrapped in a residual connection and preceded by a normalization (the “pre-norm” arrangement, universal since GPT-2 because it keeps gradients stable in deep stacks):
# one transformer block, forward, in pure numpy (EQ 2.1)
import numpy as np
rng = np.random.default_rng(0)
T, d, dff = 4, 16, 43 # toy: 4 tokens, d_model 16, dff ~ 8/3 d
def rms(x): return x / np.sqrt((x * x).mean(-1, keepdims=True) + 1e-5)
def ledger(name, x): print(f"{name:<24}{str(x.shape):>10} norm {np.linalg.norm(x):7.2f}")
h = rng.normal(0, 1, (T, d)); ledger("residual stream in", h)
Wq, Wk, Wv, Wo = rng.normal(0, d ** -0.5, (4, d, d))
x = rms(h); ledger("RMSNorm(h)", x)
Q, K, V = x @ Wq, x @ Wk, x @ Wv; ledger("Q (K, V same)", Q)
S = Q @ K.T / np.sqrt(d) + np.triu(np.full((T, T), -1e9), 1)
A = np.exp(S - S.max(-1, keepdims=True)); A /= A.sum(-1, keepdims=True)
ledger("attn weights (causal)", A)
h = h + (A @ V) @ Wo; ledger("h + Attn [residual]", h)
Wg, Wu = rng.normal(0, d ** -0.5, (2, d, dff))
Wd = rng.normal(0, dff ** -0.5, (dff, d))
x = rms(h)
g = x @ Wg
m = ((g / (1 + np.exp(-g))) * (x @ Wu)) @ Wd # SwiGLU: SiLU(gate) * up, down
ledger("SwiGLU MLP out", m)
h = h + m; ledger("h + MLP [residual]", h)
print("\nthe block ADDED its work to h -- the stream is nudged, never replaced")
The full model is: embedding lookup → \(L\) of these blocks → final norm → unembedding to logits. That's the entire architecture. Everything else in modern LLM engineering is a refinement of one of these pieces.
The residual stream is the model's workspace
The most productive mental model (due to the mechanistic-interpretability literature): the residual stream is a communication bus of width \(d_{\text{model}}\) running through the network. Each attention head and each MLP reads from the bus through a linear projection, computes something, and writes its result back by addition into (approximately) its own subspace.
- Superposition. The bus carries far more “features” than it has dimensions, packed as non-orthogonal directions. Sparse autoencoders (Chapter 09) decompress these into interpretable features.
- Iterative refinement. The token vector for “bank” enters as pure type information and is incrementally enriched: position, syntax, sense disambiguation, long-range bindings — each layer nudging the vector with a small additive update.
- Logit lens. Because every layer writes in the same coordinate system, you can apply the unembedding to intermediate states and watch the model's next-token guess sharpen layer by layer — direct evidence the stream is a progressively refined prediction.
Attention moves information between positions; the MLP transforms it in place. Roughly: attention answers “what should I look at?”, the MLP answers “what do I conclude from what I gathered?”. Knowledge recall behaves like key-value lookups stored in MLP weights; copying and binding behave like attention-head routing.
Normalization: LayerNorm → RMSNorm
Deep residual stacks need their activations kept in a stable range. The original transformer used LayerNorm; nearly every modern LLM (Llama, Mistral, Qwen, DeepSeek) uses the cheaper RMSNorm, which drops mean-centering and the bias term:
Placement matters more than flavor: pre-norm (normalize the input of each sub-layer, as in EQ 2.1) yields a clean gradient path through the residual additions, enabling very deep models without the fragile learning-rate warmup gymnastics of the original post-norm design. Some recent models add extra norms (e.g., QK-norm on attention queries/keys) for further stability at scale.
The MLP: where most parameters live
The feed-forward sub-layer is a two-layer network applied to every position independently, expanding the representation to an inner width \(d_{\text{ff}}\) (classically \(4\,d_{\text{model}}\)) and projecting back. The modern default activation is SwiGLU — a gated linear unit using SiLU (“swish”):
Interpretability work suggests MLPs implement key→value memories: the first projection detects patterns in the stream (keys), the nonlinearity gates which fire, and the second projection writes associated content (values) back. This is where “Paris is the capital of France” mostly resides — and why model-editing techniques target MLP weights.
Position: from sinusoids to RoPE
Attention is permutation-invariant — without help, the model cannot tell “dog bites man” from “man bites dog”. The original transformer added fixed sinusoidal vectors to embeddings; GPT-2 learned absolute position vectors. Modern LLMs almost universally use Rotary Position Embeddings (RoPE), which encode position by rotating query and key vectors, pairwise, by position-proportional angles:
# RoPE relativity: dot(R_m q, R_n k) depends only on n - m (EQ 2.5)
import numpy as np
rng = np.random.default_rng(0)
theta = 0.35 # one frequency dial
def R(pos): # 2x2 rotation by pos*theta
c, s = np.cos(pos * theta), np.sin(pos * theta)
return np.array([[c, -s], [s, c]])
q, k = rng.normal(0, 1, (2, 2)) # one 2-D query/key pair
print("dot(R_m q, R_n k): key n=0 n=4 n=8")
for m in (0, 4, 8):
row = [(R(m) @ q) @ (R(n) @ k) for n in (0, 4, 8)]
print(f" query m={m} " + " ".join(f"{v:6.3f}" for v in row))
print("\nconstant along diagonals: (0,4) = (4,8); (0,0) = (4,4) = (8,8).")
print("absolute position cancels in the dot product; only n - m survives.")
offs = np.arange(-16, 17) # score as a function of offset
plot_xy(offs, [q @ (R(o) @ k) for o in offs])
ALiBi is the notable alternative: skip position vectors entirely and subtract a linear penalty \(m \cdot (i - j)\) from attention scores by head — simple, and it extrapolates to longer sequences gracefully. RoPE won on quality; its base-frequency scaling tricks won on context length.
Counting parameters
Per block, with MHA and a SwiGLU MLP at \(d_{\text{ff}} = \tfrac{8}{3} d\) (writing \(d = d_{\text{model}}\)): attention contributes \(4d^2\) (the \(W_Q, W_K, W_V, W_O\) projections) and the MLP \(3 \times \tfrac{8}{3} d^2 = 8d^2\). So:
| Model | Params | L | d_model | Heads (KV) | Context | Notes |
|---|---|---|---|---|---|---|
| GPT-2 XL (2019) | 1.5B | 48 | 1,600 | 25 | 1K | Learned abs. pos., GELU, post-norm era ends |
| Llama-2-70B (2023) | 70B | 80 | 8,192 | 64 (8) | 4K | RoPE, SwiGLU, RMSNorm, GQA |
| Llama-3.1-405B (2024) | 405B | 126 | 16,384 | 128 (8) | 128K | 15T training tokens, RoPE base 500K |
| DeepSeek-V3 (2024) | 671B total / 37B active | 61 | 7,168 | 128 (MLA) | 128K | MoE: 256 experts, 8 routed + 1 shared |
The block diagram leaves one box closed — the attention layer itself, the only place tokens talk to each other, and the component the industry has re-engineered most aggressively. Chapter 03 opens it completely.
Further reading
- Vaswani et al. (2017). Attention Is All You Need. — the transformer architecture: blocks, residual connections, the data path this chapter walks.
- Radford et al. (2018). Improving Language Understanding by Generative Pre-Training (GPT). — established the decoder-only stack as the LM backbone.
- Ba, Kiros & Hinton (2016). Layer Normalization. — the normalization scheme transformers were built on.
- Zhang & Sennrich (2019). Root Mean Square Layer Normalization. — RMSNorm, the cheaper variant now standard in frontier models.
- Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. — RoPE, the dominant positional scheme.
- Elhage et al. (2021). A Mathematical Framework for Transformer Circuits. — the residual-stream-as-workspace reading used throughout this chapter.