02 · The Transformer — LLM Field Manual

2.1

Anatomy of a block

A modern decoder block is two sub-layers — self-attention then a feed-forward MLP — each wrapped in a residual connection and preceded by a normalization (the “pre-norm” arrangement, universal since GPT-2 because it keeps gradients stable in deep stacks):

EQ 2.1 — ONE TRANSFORMER BLOCK (PRE-NORM) $$ \begin{aligned} h' &= h + \mathrm{Attn}\big(\mathrm{Norm}(h)\big) \\[4px] h'' &= h' + \mathrm{MLP}\big(\mathrm{Norm}(h')\big) \end{aligned} $$

Note what the residual form implies: each sub-layer computes an update that is added to the running state $h$, never a replacement for it. A block can choose to do almost nothing — and early in training, that is exactly what keeps optimization sane at 100+ layers.

PYTHON · RUNNABLE IN-BROWSER

# one transformer block, forward, in pure numpy (EQ 2.1)
import numpy as np
rng = np.random.default_rng(0)
T, d, dff = 4, 16, 43                     # toy: 4 tokens, d_model 16, dff ~ 8/3 d

def rms(x): return x / np.sqrt((x * x).mean(-1, keepdims=True) + 1e-5)
def ledger(name, x): print(f"{name:<24}{str(x.shape):>10}   norm {np.linalg.norm(x):7.2f}")

h = rng.normal(0, 1, (T, d));             ledger("residual stream in", h)
Wq, Wk, Wv, Wo = rng.normal(0, d ** -0.5, (4, d, d))
x = rms(h);                               ledger("RMSNorm(h)", x)
Q, K, V = x @ Wq, x @ Wk, x @ Wv;         ledger("Q  (K, V same)", Q)
S = Q @ K.T / np.sqrt(d) + np.triu(np.full((T, T), -1e9), 1)
A = np.exp(S - S.max(-1, keepdims=True)); A /= A.sum(-1, keepdims=True)
ledger("attn weights (causal)", A)
h = h + (A @ V) @ Wo;                     ledger("h + Attn   [residual]", h)

Wg, Wu = rng.normal(0, d ** -0.5, (2, d, dff))
Wd = rng.normal(0, dff ** -0.5, (dff, d))
x = rms(h)
g = x @ Wg
m = ((g / (1 + np.exp(-g))) * (x @ Wu)) @ Wd   # SwiGLU: SiLU(gate) * up, down
ledger("SwiGLU MLP out", m)
h = h + m;                                ledger("h + MLP    [residual]", h)
print("\nthe block ADDED its work to h -- the stream is nudged, never replaced")

edits are live — break it on purpose

FIG 2.ADECODER BLOCK — DATA PATH

Two taps on one bus. Attention is the only place positions exchange information; the MLP transforms each position independently. Both write their result back into the stream through addition.

The full model is: embedding lookup → $L$ of these blocks → final norm → unembedding to logits. That's the entire architecture. Everything else in modern LLM engineering is a refinement of one of these pieces.

2.2

The residual stream is the model's workspace

The most productive mental model (due to the mechanistic-interpretability literature): the residual stream is a communication bus of width $d_{\text{model}}$ running through the network. Each attention head and each MLP reads from the bus through a linear projection, computes something, and writes its result back by addition into (approximately) its own subspace.

Superposition. The bus carries far more “features” than it has dimensions, packed as non-orthogonal directions. Sparse autoencoders (Chapter 09) decompress these into interpretable features.
Iterative refinement. The token vector for “bank” enters as pure type information and is incrementally enriched: position, syntax, sense disambiguation, long-range bindings — each layer nudging the vector with a small additive update.
Logit lens. Because every layer writes in the same coordinate system, you can apply the unembedding to intermediate states and watch the model's next-token guess sharpen layer by layer — direct evidence the stream is a progressively refined prediction.

INTUITION

Attention moves information between positions; the MLP transforms it in place. Roughly: attention answers “what should I look at?”, the MLP answers “what do I conclude from what I gathered?”. Knowledge recall behaves like key-value lookups stored in MLP weights; copying and binding behave like attention-head routing.

2.3

Normalization: LayerNorm → RMSNorm

Deep residual stacks need their activations kept in a stable range. The original transformer used LayerNorm; nearly every modern LLM (Llama, Mistral, Qwen, DeepSeek) uses the cheaper RMSNorm, which drops mean-centering and the bias term:

EQ 2.2 — LAYERNORM vs RMSNORM $$ \mathrm{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \qquad\Bigg|\qquad \mathrm{RMS}(x) = \gamma \odot \frac{x}{\sqrt{\tfrac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} $$

$\mu, \sigma^2$ are the per-vector mean and variance; $\gamma, \beta$ are learned. RMSNorm keeps only the scale normalization — empirically all that matters — saving a reduction pass and parameters. $\epsilon \approx 10^{-5}$ guards against division by zero.

Apply RMSNorm (with $ \gamma = 1 $, $ \epsilon $ negligible) to $ x = (3,\ 4,\ 0,\ 0) $, $ d = 4 $. What is the first component of the output?

Denominator $ = \sqrt{\tfrac{1}{4}(3^2 + 4^2 + 0 + 0)} = \sqrt{25/4} = \sqrt{6.25} = 2.5 $. First output component $ = 3 / 2.5 = $ 1.2.

Placement matters more than flavor: pre-norm (normalize the input of each sub-layer, as in EQ 2.1) yields a clean gradient path through the residual additions, enabling very deep models without the fragile learning-rate warmup gymnastics of the original post-norm design. Some recent models add extra norms (e.g., QK-norm on attention queries/keys) for further stability at scale.

2.4

The MLP: where most parameters live

The feed-forward sub-layer is a two-layer network applied to every position independently, expanding the representation to an inner width $d_{\text{ff}}$ (classically $4\,d_{\text{model}}$) and projecting back. The modern default activation is SwiGLU — a gated linear unit using SiLU (“swish”):

EQ 2.3 — SWIGLU MLP $$ \mathrm{MLP}(x) \;=\; W_{\text{down}} \Big( \mathrm{SiLU}\big(W_{\text{gate}}\, x\big) \;\odot\; W_{\text{up}}\, x \Big), \qquad \mathrm{SiLU}(z) = z \cdot \sigma(z) $$

Three matrices instead of two: a gate path squashed through SiLU multiplies an up projection element-wise, then down projects back to $d_{\text{model}}$. To hold parameter count comparable to the classic 2-matrix design, $d_{\text{ff}}$ is set to $\tfrac{8}{3} d_{\text{model}}$ (rounded for hardware). GPT-2-era models used GELU, $ \mathrm{GELU}(z) = z\,\Phi(z) $, without the gate.

Evaluate the SwiGLU activation $ \mathrm{SiLU}(z) = z\,\sigma(z) $ at $ z = 2 $. (Use $ e^{-2} = 0.1353 $.)

$ \sigma(2) = \dfrac{1}{1 + e^{-2}} = \dfrac{1}{1.1353} = 0.8808 $. Then $ \mathrm{SiLU}(2) = 2 \times 0.8808 = $ 1.762.

Interpretability work suggests MLPs implement key→value memories: the first projection detects patterns in the stream (keys), the nonlinearity gates which fire, and the second projection writes associated content (values) back. This is where “Paris is the capital of France” mostly resides — and why model-editing techniques target MLP weights.

2.5

Position: from sinusoids to RoPE

Attention is permutation-invariant — without help, the model cannot tell “dog bites man” from “man bites dog”. The original transformer added fixed sinusoidal vectors to embeddings; GPT-2 learned absolute position vectors. Modern LLMs almost universally use Rotary Position Embeddings (RoPE), which encode position by rotating query and key vectors, pairwise, by position-proportional angles:

EQ 2.4 — ROPE ROTATION $$ \begin{pmatrix} q'_{2i} \\ q'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \;\;\cos m\theta_i \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}, \qquad \theta_i = b^{-2i/d_k} $$

Each consecutive pair of dimensions $(2i, 2i{+}1)$ of a query at position $m$ is rotated by angle $m\theta_i$; keys likewise. The base $b$ (10,000 originally; 500,000 in Llama 3 for long context) sets the frequency spectrum: low-$i$ pairs spin fast (fine positional detail), high-$i$ pairs spin slowly (coarse, long-range).

RoPE with base $ b = 10{,}000 $, $ d_k = 4 $. For the pair $ i = 1 $, the frequency is $ \theta_1 = b^{-2i/d_k} = 10000^{-0.5} = 0.01 $. What rotation angle $ m\theta_1 $ (in radians) does a query at position $ m = 50 $ receive on that pair?

$ \theta_1 = 10000^{-2\cdot1/4} = 10000^{-0.5} = 1/100 = 0.01 $ rad/position. At $ m = 50 $: angle $ = 50 \times 0.01 = $ 0.5 radians.

EQ 2.5 — WHY IT WORKS: RELATIVITY $$ \langle \mathrm{R}_m q,\; \mathrm{R}_n k \rangle \;=\; \langle q,\; \mathrm{R}_{\,n-m}\, k \rangle $$

The dot product after rotation depends only on the offset $n - m$, never on absolute positions. Attention scores become translation-invariant functions of relative distance — exactly the right inductive bias for language, and the property every long-context extension method (Chapter 09) manipulates.

PYTHON · RUNNABLE IN-BROWSER

# RoPE relativity: dot(R_m q, R_n k) depends only on n - m  (EQ 2.5)
import numpy as np
rng = np.random.default_rng(0)
theta = 0.35                                  # one frequency dial

def R(pos):                                   # 2x2 rotation by pos*theta
    c, s = np.cos(pos * theta), np.sin(pos * theta)
    return np.array([[c, -s], [s, c]])

q, k = rng.normal(0, 1, (2, 2))               # one 2-D query/key pair
print("dot(R_m q, R_n k):    key n=0    n=4    n=8")
for m in (0, 4, 8):
    row = [(R(m) @ q) @ (R(n) @ k) for n in (0, 4, 8)]
    print(f"  query m={m}        " + "  ".join(f"{v:6.3f}" for v in row))
print("\nconstant along diagonals: (0,4) = (4,8); (0,0) = (4,4) = (8,8).")
print("absolute position cancels in the dot product; only n - m survives.")

offs = np.arange(-16, 17)                     # score as a function of offset
plot_xy(offs, [q @ (R(o) @ k) for o in offs])

edits are live — break it on purpose

INSTRUMENT 2.1 — RoPE FREQUENCY DIALS8 OF d_k/2 ROTATION PAIRS

TOKEN POSITION m = 0

ROPE BASE b

Each dial is one dimension pair of a query/key vector; its needle rotates at θᵢ radians per position. Fast dials (left) encode fine local order; slow dials (right) only complete a turn after thousands of tokens. Raising the base to 500K slows the whole spectrum — the first ingredient of long context (CH 09).

ALiBi is the notable alternative: skip position vectors entirely and subtract a linear penalty $m \cdot (i - j)$ from attention scores by head — simple, and it extrapolates to longer sequences gracefully. RoPE won on quality; its base-frequency scaling tricks won on context length.

2.6

Counting parameters

Per block, with MHA and a SwiGLU MLP at $d_{\text{ff}} = \tfrac{8}{3} d$ (writing $d = d_{\text{model}}$): attention contributes $4d^2$ (the $W_Q, W_K, W_V, W_O$ projections) and the MLP $3 \times \tfrac{8}{3} d^2 = 8d^2$. So:

EQ 2.6 — PARAMETER BUDGET $$ N \;\approx\; \underbrace{12\, L\, d^2}_{\text{blocks}} \;+\; \underbrace{2\,|V|\, d}_{\text{embed + unembed}} $$

Roughly two-thirds of block parameters sit in the MLPs, one-third in attention. GQA (Chapter 03) shrinks the K/V share further. The embedding term matters at small scale (a 1B model with a 256K vocabulary spends ~40% of its budget there) and fades at large scale.

Estimate the block parameters $ 12\,L\,d^2 $ for a model with $ L = 12 $ layers and $ d = 768 $. Give your answer in millions (M). (Use $ 768^2 = 589{,}824 $.)

$ 12 \times 12 \times 589{,}824 = 144 \times 589{,}824 = 84{,}934{,}656 \approx $ 84.93 M parameters — roughly the GPT-2-base block budget.

INSTRUMENT 2.2 — PARAMETER BUDGETEQ 2.6 · LIVE

LAYERS L 80

WIDTH d_model 8,192

VOCABULARY

TOTAL PARAMETERS N

—

Defaults reproduce Llama-2-70B's shape. Shrink d_model to 1,024 with the 256K vocabulary and watch the embedding table eat the model — the small-model regime where vocabulary choices dominate.

Model	Params	L	d_model	Heads (KV)	Context	Notes
GPT-2 XL (2019)	1.5B	48	1,600	25	1K	Learned abs. pos., GELU, post-norm era ends
Llama-2-70B (2023)	70B	80	8,192	64 (8)	4K	RoPE, SwiGLU, RMSNorm, GQA
Llama-3.1-405B (2024)	405B	126	16,384	128 (8)	128K	15T training tokens, RoPE base 500K
DeepSeek-V3 (2024)	671B total / 37B active	61	7,168	128 (MLA)	128K	MoE: 256 experts, 8 routed + 1 shared

The block diagram leaves one box closed — the attention layer itself, the only place tokens talk to each other, and the component the industry has re-engineered most aggressively. Chapter 03 opens it completely.

§

The Transformer

Anatomy of a block

The residual stream is the model's workspace

Normalization: LayerNorm → RMSNorm

The MLP: where most parameters live

Position: from sinusoids to RoPE

Counting parameters

Further reading