AI // ENCYCLOPEDIA / VOL II / 01 / FOUNDATIONS INDEX NEXT: TRANSFORMER →
CHAPTER 01 / 10

Foundations

Underneath the scale and engineering, a large language model computes one function: given a sequence of tokens, it returns a probability distribution over the next token. The rest of this manual covers the machinery built around that function. Attention, RLHF, quantization, and speculative decoding exist to compute it well, shape what it prefers, or evaluate it fast.

READING TIME≈ 20 MIN PREREQUISITESLINEAR ALGEBRA · PROBABILITY INSTRUMENTSTOKENIZER · PERPLEXITY DIAL
1.1

The single trick: next-token prediction

A language model defines a probability distribution over sequences of tokens. The defining move of autoregressive models — every modern LLM from GPT-2 to the current frontier — is to factor that joint distribution with the chain rule of probability, one token at a time:

EQ 1.1 — AUTOREGRESSIVE FACTORIZATION $$ p_\theta(x_1, x_2, \ldots, x_T) \;=\; \prod_{t=1}^{T} p_\theta\!\left(x_t \mid x_1, \ldots, x_{t-1}\right) $$
The model never has to score a whole sentence at once. It only ever answers one question: given everything so far, what comes next? The product of those answers is the probability of the full sequence.

Each conditional is a categorical distribution over the vocabulary, produced by a neural network \( f_\theta \) (the transformer of Chapter 02) followed by a softmax:

EQ 1.2 — FROM LOGITS TO PROBABILITIES $$ p_\theta(x_t = i \mid x_{<t}) \;=\; \mathrm{softmax}\big(f_\theta(x_{<t})\big)_i \;=\; \frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}} $$
The raw network outputs \(z \in \mathbb{R}^{|V|}\) are called logits — one real number per vocabulary entry. Softmax exponentiates and normalizes them into a proper distribution. Note the softmax is shift-invariant: adding a constant to every logit changes nothing.
A 3-token vocabulary gets logits \( z = (3,\ 1,\ 0) \). What is the softmax probability of token 1 (the first one)? Use \( e^3 = 20.09,\ e^1 = 2.718,\ e^0 = 1 \).
Sum of exponentials \( = 20.09 + 2.718 + 1 = 23.81 \). Then \( p_1 = e^3 / \text{sum} = 20.09 / 23.81 = \) 0.844.

Generation is just repeated application: feed the context in, get a distribution out, pick a token (Chapter 08 covers how to pick), append it, repeat. This loop — one forward pass per emitted token — is why inference economics are dominated by the cost of a single forward pass, and why so much of Chapter 08 is about amortizing it.

KEY IDEA

The objective is unsupervised but the supervision is free. Any text is its own training signal: position \(t\)'s label is simply the token at position \(t\). This is what lets LLMs train on trillions of tokens without human labeling — and it is also why a base model imitates the internet rather than answering questions helpfully (fixed in Chapter 05).

FIG 1.AEND-TO-END PIPELINE — TEXT IN, DISTRIBUTION OUT
“The cat sat” RAW TEXT [791, 5176, 7731] TOKEN IDS 3 × d_model EMBEDDINGS TRANSFORMER L blocks · CH 02–03 z ∈ ℝ^|V| LOGITS p(next) softmax(z)
One pass, one distribution. The entire stack exists to map a prefix of token IDs to a distribution over what comes next. During generation this pipeline runs once per output token.
1.2

Tokens & byte-pair encoding

Models do not read characters or words — they read tokens: entries of a fixed vocabulary \(V\) learned from data before training begins. Modern vocabularies run from 32K (Llama 2) through 128K (Llama 3, GPT-4 class) to 256K+ (Gemini class). Tokenization is a compression decision: a bigger vocabulary means fewer, more semantically loaded tokens per sentence, at the cost of a larger embedding table and a more expensive softmax.

The dominant algorithm is byte-pair encoding (BPE), usually applied at the byte level (GPT-2 style) so that any string — emoji, Korean, malformed UTF-8 — is representable with zero out-of-vocabulary failures. Training a BPE tokenizer is greedy agglomeration:

  1. Initialize the vocabulary with all 256 bytes.
  2. Count every adjacent symbol pair in the corpus. Find the most frequent pair \((a, b)\).
  3. Add the merged symbol \(ab\) to the vocabulary; replace every occurrence.
  4. Repeat until the vocabulary reaches the target size. The ordered merge list is the tokenizer.
EQ 1.3 — BPE MERGE RULE $$ (a^\*, b^\*) \;=\; \underset{(a,b)}{\arg\max}\; \mathrm{count}\big(a \circ b\big) $$
At every step, merge the most frequent adjacent pair. SentencePiece's unigram-LM alternative instead starts from a large candidate set and prunes pieces to maximize corpus likelihood under a mixture model — same goal, top-down instead of bottom-up.
PYTHON · RUNNABLE IN-BROWSER
# BPE from scratch: 8 greedy merge rounds on a toy corpus
from collections import Counter

corpus = "low "*5 + "lower "*2 + "newer "*6 + "newest "*3 + "widest "*3
words = Counter(tuple(w) + ("_",) for w in corpus.split())   # _ = end of word

def merge(word, a, b):
    out, i = [], 0
    while i < len(word):
        if i + 1 < len(word) and (word[i], word[i+1]) == (a, b):
            out.append(a + b); i += 2
        else:
            out.append(word[i]); i += 1
    return tuple(out)

vocab = sorted({ch for w in words for ch in w})
print("base symbols:", " ".join(vocab))
for step in range(8):
    pairs = Counter()
    for w, f in words.items():
        for pair in zip(w, w[1:]): pairs[pair] += f
    (a, b), f = max(pairs.items(), key=lambda kv: kv[1])
    vocab.append(a + b)
    words = Counter({merge(w, a, b): f for w, f in words.items()})
    print(f"merge {step+1}: '{a}' + '{b}' -> '{a+b}'   ({f} occurrences)")

print("\nlearned tokens:", [v for v in vocab if len(v) > 1])
print("words now segment as:", ["|".join(w) for w in words])
edits are live — break it on purpose
INSTRUMENT 1.1 — BPE-STYLE TOKENIZERGREEDY LONGEST-MATCH · TOY VOCAB
TOKENS
CHARACTERS
CHARS / TOKEN
␣ marks a leading space — real BPE vocabularies treat “ the” and “the” as different tokens. Note how common morphemes (“train”, “ing”, “tion”) survive as units while rare words shatter.
FAILURE MODE

Tokenization explains many famous LLM blind spots. Counting the r's in “strawberry”, reversing strings, arithmetic on long numbers — these are hard partly because the model never sees characters, only opaque token IDs whose internal spelling it must infer statistically. Number tokenization (1–3 digit chunks, right-to-left in modern tokenizers) measurably affects arithmetic accuracy.

1.3

Embeddings: tokens become geometry

A token ID is just an index. The first learned operation gives it coordinates: row \(i\) of an embedding matrix \(E \in \mathbb{R}^{|V| \times d_{\text{model}}}\) is the vector for token \(i\). For a 128K vocabulary and \(d_{\text{model}} = 8192\) (Llama-3-70B scale) that single matrix holds ≈1B parameters.

EQ 1.4 — EMBEDDING LOOKUP & UNEMBEDDING $$ h_t^{(0)} = E_{x_t} \in \mathbb{R}^{d_{\text{model}}}, \qquad z = W_U\, h_T^{(L)} \quad \text{with } W_U \in \mathbb{R}^{|V| \times d_{\text{model}}} $$
At the output end, an unembedding (LM head) \(W_U\) maps the final hidden state back to logits. Many models tie \(W_U = E\) — the same geometry encodes and decodes — saving parameters; most recent large models untie them for quality.

Because embeddings are trained by gradient descent against the prediction objective, tokens that are interchangeable in context converge to nearby vectors. Direction in this space becomes meaning: similarity is measured with the cosine,

EQ 1.5 — COSINE SIMILARITY $$ \cos(u, v) \;=\; \frac{u \cdot v}{\lVert u \rVert\, \lVert v \rVert} \in [-1, 1] $$
The classic king − man + woman ≈ queen arithmetic of word2vec survives in LLM embedding spaces, but the deeper story is downstream: the residual stream (Chapter 02) keeps refining these vectors layer by layer into contextual representations — “bank” drifts toward river or finance depending on its neighbors.
Two embedding vectors are \( u = (3,\ 4) \) and \( v = (4,\ 3) \). Compute \( \cos(u, v) \).
Dot product \( u \cdot v = 3\cdot4 + 4\cdot3 = 24 \). Lengths \( \lVert u \rVert = \sqrt{9+16} = 5 \) and \( \lVert v \rVert = \sqrt{16+9} = 5 \). So \( \cos = 24 / (5 \cdot 5) = 24/25 = \) 0.96.
1.4

The objective: cross-entropy & perplexity

Training minimizes the negative log-likelihood of the data — equivalently, the cross-entropy between the data's “one-hot” next-token distribution and the model's prediction, averaged over every position of every sequence:

EQ 1.6 — THE PRE-TRAINING LOSS $$ \mathcal{L}(\theta) \;=\; -\,\mathbb{E}_{x \sim \mathcal{D}} \left[ \frac{1}{T} \sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_{<t}\right) \right] $$
Every token position contributes a loss term in parallel during training (teacher forcing): the model predicts position \(t\) from the true prefix, not from its own samples. One sequence of length 4,096 yields 4,096 supervised examples in a single forward pass — the second reason web-scale training is feasible.
The model assigns probability \( 0.20 \) to the true next token at one position. What is that position's cross-entropy loss \( -\ln p \), in nats? (Use \( \ln 5 = 1.609 \).)
\( -\ln(0.20) = -\ln(1/5) = \ln 5 = \) 1.609 nats. Low probability on the truth ⇒ large loss.

The human-readable form of this loss is perplexity — the effective branching factor, “how many equally likely tokens is the model choosing among?”:

EQ 1.7 — PERPLEXITY $$ \mathrm{PPL} \;=\; \exp\big(\mathcal{L}\big) $$
A uniform guesser over a 128K vocabulary has PPL = 128,000. Strong frontier models reach single-digit perplexities on held-out web text. Loss differences that look tiny — 2.30 vs 2.25 nats — represent large capability gaps, which is why scaling-law plots (Chapter 04) are drawn in loss space.
A model reaches a cross-entropy loss of \( \mathcal{L} = 2.0 \) nats/token. What is its perplexity \( e^{\mathcal{L}} \)? (Use \( e^2 = 7.389 \).)
\( \mathrm{PPL} = e^{\mathcal{L}} = e^{2.0} = \) 7.389 — the model is effectively choosing among about 7 equally likely tokens at each step.
PYTHON · RUNNABLE IN-BROWSER
# cross-entropy by hand: -log p(true token), then PPL = e^L
import numpy as np

p_true = np.array([0.50, 0.25, 0.80, 0.10])   # model's p on the TRUE next token
nll = -np.log(p_true)
for t, (p, l) in enumerate(zip(p_true, nll)):
    print(f"pos {t}:  p(true) = {p:.2f}   -log p = {l:.3f} nats")

L = nll.mean()
print(f"\nL   = {L:.3f} nats   (the one p=0.10 miss is {nll[3]/nll.sum():.0%} of the total)")
print(f"PPL = e^L = {np.exp(L):.2f}  -> like guessing among ~3 equally likely tokens")
print(f"geometric mean of p = {p_true.prod() ** 0.25:.3f} = 1/PPL  (check: {1/np.exp(L):.3f})")

L_axis = np.linspace(1.0, 5.0, 60)            # the exponential dial of EQ 1.7
plot_xy(L_axis, np.exp(L_axis))
edits are live — break it on purpose
INSTRUMENT 1.2 — PERPLEXITY DIALEQ 1.7 · LIVE
PERPLEXITY e^L
BITS / TOKEN
HISTORICAL EQUIVALENT
Slide from random guessing (11.8 nats over 128K tokens) down to the frontier. The exponential means a 0.05-nat improvement near the floor is worth more than a full nat was in 2015.
QuantityUnitsReading
Loss \( \mathcal{L} \)nats / tokenWhat the optimizer sees. 1 nat = 1.443 bits.
Bits-per-bytebits / byteTokenizer-independent compression metric — lets you compare models with different vocabularies.
Perplexitydimensionless\( e^{\mathcal{L}} \). Effective number of choices per token.
1.5

What emerges from a “simple” objective

Next-token prediction looks shallow and is not. To keep lowering loss on the entire internet, a model is forced to acquire whatever machinery predicts text: syntax, then facts, then style, then — at sufficient scale — multi-step structure. Three observations anchor the rest of this manual:

  • Compression ⇒ understanding. The optimal next-token predictor for a corpus must internalize the regularities that generated it. Predicting the last word of “The capital of Mongolia is …” requires storing geography; predicting the next move in a chess transcript requires a board model.
  • In-context learning. A trained LLM can be “programmed” at inference time: show it input→output examples inside the prompt and it continues the pattern, with no weight updates. This emergent property — essentially free few-shot learning — reshaped the field after GPT-3 demonstrated it at scale.
  • Capability ≠ behavior. The base model is a simulator of its training distribution. It will complete a question with another question if that's the likeliest continuation. Turning capability into reliable, helpful, safe behavior is the entire subject of post-training (Chapter 05).
NEXT

We have a contract — prefix of tokens in, next-token distribution out — but \(f_\theta\) is still a black box. Chapter 02 opens it: the transformer, the residual stream, and where its billions of parameters actually sit.

§

Further reading

  • Shannon, C. E. (1948). A Mathematical Theory of Communication. — defines entropy and the predict-the-next-symbol view of language that perplexity inherits.
  • Bengio, Ducharme, Vincent & Jauvin (2003). A Neural Probabilistic Language Model. — the first neural LM to learn distributed word embeddings and a next-word objective.
  • Sennrich, Haddow & Birch (2016). Neural Machine Translation of Rare Words with Subword Units. — introduced byte-pair encoding to NLP, the tokenizer recipe still in use.
  • Mikolov, Sutskever, Chen, Corrado & Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. — word2vec; embeddings as geometry where direction carries meaning.
  • Radford, Wu, Child, Luan, Amodei & Sutskever (2019). Language Models are Unsupervised Multitask Learners (GPT-2). — the argument that next-token prediction at scale yields general capability.
  • Brown et al. (2020). Language Models are Few-Shot Learners (GPT-3). — demonstrated in-context learning as an emergent property of scale.