The single trick: next-token prediction
A language model defines a probability distribution over sequences of tokens. The defining move of autoregressive models — every modern LLM from GPT-2 to the current frontier — is to factor that joint distribution with the chain rule of probability, one token at a time:
Each conditional is a categorical distribution over the vocabulary, produced by a neural network \( f_\theta \) (the transformer of Chapter 02) followed by a softmax:
Generation is just repeated application: feed the context in, get a distribution out, pick a token (Chapter 08 covers how to pick), append it, repeat. This loop — one forward pass per emitted token — is why inference economics are dominated by the cost of a single forward pass, and why so much of Chapter 08 is about amortizing it.
The objective is unsupervised but the supervision is free. Any text is its own training signal: position \(t\)'s label is simply the token at position \(t\). This is what lets LLMs train on trillions of tokens without human labeling — and it is also why a base model imitates the internet rather than answering questions helpfully (fixed in Chapter 05).
Tokens & byte-pair encoding
Models do not read characters or words — they read tokens: entries of a fixed vocabulary \(V\) learned from data before training begins. Modern vocabularies run from 32K (Llama 2) through 128K (Llama 3, GPT-4 class) to 256K+ (Gemini class). Tokenization is a compression decision: a bigger vocabulary means fewer, more semantically loaded tokens per sentence, at the cost of a larger embedding table and a more expensive softmax.
The dominant algorithm is byte-pair encoding (BPE), usually applied at the byte level (GPT-2 style) so that any string — emoji, Korean, malformed UTF-8 — is representable with zero out-of-vocabulary failures. Training a BPE tokenizer is greedy agglomeration:
- Initialize the vocabulary with all 256 bytes.
- Count every adjacent symbol pair in the corpus. Find the most frequent pair \((a, b)\).
- Add the merged symbol \(ab\) to the vocabulary; replace every occurrence.
- Repeat until the vocabulary reaches the target size. The ordered merge list is the tokenizer.
# BPE from scratch: 8 greedy merge rounds on a toy corpus
from collections import Counter
corpus = "low "*5 + "lower "*2 + "newer "*6 + "newest "*3 + "widest "*3
words = Counter(tuple(w) + ("_",) for w in corpus.split()) # _ = end of word
def merge(word, a, b):
out, i = [], 0
while i < len(word):
if i + 1 < len(word) and (word[i], word[i+1]) == (a, b):
out.append(a + b); i += 2
else:
out.append(word[i]); i += 1
return tuple(out)
vocab = sorted({ch for w in words for ch in w})
print("base symbols:", " ".join(vocab))
for step in range(8):
pairs = Counter()
for w, f in words.items():
for pair in zip(w, w[1:]): pairs[pair] += f
(a, b), f = max(pairs.items(), key=lambda kv: kv[1])
vocab.append(a + b)
words = Counter({merge(w, a, b): f for w, f in words.items()})
print(f"merge {step+1}: '{a}' + '{b}' -> '{a+b}' ({f} occurrences)")
print("\nlearned tokens:", [v for v in vocab if len(v) > 1])
print("words now segment as:", ["|".join(w) for w in words])
Tokenization explains many famous LLM blind spots. Counting the r's in “strawberry”, reversing strings, arithmetic on long numbers — these are hard partly because the model never sees characters, only opaque token IDs whose internal spelling it must infer statistically. Number tokenization (1–3 digit chunks, right-to-left in modern tokenizers) measurably affects arithmetic accuracy.
Embeddings: tokens become geometry
A token ID is just an index. The first learned operation gives it coordinates: row \(i\) of an embedding matrix \(E \in \mathbb{R}^{|V| \times d_{\text{model}}}\) is the vector for token \(i\). For a 128K vocabulary and \(d_{\text{model}} = 8192\) (Llama-3-70B scale) that single matrix holds ≈1B parameters.
Because embeddings are trained by gradient descent against the prediction objective, tokens that are interchangeable in context converge to nearby vectors. Direction in this space becomes meaning: similarity is measured with the cosine,
The objective: cross-entropy & perplexity
Training minimizes the negative log-likelihood of the data — equivalently, the cross-entropy between the data's “one-hot” next-token distribution and the model's prediction, averaged over every position of every sequence:
The human-readable form of this loss is perplexity — the effective branching factor, “how many equally likely tokens is the model choosing among?”:
# cross-entropy by hand: -log p(true token), then PPL = e^L
import numpy as np
p_true = np.array([0.50, 0.25, 0.80, 0.10]) # model's p on the TRUE next token
nll = -np.log(p_true)
for t, (p, l) in enumerate(zip(p_true, nll)):
print(f"pos {t}: p(true) = {p:.2f} -log p = {l:.3f} nats")
L = nll.mean()
print(f"\nL = {L:.3f} nats (the one p=0.10 miss is {nll[3]/nll.sum():.0%} of the total)")
print(f"PPL = e^L = {np.exp(L):.2f} -> like guessing among ~3 equally likely tokens")
print(f"geometric mean of p = {p_true.prod() ** 0.25:.3f} = 1/PPL (check: {1/np.exp(L):.3f})")
L_axis = np.linspace(1.0, 5.0, 60) # the exponential dial of EQ 1.7
plot_xy(L_axis, np.exp(L_axis))
| Quantity | Units | Reading |
|---|---|---|
| Loss \( \mathcal{L} \) | nats / token | What the optimizer sees. 1 nat = 1.443 bits. |
| Bits-per-byte | bits / byte | Tokenizer-independent compression metric — lets you compare models with different vocabularies. |
| Perplexity | dimensionless | \( e^{\mathcal{L}} \). Effective number of choices per token. |
What emerges from a “simple” objective
Next-token prediction looks shallow and is not. To keep lowering loss on the entire internet, a model is forced to acquire whatever machinery predicts text: syntax, then facts, then style, then — at sufficient scale — multi-step structure. Three observations anchor the rest of this manual:
- Compression ⇒ understanding. The optimal next-token predictor for a corpus must internalize the regularities that generated it. Predicting the last word of “The capital of Mongolia is …” requires storing geography; predicting the next move in a chess transcript requires a board model.
- In-context learning. A trained LLM can be “programmed” at inference time: show it input→output examples inside the prompt and it continues the pattern, with no weight updates. This emergent property — essentially free few-shot learning — reshaped the field after GPT-3 demonstrated it at scale.
- Capability ≠ behavior. The base model is a simulator of its training distribution. It will complete a question with another question if that's the likeliest continuation. Turning capability into reliable, helpful, safe behavior is the entire subject of post-training (Chapter 05).
We have a contract — prefix of tokens in, next-token distribution out — but \(f_\theta\) is still a black box. Chapter 02 opens it: the transformer, the residual stream, and where its billions of parameters actually sit.
Further reading
- Shannon, C. E. (1948). A Mathematical Theory of Communication. — defines entropy and the predict-the-next-symbol view of language that perplexity inherits.
- Bengio, Ducharme, Vincent & Jauvin (2003). A Neural Probabilistic Language Model. — the first neural LM to learn distributed word embeddings and a next-word objective.
- Sennrich, Haddow & Birch (2016). Neural Machine Translation of Rare Words with Subword Units. — introduced byte-pair encoding to NLP, the tokenizer recipe still in use.
- Mikolov, Sutskever, Chen, Corrado & Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. — word2vec; embeddings as geometry where direction carries meaning.
- Radford, Wu, Child, Luan, Amodei & Sutskever (2019). Language Models are Unsupervised Multitask Learners (GPT-2). — the argument that next-token prediction at scale yields general capability.
- Brown et al. (2020). Language Models are Few-Shot Learners (GPT-3). — demonstrated in-context learning as an emergent property of scale.