AI // ENCYCLOPEDIA / STATISTICS / 08 / INFORMATION THEORY INDEX NEXT: THE DATA PROBLEM →
MATHEMATICS & STATISTICS · CHAPTER 08 / 08

Information Theory

In 1948 Claude Shannon laid a foundation that still governs machine learning. He measured surprise as a number, entropy, and proved it is the irreducible cost of communicating, compressing, or predicting a random source. Measured between what a model predicts and what actually happens, that same quantity is the cross-entropy loss that trains neural networks. This chapter builds entropy from one axiom, derives cross-entropy and KL divergence, then connects them to the loss function.

LEVELCORE READING TIME≈ 24 MIN BUILDS ONSTATS 01–07 INSTRUMENTSENTROPY · KL · HUFFMAN
8.1

Entropy — measuring surprise

Start with one demand: how surprised should you be by an outcome? A coin landing heads when you knew it was rigged to always land heads is no surprise at all. A fair coin landing heads is exactly one bit of surprise. Shannon insisted surprise depend only on the probability of the outcome, that a certain event (\(p = 1\)) carry zero surprise, and that the surprise of two independent events add. Only one function satisfies all three: the negative logarithm.

EQ S8.1 — SURPRISAL (SELF-INFORMATION) $$ I(x) \;=\; \log_2 \frac{1}{p(x)} \;=\; -\log_2 p(x) \qquad [\text{bits}] $$
The surprisal of an outcome is how many bits it would take to encode it optimally. A coin flip (\(p = \tfrac12\)) costs \(1\) bit; a one-in-a-million event costs \(\approx 20\) bits; a certainty costs \(0\). Independence forces additivity, and \(\log(ab) = \log a + \log b\) is the only function that turns the product of independent probabilities into a sum of surprises. Switch the log base to change the unit: base 2 → bits, base \(e\) → nats, base 10 → bans.

Entropy is the average surprisal — the expected number of bits per outcome when the source emits symbols according to distribution \(p\). It is the single number that says how uncertain, how unpredictable, how compressible a source is.

EQ S8.2 — SHANNON ENTROPY $$ H(p) \;=\; \mathbb{E}_{x \sim p}\big[\,I(x)\,\big] \;=\; -\sum_{x} p(x)\,\log_2 p(x) \qquad [\text{bits}] $$
By convention \(0 \log 0 = 0\) (an impossible symbol contributes nothing). Entropy is maximized by the uniform distribution — when every outcome is equally likely you cannot do better than guessing, so uncertainty is highest — and minimized (zero) by a point mass, where one outcome is certain. For \(K\) equally likely symbols, \(H = \log_2 K\): a fair die is \(\log_2 6 \approx 2.585\) bits, a fair coin exactly \(1\).

The two-outcome case has a name — the binary entropy function \(H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)\) — and a famous shape: a smooth arch peaking at exactly \(1\) bit when \(p = \tfrac12\) and collapsing to \(0\) at both ends. That arch is the first thing to internalize, because the curve of a training loss is the same idea wearing different clothes.

What is the Shannon entropy of a fair coin — outcomes \( \{H, T\} \) each with probability \( \tfrac12 \) — measured in bits?
\( H = -\tfrac12\log_2\tfrac12 - \tfrac12\log_2\tfrac12 = -\tfrac12(-1) - \tfrac12(-1) = \tfrac12 + \tfrac12 = \) 1 bit. This is the definition of the unit: one fair binary choice is exactly one bit of information.
A source emits one of 4 equally likely symbols. What is its entropy in bits? (\( H = \log_2 K \).)
\( H = \log_2 4 = \) 2 bits. Four equiprobable outcomes need two yes/no questions to pin down — and no coding scheme can average fewer than two bits per symbol.
PYTHON · RUNNABLE IN-BROWSER
# EQ S8.2: entropy of a distribution + the binary-entropy arch
import numpy as np

def entropy_bits(p):
    p = np.asarray(p, float)
    p = p[p > 0]                       # 0*log0 = 0 by convention
    return float(-(p * np.log2(p)).sum())

print("fair coin   H =", round(entropy_bits([0.5, 0.5]), 4), "bits")
print("biased coin H =", round(entropy_bits([0.9, 0.1]), 4), "bits")
print("fair die    H =", round(entropy_bits([1/6]*6), 4), "bits  (= log2 6)")
print("loaded die  H =", round(entropy_bits([0.5,0.1,0.1,0.1,0.1,0.1]), 4), "bits")

# the binary entropy function H_b(p): an arch peaking at p = 0.5
ps = np.linspace(0.001, 0.999, 200)
Hb = -ps*np.log2(ps) - (1-ps)*np.log2(1-ps)
print("\npeak of H_b at p =", round(float(ps[Hb.argmax()]), 3),
      "-> H =", round(float(Hb.max()), 4), "bit")
plot_xy(ps, Hb)
edits are live — break it on purpose
INSTRUMENT S8.1 — ENTROPY EXPLORERDRAG THE BARS · H PEAKS AT UNIFORM
PROBABILITY OVER 5 SYMBOLS — DRAG A BAR (RENORMALIZES TO SUM 1)
ENTROPY H
MAX POSSIBLE (log₂ 5)
FRACTION OF MAX
Drag any bar up or down — the rest rescale so the distribution always sums to one. Watch the entropy readout: it is highest when all five bars are level (the uniform, \(\log_2 5 = 2.322\) bits) and falls toward zero as you pile all the mass onto one symbol. The mint guideline marks the uniform height; pull a bar above it and entropy drops, because concentration is the opposite of surprise.
8.2

Cross-entropy & KL divergence

Entropy assumes you know the true distribution \(p\). But a model only ever has an estimate \(q\). Cross-entropy asks the practical question: if reality is \(p\) but you encode it using a code built for \(q\), how many bits per symbol do you actually pay?

EQ S8.3 — CROSS-ENTROPY $$ H(p, q) \;=\; -\sum_{x} p(x)\,\log_2 q(x) \qquad [\text{bits}] $$
Outcomes still happen with the true frequency \(p(x)\), but each is charged at the wrong codeword length \(-\log_2 q(x)\). If your model is right (\(q = p\)), cross-entropy collapses to entropy, \(H(p,p) = H(p)\) — you cannot beat the source's own entropy. If your model is wrong, you pay strictly more. That excess is the entire point of the next equation.

The gap between paying \(H(p,q)\) and the irreducible floor \(H(p)\) is the Kullback–Leibler divergence — the number of wasted bits caused by believing \(q\) when the truth is \(p\).

EQ S8.4 — KL DIVERGENCE (RELATIVE ENTROPY) $$ D_{\mathrm{KL}}(p \,\Vert\, q) \;=\; \sum_{x} p(x)\,\log_2 \frac{p(x)}{q(x)} \;=\; H(p, q) - H(p) \;\ge\; 0 $$
The decomposition \(H(p,q) = H(p) + D_{\mathrm{KL}}(p \Vert q)\) is the load-bearing identity of this chapter: the cross-entropy you minimize in training is the irreducible entropy of the data plus the divergence of your model from the truth. Since \(H(p)\) is a constant you cannot change, minimizing cross-entropy is exactly minimizing KL divergence. By Gibbs' inequality \(D_{\mathrm{KL}} \ge 0\), with equality iff \(q = p\) everywhere.

KL is not a metric. It is non-negative and zero only when the distributions match, but it is asymmetric — \(D_{\mathrm{KL}}(p\Vert q) \ne D_{\mathrm{KL}}(q\Vert p)\) in general — and it violates the triangle inequality. The asymmetry is not a flaw; it encodes a real modelling choice. Forward KL \(D_{\mathrm{KL}}(p\Vert q)\), the form inside maximum-likelihood training, is mass-covering: it punishes \(q\) heavily for assigning near-zero probability anywhere \(p\) has mass, so it spreads \(q\) to cover every mode. Reverse KL \(D_{\mathrm{KL}}(q\Vert p)\), the form inside variational inference (the ELBO, §8.5), is mode-seeking: it lets \(q\) ignore parts of \(p\) and lock onto a single mode. Which way you write the bars decides whether your model hedges or commits.

What is \( D_{\mathrm{KL}}(p \,\Vert\, p) \) — the KL divergence of any distribution from itself?
Each term is \( p(x)\log_2\dfrac{p(x)}{p(x)} = p(x)\log_2 1 = p(x)\cdot 0 = 0 \), so the sum is 0. A perfect model wastes no bits — this is the floor every cross-entropy loss is descending toward, and the equality case of Gibbs' inequality.
PYTHON · RUNNABLE IN-BROWSER
# EQ S8.3 / S8.4: entropy, cross-entropy, KL -- and KL's asymmetry
import numpy as np

def entropy(p):       return float(-(p * np.log2(p)).sum())
def cross_entropy(p,q):return float(-(p * np.log2(q)).sum())
def kl(p, q):         return float((p * np.log2(p/q)).sum())

p = np.array([0.7, 0.2, 0.1])      # the truth
q = np.array([0.2, 0.5, 0.3])      # a wrong model

H   = entropy(p)
Hpq = cross_entropy(p, q)
print(f"H(p)            = {H:.4f} bits   (irreducible floor)")
print(f"H(p, q)         = {Hpq:.4f} bits   (what you actually pay)")
print(f"KL(p || q)      = {kl(p, q):.4f} bits   (wasted bits)")
print(f"identity check  : H(p)+KL = {H + kl(p,q):.4f}  == H(p,q)?",
      np.isclose(H + kl(p,q), Hpq))

print(f"\nKL(p || q)      = {kl(p, q):.4f}")
print(f"KL(q || p)      = {kl(q, p):.4f}   <- different! KL is asymmetric")
print(f"KL(p || p)      = {kl(p, p):.4f}   <- zero iff the model is exact")
edits are live — break it on purpose
INSTRUMENT S8.2 — KL ASYMMETRY VISUALIZER3 SYMBOLS · KL(P‖Q) vs KL(Q‖P)
KL(P ‖ Q) — FORWARD
KL(Q ‖ P) — REVERSE
ASYMMETRY GAP
The fixed truth is P = (0.7, 0.2, 0.1) (mint bars); drag the sliders to reshape your model Q (blue bars; the third bar fills the remainder). Forward and reverse KL are almost never equal — push a Q-bar toward zero where P has real mass and forward KL explodes (mass-covering punishes it), while reverse KL stays mild. Both readouts hit 0 only when the blue bars exactly overlay the mint ones.
8.3

Mutual information — shared surprise

So far, one variable. Mutual information asks how much knowing one random variable tells you about another: how many bits of \(Y\)'s uncertainty vanish once you observe \(X\). It is the KL divergence between the joint distribution and the product of the marginals — i.e. how far \(X\) and \(Y\) are from being independent.

EQ S8.5 — MUTUAL INFORMATION $$ I(X; Y) \;=\; \sum_{x, y} p(x, y)\,\log_2 \frac{p(x, y)}{p(x)\,p(y)} \;=\; H(Y) - H(Y \mid X) \;=\; D_{\mathrm{KL}}\big(p(x,y) \,\Vert\, p(x)p(y)\big) $$
The middle form is the most intuitive: \(H(Y)\) is your uncertainty about \(Y\) before, \(H(Y\mid X)\) is what remains after seeing \(X\), and the drop is the information \(X\) carried. \(I(X;Y) \ge 0\), and \(I(X;Y) = 0\) iff \(X\) and \(Y\) are independent (the joint factorizes, the KL vanishes). Unlike KL, mutual information is symmetric: \(I(X;Y) = I(Y;X)\). It captures arbitrary nonlinear dependence — where correlation sees only straight lines.

Mutual information is the quiet workhorse behind a surprising amount of machine learning. Decision trees split on the feature with the highest information gain — mutual information between a feature and the label. Feature-selection ranks inputs by \(I(\text{feature}; \text{target})\). The information bottleneck frames representation learning as compressing \(X\) into \(Z\) while preserving \(I(Z; Y)\); InfoNCE and contrastive objectives are lower bounds on mutual information between views of the same datum. Wherever the question is "how related are these, beyond linear correlation?", mutual information is the honest answer — though estimating it from samples in high dimensions is notoriously hard and an active research area.

KEY

Correlation sees lines; mutual information sees structure. Two variables related by \(Y = X^2\) with \(X\) symmetric about zero have correlation exactly zero — yet \(X\) determines \(Y\) completely, so their mutual information is large. Any time you reach for "are these independent?", the bit-accurate test is \(I(X;Y) = 0\), not \(\rho = 0\).

PYTHON · RUNNABLE IN-BROWSER
# EQ S8.5: mutual information from a joint table, three identities agree
import numpy as np

# joint p(x,y) over a 2x2 grid -- correlated, not independent
P = np.array([[0.40, 0.10],
              [0.10, 0.40]])
px = P.sum(1, keepdims=True)        # marginal of X
py = P.sum(0, keepdims=True)        # marginal of Y

mask = P > 0
I = float((P[mask] * np.log2(P[mask] / (px @ py)[mask])).sum())

def H(p):                           # entropy of a flat distribution
    p = p[p > 0]; return float(-(p * np.log2(p)).sum())
HY   = H(py.ravel())
HY_X = float(-(P[mask] * np.log2((P / px)[mask])).sum())   # H(Y|X)

print(f"I(X;Y) via KL of joint vs product : {I:.4f} bits")
print(f"I(X;Y) via H(Y) - H(Y|X)          : {HY - HY_X:.4f} bits")
print(f"H(Y) = {HY:.3f}, H(Y|X) = {HY_X:.3f}  -> X removes that gap from Y")

indep = px @ py                     # what independence would look like
print("\nif X,Y were independent, I would be:",
      round(float((indep[indep>0]*np.log2((indep/(px@py))[indep>0])).sum()), 4))
edits are live — break it on purpose
8.4

Source coding — entropy as a compression limit

Entropy is not just a measure of uncertainty; it is a hard physical bound. Shannon's source coding theorem says: to encode symbols from a source with entropy \(H\) into bits without loss, you need on average at least \(H\) bits per symbol, and you can get arbitrarily close to \(H\) with a clever enough code. No lossless compressor — not ZIP, not a neural one — can beat the entropy of the source it is fed. Entropy is the compression limit.

EQ S8.6 — SOURCE CODING THEOREM (BOUNDS) $$ H(p) \;\le\; L^{*} \;<\; H(p) + 1 \qquad\text{where}\qquad L^{*} = \min_{\text{codes}} \sum_x p(x)\,\ell(x) $$
\(L^{*}\) is the expected codeword length of the best possible prefix-free code; \(\ell(x)\) is the length assigned to symbol \(x\). The optimal length is \(\ell(x) = -\log_2 p(x)\) — the surprisal of EQ S8.1 — so common symbols get short codes and rare symbols get long ones. The "\(+1\)" slack is the integer rounding penalty (you cannot use \(2.3\) bits for one symbol); coding many symbols at once, or arithmetic coding, drives the average down to \(H\) itself.

Huffman coding is the classic constructive proof: repeatedly merge the two least-probable symbols into a subtree, and the resulting prefix code is provably optimal among integer-length codes. When all probabilities are negative powers of two — a dyadic distribution — Huffman hits the entropy bound exactly, with zero slack.

A source emits four symbols with probabilities \( (\tfrac12, \tfrac14, \tfrac18, \tfrac18) \). What is its entropy — and the expected length of the optimal Huffman code — in bits per symbol?
Surprisals: \(-\log_2\tfrac12 = 1\), \(-\log_2\tfrac14 = 2\), \(-\log_2\tfrac18 = 3\), \(-\log_2\tfrac18 = 3\). So \( H = \tfrac12(1) + \tfrac14(2) + \tfrac18(3) + \tfrac18(3) = 0.5 + 0.5 + 0.375 + 0.375 = \) 1.75 bits. Because every probability is a power of two (dyadic), Huffman assigns lengths \(1,2,3,3\) and achieves this entropy exactly — no rounding waste.
PYTHON · RUNNABLE IN-BROWSER
# EQ S8.6: build a Huffman code, compare its length to the entropy bound
import numpy as np, heapq

def huffman_lengths(p):
    # heap of (prob, tie, node); node is leaf-id or (left,right)
    h = [(pi, i, i) for i, pi in enumerate(p)]
    heapq.heapify(h); nxt = len(p)
    while len(h) > 1:
        a = heapq.heappop(h); b = heapq.heappop(h)
        heapq.heappush(h, (a[0]+b[0], nxt, (a[2], b[2]))); nxt += 1
    lengths = {}
    def walk(node, d):
        if isinstance(node, tuple): walk(node[0], d+1); walk(node[1], d+1)
        else: lengths[node] = max(d, 1)
    walk(h[0][2], 0)
    return [lengths[i] for i in range(len(p))]

for name, p in [("dyadic ", [0.5, 0.25, 0.125, 0.125]),
                ("uniform", [0.25]*4),
                ("skewed ", [0.6, 0.2, 0.1, 0.1])]:
    p = np.array(p)
    H = float(-(p*np.log2(p)).sum())
    L = float((p * np.array(huffman_lengths(p))).sum())
    print(f"{name}: H = {H:.4f}  Huffman L = {L:.4f}  "
          f"slack = {L-H:+.4f}  (theorem: 0 <= slack < 1)")
edits are live — break it on purpose
INSTRUMENT S8.3 — CODING-LENGTH / HUFFMAN DEMO4 SYMBOLS · L vs ENTROPY BOUND · EQ S8.6
ENTROPY H (FLOOR)
HUFFMAN LENGTH L
SLACK L − H
Each row shows a symbol, its probability, the Huffman codeword built by merging the two rarest symbols, and that codeword's length. The DYADIC preset hits zero slack — \(L = H = 1.75\) bits — because every probability is a power of two and the surprisal \(-\log_2 p\) is already a whole number of bits. UNIFORM over four symbols also lands exactly on \(H = 2\); the SKEWED source pays a small rounding penalty (slack between 0 and 1), exactly as EQ S8.6 promises.
8.5

The bridge to ML — cross-entropy loss, perplexity, the ELBO

Here is the payoff. A classifier outputs a predicted distribution \(q\) over labels; the true label is a one-hot distribution \(p\) (all mass on the correct class \(c\)). Plug into cross-entropy, EQ S8.3:

EQ S8.7 — CROSS-ENTROPY LOSS = NEGATIVE LOG-LIKELIHOOD $$ \mathcal{L} \;=\; H(p, q) \;=\; -\sum_{k} p_k \log q_k \;=\; -\log q_c \qquad (p \text{ one-hot at the true class } c) $$
Because \(p\) is one-hot, every term vanishes except \(k = c\), and the loss reduces to the negative log-probability the model assigned to the correct answer — the negative log-likelihood (NLL). Minimizing it over a dataset is maximum-likelihood estimation; via EQ S8.4 it is also minimizing \(D_{\mathrm{KL}}(p \Vert q)\), pushing the model's distribution toward the data's. The softmax that produces \(q\) and this cross-entropy are paired precisely because the softmax's gradient through the loss is the clean \(q - p\) (see Vol I · EQ M2.3 and Vol II · EQ 4.1). Every neural classifier and every language model is trained by descending this single Shannon quantity.

For language models, the same loss wears a friendlier name. The geometric-mean per-token uncertainty is perplexity — the exponential of the cross-entropy — interpreted as the effective number of equally likely choices the model faces at each step.

EQ S8.8 — PERPLEXITY $$ \mathrm{PPL} \;=\; b^{\,H(p, q)} \;=\; \exp\!\Big(\!-\tfrac{1}{N}\textstyle\sum_{i=1}^{N} \log q(x_i \mid x_{<i})\Big) $$
\(b\) is the log base (use \(e\) with natural-log cross-entropy, \(2\) with bits — they give the same perplexity). A model with cross-entropy \(H\) bits is as uncertain as a fair die with \(2^{H}\) faces. A fair coin (\(H = 1\) bit) has perplexity \(2\); a perfect model has perplexity \(1\). Halving a language model's bits-per-token squares its odds of guessing the next word — which is why every percentage point of cross-entropy is fought over so hard.

The reach goes further. Generative models that cannot evaluate their own likelihood — VAEs, diffusion models — are trained on the evidence lower bound (ELBO), which is once again entropy bookkeeping. Maximizing the ELBO is maximizing a reconstruction term minus a KL term:

EQ S8.9 — THE ELBO $$ \log p(x) \;\ge\; \underbrace{\mathbb{E}_{q(z\mid x)}\big[\log p(x \mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathrm{KL}}\big(q(z\mid x) \,\Vert\, p(z)\big)}_{\text{regularizer}} \;=\; \mathcal{L}_{\text{ELBO}} $$
The gap between the true log-evidence \(\log p(x)\) and the bound is exactly \(D_{\mathrm{KL}}(q(z\mid x)\Vert p(z\mid x))\) — a KL that is \(\ge 0\), which is why the ELBO is a lower bound. Maximizing it pushes the approximate posterior \(q(z\mid x)\) toward the true posterior while keeping the latent code close to the prior \(p(z)\). The same three quantities — log-likelihood, cross-entropy, KL — that opened this chapter close it, now steering every variational and diffusion model. Information theory is not adjacent to machine learning; it is its loss function.
PYTHON · RUNNABLE IN-BROWSER
# EQ S8.7: softmax cross-entropy == negative log-likelihood, on a tiny example
import numpy as np

def softmax(z):
    e = np.exp(z - z.max()); return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])     # raw model scores for 3 classes
c = 0                                   # the true class index
q = softmax(logits)
onehot = np.eye(3)[c]                    # true distribution p (one-hot)

ce_full = float(-(onehot * np.log(q)).sum())   # EQ S8.3 with one-hot p
nll     = float(-np.log(q[c]))                 # collapses to -log q_c
print("softmax q        :", q.round(4))
print(f"cross-entropy    : {ce_full:.4f} nats   ({ce_full/np.log(2):.4f} bits)")
print(f"neg log-likelihd : {nll:.4f} nats   <- identical, EQ S8.7")

# perplexity over a tiny sequence of true-class probabilities (EQ S8.8)
probs = np.array([0.659, 0.40, 0.90, 0.25])     # q(x_i) the model gave the truth
ppl = float(np.exp(-np.log(probs).mean()))
print(f"\nsequence perplexity : {ppl:.3f}  (a perfect model -> 1.0)")
print(f"a fair coin per step would give PPL = {2**1.0:.1f}")
edits are live — break it on purpose
NEXT

Shannon gave us the unit of information and the loss that consumes it — but every bit of entropy in this chapter assumed a clean distribution to count. Real data is messy, biased, leaky, and never the distribution you wished for. The Data volume opens by confronting that gap head-on: where datasets come from, how they lie, and why the hardest part of machine learning happens before a single gradient is computed. Next: Data · 01 — The Data Problem.

8.R

References

  1. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27 — the founding paper: entropy, source coding, and channel capacity (EQ S8.2, S8.6).
  2. Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics 22(1) — relative entropy / KL divergence (EQ S8.4).
  3. Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley — the standard graduate text; entropy, mutual information, source coding, and their inequalities.
  4. Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE 40(9) — the optimal prefix code of §8.4 (EQ S8.6, Instrument S8.3).
  5. MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press — free online; the canonical bridge from Shannon to machine learning.
  6. Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR — the variational autoencoder and the ELBO objective (EQ S8.9).
  7. Tishby, N., Pereira, F. C. & Bialek, N. (2000). The Information Bottleneck Method. arXiv — mutual information as a principle for representation learning (§8.3).