Entropy — measuring surprise
Start with one demand: how surprised should you be by an outcome? A coin landing heads when you knew it was rigged to always land heads is no surprise at all. A fair coin landing heads is exactly one bit of surprise. Shannon insisted surprise depend only on the probability of the outcome, that a certain event (\(p = 1\)) carry zero surprise, and that the surprise of two independent events add. Only one function satisfies all three: the negative logarithm.
Entropy is the average surprisal — the expected number of bits per outcome when the source emits symbols according to distribution \(p\). It is the single number that says how uncertain, how unpredictable, how compressible a source is.
The two-outcome case has a name — the binary entropy function \(H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)\) — and a famous shape: a smooth arch peaking at exactly \(1\) bit when \(p = \tfrac12\) and collapsing to \(0\) at both ends. That arch is the first thing to internalize, because the curve of a training loss is the same idea wearing different clothes.
# EQ S8.2: entropy of a distribution + the binary-entropy arch
import numpy as np
def entropy_bits(p):
p = np.asarray(p, float)
p = p[p > 0] # 0*log0 = 0 by convention
return float(-(p * np.log2(p)).sum())
print("fair coin H =", round(entropy_bits([0.5, 0.5]), 4), "bits")
print("biased coin H =", round(entropy_bits([0.9, 0.1]), 4), "bits")
print("fair die H =", round(entropy_bits([1/6]*6), 4), "bits (= log2 6)")
print("loaded die H =", round(entropy_bits([0.5,0.1,0.1,0.1,0.1,0.1]), 4), "bits")
# the binary entropy function H_b(p): an arch peaking at p = 0.5
ps = np.linspace(0.001, 0.999, 200)
Hb = -ps*np.log2(ps) - (1-ps)*np.log2(1-ps)
print("\npeak of H_b at p =", round(float(ps[Hb.argmax()]), 3),
"-> H =", round(float(Hb.max()), 4), "bit")
plot_xy(ps, Hb)
Cross-entropy & KL divergence
Entropy assumes you know the true distribution \(p\). But a model only ever has an estimate \(q\). Cross-entropy asks the practical question: if reality is \(p\) but you encode it using a code built for \(q\), how many bits per symbol do you actually pay?
The gap between paying \(H(p,q)\) and the irreducible floor \(H(p)\) is the Kullback–Leibler divergence — the number of wasted bits caused by believing \(q\) when the truth is \(p\).
KL is not a metric. It is non-negative and zero only when the distributions match, but it is asymmetric — \(D_{\mathrm{KL}}(p\Vert q) \ne D_{\mathrm{KL}}(q\Vert p)\) in general — and it violates the triangle inequality. The asymmetry is not a flaw; it encodes a real modelling choice. Forward KL \(D_{\mathrm{KL}}(p\Vert q)\), the form inside maximum-likelihood training, is mass-covering: it punishes \(q\) heavily for assigning near-zero probability anywhere \(p\) has mass, so it spreads \(q\) to cover every mode. Reverse KL \(D_{\mathrm{KL}}(q\Vert p)\), the form inside variational inference (the ELBO, §8.5), is mode-seeking: it lets \(q\) ignore parts of \(p\) and lock onto a single mode. Which way you write the bars decides whether your model hedges or commits.
# EQ S8.3 / S8.4: entropy, cross-entropy, KL -- and KL's asymmetry
import numpy as np
def entropy(p): return float(-(p * np.log2(p)).sum())
def cross_entropy(p,q):return float(-(p * np.log2(q)).sum())
def kl(p, q): return float((p * np.log2(p/q)).sum())
p = np.array([0.7, 0.2, 0.1]) # the truth
q = np.array([0.2, 0.5, 0.3]) # a wrong model
H = entropy(p)
Hpq = cross_entropy(p, q)
print(f"H(p) = {H:.4f} bits (irreducible floor)")
print(f"H(p, q) = {Hpq:.4f} bits (what you actually pay)")
print(f"KL(p || q) = {kl(p, q):.4f} bits (wasted bits)")
print(f"identity check : H(p)+KL = {H + kl(p,q):.4f} == H(p,q)?",
np.isclose(H + kl(p,q), Hpq))
print(f"\nKL(p || q) = {kl(p, q):.4f}")
print(f"KL(q || p) = {kl(q, p):.4f} <- different! KL is asymmetric")
print(f"KL(p || p) = {kl(p, p):.4f} <- zero iff the model is exact")
Mutual information — shared surprise
So far, one variable. Mutual information asks how much knowing one random variable tells you about another: how many bits of \(Y\)'s uncertainty vanish once you observe \(X\). It is the KL divergence between the joint distribution and the product of the marginals — i.e. how far \(X\) and \(Y\) are from being independent.
Mutual information is the quiet workhorse behind a surprising amount of machine learning. Decision trees split on the feature with the highest information gain — mutual information between a feature and the label. Feature-selection ranks inputs by \(I(\text{feature}; \text{target})\). The information bottleneck frames representation learning as compressing \(X\) into \(Z\) while preserving \(I(Z; Y)\); InfoNCE and contrastive objectives are lower bounds on mutual information between views of the same datum. Wherever the question is "how related are these, beyond linear correlation?", mutual information is the honest answer — though estimating it from samples in high dimensions is notoriously hard and an active research area.
Correlation sees lines; mutual information sees structure. Two variables related by \(Y = X^2\) with \(X\) symmetric about zero have correlation exactly zero — yet \(X\) determines \(Y\) completely, so their mutual information is large. Any time you reach for "are these independent?", the bit-accurate test is \(I(X;Y) = 0\), not \(\rho = 0\).
# EQ S8.5: mutual information from a joint table, three identities agree
import numpy as np
# joint p(x,y) over a 2x2 grid -- correlated, not independent
P = np.array([[0.40, 0.10],
[0.10, 0.40]])
px = P.sum(1, keepdims=True) # marginal of X
py = P.sum(0, keepdims=True) # marginal of Y
mask = P > 0
I = float((P[mask] * np.log2(P[mask] / (px @ py)[mask])).sum())
def H(p): # entropy of a flat distribution
p = p[p > 0]; return float(-(p * np.log2(p)).sum())
HY = H(py.ravel())
HY_X = float(-(P[mask] * np.log2((P / px)[mask])).sum()) # H(Y|X)
print(f"I(X;Y) via KL of joint vs product : {I:.4f} bits")
print(f"I(X;Y) via H(Y) - H(Y|X) : {HY - HY_X:.4f} bits")
print(f"H(Y) = {HY:.3f}, H(Y|X) = {HY_X:.3f} -> X removes that gap from Y")
indep = px @ py # what independence would look like
print("\nif X,Y were independent, I would be:",
round(float((indep[indep>0]*np.log2((indep/(px@py))[indep>0])).sum()), 4))
Source coding — entropy as a compression limit
Entropy is not just a measure of uncertainty; it is a hard physical bound. Shannon's source coding theorem says: to encode symbols from a source with entropy \(H\) into bits without loss, you need on average at least \(H\) bits per symbol, and you can get arbitrarily close to \(H\) with a clever enough code. No lossless compressor — not ZIP, not a neural one — can beat the entropy of the source it is fed. Entropy is the compression limit.
Huffman coding is the classic constructive proof: repeatedly merge the two least-probable symbols into a subtree, and the resulting prefix code is provably optimal among integer-length codes. When all probabilities are negative powers of two — a dyadic distribution — Huffman hits the entropy bound exactly, with zero slack.
# EQ S8.6: build a Huffman code, compare its length to the entropy bound
import numpy as np, heapq
def huffman_lengths(p):
# heap of (prob, tie, node); node is leaf-id or (left,right)
h = [(pi, i, i) for i, pi in enumerate(p)]
heapq.heapify(h); nxt = len(p)
while len(h) > 1:
a = heapq.heappop(h); b = heapq.heappop(h)
heapq.heappush(h, (a[0]+b[0], nxt, (a[2], b[2]))); nxt += 1
lengths = {}
def walk(node, d):
if isinstance(node, tuple): walk(node[0], d+1); walk(node[1], d+1)
else: lengths[node] = max(d, 1)
walk(h[0][2], 0)
return [lengths[i] for i in range(len(p))]
for name, p in [("dyadic ", [0.5, 0.25, 0.125, 0.125]),
("uniform", [0.25]*4),
("skewed ", [0.6, 0.2, 0.1, 0.1])]:
p = np.array(p)
H = float(-(p*np.log2(p)).sum())
L = float((p * np.array(huffman_lengths(p))).sum())
print(f"{name}: H = {H:.4f} Huffman L = {L:.4f} "
f"slack = {L-H:+.4f} (theorem: 0 <= slack < 1)")
The bridge to ML — cross-entropy loss, perplexity, the ELBO
Here is the payoff. A classifier outputs a predicted distribution \(q\) over labels; the true label is a one-hot distribution \(p\) (all mass on the correct class \(c\)). Plug into cross-entropy, EQ S8.3:
For language models, the same loss wears a friendlier name. The geometric-mean per-token uncertainty is perplexity — the exponential of the cross-entropy — interpreted as the effective number of equally likely choices the model faces at each step.
The reach goes further. Generative models that cannot evaluate their own likelihood — VAEs, diffusion models — are trained on the evidence lower bound (ELBO), which is once again entropy bookkeeping. Maximizing the ELBO is maximizing a reconstruction term minus a KL term:
# EQ S8.7: softmax cross-entropy == negative log-likelihood, on a tiny example
import numpy as np
def softmax(z):
e = np.exp(z - z.max()); return e / e.sum()
logits = np.array([2.0, 1.0, 0.1]) # raw model scores for 3 classes
c = 0 # the true class index
q = softmax(logits)
onehot = np.eye(3)[c] # true distribution p (one-hot)
ce_full = float(-(onehot * np.log(q)).sum()) # EQ S8.3 with one-hot p
nll = float(-np.log(q[c])) # collapses to -log q_c
print("softmax q :", q.round(4))
print(f"cross-entropy : {ce_full:.4f} nats ({ce_full/np.log(2):.4f} bits)")
print(f"neg log-likelihd : {nll:.4f} nats <- identical, EQ S8.7")
# perplexity over a tiny sequence of true-class probabilities (EQ S8.8)
probs = np.array([0.659, 0.40, 0.90, 0.25]) # q(x_i) the model gave the truth
ppl = float(np.exp(-np.log(probs).mean()))
print(f"\nsequence perplexity : {ppl:.3f} (a perfect model -> 1.0)")
print(f"a fair coin per step would give PPL = {2**1.0:.1f}")
Shannon gave us the unit of information and the loss that consumes it — but every bit of entropy in this chapter assumed a clean distribution to count. Real data is messy, biased, leaky, and never the distribution you wished for. The Data volume opens by confronting that gap head-on: where datasets come from, how they lie, and why the hardest part of machine learning happens before a single gradient is computed. Next: Data · 01 — The Data Problem.
References
- Shannon, C. E. (1948). A Mathematical Theory of Communication.
- Kullback, S. & Leibler, R. A. (1951). On Information and Sufficiency.
- Cover, T. M. & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.).
- Huffman, D. A. (1952). A Method for the Construction of Minimum-Redundancy Codes.
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms.
- Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes.
- Tishby, N., Pereira, F. C. & Bialek, N. (2000). The Information Bottleneck Method.