Prior, likelihood, posterior
Start from the definition of conditional probability and read it as a learning rule. You hold a belief about a parameter \(\theta\) before seeing data — the prior \(p(\theta)\). The data \(D\) arrive with a likelihood \(p(D \mid \theta)\), the probability the model assigns to what you observed for each candidate value of \(\theta\). Bayes' theorem combines them into the posterior \(p(\theta \mid D)\) — your updated belief:
The unnormalized form is the one to keep in your head, because it shows exactly how the three pieces interact:
Three properties make this more than a formula. First, it is sequential: yesterday's posterior is today's prior, and processing data in one batch or in a stream gives the identical result. Second, it returns a whole distribution, not a point estimate — uncertainty is first-class, not an afterthought computed from a sampling thought-experiment. Third, every quantity in it is a probability about \(\theta\) itself, which is what most people actually want to know and (contestably) believe a confidence interval already tells them.
The interpretive split is real, not cosmetic. To a frequentist, \(\theta\) is a fixed constant and probability statements about it are meaningless; randomness lives in the data and the procedure. To a Bayesian, \(\theta\) is uncertain and probability is the language of that uncertainty. Both camps agree on the math of EQ S5.1 — they disagree on what a probability is. Most working statisticians today are pragmatic: Bayesian when priors are defensible and uncertainty must be honest, frequentist when a guarantee over repeated use is what matters.
Conjugate priors: Beta–Binomial & Normal–Normal
The integral in EQ S5.1 is what makes Bayes hard. A conjugate prior sidesteps it entirely: if the prior and posterior belong to the same family, updating is just arithmetic on the parameters. The canonical pair is the Beta prior with a Binomial likelihood — the model for "estimate a coin's bias from heads and tails."
The same trick works for a Gaussian mean with known variance. A Normal prior on the mean, combined with Normal data, yields a Normal posterior whose mean is a precision-weighted average of prior and data:
# EQ S5.3: Beta-Binomial conjugate update -- posterior mean vs MLE
import numpy as np
a0, b0 = 2.0, 2.0 # Beta(2,2) prior: gentle nudge toward a fair coin
k, n = 7, 10 # observed: 7 heads in 10 flips
a, b = a0 + k, b0 + (n - k) # EQ S5.3: add heads to a, tails to b
post_mean = a / (a + b) # E[theta | data]
post_var = a * b / ((a + b)**2 * (a + b + 1))
prior_mean = a0 / (a0 + b0)
mle = k / n
print(f"prior : Beta({a0:.0f},{b0:.0f}) mean {prior_mean:.4f}")
print(f"posterior : Beta({a:.0f},{b:.0f}) mean {post_mean:.4f} sd {post_var**0.5:.4f}")
print(f"MLE k/n : {mle:.4f}")
print(f"shrinkage : posterior sits {mle - post_mean:+.4f} from the MLE,")
print(f" pulled toward the prior mean {prior_mean:.2f}")
# draw the posterior density on a grid (unnormalized Beta kernel, then normalize)
grid = np.linspace(0.001, 0.999, 200)
dx = grid[1] - grid[0]
dens = grid**(a - 1) * (1 - grid)**(b - 1)
dens /= dens.sum() * dx # normalize to unit area (Riemann)
plot_xy(grid, dens)
MAP vs MLE: two ways to pick one number
A full posterior is the honest answer, but engineering often wants a single estimate. Two point estimates dominate. Maximum likelihood (MLE) picks the \(\theta\) that makes the data most probable, ignoring any prior. Maximum a posteriori (MAP) picks the mode of the posterior — the most probable \(\theta\) given the data and the prior:
The connection to machine learning is direct and worth internalizing: regularization is a prior. Adding an \(L_2\) penalty \(\lambda\lVert\theta\rVert^2\) to a loss is precisely MAP estimation under a Gaussian prior \(\mathcal{N}(0, 1/(2\lambda))\) on the weights; an \(L_1\) penalty is a Laplace prior. The penalty strength \(\lambda\) is the prior's tightness. Seen this way, "regularized MLE" and "MAP" are the same computation under two vocabularies.
# Grid-approximate a Normal-mean posterior (EQ S5.4 shape, computed numerically)
# then read off the 95% credible interval from the posterior CDF.
import numpy as np
rng = np.random.default_rng(0)
mu_true, sigma = 5.0, 2.0 # known data variance
data = rng.normal(mu_true, sigma, size=8) # a SMALL sample of 8
mu0, tau0 = 0.0, 5.0 # weak prior: N(0, 5^2) on the mean
grid = np.linspace(-2, 12, 2000) # candidate values of mu
logprior = -0.5 * ((grid - mu0) / tau0)**2
# log-likelihood of the sample under each candidate mu (sum over data points)
loglik = -0.5 * (((data[:, None] - grid[None, :]) / sigma)**2).sum(0)
logpost = logprior + loglik
post = np.exp(logpost - logpost.max()) # stabilize, then normalize
dx = grid[1] - grid[0]
post /= post.sum() * dx # unit-area posterior
cdf = np.cumsum(post) * dx # running probability mass
lo = grid[np.searchsorted(cdf, 0.025)] # 2.5th percentile
hi = grid[np.searchsorted(cdf, 0.975)] # 97.5th percentile
mean = (grid * post).sum() * dx # E[mu | data]
print(f"sample mean (MLE) : {data.mean():.3f}")
print(f"posterior mean : {mean:.3f}")
print(f"95% credible interval : [{lo:.3f}, {hi:.3f}]")
print(f"interpretation : P(mu in interval | data) = 0.95 -- a")
print(f" statement about mu, not about the procedure")
plot_xy(grid, post)
Credible intervals vs confidence intervals
This is the comparison where the two philosophies collide most sharply, and where the difference is routinely misstated. Both produce an interval; they answer different questions.
A confidence interval makes no such statement. Its 95% is a property of the procedure across hypothetical repetitions: if you reran the whole experiment many times and computed an interval each time, about 95% of those intervals would contain the true (fixed) \(\theta\). For the one interval in front of you, \(\theta\) is either in it or not — the 95% does not transfer to this realization. The seductive sentence "there's a 95% chance the parameter is in this interval" is a credible-interval statement smuggled onto a confidence interval, and it is false under the frequentist definition.
| Credible interval (Bayesian) | Confidence interval (frequentist) | |
|---|---|---|
| What's random | θ (the belief) | the interval (the data) |
| The 95% means | P(θ in [L,U] | this data) = 0.95 | 95% of such intervals cover the fixed θ over repeats |
| Needs a prior | yes | no |
| Guarantee | conditional on the data you saw | long-run, over the procedure |
| "95% chance θ is inside" | correct | wrong |
For large samples with a flat prior the two intervals nearly coincide numerically — which is why the distinction looks pedantic until it isn't. They diverge when the prior carries real information, when \(n\) is small, or when the parameter lives near a boundary (a near-zero proportion, a variance close to zero), where confidence intervals can behave pathologically — even extending below zero for a quantity that cannot be negative — while a properly bounded prior keeps the credible interval inside the feasible region.
The interpretation gap cuts both ways. Bayesians point out that the credible interval answers the question people actually ask. Frequentists counter that it only does so if you accept the prior — and that a confidence interval's coverage guarantee holds regardless of any belief, which is exactly what you want for a regulator or a referee. Neither is "more correct"; they optimize for different things. The honest summary: report a credible interval when a defensible prior exists and you owe a statement about this parameter; report a confidence interval when you owe a guarantee that survives an adversary's choice of θ.
When to go Bayesian: small data, real priors, hierarchy
Bayesian inference is not a moral upgrade over frequentist statistics — it is a tool with a cost (you must specify a prior, and often pay for sampling) and three regimes where it clearly pays for itself.
- Small data. When \(n\) is tiny, the MLE is high-variance and can sit on the boundary (zero successes \(\Rightarrow\) "the true rate is 0"). A mild prior regularizes the estimate and, more importantly, the posterior reports its own width — you get calibrated uncertainty instead of an overconfident point. This is exactly the regime Instrument S5.2 dramatizes.
- Genuine prior information. If you actually know something — a physical constraint, last quarter's conversion rate, a published effect size — discarding it to "let the data speak" is throwing away signal. A prior is the disciplined way to encode it, and the posterior shows precisely how much the new data revised it.
- Hierarchy & partial pooling. When you estimate many related quantities at once — conversion rates for 500 stores, batting averages for 200 players, effects across 30 hospitals — a hierarchical model lets a shared hyper-prior borrow strength across groups. Each estimate is shrunk toward the population mean by an amount the data decide; noisy small-sample groups shrink a lot, well-measured groups barely move. This is the modern form of the James–Stein result that a pooled estimator dominates independent MLEs.
When to stay frequentist. If a defensible prior is genuinely unavailable and a referee will challenge whatever you pick; if you need a coverage guarantee that holds for an adversarially chosen θ (much regulatory and clinical work); or if the model is simple, data abundant, and the two answers coincide anyway — the frequentist route is cheaper and beyond dispute. The mature stance is fluency in both, not allegiance to one.
Every update in this chapter — precision-weighted means, conjugate sums, the integral in Bayes' theorem — is linear algebra in disguise. Chapter 06 lays the foundation those operations stand on: vectors and matrices, eigen-decomposition, the SVD, and the geometry of projections that turns "weighted average of prior and data" into a single matrix equation.
References
- Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances.
- Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.).
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science.
- Efron, B. & Morris, C. (1975). Data Analysis Using Stein's Estimator and Its Generalizations.
- Casella, G. & Berger, R. L. (1987). Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem.
- Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language.