Bayesian Inference — AI Encyclopedia

5.1

Prior, likelihood, posterior

Start from the definition of conditional probability and read it as a learning rule. You hold a belief about a parameter $\theta$ before seeing data — the prior $p(\theta)$. The data $D$ arrive with a likelihood $p(D \mid \theta)$, the probability the model assigns to what you observed for each candidate value of $\theta$. Bayes' theorem combines them into the posterior $p(\theta \mid D)$ — your updated belief:

EQ S5.1 — BAYES' THEOREM $$ p(\theta \mid D) \;=\; \frac{p(D \mid \theta)\, p(\theta)}{p(D)}, \qquad p(D) = \int p(D \mid \theta)\, p(\theta)\, \mathrm{d}\theta $$

The denominator $p(D)$ — the marginal likelihood or evidence — is just the constant that makes the posterior integrate to 1; it does not depend on $\theta$. So for inference about $\theta$ you can usually drop it and work with the proportionality posterior $\propto$ likelihood $\times$ prior. The hard part of Bayesian computation is almost always that integral over $\theta$; conjugacy (§5.2) makes it disappear, and when it cannot, you reach for MCMC or variational methods.

The unnormalized form is the one to keep in your head, because it shows exactly how the three pieces interact:

EQ S5.2 — POSTERIOR AS A PRODUCT $$ \underbrace{p(\theta \mid D)}_{\text{posterior}} \;\propto\; \underbrace{p(D \mid \theta)}_{\text{likelihood}} \;\cdot\; \underbrace{p(\theta)}_{\text{prior}} $$

The prior is a soft starting point; the likelihood pulls it toward whatever the data support. With little data the prior dominates; as data accumulate the likelihood overwhelms any non-dogmatic prior and the posterior converges to the truth (the Bernstein–von Mises theorem makes this precise: the posterior becomes asymptotically Gaussian and prior-independent). A prior that puts zero mass on a value can never recover — Cromwell's rule: never assign probability exactly 0 or 1 to something you might be wrong about.

Three properties make this more than a formula. First, it is sequential: yesterday's posterior is today's prior, and processing data in one batch or in a stream gives the identical result. Second, it returns a whole distribution, not a point estimate — uncertainty is first-class, not an afterthought computed from a sampling thought-experiment. Third, every quantity in it is a probability about $\theta$ itself, which is what most people actually want to know and (contestably) believe a confidence interval already tells them.

KEY

The interpretive split is real, not cosmetic. To a frequentist, $\theta$ is a fixed constant and probability statements about it are meaningless; randomness lives in the data and the procedure. To a Bayesian, $\theta$ is uncertain and probability is the language of that uncertainty. Both camps agree on the math of EQ S5.1 — they disagree on what a probability is. Most working statisticians today are pragmatic: Bayesian when priors are defensible and uncertainty must be honest, frequentist when a guarantee over repeated use is what matters.

5.2

Conjugate priors: Beta–Binomial & Normal–Normal

The integral in EQ S5.1 is what makes Bayes hard. A conjugate prior sidesteps it entirely: if the prior and posterior belong to the same family, updating is just arithmetic on the parameters. The canonical pair is the Beta prior with a Binomial likelihood — the model for "estimate a coin's bias from heads and tails."

EQ S5.3 — BETA–BINOMIAL UPDATE $$ \theta \sim \mathrm{Beta}(\alpha, \beta), \quad k \mid \theta \sim \mathrm{Binomial}(n, \theta) \;\;\Longrightarrow\;\; \theta \mid k \sim \mathrm{Beta}(\alpha + k,\; \beta + n - k) $$

Observe $k$ successes in $n$ trials and you simply add the successes to $\alpha$ and the failures to $\beta$. The prior parameters act as pseudo-counts: $\mathrm{Beta}(1,1)$ is uniform (one imaginary head, one imaginary tail — total ignorance over $[0,1]$); $\mathrm{Beta}(2,2)$ is a gentle nudge toward a fair coin. The posterior mean is $\dfrac{\alpha + k}{\alpha + \beta + n}$ — a data MLE $k/n$ shrunk toward the prior mean $\alpha/(\alpha+\beta)$, with the shrinkage fading as $n$ grows.

The same trick works for a Gaussian mean with known variance. A Normal prior on the mean, combined with Normal data, yields a Normal posterior whose mean is a precision-weighted average of prior and data:

EQ S5.4 — NORMAL–NORMAL UPDATE (KNOWN σ²) $$ \mu \sim \mathcal{N}(\mu_0, \tau_0^2), \;\; x_i \mid \mu \sim \mathcal{N}(\mu, \sigma^2) \;\;\Longrightarrow\;\; \mu \mid \bar{x} \sim \mathcal{N}\!\left( \frac{\tfrac{\mu_0}{\tau_0^2} + \tfrac{n\bar{x}}{\sigma^2}}{\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}},\; \left(\tfrac{1}{\tau_0^2} + \tfrac{n}{\sigma^2}\right)^{-1} \right) $$

Work in precision (inverse variance) and it is beautifully simple: posterior precision = prior precision + data precision, and the posterior mean is the average of $\mu_0$ and $\bar{x}$ weighted by those precisions. Each observation adds $1/\sigma^2$ of precision, so the posterior tightens like $1/n$ and the data term $n\bar{x}/\sigma^2$ eventually swamps the prior. This is the engine behind Kalman filters, ridge regression's Bayesian reading, and Gaussian hierarchical models.

You start with a $ \mathrm{Beta}(2,2) $ prior on a coin's bias and observe 7 heads in 10 flips. What is the posterior mean? (Use EQ S5.3, then $ \tfrac{\alpha'}{\alpha'+\beta'} $.)

Update: $ \alpha' = 2 + 7 = 9 $, $ \beta' = 2 + (10-7) = 5 $, so the posterior is $ \mathrm{Beta}(9,5) $. Posterior mean $ = \dfrac{9}{9+5} = \dfrac{9}{14} = $ 0.643. Note it sits below the raw MLE of $7/10 = 0.70$: the symmetric prior shrinks the estimate toward $0.5$.

A posterior comes out as $ \mathrm{Beta}(3, 1) $. What is its mean?

The mean of $ \mathrm{Beta}(\alpha,\beta) $ is $ \dfrac{\alpha}{\alpha+\beta} = \dfrac{3}{3+1} = \dfrac{3}{4} = $ 0.75. (Its mode, by contrast, is $ \tfrac{\alpha-1}{\alpha+\beta-2} = \tfrac{2}{2} = 1 $ — mean and mode part ways for skewed Betas.)

PYTHON · RUNNABLE IN-BROWSER

# EQ S5.3: Beta-Binomial conjugate update -- posterior mean vs MLE
import numpy as np

a0, b0 = 2.0, 2.0          # Beta(2,2) prior: gentle nudge toward a fair coin
k, n   = 7, 10             # observed: 7 heads in 10 flips

a, b = a0 + k, b0 + (n - k)            # EQ S5.3: add heads to a, tails to b
post_mean = a / (a + b)               # E[theta | data]
post_var  = a * b / ((a + b)**2 * (a + b + 1))
prior_mean = a0 / (a0 + b0)
mle = k / n

print(f"prior     : Beta({a0:.0f},{b0:.0f})  mean {prior_mean:.4f}")
print(f"posterior : Beta({a:.0f},{b:.0f})  mean {post_mean:.4f}  sd {post_var**0.5:.4f}")
print(f"MLE k/n   : {mle:.4f}")
print(f"shrinkage : posterior sits {mle - post_mean:+.4f} from the MLE,")
print(f"            pulled toward the prior mean {prior_mean:.2f}")

# draw the posterior density on a grid (unnormalized Beta kernel, then normalize)
grid = np.linspace(0.001, 0.999, 200)
dx   = grid[1] - grid[0]
dens = grid**(a - 1) * (1 - grid)**(b - 1)
dens /= dens.sum() * dx                # normalize to unit area (Riemann)
plot_xy(grid, dens)

edits are live — break it on purpose

INSTRUMENT S5.1 — BETA–BINOMIAL UPDATEREQ S5.3 · LIVE POSTERIOR

PRIOR α 2

PRIOR β 2

TRUE BIAS (COIN) 0.70

FLIP COINS

DATA (HEADS / FLIPS)

0 / 0

POSTERIOR Beta(α′,β′)

—

POSTERIOR MEAN

—

95% CREDIBLE

—

The grey curve is the prior, the mint curve the posterior; the dashed line is the true bias. Flip a few coins and the posterior is broad and prior-shaped; flip a hundred and watch it collapse to a spike on the truth — the likelihood drowning the prior exactly as §5.1 promised. Set α = β = 1 for a flat prior and the posterior mean tracks the raw MLE; crank α and β up to feel a stubborn prior resist the data.

5.3

MAP vs MLE: two ways to pick one number

A full posterior is the honest answer, but engineering often wants a single estimate. Two point estimates dominate. Maximum likelihood (MLE) picks the $\theta$ that makes the data most probable, ignoring any prior. Maximum a posteriori (MAP) picks the mode of the posterior — the most probable $\theta$ given the data and the prior:

EQ S5.5 — MLE AND MAP $$ \hat{\theta}_{\text{MLE}} = \arg\max_{\theta}\; p(D \mid \theta), \qquad \hat{\theta}_{\text{MAP}} = \arg\max_{\theta}\; p(D \mid \theta)\, p(\theta) $$

MAP is MLE with the prior multiplied back in — equivalently, with $\log p(\theta)$ added to the log-likelihood. MAP collapses to MLE exactly when the prior is flat ($p(\theta)$ constant), so MLE is the special case "I refuse to state a prior." Crucially, MAP is not the same as the posterior mean: for a skewed posterior the mode, mean, and median all differ, and MAP — being a single point — throws away the very uncertainty that motivated going Bayesian.

The connection to machine learning is direct and worth internalizing: regularization is a prior. Adding an $L_2$ penalty $\lambda\lVert\theta\rVert^2$ to a loss is precisely MAP estimation under a Gaussian prior $\mathcal{N}(0, 1/(2\lambda))$ on the weights; an $L_1$ penalty is a Laplace prior. The penalty strength $\lambda$ is the prior's tightness. Seen this way, "regularized MLE" and "MAP" are the same computation under two vocabularies.

EQ S5.6 — MAP MEAN OF A BETA POSTERIOR $$ \hat{\theta}_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}, \qquad \hat{\theta}_{\text{mean}} = \frac{\alpha + k}{\alpha + \beta + n}, \qquad \hat{\theta}_{\text{MLE}} = \frac{k}{n} $$

For the Beta–Binomial model all three estimators have closed forms, and comparing them is the cleanest way to feel the difference. With a flat $\mathrm{Beta}(1,1)$ prior the MAP $\tfrac{k}{n}$ coincides with the MLE (the "$-1$" terms cancel the "$+1$" pseudo-counts), while the posterior mean $\tfrac{k+1}{n+2}$ is Laplace's rule of succession — still shrunk toward $0.5$. On small $n$, these can differ enough to matter; by large $n$ they converge.

PYTHON · RUNNABLE IN-BROWSER

# Grid-approximate a Normal-mean posterior (EQ S5.4 shape, computed numerically)
# then read off the 95% credible interval from the posterior CDF.
import numpy as np
rng = np.random.default_rng(0)

mu_true, sigma = 5.0, 2.0          # known data variance
data = rng.normal(mu_true, sigma, size=8)    # a SMALL sample of 8
mu0, tau0 = 0.0, 5.0               # weak prior: N(0, 5^2) on the mean

grid = np.linspace(-2, 12, 2000)   # candidate values of mu
logprior = -0.5 * ((grid - mu0) / tau0)**2
# log-likelihood of the sample under each candidate mu (sum over data points)
loglik = -0.5 * (((data[:, None] - grid[None, :]) / sigma)**2).sum(0)
logpost = logprior + loglik
post = np.exp(logpost - logpost.max())       # stabilize, then normalize
dx   = grid[1] - grid[0]
post /= post.sum() * dx                       # unit-area posterior

cdf = np.cumsum(post) * dx                    # running probability mass
lo = grid[np.searchsorted(cdf, 0.025)]       # 2.5th percentile
hi = grid[np.searchsorted(cdf, 0.975)]       # 97.5th percentile
mean = (grid * post).sum() * dx              # E[mu | data]

print(f"sample mean (MLE)     : {data.mean():.3f}")
print(f"posterior mean        : {mean:.3f}")
print(f"95% credible interval : [{lo:.3f}, {hi:.3f}]")
print(f"interpretation        : P(mu in interval | data) = 0.95 -- a")
print(f"                        statement about mu, not about the procedure")
plot_xy(grid, post)

edits are live — break it on purpose

INSTRUMENT S5.2 — MAP vs MLE ON A SMALL SAMPLEEQ S5.6 · BETA–BINOMIAL

HEADS k 2

FLIPS n 3

PRIOR α=β 2.0

MLE k/n

—

MAP (mode)

—

POSTERIOR MEAN

—

GAP MLE − MAP

—

Three vertical markers on the posterior: blue MLE, mint MAP (mode), and a paler mint posterior mean. With n = 3 and a $\mathrm{Beta}(2,2)$ prior they sit far apart — small data is exactly where the prior earns its keep. Drag n up to 40 and the three markers march together, the posterior narrows, and the choice of estimator stops mattering. Set the prior to 1 and the MAP snaps onto the MLE (EQ S5.6).

5.4

Credible intervals vs confidence intervals

This is the comparison where the two philosophies collide most sharply, and where the difference is routinely misstated. Both produce an interval; they answer different questions.

EQ S5.7 — A 95% CREDIBLE INTERVAL $$ \mathbb{P}\big(\theta \in [L, U] \,\big|\, D\big) = 0.95, \qquad \text{e.g. } \int_{L}^{U} p(\theta \mid D)\, \mathrm{d}\theta = 0.95 $$

A credible interval is a direct probability statement about the parameter: given the data you actually saw, there is a 95% probability $\theta$ lies in $[L, U]$. Two common flavors: the equal-tailed interval (cut 2.5% off each tail of the posterior) and the highest-posterior-density (HPD) interval (the shortest interval containing 95% of the mass — every point inside is more probable than every point outside). For symmetric posteriors they coincide; for skewed ones the HPD is tighter and more honest.

A confidence interval makes no such statement. Its 95% is a property of the procedure across hypothetical repetitions: if you reran the whole experiment many times and computed an interval each time, about 95% of those intervals would contain the true (fixed) $\theta$. For the one interval in front of you, $\theta$ is either in it or not — the 95% does not transfer to this realization. The seductive sentence "there's a 95% chance the parameter is in this interval" is a credible-interval statement smuggled onto a confidence interval, and it is false under the frequentist definition.

	Credible interval (Bayesian)	Confidence interval (frequentist)
What's random	θ (the belief)	the interval (the data)
The 95% means	P(θ in [L,U] \| this data) = 0.95	95% of such intervals cover the fixed θ over repeats
Needs a prior	yes	no
Guarantee	conditional on the data you saw	long-run, over the procedure
"95% chance θ is inside"	correct	wrong

For large samples with a flat prior the two intervals nearly coincide numerically — which is why the distinction looks pedantic until it isn't. They diverge when the prior carries real information, when $n$ is small, or when the parameter lives near a boundary (a near-zero proportion, a variance close to zero), where confidence intervals can behave pathologically — even extending below zero for a quantity that cannot be negative — while a properly bounded prior keeps the credible interval inside the feasible region.

A Bayesian reports a 95% credible interval $[L, U]$ for $ \theta $. Reading it correctly: what is $ \mathbb{P}(\theta \in [L, U] \mid D) $, the probability the parameter lies inside given the observed data?

By definition (EQ S5.7), a 95% credible interval is exactly the interval whose posterior mass is 0.95, so $ \mathbb{P}(\theta \in [L,U] \mid D) = $ 0.95. This is the statement people want a confidence interval to make — and the reason credible intervals are easier to communicate honestly.

CONTESTED

The interpretation gap cuts both ways. Bayesians point out that the credible interval answers the question people actually ask. Frequentists counter that it only does so if you accept the prior — and that a confidence interval's coverage guarantee holds regardless of any belief, which is exactly what you want for a regulator or a referee. Neither is "more correct"; they optimize for different things. The honest summary: report a credible interval when a defensible prior exists and you owe a statement about this parameter; report a confidence interval when you owe a guarantee that survives an adversary's choice of θ.

5.5

When to go Bayesian: small data, real priors, hierarchy

Bayesian inference is not a moral upgrade over frequentist statistics — it is a tool with a cost (you must specify a prior, and often pay for sampling) and three regimes where it clearly pays for itself.

Small data. When $n$ is tiny, the MLE is high-variance and can sit on the boundary (zero successes $\Rightarrow$ "the true rate is 0"). A mild prior regularizes the estimate and, more importantly, the posterior reports its own width — you get calibrated uncertainty instead of an overconfident point. This is exactly the regime Instrument S5.2 dramatizes.
Genuine prior information. If you actually know something — a physical constraint, last quarter's conversion rate, a published effect size — discarding it to "let the data speak" is throwing away signal. A prior is the disciplined way to encode it, and the posterior shows precisely how much the new data revised it.
Hierarchy & partial pooling. When you estimate many related quantities at once — conversion rates for 500 stores, batting averages for 200 players, effects across 30 hospitals — a hierarchical model lets a shared hyper-prior borrow strength across groups. Each estimate is shrunk toward the population mean by an amount the data decide; noisy small-sample groups shrink a lot, well-measured groups barely move. This is the modern form of the James–Stein result that a pooled estimator dominates independent MLEs.

EQ S5.8 — A HIERARCHICAL (PARTIAL-POOLING) MODEL $$ \phi \sim p(\phi), \qquad \theta_j \mid \phi \sim p(\theta_j \mid \phi)\;\; (j = 1,\dots,J), \qquad y_{ij} \mid \theta_j \sim p(y \mid \theta_j) $$

A top-level hyperparameter $\phi$ (e.g. the population mean and spread) generates group-level parameters $\theta_j$, which generate the observations. Fitting $\phi$ and all $\theta_j$ jointly produces shrinkage toward the group mean — the cure for both the "no pooling" extreme (every group on its own, wildly overfit) and the "complete pooling" extreme (one number for everyone, badly biased). Closed-form conjugacy rarely survives here, so these models are the daily reason practitioners reach for MCMC (Hamiltonian Monte Carlo / NUTS in Stan, PyMC, NumPyro) or variational inference.

When to stay frequentist. If a defensible prior is genuinely unavailable and a referee will challenge whatever you pick; if you need a coverage guarantee that holds for an adversarially chosen θ (much regulatory and clinical work); or if the model is simple, data abundant, and the two answers coincide anyway — the frequentist route is cheaper and beyond dispute. The mature stance is fluency in both, not allegiance to one.

INSTRUMENT S5.3 — PRIOR-SENSITIVITY EXPLORERSAME DATA · THREE PRIORS · EQ S5.3

HEADS k 3

FLIPS n 5

FLAT Beta(1,1)

—

FAIR-LEANING Beta(10,10)

—

SKEPTIC Beta(2,8)

—

The identical data feed three posteriors: a flat prior (let the data speak), a strong fair-coin prior, and a skeptic who expects a low rate. With n = 5 the three posterior means disagree sharply — prior choice is doing real work. Now drag n toward 100: the curves converge and the disagreement evaporates. That convergence is the honest defence of priors — with enough data they wash out; with little data, stating yours is just being explicit about an assumption you were making anyway.

Every update in this chapter — precision-weighted means, conjugate sums, the integral in Bayes' theorem — is linear algebra in disguise. Chapter 06 lays the foundation those operations stand on: vectors and matrices, eigen-decomposition, the SVD, and the geometry of projections that turns "weighted average of prior and data" into a single matrix equation.

5.R

References

Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of the theorem.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2013). Bayesian Data Analysis (3rd ed.). CRC Press — the standard modern reference for conjugacy, hierarchy, and computation.
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press — the objective-Bayesian case for probability as extended logic.
Efron, B. & Morris, C. (1975). Data Analysis Using Stein's Estimator and Its Generalizations. JASA / Ann. Statist. — shrinkage and partial pooling as empirical Bayes.
Casella, G. & Berger, R. L. (1987). Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem. JASA — a careful account of where the two frameworks agree and diverge.
Carpenter, B. et al. (2017). Stan: A Probabilistic Programming Language. J. Stat. Softw. — Hamiltonian Monte Carlo for the hierarchical models of §5.5.