The Markov property & transition matrices
Most stochastic processes carry their whole history with them: tomorrow's weather might depend on the last week, a word on the whole paragraph. A Markov chain is the radical simplification that throws the history away. The future depends on the present state and nothing earlier — the chain is memoryless:
Collect the \(P_{ij}\) into a transition matrix \(P\). It is row-stochastic: every entry is non-negative and every row sums to 1, because from state \(i\) the chain must go somewhere. If the distribution over states today is a row vector \(\pi^{(t)}\), then one step of the chain is one matrix multiply:
A canonical example is a tiny two-state weather model: Sunny and Rainy, with \(P(\text{stay sunny}) = 0.7\) and \(P(\text{stay rainy}) = 0.6\). The transition matrix and one step from "certainly sunny today" are:
# Iterate a 2-state chain (EQ S7.2) and watch it forget where it started
import numpy as np
P = np.array([[0.7, 0.3], # from Sunny -> [Sunny, Rainy]
[0.4, 0.6]]) # from Rainy -> [Sunny, Rainy]
for start in ([1.0, 0.0], [0.0, 1.0]): # two very different beginnings
pi = np.array(start)
print(f"start {start}:")
for t in range(6):
print(f" t={t} P(Sunny)={pi[0]:.4f} P(Rainy)={pi[1]:.4f}")
pi = pi @ P # one step = one matrix multiply
print()
print("Both starts converge to (4/7, 3/7) =",
np.round([4/7, 3/7], 4), "-- the chain forgets its initial state.")
Stationary distributions & ergodicity
The fixed point the weather chain crawled toward is no accident. A distribution \(\pi\) is stationary if applying the chain leaves it unchanged: one step in, the same distribution out. It is the eigenvector of \(P^\top\) for eigenvalue 1:
When does iterating actually reach \(\pi\), and is \(\pi\) unique? That is the question of ergodicity. A finite chain converges to a single stationary distribution from every start exactly when it is:
- Irreducible — every state is reachable from every other (the chain is one connected piece, not separate islands). Otherwise each island has its own \(\pi\).
- Aperiodic — the chain is not trapped in a fixed cycle (e.g. a chain that strictly alternates A→B→A→B has period 2 and never settles; it oscillates forever). A single self-loop \(P_{ii} > 0\) anywhere breaks periodicity.
An irreducible, aperiodic finite chain has a unique stationary \(\pi\), and \(P^{\,t} \to \mathbf{1}\pi\) as \(t \to \infty\). Two consequences power the rest of the chapter. (1) Convergence: the distribution forgets its start at a rate governed by the second-largest eigenvalue \(|\lambda_2|\) — the spectral gap \(1 - |\lambda_2|\) is the mixing speed. (2) Time-averages = space-averages: the long-run fraction of time a single trajectory spends in state \(i\) equals \(\pi_i\). That second fact is what makes MCMC (§7.4) legal: run one walk long enough and its visit-frequencies are the target distribution.
A sufficient (not necessary) condition that makes \(\pi\) easy to verify and is the cornerstone of MCMC is detailed balance — the chain is reversible, with as much probability flowing \(i \to j\) as \(j \to i\):
This is also the engine behind PageRank. Model a random surfer who, at each step, follows an outgoing link uniformly at random; with probability \(1-\alpha\) (the damping, \(\alpha \approx 0.15\)) they instead teleport to a random page. That teleport term makes the chain irreducible and aperiodic on any web graph, so the ergodic theorem guarantees a unique stationary \(\pi\) — and \(\pi_i\), the long-run fraction of time the surfer sits on page \(i\), is its PageRank.
# Power-iterate a transition matrix to its stationary pi; verify pi @ P = pi
import numpy as np
P = np.array([[0.7, 0.3],
[0.4, 0.6]])
assert np.allclose(P.sum(1), 1.0) # rows are valid distributions
pi = np.array([1.0, 0.0]) # any start works
for _ in range(200):
pi = pi @ P # EQ S7.2, repeatedly
pi = pi / pi.sum()
print("stationary pi :", np.round(pi, 6))
print("pi @ P :", np.round(pi @ P, 6))
print("fixed point holds :", np.allclose(pi @ P, pi))
# cross-check against the dominant LEFT eigenvector of P (eigval 1)
w, v = np.linalg.eig(P.T) # left eigvecs = right eigvecs of P^T
k = np.argmin(np.abs(w - 1.0)) # the eigenvalue equal to 1
ev = np.real(v[:, k]); ev = ev / ev.sum()
print("eigenvector pi :", np.round(ev, 6))
print("closed form (4/7) :", round(4/7, 6))
Hidden Markov Models
So far the state was observable — we saw the weather directly. A Hidden Markov Model (HMM) adds one layer of indirection: the Markov chain runs underneath, but you never see the states. You see only emissions — noisy observations whose distribution depends on the hidden state. The classic toy: you cannot see the weather, but you see whether a friend carries an umbrella; you infer the hidden weather from the visible umbrellas.
Three canonical questions, each with an exact algorithm — all variants of the same dynamic program that re-uses the Markov factorization instead of enumerating paths:
| Question | Algorithm | Computes |
|---|---|---|
| Likelihood — how probable is this observation sequence? | Forward | \(\Pr(x_{1:T})\) by summing over hidden paths |
| Decoding — what is the single most likely hidden path? | Viterbi | \(\arg\max_z \Pr(z_{1:T} \mid x_{1:T})\) |
| Learning — fit \(A, B, \pi\) from unlabeled data | Baum–Welch (EM) | parameters that locally maximize \(\Pr(x_{1:T})\) |
The forward algorithm carries a vector of "beliefs" \(\alpha_t(j) = \Pr(x_{1:t},\, z_t = j)\) — the joint probability of the observations so far and being in state \(j\) now — and updates it one observation at a time:
HMMs ruled speech recognition, part-of-speech tagging, and bioinformatics (gene finding, sequence alignment) for two decades. In modern deep learning their throne went to RNNs and then Transformers (Vol II · Ch 03), which drop the discrete-state and conditional-independence constraints. But the HMM's structure survives everywhere: linear-state-space models and Kalman filters are HMMs with continuous Gaussian states, and the forward–backward dynamic program is the direct ancestor of the message-passing that powers modern structured prediction.
Markov Chain Monte Carlo — Metropolis–Hastings
Here the chapter turns inside-out. Until now \(P\) was given and we asked for \(\pi\). MCMC inverts the problem: you are given a target distribution \(\pi\) — typically a Bayesian posterior (Stats 05) you can evaluate up to a constant but cannot integrate or sample from directly — and you construct a Markov chain whose stationary distribution is exactly that \(\pi\). Run the chain; its trajectory becomes your sample. The ergodic theorem (§7.2) is the guarantee that this is allowed.
The defining pain of Bayesian inference is the normalizing constant. The posterior \(p(\theta \mid x) = \frac{p(x \mid \theta)\,p(\theta)}{p(x)}\) has a denominator \(p(x) = \int p(x\mid\theta)p(\theta)\,\mathrm{d}\theta\) that is usually an intractable high-dimensional integral. MCMC sidesteps it entirely: every step compares two densities as a ratio, so the unknown constant cancels. You only ever need \(\pi\) up to proportionality. That single trick is why MCMC, not algebra, is how most real Bayesian models are fit.
The Metropolis–Hastings algorithm is a constructive recipe for a chain obeying detailed balance (EQ S7.5) with respect to any \(\pi\). From the current point \(\theta\), propose a move to \(\theta'\) from a proposal density \(q(\theta' \mid \theta)\); then accept it with a probability designed to enforce reversibility:
The logic is a biased random walk. Uphill moves (to higher \(\pi\)) are always taken; downhill moves are taken with probability equal to the density ratio, so the walker explores the tails without getting stuck on the peak. Over many steps it visits each region in proportion to \(\pi\) — exactly the time-average = space-average promise. Two practical knobs dominate everything:
- Step size (proposal width). Too small and the walker shuffles, exploring slowly with high autocorrelation; too large and almost every proposal lands in the low-density wilderness and is rejected. The folklore target acceptance rate is ~0.234 for high-dimensional random-walk Metropolis (an Roberts–Gelman–Gilks result) — neither greedy nor timid.
- Burn-in & mixing. The chain starts wherever you put it, not at \(\pi\); the first stretch is transient and is discarded as burn-in. Consecutive samples are correlated, so the effective sample size is far below the raw count. Diagnostics like the \(\hat{R}\) statistic (comparing several independent chains) and trace plots are how you decide it has converged — and "it looks converged" is famously not a proof.
# Metropolis-Hastings for a 2-component Gaussian mixture target (EQ S7.8)
import numpy as np
rng = np.random.default_rng(0)
def target(x): # unnormalized density: two peaks
return 0.6*np.exp(-0.5*((x+2)/0.7)**2) + 0.4*np.exp(-0.5*((x-2)/1.0)**2)
x, step, n = 0.0, 1.5, 40000
samples, accepts = np.empty(n), 0
for i in range(n):
xp = x + rng.normal(0, step) # symmetric Gaussian proposal
if rng.random() < min(1.0, target(xp)/target(x)): # accept rule, const cancels
x, accepts = xp, accepts + 1 # move; ratio needs no normalizer
samples[i] = x
burn = samples[2000:] # discard burn-in
print(f"acceptance rate : {accepts/n:.3f}")
print(f"sample mean : {burn.mean():+.3f} (true mix mean is near 0)")
print(f"sample std : {burn.std():.3f}")
# overlay the recovered target density on the canvas (normalized by area)
xs = np.linspace(-6, 6, 200)
area = np.sum(target(xs)) * (xs[1] - xs[0]) # Riemann sum normalizer
plot_xy(xs, target(xs) / area) # the target the walk reproduced
Gibbs sampling & MCMC in modern ML
Metropolis–Hastings is general but blunt: one proposal width for a whole high-dimensional space rarely fits. Gibbs sampling is the special case that exploits structure — when you cannot sample the joint \(p(\theta_1, \ldots, \theta_d)\) but can sample each variable from its full conditional given all the others. You then cycle through the coordinates, replacing each in turn by a fresh draw from its conditional:
Where MCMC sits in the 2026 landscape:
- Probabilistic programming. Stan, PyMC, and NumPyro let you declare a model and sample its posterior with no hand-derived math. The default sampler is almost never vanilla Metropolis — it is the No-U-Turn Sampler (NUTS), an adaptive form of Hamiltonian Monte Carlo that uses gradients of \(\log\pi\) to propose long, informed trajectories instead of a blind local jiggle, dramatically improving mixing in high dimensions. Vanilla random-walk Metropolis is now mostly pedagogy and a fallback.
- The gradient frontier. HMC/NUTS need \(\nabla \log \pi\), which auto-diff (Vol II · Ch 03 machinery) supplies for free. For massive datasets, stochastic-gradient MCMC (SGLD and kin) injects calibrated noise into SGD steps so the optimizer's trajectory itself samples a Bayesian posterior over weights — the bridge between deep learning and Bayesian inference.
- The diffusion connection. Modern image and video generators (Vol II · Ch 08) are, at heart, learned reverse-time Markov chains: a forward chain gradually adds Gaussian noise to data, and a neural network learns to run the chain backward, sampling images by walking from noise to signal. Langevin-style sampling — a gradient ascent on \(\log\pi\) with noise — is the direct intellectual descendant of the Metropolis walk you ran in Instrument S7.3.
- The honest caveats. Convergence is asymptotic and unprovable in finite time; multimodal targets (like Instrument S7.3 with small σ) can trap a chain in one mode forever; and effective sample sizes can be a tiny fraction of the raw count. MCMC is the workhorse of Bayesian computation precisely because, used with diagnostics and skepticism, it is the most general tool we have — not because it is foolproof.
Markov chains gave us a way to sample what we cannot integrate; the next chapter asks how much a random outcome is worth knowing. Information theory measures uncertainty in bits — entropy, cross-entropy, KL divergence, mutual information — and turns out to be the language in which the acceptance ratios, the mixing rates, and the very loss functions of every model in this encyclopedia are most naturally written. Stats 08: Information Theory.
References
- Norris, J. R. (1997). Markov Chains.
- Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines.
- Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications.
- Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.
- Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.
- Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web.
- Roberts, G. O., Gelman, A. & Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms.
- Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.