Markov Chains & MCMC — AI Encyclopedia

7.1

The Markov property & transition matrices

Most stochastic processes carry their whole history with them: tomorrow's weather might depend on the last week, a word on the whole paragraph. A Markov chain is the radical simplification that throws the history away. The future depends on the present state and nothing earlier — the chain is memoryless:

EQ S7.1 — THE MARKOV PROPERTY $$ \Pr\!\big(X_{t+1} = j \mid X_t = i,\, X_{t-1},\, \ldots,\, X_0\big) \;=\; \Pr\!\big(X_{t+1} = j \mid X_t = i\big) \;=\; P_{ij} $$

Conditioning on the entire past collapses to conditioning on the single current state. The number $P_{ij}$ — the probability of stepping from state $i$ to state $j$ — is independent of when or how you arrived at $i$. This one line is the whole subject. Everything downstream (stationarity, HMMs, MCMC) is a consequence of replacing "history" with "current state". A chain that instead needs the last $k$ states is order-$k$; but any order-$k$ chain over alphabet $S$ is an ordinary order-1 chain over the enlarged state space $S^k$, so we lose no generality studying order 1.

Collect the $P_{ij}$ into a transition matrix $P$. It is row-stochastic: every entry is non-negative and every row sums to 1, because from state $i$ the chain must go somewhere. If the distribution over states today is a row vector $\pi^{(t)}$, then one step of the chain is one matrix multiply:

EQ S7.2 — ONE STEP & THE CHAPMAN–KOLMOGOROV RELATION $$ \pi^{(t+1)} \;=\; \pi^{(t)} P, \qquad\Longrightarrow\qquad \pi^{(t)} \;=\; \pi^{(0)} P^{\,t}, \qquad \big(P^{\,n}\big)_{ij} \;=\; \Pr\!\big(X_{t+n}=j \mid X_t=i\big) $$

Distributions are left-multiplied by $P$ (row vector times matrix). The $n$-step transition probabilities are just the $n$-th matrix power — the Chapman–Kolmogorov equation $P^{m+n} = P^m P^n$ is nothing more than the associativity of matrix multiplication. So the long-run behavior of a Markov chain is an eigenvalue question about $P$, which is exactly the linear algebra of Chapter 06 put to work.

A canonical example is a tiny two-state weather model: Sunny and Rainy, with $P(\text{stay sunny}) = 0.7$ and $P(\text{stay rainy}) = 0.6$. The transition matrix and one step from "certainly sunny today" are:

EQ S7.3 — A TWO-STATE CHAIN $$ P = \begin{pmatrix} 0.7 & 0.3 \\ 0.4 & 0.6 \end{pmatrix}, \qquad \pi^{(0)} = (1,\ 0), \qquad \pi^{(1)} = \pi^{(0)} P = (0.7,\ 0.3) $$

Rows are "from", columns are "to": row 1 is "from Sunny", so $0.7$ stay sunny, $0.3$ turn rainy. Iterate and the distribution marches toward a fixed point $(4/7,\ 3/7) \approx (0.571,\ 0.429)$ regardless of the starting day — the stationary distribution of §7.2. The whole future is encoded in this $2\times2$ grid.

A chain's next state depends only on the current state — that is exactly the Markov property of EQ S7.1. What is the order of such a chain (how many previous states the transition rule reads)?

Order-$k$ means the next state depends on the last $k$ states. The Markov property says it depends on the current state alone, so $k = $ 1. (An order-2 chain would read the last two states; it can always be re-encoded as an order-1 chain over pairs.)

PYTHON · RUNNABLE IN-BROWSER

# Iterate a 2-state chain (EQ S7.2) and watch it forget where it started
import numpy as np
P = np.array([[0.7, 0.3],     # from Sunny -> [Sunny, Rainy]
              [0.4, 0.6]])    # from Rainy -> [Sunny, Rainy]

for start in ([1.0, 0.0], [0.0, 1.0]):     # two very different beginnings
    pi = np.array(start)
    print(f"start {start}:")
    for t in range(6):
        print(f"  t={t}  P(Sunny)={pi[0]:.4f}  P(Rainy)={pi[1]:.4f}")
        pi = pi @ P                          # one step = one matrix multiply
    print()

print("Both starts converge to (4/7, 3/7) =",
      np.round([4/7, 3/7], 4), "-- the chain forgets its initial state.")

edits are live — break it on purpose

INSTRUMENT S7.1 — TRANSITION-MATRIX EXPLOREREDIT P · ITERATE TO STATIONARY · EQ S7.2

P(A → A) 0.70

P(B → B) 0.60

START P(A) 1.00

CURRENT P(A) · P(B)

—

STATIONARY π = (πA, πB)

—

STEPS TO CONVERGE (Δ<1e-4)

—

Off-diagonals are fixed by row-stochasticity: $P(A\to B) = 1 - P(A\to A)$. Press STEP to apply $\pi \leftarrow \pi P$ once and watch the bars march; RUN animates the whole walk to the fixed point. The dashed line is the solved stationary $\pi_A = \tfrac{P(B\to A)}{P(A\to B)+P(B\to A)}$. Drag START P(A) anywhere — the chain lands on the same π, because it forgets its past.

7.2

Stationary distributions & ergodicity

The fixed point the weather chain crawled toward is no accident. A distribution $\pi$ is stationary if applying the chain leaves it unchanged: one step in, the same distribution out. It is the eigenvector of $P^\top$ for eigenvalue 1:

EQ S7.4 — STATIONARY DISTRIBUTION $$ \pi P = \pi, \qquad \sum_i \pi_i = 1, \qquad \pi_i \ge 0 \qquad\Longleftrightarrow\qquad P^\top \pi^\top = \pi^\top $$

A row-stochastic $P$ always has eigenvalue 1 (its rows sum to 1, so the all-ones vector is a right eigenvector); the corresponding left eigenvector, normalized to sum to 1, is $\pi$. For the two-state chain, solving $\pi_A P_{AB} = \pi_B P_{BA}$ with $\pi_A + \pi_B = 1$ gives $\pi_A = P_{BA}/(P_{AB}+P_{BA})$. Stationary does not mean the chain stops — individual realizations keep hopping; it means the population of states stops shifting.

When does iterating actually reach $\pi$, and is $\pi$ unique? That is the question of ergodicity. A finite chain converges to a single stationary distribution from every start exactly when it is:

Irreducible — every state is reachable from every other (the chain is one connected piece, not separate islands). Otherwise each island has its own $\pi$.
Aperiodic — the chain is not trapped in a fixed cycle (e.g. a chain that strictly alternates A→B→A→B has period 2 and never settles; it oscillates forever). A single self-loop $P_{ii} > 0$ anywhere breaks periodicity.

ERGODIC THEOREM

An irreducible, aperiodic finite chain has a unique stationary $\pi$, and $P^{\,t} \to \mathbf{1}\pi$ as $t \to \infty$. Two consequences power the rest of the chapter. (1) Convergence: the distribution forgets its start at a rate governed by the second-largest eigenvalue $|\lambda_2|$ — the spectral gap $1 - |\lambda_2|$ is the mixing speed. (2) Time-averages = space-averages: the long-run fraction of time a single trajectory spends in state $i$ equals $\pi_i$. That second fact is what makes MCMC (§7.4) legal: run one walk long enough and its visit-frequencies are the target distribution.

A sufficient (not necessary) condition that makes $\pi$ easy to verify and is the cornerstone of MCMC is detailed balance — the chain is reversible, with as much probability flowing $i \to j$ as $j \to i$:

EQ S7.5 — DETAILED BALANCE (REVERSIBILITY) $$ \pi_i\, P_{ij} \;=\; \pi_j\, P_{ji} \quad \text{for all } i, j \qquad\Longrightarrow\qquad \pi P = \pi $$

Sum the left identity over $i$: $\sum_i \pi_i P_{ij} = \pi_j \sum_i P_{ji} = \pi_j$, which is exactly $(\pi P)_j = \pi_j$. So detailed balance implies stationarity — a strictly stronger, local, pairwise condition that is far easier to engineer than the global $\pi P = \pi$. Metropolis–Hastings (§7.4) is, in one sentence, a recipe for constructing a chain that satisfies EQ S7.5 for any target $\pi$ you name.

This is also the engine behind PageRank. Model a random surfer who, at each step, follows an outgoing link uniformly at random; with probability $1-\alpha$ (the damping, $\alpha \approx 0.15$) they instead teleport to a random page. That teleport term makes the chain irreducible and aperiodic on any web graph, so the ergodic theorem guarantees a unique stationary $\pi$ — and $\pi_i$, the long-run fraction of time the surfer sits on page $i$, is its PageRank.

For the two-state chain with $P(\text{stay A}) = 0.7$ (so $P(A\to B)=0.3$) and $P(\text{stay B}) = 0.6$ (so $P(B\to A)=0.4$), what is the stationary probability $\pi_A$? Use $\pi_A = \dfrac{P(B\to A)}{P(A\to B) + P(B\to A)}$.

$ \pi_A = \dfrac{0.4}{0.3 + 0.4} = \dfrac{0.4}{0.7} = \dfrac{4}{7} = $ 0.571. Equivalently, detailed balance $\pi_A \cdot 0.3 = \pi_B \cdot 0.4$ with $\pi_A + \pi_B = 1$ gives the same answer. The chain spends ~57% of its days sunny in the long run.

INSTRUMENT S7.2 — RANDOM WALK & PAGERANK5-NODE GRAPH · DAMPED SURFER · EQ S7.4

DAMPING α (teleport prob) 0.15

WALK SPEED MED

STEPS WALKED

TOP PAGE (EMPIRICAL)

—

EMPIRICAL ≈ EXACT π?

—

A single surfer hops the directed graph; each node's bar shows the empirical visit frequency, with the blue tick marking the exact stationary $\pi$ (the eigenvector, solved by power iteration). Watch the time-average climb toward the space-average — that convergence is the ergodic theorem in action. Node C is a hub with many inbound links, so it wins. Set α high and authority flattens (everyone teleports everywhere); set α = 0 and a dangling/cyclic trap can starve the rest.

PYTHON · RUNNABLE IN-BROWSER

# Power-iterate a transition matrix to its stationary pi; verify pi @ P = pi
import numpy as np
P = np.array([[0.7, 0.3],
              [0.4, 0.6]])
assert np.allclose(P.sum(1), 1.0)          # rows are valid distributions

pi = np.array([1.0, 0.0])                   # any start works
for _ in range(200):
    pi = pi @ P                             # EQ S7.2, repeatedly
pi = pi / pi.sum()

print("stationary pi      :", np.round(pi, 6))
print("pi @ P             :", np.round(pi @ P, 6))
print("fixed point holds  :", np.allclose(pi @ P, pi))

# cross-check against the dominant LEFT eigenvector of P (eigval 1)
w, v = np.linalg.eig(P.T)                    # left eigvecs = right eigvecs of P^T
k = np.argmin(np.abs(w - 1.0))              # the eigenvalue equal to 1
ev = np.real(v[:, k]); ev = ev / ev.sum()
print("eigenvector pi     :", np.round(ev, 6))
print("closed form (4/7)  :", round(4/7, 6))

edits are live — break it on purpose

7.3

Hidden Markov Models

So far the state was observable — we saw the weather directly. A Hidden Markov Model (HMM) adds one layer of indirection: the Markov chain runs underneath, but you never see the states. You see only emissions — noisy observations whose distribution depends on the hidden state. The classic toy: you cannot see the weather, but you see whether a friend carries an umbrella; you infer the hidden weather from the visible umbrellas.

EQ S7.6 — HMM JOINT DISTRIBUTION $$ \Pr\!\big(x_{1:T},\, z_{1:T}\big) \;=\; \underbrace{\pi_{z_1}}_{\text{initial}} \;\prod_{t=2}^{T} \underbrace{A_{z_{t-1} z_t}}_{\text{transition}} \;\prod_{t=1}^{T} \underbrace{B_{z_t}(x_t)}_{\text{emission}} $$

$z_{1:T}$ are the hidden states (the Markov chain), $x_{1:T}$ the observations. $A$ is the state-transition matrix (the chain of §7.1), $B_{z}(x)$ the emission probability of seeing $x$ when the hidden state is $z$, and $\pi$ the initial distribution. Two Markov assumptions are stacked: states depend only on the previous state, and each observation depends only on the current state. Summing this joint over all $K^T$ hidden paths looks hopeless — but dynamic programming makes it $O(K^2 T)$.

Three canonical questions, each with an exact algorithm — all variants of the same dynamic program that re-uses the Markov factorization instead of enumerating paths:

Question	Algorithm	Computes
Likelihood — how probable is this observation sequence?	Forward	$\Pr(x_{1:T})$ by summing over hidden paths
Decoding — what is the single most likely hidden path?	Viterbi	$\arg\max_z \Pr(z_{1:T} \mid x_{1:T})$
Learning — fit $A, B, \pi$ from unlabeled data	Baum–Welch (EM)	parameters that locally maximize $\Pr(x_{1:T})$

The forward algorithm carries a vector of "beliefs" $\alpha_t(j) = \Pr(x_{1:t},\, z_t = j)$ — the joint probability of the observations so far and being in state $j$ now — and updates it one observation at a time:

EQ S7.7 — THE FORWARD RECURSION $$ \alpha_t(j) \;=\; B_j(x_t) \sum_{i=1}^{K} \alpha_{t-1}(i)\, A_{ij}, \qquad \Pr(x_{1:T}) \;=\; \sum_{j=1}^{K} \alpha_T(j) $$

Each step blends "where could I have come from" ($\sum_i \alpha_{t-1}(i) A_{ij}$, a transition step exactly like EQ S7.2) with "does state $j$ explain what I just saw" ($B_j(x_t)$). The whole sequence's likelihood is the final beliefs summed. Viterbi is the same recursion with $\sum$ replaced by $\max$ (and a back-pointer), turning "total probability of all paths" into "probability of the single best path".

HMMs ruled speech recognition, part-of-speech tagging, and bioinformatics (gene finding, sequence alignment) for two decades. In modern deep learning their throne went to RNNs and then Transformers (Vol II · Ch 03), which drop the discrete-state and conditional-independence constraints. But the HMM's structure survives everywhere: linear-state-space models and Kalman filters are HMMs with continuous Gaussian states, and the forward–backward dynamic program is the direct ancestor of the message-passing that powers modern structured prediction.

An HMM starts with $\pi_{\text{Rainy}} = 0.6$. The emission probability of seeing an umbrella given Rainy is $B_{\text{Rainy}}(\text{umbrella}) = 0.3$. Using EQ S7.7 at $t=1$ (no transition yet), what is the forward value $\alpha_1(\text{Rainy}) = \pi_{\text{Rainy}}\, B_{\text{Rainy}}(\text{umbrella})$?

$ \alpha_1(\text{Rainy}) = \pi_{\text{Rainy}} \cdot B_{\text{Rainy}}(\text{umbrella}) = 0.6 \times 0.3 = $ 0.18. This is the joint probability of "the weather is Rainy on day 1 and we saw an umbrella" — the seed the forward recursion then propagates forward.

7.4

Markov Chain Monte Carlo — Metropolis–Hastings

Here the chapter turns inside-out. Until now $P$ was given and we asked for $\pi$. MCMC inverts the problem: you are given a target distribution $\pi$ — typically a Bayesian posterior (Stats 05) you can evaluate up to a constant but cannot integrate or sample from directly — and you construct a Markov chain whose stationary distribution is exactly that $\pi$. Run the chain; its trajectory becomes your sample. The ergodic theorem (§7.2) is the guarantee that this is allowed.

WHY THIS MATTERS

The defining pain of Bayesian inference is the normalizing constant. The posterior $p(\theta \mid x) = \frac{p(x \mid \theta)\,p(\theta)}{p(x)}$ has a denominator $p(x) = \int p(x\mid\theta)p(\theta)\,\mathrm{d}\theta$ that is usually an intractable high-dimensional integral. MCMC sidesteps it entirely: every step compares two densities as a ratio, so the unknown constant cancels. You only ever need $\pi$ up to proportionality. That single trick is why MCMC, not algebra, is how most real Bayesian models are fit.

The Metropolis–Hastings algorithm is a constructive recipe for a chain obeying detailed balance (EQ S7.5) with respect to any $\pi$. From the current point $\theta$, propose a move to $\theta'$ from a proposal density $q(\theta' \mid \theta)$; then accept it with a probability designed to enforce reversibility:

EQ S7.8 — THE METROPOLIS–HASTINGS ACCEPTANCE RULE $$ a \;=\; \min\!\left(1,\; \frac{\pi(\theta')\, q(\theta \mid \theta')}{\pi(\theta)\, q(\theta' \mid \theta)} \right); \qquad \text{accept } \theta' \text{ with prob } a, \text{ else stay at } \theta $$

$\pi$ appears only as the ratio $\pi(\theta')/\pi(\theta)$ — the normalizing constant cancels, which is the whole point. The proposal ratio $q(\theta\mid\theta')/q(\theta'\mid\theta)$ corrects for any asymmetry in how you propose. For a symmetric proposal (e.g. a Gaussian centered on the current point) those $q$ terms cancel and the rule collapses to the original 1953 Metropolis form $a = \min(1,\, \pi(\theta')/\pi(\theta))$: always move toward higher density, sometimes move toward lower. Plugging EQ S7.8 into detailed balance verifies $\pi$ is stationary by construction.

The logic is a biased random walk. Uphill moves (to higher $\pi$) are always taken; downhill moves are taken with probability equal to the density ratio, so the walker explores the tails without getting stuck on the peak. Over many steps it visits each region in proportion to $\pi$ — exactly the time-average = space-average promise. Two practical knobs dominate everything:

Step size (proposal width). Too small and the walker shuffles, exploring slowly with high autocorrelation; too large and almost every proposal lands in the low-density wilderness and is rejected. The folklore target acceptance rate is ~0.234 for high-dimensional random-walk Metropolis (an Roberts–Gelman–Gilks result) — neither greedy nor timid.
Burn-in & mixing. The chain starts wherever you put it, not at $\pi$; the first stretch is transient and is discarded as burn-in. Consecutive samples are correlated, so the effective sample size is far below the raw count. Diagnostics like the $\hat{R}$ statistic (comparing several independent chains) and trace plots are how you decide it has converged — and "it looks converged" is famously not a proof.

In symmetric-proposal Metropolis, the acceptance probability is $a = \min\!\big(1,\, \pi(\theta')/\pi(\theta)\big)$. The current point has (unnormalized) density $\pi(\theta) = 2$ and the proposed point has $\pi(\theta') = 1$. What is the acceptance probability $a$?

The proposal moves downhill (1 < 2), so $a = \min\!\big(1,\, 1/2\big) = $ 0.5. The walker takes this downhill step half the time — accepting just enough bad moves to map the distribution's tails rather than collapsing onto its mode. (Had the move been uphill, $a = \min(1,\, \text{ratio}\,>\,1) = 1$: always accept.)

INSTRUMENT S7.3 — METROPOLIS SAMPLER1-D TWO-PEAK TARGET · HISTOGRAM CONVERGES · EQ S7.8

PROPOSAL STEP σ 1.0

SAMPLE SPEED MED

SAMPLES DRAWN

ACCEPTANCE RATE

—

HISTOGRAM vs TARGET (L1)

—

The blue curve is the target — a two-peaked mixture you can evaluate but not easily sample. Mint bars are the running histogram of accepted samples; the dot is the current walker. Watch the bars grow into the curve. Now break it: shrink σ toward 0 and the walker can't cross the valley between peaks — it samples one mode and reports a confidently wrong distribution (the canonical MCMC failure). Blow σ up and the acceptance rate craters as proposals miss the target entirely. The healthy regime is in between.

PYTHON · RUNNABLE IN-BROWSER

# Metropolis-Hastings for a 2-component Gaussian mixture target (EQ S7.8)
import numpy as np
rng = np.random.default_rng(0)

def target(x):                              # unnormalized density: two peaks
    return 0.6*np.exp(-0.5*((x+2)/0.7)**2) + 0.4*np.exp(-0.5*((x-2)/1.0)**2)

x, step, n = 0.0, 1.5, 40000
samples, accepts = np.empty(n), 0
for i in range(n):
    xp = x + rng.normal(0, step)            # symmetric Gaussian proposal
    if rng.random() < min(1.0, target(xp)/target(x)):   # accept rule, const cancels
        x, accepts = xp, accepts + 1        # move; ratio needs no normalizer
    samples[i] = x

burn = samples[2000:]                        # discard burn-in
print(f"acceptance rate : {accepts/n:.3f}")
print(f"sample mean     : {burn.mean():+.3f}   (true mix mean is near 0)")
print(f"sample std      : {burn.std():.3f}")
# overlay the recovered target density on the canvas (normalized by area)
xs = np.linspace(-6, 6, 200)
area = np.sum(target(xs)) * (xs[1] - xs[0])  # Riemann sum normalizer
plot_xy(xs, target(xs) / area)               # the target the walk reproduced

edits are live — break it on purpose

7.5

Gibbs sampling & MCMC in modern ML

Metropolis–Hastings is general but blunt: one proposal width for a whole high-dimensional space rarely fits. Gibbs sampling is the special case that exploits structure — when you cannot sample the joint $p(\theta_1, \ldots, \theta_d)$ but can sample each variable from its full conditional given all the others. You then cycle through the coordinates, replacing each in turn by a fresh draw from its conditional:

EQ S7.9 — THE GIBBS SWEEP $$ \theta_1^{(t+1)} \sim p\big(\theta_1 \mid \theta_2^{(t)}, \ldots, \theta_d^{(t)}\big), \;\; \theta_2^{(t+1)} \sim p\big(\theta_2 \mid \theta_1^{(t+1)}, \theta_3^{(t)}, \ldots\big), \;\; \ldots $$

Each coordinate is updated from its conditional with the others held fixed (always using the freshest values). Gibbs is Metropolis–Hastings with acceptance probability always equal to 1: when you propose from the exact conditional, the MH ratio simplifies to one, so no move is ever rejected. The price is that you need those conditionals in closed form — which is exactly why conjugate priors (Stats 05) and graphical models are its natural habitat. It can also mix slowly when coordinates are strongly correlated, because it only ever moves axis-by-axis.

Where MCMC sits in the 2026 landscape:

Probabilistic programming. Stan, PyMC, and NumPyro let you declare a model and sample its posterior with no hand-derived math. The default sampler is almost never vanilla Metropolis — it is the No-U-Turn Sampler (NUTS), an adaptive form of Hamiltonian Monte Carlo that uses gradients of $\log\pi$ to propose long, informed trajectories instead of a blind local jiggle, dramatically improving mixing in high dimensions. Vanilla random-walk Metropolis is now mostly pedagogy and a fallback.
The gradient frontier. HMC/NUTS need $\nabla \log \pi$, which auto-diff (Vol II · Ch 03 machinery) supplies for free. For massive datasets, stochastic-gradient MCMC (SGLD and kin) injects calibrated noise into SGD steps so the optimizer's trajectory itself samples a Bayesian posterior over weights — the bridge between deep learning and Bayesian inference.
The diffusion connection. Modern image and video generators (Vol II · Ch 08) are, at heart, learned reverse-time Markov chains: a forward chain gradually adds Gaussian noise to data, and a neural network learns to run the chain backward, sampling images by walking from noise to signal. Langevin-style sampling — a gradient ascent on $\log\pi$ with noise — is the direct intellectual descendant of the Metropolis walk you ran in Instrument S7.3.
The honest caveats. Convergence is asymptotic and unprovable in finite time; multimodal targets (like Instrument S7.3 with small σ) can trap a chain in one mode forever; and effective sample sizes can be a tiny fraction of the raw count. MCMC is the workhorse of Bayesian computation precisely because, used with diagnostics and skepticism, it is the most general tool we have — not because it is foolproof.

Markov chains gave us a way to sample what we cannot integrate; the next chapter asks how much a random outcome is worth knowing. Information theory measures uncertainty in bits — entropy, cross-entropy, KL divergence, mutual information — and turns out to be the language in which the acceptance ratios, the mixing rates, and the very loss functions of every model in this encyclopedia are most naturally written. Stats 08: Information Theory.

7.R

References

Norris, J. R. (1997). Markov Chains. Cambridge University Press — the standard rigorous treatment of transition matrices, stationarity, ergodicity, and reversibility.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 21(6) — the original Metropolis acceptance rule (symmetric-proposal special case of EQ S7.8).
Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57(1) — the generalization to asymmetric proposals, completing Metropolis–Hastings.
Rabiner, L. R. (1989). A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 77(2) — the canonical reference for the forward, Viterbi, and Baum–Welch algorithms (§7.3).
Geman, S. & Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE TPAMI 6(6) — introduced Gibbs sampling (EQ S7.9) to statistics and image analysis.
Page, L., Brin, S., Motwani, R. & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab — the damped random-walk Markov chain whose stationary distribution is PageRank (§7.2).
Roberts, G. O., Gelman, A. & Gilks, W. R. (1997). Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. Ann. Appl. Probab. 7(1) — the ~0.234 optimal acceptance-rate result (§7.4).
Hoffman, M. D. & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. JMLR 15 — NUTS, the adaptive HMC sampler behind Stan / PyMC / NumPyro (§7.5).

Question	Algorithm	Computes
Likelihood — how probable is this observation sequence?	Forward	\(\Pr(x_{1:T})\) by summing over hidden paths
Decoding — what is the single most likely hidden path?	Viterbi	\(\arg\max_z \Pr(z_{1:T} \mid x_{1:T})\)
Learning — fit \(A, B, \pi\) from unlabeled data	Baum–Welch (EM)	parameters that locally maximize \(\Pr(x_{1:T})\)