AI // ENCYCLOPEDIA / STATISTICS / 01 / PROBABILITY INDEX NEXT: DISTRIBUTIONS →
MATHEMATICS & STATISTICS · CHAPTER 01 / 08

Probability — The Logic of Uncertainty

Probability gives degrees of belief an arithmetic, built on three axioms. Conditioning is the operation that turns prior belief into posterior knowledge: it governs how a diagnostic test revises a diagnosis, how a spam filter learns, and how any model reasons under doubt.

LEVELINTRO READING TIME≈ 24 MIN BUILDS ONALGEBRA INSTRUMENTSBAYES BOX · LLN · TREE
1.1

Sample spaces, events & Kolmogorov's axioms

Probability begins by naming everything that could happen. The sample space \(\Omega\) is the set of all possible outcomes of an experiment: for one die roll \(\Omega = \{1,2,3,4,5,6\}\); for a coin flip \(\Omega = \{H, T\}\). An event is any subset of \(\Omega\) — "the roll is even" is the event \(\{2,4,6\}\). Probability is then a single function that assigns each event a number between 0 and 1, measuring how much of the sample space it occupies.

In 1933 Andrei Kolmogorov reduced the entire subject to three rules. Every theorem in this chapter — every theorem in probability — is a consequence of just these:

EQ S1.1 — KOLMOGOROV'S AXIOMS $$ \text{(1)}\;\; P(A) \ge 0 \qquad \text{(2)}\;\; P(\Omega) = 1 \qquad \text{(3)}\;\; P\!\Big(\bigcup_i A_i\Big) = \sum_i P(A_i) \;\;\text{for disjoint } A_i $$
Probabilities are non-negative, the certain event has probability one, and the probability of any of several mutually exclusive events is the sum of their probabilities. That is the whole foundation. From them follow the complement rule \(P(A^c) = 1 - P(A)\), monotonicity \(A \subseteq B \Rightarrow P(A) \le P(B)\), and — for events that can overlap — inclusion–exclusion.

The third axiom only adds probabilities when events cannot happen together. When they can overlap, naively summing double-counts the intersection, so you subtract it back out:

EQ S1.2 — ADDITION RULE (INCLUSION–EXCLUSION) $$ P(A \cup B) \;=\; P(A) + P(B) - P(A \cap B) $$
"Probability of A or B" — where "or" is inclusive. The overlap \(P(A \cap B)\) sits inside both \(P(A)\) and \(P(B)\), so it is counted twice and must be removed once. If \(A\) and \(B\) are disjoint, \(P(A \cap B) = 0\) and this collapses to axiom 3. This single correction is the seed of nearly every "but you forgot to subtract the overlap" mistake in applied probability.

For a finite, equally-likely sample space — fair dice, shuffled cards, balanced coins — every outcome carries weight \(1/|\Omega|\), and the probability of an event reduces to counting:

EQ S1.3 — THE CLASSICAL (COUNTING) DEFINITION $$ P(A) \;=\; \frac{|A|}{|\Omega|} \;=\; \frac{\text{number of favorable outcomes}}{\text{number of possible outcomes}} $$
Valid only when outcomes are equally likely — a modelling assumption, not a law. Most of real life is not equally likely (a biased coin, a loaded market), which is exactly why Kolmogorov's axioms are stated abstractly: they hold whether probabilities come from symmetry, from long-run frequency, or from a degree of belief.

Frequentist vs. Bayesian — the honest caveat. The axioms say what a probability function must obey; they are silent on what a probability means. One school reads \(P(A)\) as a long-run frequency — the fraction of times \(A\) occurs in endless repetitions (§1.4 makes this precise). The other reads it as a degree of belief that can be updated by evidence (§1.3). Both satisfy EQ S1.1 identically, which is why the two camps share every equation and disagree only on interpretation. This chapter uses whichever lens is clearer and flags the switch.

From a single draw, \( P(A) = 0.5 \), \( P(B) = 0.4 \), and \( P(A \cap B) = 0.2 \). What is \( P(A \cup B) \)?
By EQ S1.2, \( P(A \cup B) = P(A) + P(B) - P(A \cap B) = 0.5 + 0.4 - 0.2 = \) 0.7. The \(0.2\) overlap was sitting inside both \(0.5\) and \(0.4\), so it is removed exactly once.
PYTHON · RUNNABLE IN-BROWSER
# Axioms by counting: a fair die, the event "even", and the addition rule
import numpy as np
omega = np.array([1, 2, 3, 4, 5, 6])          # sample space

A = omega[omega % 2 == 0]                       # event A: roll is even {2,4,6}
B = omega[omega > 3]                            # event B: roll > 3   {4,5,6}
P = lambda S: len(S) / len(omega)               # EQ S1.3: classical definition

inter = np.intersect1d(A, B)                     # A and B  -> {4, 6}
union = np.union1d(A, B)                         # A or  B  -> {2,4,5,6}
print(f"P(A)            = {P(A):.4f}")
print(f"P(B)            = {P(B):.4f}")
print(f"P(A and B)      = {P(inter):.4f}")
print(f"P(A or  B) count= {P(union):.4f}")
addition = P(A) + P(B) - P(inter)               # EQ S1.2
print(f"P(A or  B) rule = {addition:.4f}   <- matches the direct count")
print(f"P(not A)        = {1 - P(A):.4f}   <- complement rule")
edits are live — break it on purpose
1.2

Conditional probability & independence

The single most important operation in the subject is conditioning: revising a probability once you learn that some event has occurred. Learning that \(B\) happened shrinks your world from all of \(\Omega\) down to just \(B\), and you renormalize so the new, smaller world again has total probability one.

EQ S1.4 — CONDITIONAL PROBABILITY $$ P(A \mid B) \;=\; \frac{P(A \cap B)}{P(B)}, \qquad P(B) > 0 $$
Read aloud: "the probability of \(A\) given \(B\)." You keep only the part of \(A\) that lives inside \(B\) — the numerator — and rescale by the size of the new universe \(B\). Conditioning is the engine of all learning from evidence: every belief update, every diagnosis, every filter is some instance of this one ratio.

Rearranging EQ S1.4 gives the multiplication rule for the joint probability of two events, and applying it across a partition gives a way to assemble a total probability from its conditional pieces:

EQ S1.5 — CHAIN RULE & LAW OF TOTAL PROBABILITY $$ P(A \cap B) = P(A \mid B)\,P(B), \qquad\qquad P(A) = \sum_i P(A \mid B_i)\,P(B_i) $$
Left: a joint probability factors into a marginal times a conditional. Right: if the \(B_i\) partition \(\Omega\) (mutually exclusive, collectively exhaustive), the overall chance of \(A\) is a weighted average of its chances within each slice, weighted by how likely each slice is. This averaging step is exactly the denominator of Bayes' theorem in §1.3.

Two events are independent when knowing one tells you nothing about the other — conditioning leaves the probability unchanged, \(P(A \mid B) = P(A)\). Substituting into the chain rule gives the cleaner, symmetric test:

EQ S1.6 — INDEPENDENCE $$ A \perp B \quad\Longleftrightarrow\quad P(A \cap B) = P(A)\,P(B) $$
Independence means the joint factors into the product of marginals. It is a property of the probabilities, not of the physical situation — two events can be independent under one distribution and dependent under another. A frequent trap: mutually exclusive events with positive probability are the opposite of independent — if \(A\) rules out \(B\), then learning \(A\) tells you \(B\) cannot happen, so \(P(B \mid A) = 0 \ne P(B)\).

Conditioning is not symmetric. \(P(A \mid B)\) and \(P(B \mid A)\) are different numbers in general — most rain comes with clouds, but most clouds bring no rain. Confusing the two is the prosecutor's fallacy (§1.5), and correcting it is precisely what Bayes' theorem does.

Roll a fair die. Given that the result is even (\(B = \{2,4,6\}\)), what is the probability it is greater than 3 (\(A = \{4,5,6\}\))? Compute \( P(A \mid B) \).
\( A \cap B = \{4, 6\} \), so \( P(A \cap B) = 2/6 \) and \( P(B) = 3/6 \). By EQ S1.4, \( P(A \mid B) = \dfrac{2/6}{3/6} = \dfrac{2}{3} \approx \) 0.667. For contrast, the unconditional \(P(A) = 3/6 = 0.5\): conditioning on "even" raises the probability from 0.5 to 0.667, because the even faces lean high.
INSTRUMENT S1.3 — CONDITIONAL-PROBABILITY EXPLORERVENN ⇄ TREE · EQ S1.4–S1.6
P(A ∪ B)
P(A | B)
P(B | A)
INDEPENDENT?
The two circles are events \(A\) and \(B\); their overlap is \(P(A \cap B)\). Drag the overlap slider toward \(P(A)\,P(B)\) and the verdict flips to INDEPENDENT — that is the exact point where conditioning stops changing anything (\(P(A\mid B) = P(A)\)). Push the overlap to zero and the events become mutually exclusive: \(P(A\mid B) = 0\), the opposite of independent. The slider is clamped so the overlap can never exceed either circle — an impossible probability the axioms forbid.
1.3

Bayes' theorem — inverting the condition

We can usually measure \(P(\text{evidence} \mid \text{cause})\) — how often a disease produces a positive test, how often spam contains the word "free." But what we want is the reverse: \(P(\text{cause} \mid \text{evidence})\) — given a positive test, how likely is the disease? Bayes' theorem is the bridge. Start from the symmetry of the chain rule, \(P(A \cap B) = P(A \mid B)P(B) = P(B \mid A)P(A)\), and solve for the conditional you don't have:

EQ S1.7 — BAYES' THEOREM $$ P(H \mid E) \;=\; \frac{P(E \mid H)\,P(H)}{P(E)}, \qquad P(E) = P(E\mid H)P(H) + P(E\mid H^c)P(H^c) $$
\(H\) is a hypothesis, \(E\) the evidence. \(P(H)\) is the prior (belief before seeing \(E\)); \(P(E\mid H)\) the likelihood (how well \(H\) predicts \(E\)); \(P(H\mid E)\) the posterior (belief after). The denominator \(P(E)\) — the total probability of the evidence under all hypotheses, from EQ S1.5 — is just the normalizer that makes the posteriors sum to one. Prior, scaled by how well the data fit, renormalized: that is all learning is.

The structure is clearer in odds form, which strips away the shared denominator. The posterior odds are the prior odds multiplied by the likelihood ratio — how much more probable the evidence is under \(H\) than under its negation:

EQ S1.8 — BAYES IN ODDS FORM $$ \underbrace{\frac{P(H \mid E)}{P(H^c \mid E)}}_{\text{posterior odds}} \;=\; \underbrace{\frac{P(H)}{P(H^c)}}_{\text{prior odds}} \;\times\; \underbrace{\frac{P(E \mid H)}{P(E \mid H^c)}}_{\text{likelihood ratio}} $$
Evidence enters as a multiplier on your odds — a likelihood ratio of 1 leaves belief untouched; 10 multiplies your odds tenfold; \(\tfrac{1}{10}\) divides them. This form makes the central lesson of §1.5 visible at a glance: a strong test (large likelihood ratio) applied to a rare hypothesis (tiny prior odds) can still leave the posterior small. The multiplier is powerful, but it multiplies a number that started near zero.
THE BASE RATE

Why a 99%-accurate test can be wrong most of the time it fires. Take a disease that afflicts 1 person in 100, a test with 99% sensitivity and 95% specificity. Of 10,000 people, ~100 are sick and ~99 test positive correctly; of the 9,900 healthy, 5% — about 495 — test positive falsely. A positive result therefore points to a sick person only \(99/(99 + 495) \approx 17\%\) of the time. The test is excellent; the disease is rarer than the test's error rate, and the rarity dominates. Conditioning forces you to confront the base rate the headline accuracy hides.

A disease has prevalence \(P(D) = 0.01\). A test has sensitivity \(P(+\mid D) = 0.99\) and specificity \(P(-\mid D^c) = 0.95\) (so the false-positive rate is \(0.05\)). You test positive. What is \( P(D \mid +) \)?
By EQ S1.7: numerator \( = P(+\mid D)P(D) = 0.99 \times 0.01 = 0.0099\). Denominator \( = 0.0099 + P(+\mid D^c)P(D^c) = 0.0099 + 0.05 \times 0.99 = 0.0099 + 0.0495 = 0.0594\). So \( P(D \mid +) = 0.0099 / 0.0594 = \) 0.167 — about one in six. A near-perfect test on a rare disease still leaves five of six positives healthy.
INSTRUMENT S1.1 — BAYES-BOX DISEASE-TEST CALCULATOREQ S1.7 · LIVE · 10,000-PERSON COHORT
P(DISEASE | POSITIVE)
TRUE POS · FALSE POS
P(HEALTHY | NEGATIVE)
Each cell of the bar is the 10,000-person cohort split into true/false positives and negatives. At the defaults — prevalence 1%, sensitivity 99%, specificity 95% — a positive result means disease only ~16.7% of the time: the base-rate trap, made of red false positives swamping the green true ones. Now drag prevalence up to 20% and watch the posterior leap past 80% — the same test, a different population. The lesson the headline accuracy hides: a test's worth depends on who you give it to.
PYTHON · RUNNABLE IN-BROWSER
# Monte-Carlo a conditional probability and check it against exact Bayes
import numpy as np
rng = np.random.default_rng(0)

prev, sens, spec = 0.01, 0.99, 0.95            # prevalence, sensitivity, specificity
N = 2_000_000

disease  = rng.random(N) < prev                                  # who is actually sick
positive = np.where(disease, rng.random(N) < sens,              # sick  -> true positive
                              rng.random(N) < (1 - spec))        # well  -> false positive

mc = disease[positive].mean()                  # P(D | +) by simulation
exact = (sens * prev) / (sens * prev + (1 - spec) * (1 - prev))  # EQ S1.7
print(f"positives observed     : {positive.sum():,} of {N:,}")
print(f"P(D | +)  Monte-Carlo  : {mc:.4f}")
print(f"P(D | +)  exact (Bayes): {exact:.4f}")
print(f"gap                    : {abs(mc - exact):.4f}")
print(f"\nbase-rate trap: a 99%/95% test is right only {exact*100:.1f}% of the time it fires.")
edits are live — break it on purpose
1.4

Random variables, expectation & variance

A random variable \(X\) is a function that attaches a number to each outcome — the value rolled, the count of heads in ten flips, tomorrow's return. It lets us do arithmetic with chance. The two numbers that summarize a random variable are its center of mass and its spread.

The expectation (mean) is the probability-weighted average of the values \(X\) can take — the long-run average if you repeated the experiment forever:

EQ S1.9 — EXPECTATION $$ \mathbb{E}[X] \;=\; \sum_x x\,P(X = x) \quad\text{(discrete)}, \qquad \mathbb{E}[X] = \int x\,f(x)\,\mathrm{d}x \quad\text{(continuous)} $$
Expectation is linear no matter what: \(\mathbb{E}[aX + bY] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y]\), even when \(X\) and \(Y\) are dependent — a fact used constantly and far less restrictive than it looks. Note the expected value need not be an attainable outcome: a fair die's mean is 3.5, a face it can never show.

The variance measures how far values typically stray from the mean — the expected squared deviation. Its square root, the standard deviation, restores the original units:

EQ S1.10 — VARIANCE $$ \mathrm{Var}(X) \;=\; \mathbb{E}\!\big[(X - \mathbb{E}[X])^2\big] \;=\; \mathbb{E}[X^2] - \big(\mathbb{E}[X]\big)^2 $$
The right-hand form ("mean of the square minus the square of the mean") is the one you actually compute. Variance is not linear: \(\mathrm{Var}(aX) = a^2\,\mathrm{Var}(X)\), and \(\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)\) holds only when \(X \perp Y\). That independence-gated additivity is what makes the average of \(n\) independent samples have variance \(\sigma^2/n\) — the mathematical reason averaging reduces noise.

That last fact is the law of large numbers: as you collect more independent samples, their running average converges to the true expectation. Probability, defined abstractly by Kolmogorov, finally reconnects to the frequentist intuition of "long-run frequency" — they are provably the same limit.

EQ S1.11 — LAW OF LARGE NUMBERS $$ \bar{X}_n \;=\; \frac{1}{n}\sum_{i=1}^{n} X_i \;\xrightarrow[n \to \infty]{}\; \mathbb{E}[X] $$
The sample mean of i.i.d. draws converges to the population mean. The convergence is slow: the spread of \(\bar{X}_n\) shrinks like \(1/\sqrt{n}\), so cutting your error in half takes four times the data. This \(\sqrt{n}\) rate governs the width of every confidence interval and the cost of every Monte-Carlo estimate (and is why the simulations above use millions of samples for three-decimal accuracy).
Let \(X\) be the result of one roll of a fair six-sided die. Compute the expectation \( \mathbb{E}[X] \) using EQ S1.9.
Each face has probability \(1/6\): \( \mathbb{E}[X] = \tfrac{1}{6}(1+2+3+4+5+6) = \tfrac{21}{6} = \) 3.5. The mean is exactly halfway between 3 and 4 — a value the die can never actually show, which is fine: expectation is a balance point, not an outcome.
For the same fair die, compute the variance \( \mathrm{Var}(X) \) using EQ S1.10. (Note \( \mathbb{E}[X^2] = \tfrac{1}{6}(1+4+9+16+25+36) = \tfrac{91}{6} \).)
\( \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \tfrac{91}{6} - 3.5^2 = 15.1\overline{6} - 12.25 = \tfrac{35}{12} \approx \) 2.917. The standard deviation is \( \sqrt{35/12} \approx 1.71 \), a natural "typical distance from 3.5."
INSTRUMENT S1.2 — LAW-OF-LARGE-NUMBERS SIMULATOREQ S1.11 · RUNNING AVERAGE → E[X]
SAMPLES n
RUNNING MEAN X̄ₙ
TRUE E[X]
The mint line is the running average \(\bar{X}_n\); the dashed line is the true mean (0.5 for the coin, 3.5 for the die). Press RUN and watch the early average lurch wildly, then settle — the convergence visibly slows because error shrinks only as \(1/\sqrt{n}\). A meaningful baseline is drawn before you touch anything: the first ~120 samples render on load. This is the bridge from Kolmogorov's abstract \(P\) back to "long-run frequency."
PYTHON · RUNNABLE IN-BROWSER
# Monty Hall: simulate stay vs switch and print the win rates
import numpy as np
rng = np.random.default_rng(0)
N = 200_000

car   = rng.integers(0, 3, N)            # door hiding the car (0,1,2)
pick  = rng.integers(0, 3, N)            # contestant's first pick

# stay wins exactly when the first pick was already the car
stay_wins = (pick == car)
# the host opens a goat door; switching wins whenever staying loses
switch_wins = ~stay_wins

print(f"trials              : {N:,}")
print(f"P(win | STAY)       : {stay_wins.mean():.4f}   (theory 1/3 = 0.3333)")
print(f"P(win | SWITCH)     : {switch_wins.mean():.4f}   (theory 2/3 = 0.6667)")
print(f"switching advantage : {switch_wins.mean() / stay_wins.mean():.2f}x")
print("\nWhy: your first pick is right 1/3 of the time, so the OTHER unopened")
print("door carries the remaining 2/3 once the host reveals a goat. Conditioning,")
print("not intuition. (Law of large numbers: rates lock in as N grows.)")
edits are live — break it on purpose
1.5

Pitfalls — base rates & the prosecutor's fallacy

Probability's hardest errors are not algebraic — they are interpretive. The mind reaches for the wrong conditional, ignores the denominator, or forgets how rare the thing it is reasoning about really is. Three traps cause most real-world damage.

Base-rate neglect

The §1.3 disease test is the canonical case: people quote a 99% test and conclude a positive result means 99% chance of disease, forgetting the 1% prevalence that makes false positives outnumber true ones. The base rate is the prior in Bayes' theorem, and dropping it is mathematically equivalent to setting \(P(H) = P(H^c)\) — assuming the disease is as common as health. Whenever someone reports a conditional probability without a base rate, the number is uninterpretable.

The prosecutor's fallacy

This is the confusion of \(P(E \mid H)\) with \(P(H \mid E)\) dressed in a courtroom. A forensic match has a one-in-a-million random-match probability: \(P(\text{match} \mid \text{innocent}) = 10^{-6}\). The prosecutor declares the chance the defendant is innocent is therefore one in a million — but that swaps the conditional. The quantity that matters is \(P(\text{innocent} \mid \text{match})\), and by Bayes it depends on the suspect pool. In a city of 10 million, roughly 10 innocent people also match by chance; with one true source, a bare match makes the defendant only ~1-in-11 likely to be the source. The likelihood ratio is enormous, but multiplied against a tiny prior, the posterior is far from certainty.

EQ S1.12 — THE FALLACY, STATED EXACTLY $$ P(E \mid H) \;\ne\; P(H \mid E), \qquad\text{related only through}\quad P(H \mid E) = P(E \mid H)\,\frac{P(H)}{P(E)} $$
The two conditionals differ by the factor \(P(H)/P(E)\) — exactly the prior-over-evidence ratio that base-rate neglect throws away. They coincide only when \(P(H) = P(E)\), a coincidence, never a rule. "The evidence is unlikely if innocent" is not "innocence is unlikely given the evidence." Real convictions (and acquittals) have turned on this single transposition.
PITFALLS

The recurring shapes of error: (1) base-rate neglect — quoting \(P(E\mid H)\) and ignoring how rare \(H\) is; (2) the prosecutor's fallacy — reading \(P(E\mid H)\) as \(P(H\mid E)\); (3) the conjunction fallacy — judging \(P(A \cap B) > P(A)\), impossible since an intersection can never be larger than either part (the "Linda is a bank teller and a feminist" experiment); (4) the gambler's fallacy — believing independent trials "are due" to correct, when by EQ S1.6 past flips tell a fair coin nothing.

The unifying diagnosis. Every one of these is a failure to condition correctly — to track which event is given, which is uncertain, and what the base rates are. Bayes' theorem is not just a formula; it is the discipline that makes these errors impossible to commit if you actually write the ratio down. That is why §1.3 is the spine of this chapter and of everything statistical that follows.

NEXT

We have treated probabilities of events and the mean and spread of a random variable in the abstract. The next chapter gives those random variables names and shapes: the Bernoulli, binomial, Poisson, normal, and exponential distributions — the recurring "characters" of uncertainty — plus the central limit theorem that explains why the bell curve appears everywhere a sum or an average does.

1.R

References

  1. Kolmogorov, A. N. (1933). Foundations of the Theory of Probability (Grundbegriffe der Wahrscheinlichkeitsrechnung). The axiomatic foundation of EQ S1.1; measure-theoretic probability.
  2. Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances. Phil. Trans. R. Soc. — the original statement of EQ S1.7.
  3. Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). Chapman & Hall / CRC. Harvard Stat 110 — conditioning, Bayes, expectation, LLN; free course materials.
  4. Tversky, A. & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review 90(4) — the Linda problem (§1.5).
  5. Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. Probability as extended logic — the Bayesian reading of §1.1 and §1.3.