Sample spaces, events & Kolmogorov's axioms
Probability begins by naming everything that could happen. The sample space \(\Omega\) is the set of all possible outcomes of an experiment: for one die roll \(\Omega = \{1,2,3,4,5,6\}\); for a coin flip \(\Omega = \{H, T\}\). An event is any subset of \(\Omega\) — "the roll is even" is the event \(\{2,4,6\}\). Probability is then a single function that assigns each event a number between 0 and 1, measuring how much of the sample space it occupies.
In 1933 Andrei Kolmogorov reduced the entire subject to three rules. Every theorem in this chapter — every theorem in probability — is a consequence of just these:
The third axiom only adds probabilities when events cannot happen together. When they can overlap, naively summing double-counts the intersection, so you subtract it back out:
For a finite, equally-likely sample space — fair dice, shuffled cards, balanced coins — every outcome carries weight \(1/|\Omega|\), and the probability of an event reduces to counting:
Frequentist vs. Bayesian — the honest caveat. The axioms say what a probability function must obey; they are silent on what a probability means. One school reads \(P(A)\) as a long-run frequency — the fraction of times \(A\) occurs in endless repetitions (§1.4 makes this precise). The other reads it as a degree of belief that can be updated by evidence (§1.3). Both satisfy EQ S1.1 identically, which is why the two camps share every equation and disagree only on interpretation. This chapter uses whichever lens is clearer and flags the switch.
# Axioms by counting: a fair die, the event "even", and the addition rule
import numpy as np
omega = np.array([1, 2, 3, 4, 5, 6]) # sample space
A = omega[omega % 2 == 0] # event A: roll is even {2,4,6}
B = omega[omega > 3] # event B: roll > 3 {4,5,6}
P = lambda S: len(S) / len(omega) # EQ S1.3: classical definition
inter = np.intersect1d(A, B) # A and B -> {4, 6}
union = np.union1d(A, B) # A or B -> {2,4,5,6}
print(f"P(A) = {P(A):.4f}")
print(f"P(B) = {P(B):.4f}")
print(f"P(A and B) = {P(inter):.4f}")
print(f"P(A or B) count= {P(union):.4f}")
addition = P(A) + P(B) - P(inter) # EQ S1.2
print(f"P(A or B) rule = {addition:.4f} <- matches the direct count")
print(f"P(not A) = {1 - P(A):.4f} <- complement rule")
Conditional probability & independence
The single most important operation in the subject is conditioning: revising a probability once you learn that some event has occurred. Learning that \(B\) happened shrinks your world from all of \(\Omega\) down to just \(B\), and you renormalize so the new, smaller world again has total probability one.
Rearranging EQ S1.4 gives the multiplication rule for the joint probability of two events, and applying it across a partition gives a way to assemble a total probability from its conditional pieces:
Two events are independent when knowing one tells you nothing about the other — conditioning leaves the probability unchanged, \(P(A \mid B) = P(A)\). Substituting into the chain rule gives the cleaner, symmetric test:
Conditioning is not symmetric. \(P(A \mid B)\) and \(P(B \mid A)\) are different numbers in general — most rain comes with clouds, but most clouds bring no rain. Confusing the two is the prosecutor's fallacy (§1.5), and correcting it is precisely what Bayes' theorem does.
Bayes' theorem — inverting the condition
We can usually measure \(P(\text{evidence} \mid \text{cause})\) — how often a disease produces a positive test, how often spam contains the word "free." But what we want is the reverse: \(P(\text{cause} \mid \text{evidence})\) — given a positive test, how likely is the disease? Bayes' theorem is the bridge. Start from the symmetry of the chain rule, \(P(A \cap B) = P(A \mid B)P(B) = P(B \mid A)P(A)\), and solve for the conditional you don't have:
The structure is clearer in odds form, which strips away the shared denominator. The posterior odds are the prior odds multiplied by the likelihood ratio — how much more probable the evidence is under \(H\) than under its negation:
Why a 99%-accurate test can be wrong most of the time it fires. Take a disease that afflicts 1 person in 100, a test with 99% sensitivity and 95% specificity. Of 10,000 people, ~100 are sick and ~99 test positive correctly; of the 9,900 healthy, 5% — about 495 — test positive falsely. A positive result therefore points to a sick person only \(99/(99 + 495) \approx 17\%\) of the time. The test is excellent; the disease is rarer than the test's error rate, and the rarity dominates. Conditioning forces you to confront the base rate the headline accuracy hides.
# Monte-Carlo a conditional probability and check it against exact Bayes
import numpy as np
rng = np.random.default_rng(0)
prev, sens, spec = 0.01, 0.99, 0.95 # prevalence, sensitivity, specificity
N = 2_000_000
disease = rng.random(N) < prev # who is actually sick
positive = np.where(disease, rng.random(N) < sens, # sick -> true positive
rng.random(N) < (1 - spec)) # well -> false positive
mc = disease[positive].mean() # P(D | +) by simulation
exact = (sens * prev) / (sens * prev + (1 - spec) * (1 - prev)) # EQ S1.7
print(f"positives observed : {positive.sum():,} of {N:,}")
print(f"P(D | +) Monte-Carlo : {mc:.4f}")
print(f"P(D | +) exact (Bayes): {exact:.4f}")
print(f"gap : {abs(mc - exact):.4f}")
print(f"\nbase-rate trap: a 99%/95% test is right only {exact*100:.1f}% of the time it fires.")
Random variables, expectation & variance
A random variable \(X\) is a function that attaches a number to each outcome — the value rolled, the count of heads in ten flips, tomorrow's return. It lets us do arithmetic with chance. The two numbers that summarize a random variable are its center of mass and its spread.
The expectation (mean) is the probability-weighted average of the values \(X\) can take — the long-run average if you repeated the experiment forever:
The variance measures how far values typically stray from the mean — the expected squared deviation. Its square root, the standard deviation, restores the original units:
That last fact is the law of large numbers: as you collect more independent samples, their running average converges to the true expectation. Probability, defined abstractly by Kolmogorov, finally reconnects to the frequentist intuition of "long-run frequency" — they are provably the same limit.
# Monty Hall: simulate stay vs switch and print the win rates
import numpy as np
rng = np.random.default_rng(0)
N = 200_000
car = rng.integers(0, 3, N) # door hiding the car (0,1,2)
pick = rng.integers(0, 3, N) # contestant's first pick
# stay wins exactly when the first pick was already the car
stay_wins = (pick == car)
# the host opens a goat door; switching wins whenever staying loses
switch_wins = ~stay_wins
print(f"trials : {N:,}")
print(f"P(win | STAY) : {stay_wins.mean():.4f} (theory 1/3 = 0.3333)")
print(f"P(win | SWITCH) : {switch_wins.mean():.4f} (theory 2/3 = 0.6667)")
print(f"switching advantage : {switch_wins.mean() / stay_wins.mean():.2f}x")
print("\nWhy: your first pick is right 1/3 of the time, so the OTHER unopened")
print("door carries the remaining 2/3 once the host reveals a goat. Conditioning,")
print("not intuition. (Law of large numbers: rates lock in as N grows.)")
Pitfalls — base rates & the prosecutor's fallacy
Probability's hardest errors are not algebraic — they are interpretive. The mind reaches for the wrong conditional, ignores the denominator, or forgets how rare the thing it is reasoning about really is. Three traps cause most real-world damage.
Base-rate neglect
The §1.3 disease test is the canonical case: people quote a 99% test and conclude a positive result means 99% chance of disease, forgetting the 1% prevalence that makes false positives outnumber true ones. The base rate is the prior in Bayes' theorem, and dropping it is mathematically equivalent to setting \(P(H) = P(H^c)\) — assuming the disease is as common as health. Whenever someone reports a conditional probability without a base rate, the number is uninterpretable.
The prosecutor's fallacy
This is the confusion of \(P(E \mid H)\) with \(P(H \mid E)\) dressed in a courtroom. A forensic match has a one-in-a-million random-match probability: \(P(\text{match} \mid \text{innocent}) = 10^{-6}\). The prosecutor declares the chance the defendant is innocent is therefore one in a million — but that swaps the conditional. The quantity that matters is \(P(\text{innocent} \mid \text{match})\), and by Bayes it depends on the suspect pool. In a city of 10 million, roughly 10 innocent people also match by chance; with one true source, a bare match makes the defendant only ~1-in-11 likely to be the source. The likelihood ratio is enormous, but multiplied against a tiny prior, the posterior is far from certainty.
The recurring shapes of error: (1) base-rate neglect — quoting \(P(E\mid H)\) and ignoring how rare \(H\) is; (2) the prosecutor's fallacy — reading \(P(E\mid H)\) as \(P(H\mid E)\); (3) the conjunction fallacy — judging \(P(A \cap B) > P(A)\), impossible since an intersection can never be larger than either part (the "Linda is a bank teller and a feminist" experiment); (4) the gambler's fallacy — believing independent trials "are due" to correct, when by EQ S1.6 past flips tell a fair coin nothing.
The unifying diagnosis. Every one of these is a failure to condition correctly — to track which event is given, which is uncertain, and what the base rates are. Bayes' theorem is not just a formula; it is the discipline that makes these errors impossible to commit if you actually write the ratio down. That is why §1.3 is the spine of this chapter and of everything statistical that follows.
We have treated probabilities of events and the mean and spread of a random variable in the abstract. The next chapter gives those random variables names and shapes: the Bernoulli, binomial, Poisson, normal, and exponential distributions — the recurring "characters" of uncertainty — plus the central limit theorem that explains why the bell curve appears everywhere a sum or an average does.
References
- Kolmogorov, A. N. (1933). Foundations of the Theory of Probability (Grundbegriffe der Wahrscheinlichkeitsrechnung).
- Bayes, T. & Price, R. (1763). An Essay towards solving a Problem in the Doctrine of Chances.
- Blitzstein, J. K. & Hwang, J. (2019). Introduction to Probability (2nd ed.). Chapman & Hall / CRC.
- Tversky, A. & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment.
- Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.