Discrete distributions: counting outcomes
A distribution is a complete accounting of how probability is spread over the possible outcomes of a random quantity. When the outcomes are countable — heads or tails, the number of emails arriving in an hour, the roll of a die — we describe it with a probability mass function (PMF): a rule \(p(x)\) that assigns each outcome a probability, with the masses summing to one. Four discrete families cover an astonishing share of real problems, and they are all secretly about the same atom: a single yes/no trial.
The atom is the Bernoulli distribution — one trial with success probability \(p\). Everything in this section is built by repeating it, counting it, or waiting on it.
Two more families finish the toolkit, and both arise by pushing the Binomial to a limit:
- Poisson — the law of rare events spread over a continuum of opportunity. Take a Binomial with many trials (\(n \to \infty\)) each tiny in probability (\(p \to 0\)) but with a fixed expected count \(\lambda = np\), and you get \(P(X = k) = e^{-\lambda}\lambda^{k}/k!\). It models arrivals: photons on a sensor, customers at a till, mutations along a genome, requests at a server. Its defining quirk — mean equals variance equals \(\lambda\) — is a diagnostic: if your count data has variance much larger than its mean, it is over-dispersed and the Poisson is the wrong model.
- Geometric — the waiting time for the first success: \(P(X = k) = (1 - p)^{k - 1} p\) for \(k = 1, 2, \dots\) (the number of flips up to and including the first head). It is the discrete cousin of the Exponential (§2.2) and is memoryless: having already waited ten flips tells you nothing about how many more remain.
# Sample Binomial, Poisson, Normal -- empirical vs theoretical mean and var
import numpy as np
rng = np.random.default_rng(0)
M = 200_000 # samples per family
n, p = 10, 0.3 # Binomial(10, 0.3)
lam = 4.0 # Poisson(4)
mu, sig = 0.0, 2.0 # Normal(0, 2)
draws = {
"Binomial(10,0.3)": (rng.binomial(n, p, M), n*p, n*p*(1-p)),
"Poisson(4)": (rng.poisson(lam, M), lam, lam), # mean == var == lambda
"Normal(0,2)": (rng.normal(mu, sig, M), mu, sig**2),
}
print(f"{'family':18}{'emp mean':>10}{'theory':>9}{'emp var':>10}{'theory':>9}")
for name, (s, m, v) in draws.items():
print(f"{name:18}{s.mean():10.3f}{m:9.3f}{s.var():10.3f}{v:9.3f}")
print("\nempirical moments track the formulas to ~1% at M = 200k;")
print("note Poisson's mean and variance are both 4 -- its signature.")
Continuous distributions: spreading mass over a line
When outcomes form a continuum — a height, a temperature, a wait in seconds — no single point can carry positive probability (there are infinitely many points). Instead we use a probability density function (PDF) \(f(x)\): probability is area under the curve, so \(P(a \le X \le b) = \int_a^b f(x)\,\mathrm{d}x\) and the total area is one. Three continuous families dominate the introductory landscape.
The Uniform on \([a, b]\) is the flat distribution — every value in the interval equally likely. It is the bedrock of simulation: a computer's random-number generator produces \(\text{Uniform}(0, 1)\) draws, and every other distribution is manufactured from them by transformation.
The Exponential is the continuous waiting time between Poisson events: if arrivals come at rate \(\lambda\), the gap until the next one is \(\text{Exp}(\lambda)\), with density \(f(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\). Like the Geometric, it is memoryless — the only continuous distribution that is. A bus that arrives "on average every 10 minutes" as a Poisson process gives you no credit for the 9 minutes you've already waited; your expected remaining wait is still 10. This is famously counter-intuitive and is exactly why memorylessness deserves a name.
Moments: four numbers that describe a shape
You don't need the whole PDF to talk about a distribution; four summary numbers — the moments — capture its location, spread, lopsidedness, and tail-heaviness. They are how one distribution gets compared to another, and how you decide whether the Normal is a fair description of your data.
Each higher moment refines the picture. Mean and variance alone cannot distinguish a symmetric bell from a lopsided ramp with the same centre and spread — you need skew. And two distributions can share mean, variance, and skew yet differ wildly in how often they throw extreme values — that difference lives in the kurtosis, which is the single most important number when randomness can hurt you (§2.5).
A caution that experts insist on. Higher moments are estimated from data far less reliably than lower ones: a sample skew or kurtosis is dominated by the few most extreme points you happened to observe, so it is noisy and, for genuinely heavy-tailed data, may not even converge. For some distributions in §2.5 the higher moments are infinite — they do not exist at all. Treat sample kurtosis as a hint, not a measurement.
| Distribution | Mean | Variance | Skew | Excess kurtosis |
|---|---|---|---|---|
| Normal(\(\mu, \sigma^2\)) | μ | σ² | 0 | 0 |
| Uniform(\(a, b\)) | (a+b)/2 | (b−a)²/12 | 0 | −1.2 |
| Exponential(\(\lambda\)) | 1/λ | 1/λ² | +2 | +6 |
| Poisson(\(\lambda\)) | λ | λ | 1/√λ | 1/λ |
| Student-t(\(\nu\)) | 0 (ν>1) | ν/(ν−2) | 0 (ν>3) | 6/(ν−4) |
Read the kurtosis column as a "danger gauge." The Uniform is platykurtic (negative excess) — bounded, no surprises. The Exponential and especially the Student-t are leptokurtic (positive excess) — far more prone to outliers than a Normal of the same variance. A Student-t with \(\nu = 5\) has excess kurtosis \(6/(5 - 4) = 6\), and below \(\nu = 4\) its kurtosis is infinite.
The Central Limit Theorem: why the Normal is everywhere
Here is the result that makes the whole subject hang together — and the reason the Normal earns its place at the centre of statistics. Take any distribution with a finite mean \(\mu\) and finite variance \(\sigma^2\). Draw \(n\) independent samples from it and average them. As \(n\) grows, the distribution of that average — properly recentred and rescaled — converges to a standard Normal, regardless of the shape you started from.
This is why the bell curve appears unbidden across nature and engineering: any quantity that is the sum of many small independent contributions — measurement error from countless tiny perturbations, a person's height from thousands of genetic and environmental nudges, the total noise on a sensor — is approximately Normal by construction. The Normal is not assumed; it is produced, again and again, by aggregation.
Convergence is fast for friendly shapes, slow for skewed ones. For a symmetric starting distribution like the Uniform, the average of just \(n = 5\)–\(10\) draws already looks convincingly bell-shaped. For a strongly skewed one like the Exponential you may need \(n = 30\)–\(50\) before the bell is clean — the textbook "\(n \ge 30\)" rule of thumb is a rough average, not a law. The shape of the parent distribution governs the rate of convergence, even though it never governs the limit.
# CLT demo: average N Uniform(0,1) draws, M times, and histogram the means
import numpy as np
rng = np.random.default_rng(1)
N, M = 30, 40_000 # N draws per mean, M means
means = rng.uniform(0, 1, size=(M, N)).mean(axis=1)
# CLT prediction for the means: centre 0.5, variance (1/12)/N
print(f"empirical mean of means : {means.mean():.4f} (theory 0.5000)")
print(f"empirical var of means : {means.var():.5f} (theory {(1/12)/N:.5f})")
print(f"=> standard error shrinks like 1/sqrt(N): {(1/12/N)**0.5:.4f}")
# a bell emerges from a FLAT parent -- plot the density histogram
hist, edges = np.histogram(means, bins=45, density=True)
centers = 0.5 * (edges[:-1] + edges[1:])
print("\nthe parent Uniform is flat; the average of 30 of them is a clean bell.")
plot_xy(centers, hist)
Heavy tails for quants: when the Normal lies
The CLT comes with fine print, and on a trading desk that fine print is the whole story. The theorem requires a finite variance. Many real processes — financial returns above all — produce extreme moves far more often than a Normal of the same everyday spread would ever allow. These are heavy-tailed (or "fat-tailed") distributions, and mistaking them for Normal is how risk models blow up.
Three families matter, in rising order of danger:
- Student-t(\(\nu\)). A bell that looks Normal in the middle but decays far more slowly in the tails, governed by the degrees of freedom \(\nu\). Small \(\nu\) means fat tails; as \(\nu \to \infty\) it converges back to the Normal. It is the workhorse for daily and weekly asset returns, where \(\nu \approx 3\)–\(6\) typically fits — and where, below \(\nu = 4\), the kurtosis is infinite and below \(\nu = 2\) even the variance is infinite, voiding the CLT outright.
- Lognormal. If \(\log X\) is Normal, then \(X\) is lognormal: strictly positive, right-skewed, with a long upper tail. It is the natural model for quantities that grow multiplicatively — stock prices (Quant 03's geometric Brownian motion), income, city sizes, file sizes. Because it is a transformed Normal, the CLT applies to its logarithm, not to it.
- Power laws (Pareto). The heaviest tails of all: \(P(X > x) \propto x^{-\alpha}\). The tail decays only polynomially, so for small enough exponent \(\alpha\) the variance — or even the mean — fails to exist, and sample averages never settle down. Power laws describe wealth, city populations, word frequencies, network degrees, and the size of catastrophic losses. Whether they are the right model for financial returns, versus a Student-t with merely fattish tails, remains genuinely contested among quants: the data in the extreme tail is, by definition, sparse, and the two models are hard to tell apart from any finite sample.
There is a deeper reason heavy tails persist. A generalised CLT says that sums of infinite-variance variables converge not to the Normal but to the stable family (of which the Normal is the lone finite-variance member). So heavy-tailedness is not a failure of aggregation to "kick in" — for these processes, aggregation has a different, fatter-tailed attractor. The Normal is the special case, not the rule.
How fat are the tails, really? That financial returns are heavier-tailed than Normal is settled and uncontroversial. How heavy is not. One camp (after Mandelbrot) argues for true power laws with possibly infinite variance; another fits finite-variance Student-t or stochastic-volatility models that generate fat tails without abandoning the CLT. The disagreement is hard to resolve precisely because extreme events are rare, so the deciding data is scarce. The honest engineering posture: assume tails fatter than Normal, stress-test against several tail models, and never let a single distributional assumption carry your entire risk number.
One distribution describes one quantity; the next chapter asks how two quantities move together. Stats 03: descriptive statistics and correlation — summarising real data with means, medians, and quantiles, and measuring the linear (Pearson) and rank (Spearman) association between variables, with the warning that opens every honest course: correlation is not causation.
References
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference.
- Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory.
- Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1), 1–25.
- Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4), 394–419.
- Cont, R. (2001). Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues. Quantitative Finance 1(2), 223–236.
- Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review 51(4), 661–703.