AI // ENCYCLOPEDIA / STATISTICS / 02 / DISTRIBUTIONS INDEX NEXT: CORRELATION →
MATHEMATICS & STATISTICS · CHAPTER 02 / 08

Distributions — The Shapes of Randomness

A handful of named distributions account for most randomness in practice: coin flips, queue arrivals, measurement noise, market returns. Each is fixed by one or two numbers. The Central Limit Theorem explains why the Normal curve recurs so often: average enough independent quantities and it appears, regardless of where you started.

LEVELINTRO READING TIME≈ 24 MIN BUILDS ONSTATS 01 INSTRUMENTSEXPLORER · CLT · TAIL RISK
2.1

Discrete distributions: counting outcomes

A distribution is a complete accounting of how probability is spread over the possible outcomes of a random quantity. When the outcomes are countable — heads or tails, the number of emails arriving in an hour, the roll of a die — we describe it with a probability mass function (PMF): a rule \(p(x)\) that assigns each outcome a probability, with the masses summing to one. Four discrete families cover an astonishing share of real problems, and they are all secretly about the same atom: a single yes/no trial.

The atom is the Bernoulli distribution — one trial with success probability \(p\). Everything in this section is built by repeating it, counting it, or waiting on it.

EQ S2.1 — BERNOULLI & BINOMIAL $$ \text{Bernoulli: } \; p(1) = p,\; p(0) = 1 - p \qquad\qquad \text{Binomial: } \; P(X = k) = \binom{n}{k} p^{k} (1 - p)^{n - k} $$
A Bernoulli variable is a single coin flip scored 1 (success, probability \(p\)) or 0 (failure). The Binomial counts how many successes appear in \(n\) independent Bernoulli flips: \(\binom{n}{k}\) is the number of ways to place the \(k\) successes, times the probability of any one such arrangement. A Binomial is just a sum of \(n\) Bernoullis — which is exactly why §2.4 will make it look Normal as \(n\) grows. Its mean is \(np\) and its variance \(np(1-p)\).

Two more families finish the toolkit, and both arise by pushing the Binomial to a limit:

  • Poisson — the law of rare events spread over a continuum of opportunity. Take a Binomial with many trials (\(n \to \infty\)) each tiny in probability (\(p \to 0\)) but with a fixed expected count \(\lambda = np\), and you get \(P(X = k) = e^{-\lambda}\lambda^{k}/k!\). It models arrivals: photons on a sensor, customers at a till, mutations along a genome, requests at a server. Its defining quirk — mean equals variance equals \(\lambda\) — is a diagnostic: if your count data has variance much larger than its mean, it is over-dispersed and the Poisson is the wrong model.
  • Geometric — the waiting time for the first success: \(P(X = k) = (1 - p)^{k - 1} p\) for \(k = 1, 2, \dots\) (the number of flips up to and including the first head). It is the discrete cousin of the Exponential (§2.2) and is memoryless: having already waited ten flips tells you nothing about how many more remain.
EQ S2.2 — POISSON & GEOMETRIC $$ \text{Poisson}(\lambda): \; P(X = k) = \frac{e^{-\lambda}\,\lambda^{k}}{k!} \qquad\qquad \text{Geometric}(p): \; P(X = k) = (1 - p)^{k - 1} p $$
Poisson: one parameter \(\lambda > 0\) is both the rate and (uniquely) both moments. Geometric: \(\mathbb{E}[X] = 1/p\) — a fair coin (\(p = 0.5\)) takes 2 flips on average to land its first head; a rare success (\(p = 0.01\)) takes 100. Both inherit independence from the Bernoulli atom they are built from, which is what makes their formulas so clean.
A single trial succeeds with probability \(p = 0.3\). What is the variance of this \(\text{Bernoulli}(0.3)\) variable, \(p(1 - p)\)?
A Bernoulli's variance is \(\mathbb{E}[X^2] - (\mathbb{E}[X])^2 = p - p^2 = p(1 - p)\). With \(p = 0.3\): \(0.3 \times 0.7 = \) 0.21. Note it is maximised at \(p = 0.5\) (variance 0.25) — a fair coin is the most unpredictable, a near-certain trial the least.
Calls arrive at a desk at a rate of one per minute, \(\lambda = 1\). Using the Poisson PMF, what is \(P(X = 2)\) — the probability of exactly two calls in a minute? (Use \(e^{-1} = 0.368\).)
\(P(X = 2) = \dfrac{e^{-1}\,1^{2}}{2!} = \dfrac{0.368}{2} = \) 0.184. About one minute in five-and-a-half sees exactly two calls — even though one is the expected number.
PYTHON · RUNNABLE IN-BROWSER
# Sample Binomial, Poisson, Normal -- empirical vs theoretical mean and var
import numpy as np
rng = np.random.default_rng(0)
M = 200_000                                   # samples per family

n, p   = 10, 0.3                              # Binomial(10, 0.3)
lam     = 4.0                                 # Poisson(4)
mu, sig = 0.0, 2.0                            # Normal(0, 2)

draws = {
    "Binomial(10,0.3)": (rng.binomial(n, p, M),  n*p,        n*p*(1-p)),
    "Poisson(4)":       (rng.poisson(lam, M),    lam,        lam),       # mean == var == lambda
    "Normal(0,2)":      (rng.normal(mu, sig, M), mu,         sig**2),
}

print(f"{'family':18}{'emp mean':>10}{'theory':>9}{'emp var':>10}{'theory':>9}")
for name, (s, m, v) in draws.items():
    print(f"{name:18}{s.mean():10.3f}{m:9.3f}{s.var():10.3f}{v:9.3f}")

print("\nempirical moments track the formulas to ~1% at M = 200k;")
print("note Poisson's mean and variance are both 4 -- its signature.")
edits are live — break it on purpose
INSTRUMENT S2.1 — DISTRIBUTION EXPLORERPMF / PDF + SAMPLED HISTOGRAM · 6 FAMILIES
TYPE
MEAN
VARIANCE
STD DEV
The mint curve is the exact theoretical PMF (bars, for discrete families) or PDF (continuous); the blue outline is a histogram of 4,000 fresh samples. Switch to Poisson and notice the readouts for mean and variance stay locked together. Drag a Binomial's \(n\) up and watch the discrete bars climb into a smooth bell — a preview of §2.4. The two sliders rename themselves to whatever the chosen family's parameters are.
2.2

Continuous distributions: spreading mass over a line

When outcomes form a continuum — a height, a temperature, a wait in seconds — no single point can carry positive probability (there are infinitely many points). Instead we use a probability density function (PDF) \(f(x)\): probability is area under the curve, so \(P(a \le X \le b) = \int_a^b f(x)\,\mathrm{d}x\) and the total area is one. Three continuous families dominate the introductory landscape.

The Uniform on \([a, b]\) is the flat distribution — every value in the interval equally likely. It is the bedrock of simulation: a computer's random-number generator produces \(\text{Uniform}(0, 1)\) draws, and every other distribution is manufactured from them by transformation.

EQ S2.3 — THE NORMAL (GAUSSIAN) DENSITY $$ f(x) = \frac{1}{\sigma\sqrt{2\pi}}\, \exp\!\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), \qquad x \in \mathbb{R} $$
The bell curve, fixed entirely by its mean \(\mu\) (where it is centred) and standard deviation \(\sigma\) (how wide). The exponent is a parabola in \(x\), so the log-density is a downward parabola — the source of the curve's symmetric, rapidly-decaying tails. The 68–95–99.7 rule: roughly 68% of mass lies within \(1\sigma\) of the mean, 95% within \(2\sigma\), 99.7% within \(3\sigma\). Standardising via \(z = (x - \mu)/\sigma\) collapses every Normal onto one standard Normal, \(\mathcal{N}(0, 1)\) — the reason a single z-table once sufficed for all of statistics.

The Exponential is the continuous waiting time between Poisson events: if arrivals come at rate \(\lambda\), the gap until the next one is \(\text{Exp}(\lambda)\), with density \(f(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\). Like the Geometric, it is memoryless — the only continuous distribution that is. A bus that arrives "on average every 10 minutes" as a Poisson process gives you no credit for the 9 minutes you've already waited; your expected remaining wait is still 10. This is famously counter-intuitive and is exactly why memorylessness deserves a name.

EQ S2.4 — UNIFORM & EXPONENTIAL $$ \text{Uniform}(a,b): \; f(x) = \frac{1}{b - a} \;\; (a \le x \le b) \qquad\qquad \text{Exponential}(\lambda): \; f(x) = \lambda e^{-\lambda x} \;\; (x \ge 0) $$
Uniform: \(\mathbb{E}[X] = \tfrac{a + b}{2}\), \(\operatorname{Var}(X) = \tfrac{(b - a)^2}{12}\) — that \(1/12\) returns in the CLT instrument. Exponential: \(\mathbb{E}[X] = 1/\lambda\), \(\operatorname{Var}(X) = 1/\lambda^2\); its variance equals its mean squared, so the distribution is right-skewed — many short waits, a few long ones. The Exponential is to the Poisson what the Geometric is to the Bernoulli: the continuous waiting time for a discrete counting process.
A random number is drawn uniformly from \([0, 1]\). What is its variance, \(\dfrac{(b - a)^2}{12}\)?
With \(a = 0,\ b = 1\): \(\operatorname{Var}(X) = \dfrac{(1 - 0)^2}{12} = \dfrac{1}{12} = \) 0.0833. This single number — the variance of a unit uniform — is the seed the Central Limit Theorem grows the Normal from in §2.4.
2.3

Moments: four numbers that describe a shape

You don't need the whole PDF to talk about a distribution; four summary numbers — the moments — capture its location, spread, lopsidedness, and tail-heaviness. They are how one distribution gets compared to another, and how you decide whether the Normal is a fair description of your data.

EQ S2.5 — THE FOUR MOMENTS $$ \mu = \mathbb{E}[X], \quad \sigma^2 = \mathbb{E}\big[(X - \mu)^2\big], \quad \text{skew} = \mathbb{E}\!\left[\left(\tfrac{X - \mu}{\sigma}\right)^3\right], \quad \text{kurt} = \mathbb{E}\!\left[\left(\tfrac{X - \mu}{\sigma}\right)^4\right] $$
Mean \(\mu\) — the centre of mass, the balance point of the density. Variance \(\sigma^2\) — the average squared distance from the mean; its square root \(\sigma\) is the standard deviation, in the same units as the data. Skewness — the standardised third moment; \(0\) for any symmetric distribution, positive when the right tail is longer (incomes, wait times), negative when the left tail is. Kurtosis — the standardised fourth moment; it measures how much mass sits in the tails. The Normal has kurtosis exactly \(3\), so practitioners quote excess kurtosis \(= \text{kurt} - 3\): zero for a Normal, positive for the heavy-tailed distributions of §2.5.

Each higher moment refines the picture. Mean and variance alone cannot distinguish a symmetric bell from a lopsided ramp with the same centre and spread — you need skew. And two distributions can share mean, variance, and skew yet differ wildly in how often they throw extreme values — that difference lives in the kurtosis, which is the single most important number when randomness can hurt you (§2.5).

A caution that experts insist on. Higher moments are estimated from data far less reliably than lower ones: a sample skew or kurtosis is dominated by the few most extreme points you happened to observe, so it is noisy and, for genuinely heavy-tailed data, may not even converge. For some distributions in §2.5 the higher moments are infinite — they do not exist at all. Treat sample kurtosis as a hint, not a measurement.

DistributionMeanVarianceSkewExcess kurtosis
Normal(\(\mu, \sigma^2\))μσ²00
Uniform(\(a, b\))(a+b)/2(b−a)²/120−1.2
Exponential(\(\lambda\))1/λ1/λ²+2+6
Poisson(\(\lambda\))λλ1/√λ1/λ
Student-t(\(\nu\))0 (ν>1)ν/(ν−2)0 (ν>3)6/(ν−4)

Read the kurtosis column as a "danger gauge." The Uniform is platykurtic (negative excess) — bounded, no surprises. The Exponential and especially the Student-t are leptokurtic (positive excess) — far more prone to outliers than a Normal of the same variance. A Student-t with \(\nu = 5\) has excess kurtosis \(6/(5 - 4) = 6\), and below \(\nu = 4\) its kurtosis is infinite.

2.4

The Central Limit Theorem: why the Normal is everywhere

Here is the result that makes the whole subject hang together — and the reason the Normal earns its place at the centre of statistics. Take any distribution with a finite mean \(\mu\) and finite variance \(\sigma^2\). Draw \(n\) independent samples from it and average them. As \(n\) grows, the distribution of that average — properly recentred and rescaled — converges to a standard Normal, regardless of the shape you started from.

EQ S2.6 — THE CENTRAL LIMIT THEOREM $$ \bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i \quad\Longrightarrow\quad \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \;\xrightarrow{\;d\;}\; \mathcal{N}(0, 1) \quad \text{as } n \to \infty $$
The sample mean \(\bar{X}_n\) is itself random; the CLT pins down its distribution. Two facts fall out for free. First, \(\bar{X}_n\) centres on \(\mu\) — the average is an unbiased estimate of the true mean. Second, its spread shrinks as \(\sigma/\sqrt{n}\): the standard error falls like \(1/\sqrt{n}\), so to halve your uncertainty you must quadruple your sample. The \(\xrightarrow{d}\) means "converges in distribution." The CLT does not require the \(X_i\) to be Normal — only that they share a distribution with finite variance, the one condition §2.5 will show is not always met.

This is why the bell curve appears unbidden across nature and engineering: any quantity that is the sum of many small independent contributions — measurement error from countless tiny perturbations, a person's height from thousands of genetic and environmental nudges, the total noise on a sensor — is approximately Normal by construction. The Normal is not assumed; it is produced, again and again, by aggregation.

KEY

Convergence is fast for friendly shapes, slow for skewed ones. For a symmetric starting distribution like the Uniform, the average of just \(n = 5\)–\(10\) draws already looks convincingly bell-shaped. For a strongly skewed one like the Exponential you may need \(n = 30\)–\(50\) before the bell is clean — the textbook "\(n \ge 30\)" rule of thumb is a rough average, not a law. The shape of the parent distribution governs the rate of convergence, even though it never governs the limit.

PYTHON · RUNNABLE IN-BROWSER
# CLT demo: average N Uniform(0,1) draws, M times, and histogram the means
import numpy as np
rng = np.random.default_rng(1)
N, M = 30, 40_000                              # N draws per mean, M means

means = rng.uniform(0, 1, size=(M, N)).mean(axis=1)

# CLT prediction for the means: centre 0.5, variance (1/12)/N
print(f"empirical  mean of means : {means.mean():.4f}   (theory 0.5000)")
print(f"empirical  var  of means : {means.var():.5f}  (theory {(1/12)/N:.5f})")
print(f"=> standard error shrinks like 1/sqrt(N):  {(1/12/N)**0.5:.4f}")

# a bell emerges from a FLAT parent -- plot the density histogram
hist, edges = np.histogram(means, bins=45, density=True)
centers = 0.5 * (edges[:-1] + edges[1:])
print("\nthe parent Uniform is flat; the average of 30 of them is a clean bell.")
plot_xy(centers, hist)
edits are live — break it on purpose
INSTRUMENT S2.2 — CENTRAL LIMIT THEOREM SIMULATORAVERAGE OF N IID DRAWS → NORMAL · EQ S2.6
PARENT SHAPE
STD ERROR σ/√N
SAMPLE SKEW
At \(N = 1\) you see the raw parent — flat, skewed, or two-spiked. Drag \(N\) upward: 10,000 sample means are re-histogrammed each step and the mint Normal curve (mean \(\mu\), width \(\sigma/\sqrt{N}\)) is overlaid for comparison. Watch the Exponential's heavy right skew melt away far more slowly than the Uniform's — the parent shape sets the speed of convergence, never the destination. The sample-skew readout marches toward zero as the bell forms.
You average \(n = 4\) draws from \(\text{Uniform}(0, 1)\), whose standard deviation is \(\sigma = 1/\sqrt{12} = 0.2887\). By what factor \(1/\sqrt{n}\) does the standard error of the mean shrink relative to a single draw?
The standard error is \(\sigma/\sqrt{n}\), so relative to one draw it shrinks by \(1/\sqrt{n} = 1/\sqrt{4} = 1/2 = \) 0.5. Quadrupling the sample halves the error — the \(1/\sqrt{n}\) law that governs every poll, A/B test, and Monte-Carlo estimate.
2.5

Heavy tails for quants: when the Normal lies

The CLT comes with fine print, and on a trading desk that fine print is the whole story. The theorem requires a finite variance. Many real processes — financial returns above all — produce extreme moves far more often than a Normal of the same everyday spread would ever allow. These are heavy-tailed (or "fat-tailed") distributions, and mistaking them for Normal is how risk models blow up.

Three families matter, in rising order of danger:

  • Student-t(\(\nu\)). A bell that looks Normal in the middle but decays far more slowly in the tails, governed by the degrees of freedom \(\nu\). Small \(\nu\) means fat tails; as \(\nu \to \infty\) it converges back to the Normal. It is the workhorse for daily and weekly asset returns, where \(\nu \approx 3\)–\(6\) typically fits — and where, below \(\nu = 4\), the kurtosis is infinite and below \(\nu = 2\) even the variance is infinite, voiding the CLT outright.
  • Lognormal. If \(\log X\) is Normal, then \(X\) is lognormal: strictly positive, right-skewed, with a long upper tail. It is the natural model for quantities that grow multiplicatively — stock prices (Quant 03's geometric Brownian motion), income, city sizes, file sizes. Because it is a transformed Normal, the CLT applies to its logarithm, not to it.
  • Power laws (Pareto). The heaviest tails of all: \(P(X > x) \propto x^{-\alpha}\). The tail decays only polynomially, so for small enough exponent \(\alpha\) the variance — or even the mean — fails to exist, and sample averages never settle down. Power laws describe wealth, city populations, word frequencies, network degrees, and the size of catastrophic losses. Whether they are the right model for financial returns, versus a Student-t with merely fattish tails, remains genuinely contested among quants: the data in the extreme tail is, by definition, sparse, and the two models are hard to tell apart from any finite sample.
EQ S2.7 — STUDENT-t & POWER-LAW TAILS $$ f_{t}(x;\nu) = \frac{\Gamma\!\left(\frac{\nu + 1}{2}\right)}{\sqrt{\nu\pi}\,\Gamma\!\left(\frac{\nu}{2}\right)} \left(1 + \frac{x^2}{\nu}\right)^{-\frac{\nu + 1}{2}} \qquad\qquad P(X > x) \;\sim\; x^{-\alpha} \;\;\text{(power law)} $$
The Student-t density decays like \(x^{-(\nu + 1)}\) for large \(x\) — a polynomial tail, versus the Normal's \(e^{-x^2/2}\) which is astronomically thinner. That polynomial decay is exactly a power-law tail with \(\alpha = \nu\). The practical consequence: a "six-sigma" daily move is a once-in-a-million-years event under a Normal, but happens every few years in real markets. Risk built on the Normal systematically under-prices the catastrophe; this miscalibration is the proximate cause of more than one financial crisis.

There is a deeper reason heavy tails persist. A generalised CLT says that sums of infinite-variance variables converge not to the Normal but to the stable family (of which the Normal is the lone finite-variance member). So heavy-tailedness is not a failure of aggregation to "kick in" — for these processes, aggregation has a different, fatter-tailed attractor. The Normal is the special case, not the rule.

INSTRUMENT S2.3 — TAIL-RISK OVERLAYNORMAL vs STUDENT-t · TAIL-PROBABILITY READOUT
P(|X| > t) NORMAL
P(|X| > t) STUDENT-t
TAIL RATIO t / Normal
Both curves are scaled to unit variance, so they agree in the bland centre — the danger hides in the tails. Switch to LOG-y to see the gap explode: the Normal plunges as a downward parabola while the Student-t falls only linearly (a power-law tail). Push the threshold to 4–5σ and read the ratio: at \(\nu = 4\) a 4σ event is many times likelier under the t than the Normal. Raise \(\nu\) toward 30 and the two distributions merge — the Student-t becoming Normal in the limit.
CONTESTED

How fat are the tails, really? That financial returns are heavier-tailed than Normal is settled and uncontroversial. How heavy is not. One camp (after Mandelbrot) argues for true power laws with possibly infinite variance; another fits finite-variance Student-t or stochastic-volatility models that generate fat tails without abandoning the CLT. The disagreement is hard to resolve precisely because extreme events are rare, so the deciding data is scarce. The honest engineering posture: assume tails fatter than Normal, stress-test against several tail models, and never let a single distributional assumption carry your entire risk number.

NEXT

One distribution describes one quantity; the next chapter asks how two quantities move together. Stats 03: descriptive statistics and correlation — summarising real data with means, medians, and quantiles, and measuring the linear (Pearson) and rank (Spearman) association between variables, with the warning that opens every honest course: correlation is not causation.

2.R

References

  1. Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer — the standard modern reference for the distributions, moments, and convergence results in this chapter.
  2. Fischer, H. (2011). A History of the Central Limit Theorem: From Classical to Modern Probability Theory. Springer — the full lineage of EQ S2.6, from de Moivre and Laplace to Lindeberg and Lévy.
  3. Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1), 1–25. The original derivation of the Student-t distribution (EQ S2.7), written at the Guinness brewery.
  4. Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business 36(4), 394–419. The founding argument that financial returns are heavy-tailed and possibly infinite-variance — the contested claim of §2.5.
  5. Cont, R. (2001). Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues. Quantitative Finance 1(2), 223–236. A careful survey of the fat-tail evidence and why pinning down the tail exponent is genuinely hard.
  6. Clauset, A., Shalizi, C. R. & Newman, M. E. J. (2009). Power-Law Distributions in Empirical Data. SIAM Review 51(4), 661–703. The methodological reference on fitting and — crucially — testing power-law tails against alternatives.