Summarizing one variable
Before two variables can be related, each must be described. A column of numbers is summarized along two axes: where it sits (location) and how spread out it is (scale). Get these two right and most of descriptive statistics follows.
The two headline measures of location are the mean and the median. The mean is the balance point; the median is the middle value once the data is sorted. They agree on symmetric data and disagree — sometimes wildly — on skewed data.
The mean has one fatal weakness: it is not robust. A single extreme value drags it arbitrarily far, while the median barely flinches. This is the first lesson of robust statistics — and it returns the moment a single outlier hijacks a correlation in §3.2.
Quantiles generalize the median. The \(q\)-quantile is the value below which a fraction \(q\) of the data falls: the median is the \(0.5\)-quantile, the quartiles are the \(0.25\) and \(0.75\) quantiles, and the interquartile range (IQR \(= Q_3 - Q_1\)) is a robust measure of scale that ignores the tails entirely.
| Measure | What it captures | Robust to outliers? | Breakdown point |
|---|---|---|---|
| Mean | Location (balance point) | No | 0% |
| Median | Location (middle value) | Yes | 50% |
| Std. deviation | Scale (typical spread) | No | 0% |
| IQR | Scale (middle 50%) | Yes | 25% |
The breakdown point is the fraction of the data you can corrupt before the statistic becomes meaningless. The mean breaks with one bad point (0%); the median survives until half the data is corrupted (50%). When you do not yet trust your data, summarize it with the median and IQR first.
# Location & scale: mean vs median, std vs IQR -- and how one outlier hits each
import numpy as np
x = np.array([2, 4, 4, 5, 5, 6, 7, 8, 9, 10], dtype=float)
def summarize(v):
q1, med, q3 = np.percentile(v, [25, 50, 75])
return dict(mean=v.mean(), median=med,
std=v.std(ddof=1), # ddof=1 => Bessel's n-1
iqr=q3 - q1)
print("clean data :", {k: round(val, 2) for k, val in summarize(x).items()})
x_bad = x.copy(); x_bad[-1] = 1000.0 # one wild outlier
print("with outlier:", {k: round(val, 2) for k, val in summarize(x_bad).items()})
print("\nmean moved by", round(summarize(x_bad)['mean'] - summarize(x)['mean'], 2))
print("median moved by", round(summarize(x_bad)['median'] - summarize(x)['median'], 2))
print("=> the mean & std chase the outlier; median & IQR barely notice it.")
Covariance & Pearson correlation
With two variables \(X\) and \(Y\), the first question is whether they move together. Covariance answers it directly: when \(X\) is above its mean, is \(Y\) usually above its mean too? Multiply the two deviations and average — positive products dominate when they rise together, negative when one rises as the other falls.
The fix is to divide out the scale. Normalize covariance by the two standard deviations and you get the Pearson correlation coefficient \(r\) — a pure, unitless number locked to \([-1, +1]\).
# Pearson from scratch -- and proof it only sees straight lines (EQ S3.3)
import numpy as np
def pearson(x, y):
x, y = np.asarray(x, float), np.asarray(y, float)
xc, yc = x - x.mean(), y - y.mean()
return float((xc * yc).sum() / np.sqrt((xc**2).sum() * (yc**2).sum()))
x = np.linspace(-3, 3, 60)
print("y = 2x + 1 (perfect line) r =", round(pearson(x, 2*x + 1), 3))
print("y = -x (perfect down-line) r =", round(pearson(x, -x), 3))
print("y = x**2 (perfect parabola) r =", round(pearson(x, x**2), 3))
print("\nThe parabola is a *perfect* relationship -- yet Pearson reports ~0,")
print("because the symmetric U has no net linear trend. Always plot first.")
plot_scatter(x, x**2) # see the U that Pearson is blind to
Rank correlation: Spearman & Kendall
Pearson asks "do they fall on a line?" Often the better question is "do they move in the same order?" — a relationship can be reliably increasing without being straight. Rank correlation answers that softer question, and in doing so buys robustness for free.
Spearman's ρ is breathtakingly simple: replace every value by its rank, then run ordinary Pearson on the ranks. Because ranks are bounded \(1,\dots,n\), no outlier can pull harder than one rank — and any strictly increasing relationship, line or not, gets \(\rho = 1\).
Kendall's τ attacks the same target — monotone agreement — from a different angle. It counts ordered pairs: a pair \((i,j)\) is concordant if \(x\) and \(y\) agree on which is larger, and discordant if they disagree.
Pearson when the relationship is plausibly linear and the data is clean and roughly normal. Spearman / Kendall when you only care about monotone direction, when outliers are present, when the data is ordinal (ratings, ranks), or when the relationship is curved but consistently increasing. A large gap between Pearson and Spearman is itself a diagnostic: it screams "non-linearity or outliers — go look at the scatter plot."
# Pearson vs Spearman on monotone-nonlinear data -- watch the gap open
import numpy as np
def pearson(x, y):
xc, yc = x - x.mean(), y - y.mean()
return float((xc * yc).sum() / np.sqrt((xc**2).sum() * (yc**2).sum()))
def spearman(x, y): # Pearson on the ranks
rx = np.argsort(np.argsort(x)).astype(float)
ry = np.argsort(np.argsort(y)).astype(float)
return pearson(rx, ry)
x = np.linspace(0.1, 4, 80)
for name, y in [("linear y=x", x),
("exp y=e^x", np.exp(x)),
("cubic y=x^3", x**3),
("log y=log x", np.log(x))]:
print(f"{name:16s} pearson {pearson(x, y):+.3f} spearman {spearman(x, y):+.3f}")
print("\nEvery curve above is strictly increasing -> Spearman = +1.000 exactly.")
print("Pearson sags below 1 wherever the curve bends. The gap = non-linearity.")
Why correlation ≠ causation
Here is the cliff every analyst eventually walks off. You compute a strong \(r\), the p-value is tiny, the scatter is gorgeous — and you conclude that \(X\) causes \(Y\). The conclusion does not follow, and no amount of additional data fixes it. A correlation is consistent with at least four very different worlds.
| If X and Y correlate, it could be… | Structure | Example |
|---|---|---|
| X causes Y | X → Y | Smoking → lung cancer |
| Y causes X (reverse) | X ← Y | "Umbrellas → rain" read backwards |
| A confounder Z causes both | X ← Z → Y | Ice-cream sales & drownings ← summer heat |
| Pure coincidence | none | Spurious correlations in noisy, multiply-tested data |
The most dangerous of these is the confounder: a hidden variable \(Z\) that drives both \(X\) and \(Y\), manufacturing a correlation between them where no direct link exists. Ice-cream sales and drowning deaths rise together — not because frozen dairy is lethal, but because hot weather \(Z\) independently boosts both. Condition on \(Z\) (compare days at the same temperature) and the correlation evaporates.
Simpson's paradox
Confounding has a spectacular special case. Simpson's paradox is when a trend that holds in every subgroup reverses when the groups are pooled. It is not a statistical glitch — both the aggregate and the per-group numbers are arithmetically correct. The aggregate is simply answering a different, usually wrong, question.
UC Berkeley's graduate admissions looked biased against women in aggregate (about 44% of men admitted vs 35% of women). But department by department, women were admitted at equal or higher rates than men. The resolution: women applied disproportionately to highly competitive departments with low admit rates for everyone. Department was the confounder. Pooling across it produced a reversal that defamed the wrong cause — a textbook reason never to aggregate blindly across a variable that drives both your exposure and your outcome.
Causal thinking: DAGs, backdoor paths, the do-operator
If more data cannot turn correlation into causation, what can? A causal model — an explicit, falsifiable claim about which variable affects which. Judea Pearl's framework, the standard since the 2000s, draws these claims as a directed acyclic graph (DAG): nodes are variables, arrows are direct causal effects, "acyclic" means no variable causes itself through a loop.
This figure carries the central, counter-intuitive lesson of causal inference: "control for everything" is wrong. The arrows decide. A fork (\(X \leftarrow Z \rightarrow Y\)) is the confounder you must adjust for. A chain (\(X \rightarrow Z \rightarrow Y\)) makes \(Z\) a mediator — adjust for it and you delete part of the real effect. A collider (\(X \rightarrow Z \leftarrow Y\)) is the trap: \(X\) and \(Y\) are independent until you condition on \(Z\), which opens a fake association (this is collider / selection bias).
A backdoor path is any non-causal route from \(X\) to \(Y\) that starts with an arrow into \(X\) — exactly the channel through which confounding leaks. The backdoor criterion says: to read the true causal effect of \(X\) on \(Y\), find a set of variables that blocks every backdoor path without opening a collider, and adjust for them. Do that, and observational data yields a causal answer.
You can now describe data and reason about what does — and does not — cause what. The missing piece is uncertainty: every \(r\), every mean, every effect estimate is computed from a finite sample and could be a fluke. Chapter 04 — Inference & Testing — builds the machinery to ask "could this have happened by chance?": sampling distributions, confidence intervals, p-values, and the hypothesis tests that decide when a correlation is real enough to act on.
References
- Pearson, K. (1895). Note on Regression and Inheritance in the Case of Two Parents.
- Spearman, C. (1904). The Proof and Measurement of Association between Two Things.
- Kendall, M. G. (1938). A New Measure of Rank Correlation.
- Bickel, P. J., Hammel, E. A. & O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley.
- Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.).
- Pearl, J. (1995). Causal Diagrams for Empirical Research.