Correlation & Causation — AI Encyclopedia

3.1

Summarizing one variable

Before two variables can be related, each must be described. A column of numbers is summarized along two axes: where it sits (location) and how spread out it is (scale). Get these two right and most of descriptive statistics follows.

The two headline measures of location are the mean and the median. The mean is the balance point; the median is the middle value once the data is sorted. They agree on symmetric data and disagree — sometimes wildly — on skewed data.

EQ S3.1 — MEAN & VARIANCE $$ \bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i, \qquad s^2 = \frac{1}{n-1}\sum_{i=1}^{n}\big(x_i - \bar{x}\big)^2, \qquad s = \sqrt{s^2} $$

$\bar{x}$ is the arithmetic mean; $s^2$ the sample variance — the average squared distance from the mean; $s$ the standard deviation, in the same units as the data. The divisor $n-1$ (not $n$) is Bessel's correction: dividing by $n$ systematically under-estimates the spread because the deviations are taken from the sample mean — which is itself fit to the data — so one degree of freedom is already spent. Variance is the engine of everything that follows: correlation is just shared variance, normalized.

The mean has one fatal weakness: it is not robust. A single extreme value drags it arbitrarily far, while the median barely flinches. This is the first lesson of robust statistics — and it returns the moment a single outlier hijacks a correlation in §3.2.

Quantiles generalize the median. The $q$-quantile is the value below which a fraction $q$ of the data falls: the median is the $0.5$-quantile, the quartiles are the $0.25$ and $0.75$ quantiles, and the interquartile range (IQR $= Q_3 - Q_1$) is a robust measure of scale that ignores the tails entirely.

Measure	What it captures	Robust to outliers?	Breakdown point
Mean	Location (balance point)	No	0%
Median	Location (middle value)	Yes	50%
Std. deviation	Scale (typical spread)	No	0%
IQR	Scale (middle 50%)	Yes	25%

The breakdown point is the fraction of the data you can corrupt before the statistic becomes meaningless. The mean breaks with one bad point (0%); the median survives until half the data is corrupted (50%). When you do not yet trust your data, summarize it with the median and IQR first.

PYTHON · RUNNABLE IN-BROWSER

# Location & scale: mean vs median, std vs IQR -- and how one outlier hits each
import numpy as np
x = np.array([2, 4, 4, 5, 5, 6, 7, 8, 9, 10], dtype=float)

def summarize(v):
    q1, med, q3 = np.percentile(v, [25, 50, 75])
    return dict(mean=v.mean(), median=med,
                std=v.std(ddof=1),                 # ddof=1 => Bessel's n-1
                iqr=q3 - q1)

print("clean data :", {k: round(val, 2) for k, val in summarize(x).items()})

x_bad = x.copy(); x_bad[-1] = 1000.0              # one wild outlier
print("with outlier:", {k: round(val, 2) for k, val in summarize(x_bad).items()})

print("\nmean   moved by", round(summarize(x_bad)['mean']   - summarize(x)['mean'],   2))
print("median moved by", round(summarize(x_bad)['median'] - summarize(x)['median'], 2))
print("=> the mean & std chase the outlier; median & IQR barely notice it.")

edits are live — break it on purpose

3.2

Covariance & Pearson correlation

With two variables $X$ and $Y$, the first question is whether they move together. Covariance answers it directly: when $X$ is above its mean, is $Y$ usually above its mean too? Multiply the two deviations and average — positive products dominate when they rise together, negative when one rises as the other falls.

EQ S3.2 — COVARIANCE $$ \operatorname{cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}\big(x_i - \bar{x}\big)\big(y_i - \bar{y}\big) $$

Each term is the product of two signed deviations. Same side of the mean → positive; opposite sides → negative. The sum's sign tells you the direction of the association. But its magnitude is uninterpretable: covariance carries the units of $X$ times the units of $Y$, so rescaling height from metres to centimetres multiplies it by 100 without changing anything real. Covariance has the right sign but the wrong scale.

The fix is to divide out the scale. Normalize covariance by the two standard deviations and you get the Pearson correlation coefficient $r$ — a pure, unitless number locked to $[-1, +1]$.

EQ S3.3 — PEARSON CORRELATION $$ r = \frac{\operatorname{cov}(X, Y)}{s_X\, s_Y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2}\;\sqrt{\sum_i (y_i - \bar{y})^2}} $$

$r = +1$ is a perfect increasing line, $r = -1$ a perfect decreasing line, $r = 0$ no linear association. Geometrically, $r$ is the cosine of the angle between the two mean-centred data vectors — which is exactly why $|r| \le 1$. The square, $r^2$, is the fraction of $Y$'s variance a straight line through $X$ explains. Pearson sees only straight lines: it can read $r \approx 0$ off data that is perfectly but non-linearly related (a clean parabola), and it is dragged hard by a single outlier — both failures you can trigger in the instrument below.

You measure five points that lie exactly on the increasing line $ y = 3x + 2 $: $(0,2),(1,5),(2,8),(3,11),(4,14)$. What is their Pearson correlation $ r $?

Every point sits on one straight increasing line, so the linear fit is perfect: $ r = $ 1.0. Pearson reaches $+1$ for any increasing line, regardless of its slope — the slope (here 3) and intercept (here 2) do not affect $r$; only the tightness and direction of the linear pattern do.

Two variables have covariance $ \operatorname{cov}(X,Y) = 6 $, with standard deviations $ \sigma_X = 2 $ and $ \sigma_Y = 6 $. What is the Pearson correlation $ r $?

By EQ S3.3, $ r = \dfrac{\operatorname{cov}(X,Y)}{\sigma_X\,\sigma_Y} = \dfrac{6}{2 \times 6} = \dfrac{6}{12} = $ 0.5 — a moderate positive linear association. Notice the covariance alone (6) told you nothing until it was divided by the spreads.

PYTHON · RUNNABLE IN-BROWSER

# Pearson from scratch -- and proof it only sees straight lines (EQ S3.3)
import numpy as np

def pearson(x, y):
    x, y = np.asarray(x, float), np.asarray(y, float)
    xc, yc = x - x.mean(), y - y.mean()
    return float((xc * yc).sum() / np.sqrt((xc**2).sum() * (yc**2).sum()))

x = np.linspace(-3, 3, 60)

print("y = 2x + 1   (perfect line)      r =", round(pearson(x, 2*x + 1), 3))
print("y = -x       (perfect down-line) r =", round(pearson(x, -x),       3))
print("y = x**2     (perfect parabola)  r =", round(pearson(x, x**2),     3))
print("\nThe parabola is a *perfect* relationship -- yet Pearson reports ~0,")
print("because the symmetric U has no net linear trend. Always plot first.")
plot_scatter(x, x**2)        # see the U that Pearson is blind to

edits are live — break it on purpose

INSTRUMENT S3.1 — SCATTER & CORRELATION EXPLORERDRAG THE OUTLIER · NOISE SLIDER · PEARSON vs SPEARMAN

NOISE σ 0.30

TRUE SLOPE +1.0

PEARSON r

—

SPEARMAN ρ

—

r² (VARIANCE EXPLAINED)

—

Drag the single red point — the outlier — anywhere on the canvas. Watch Pearson r swing dramatically while Spearman ρ barely moves: Spearman works on ranks, so one wild value can only shift it by one rank, not arbitrarily far. Now raise the noise slider toward 1.5 and both coefficients collapse toward 0; set slope to negative and both flip sign. The line is the least-squares fit.

3.3

Rank correlation: Spearman & Kendall

Pearson asks "do they fall on a line?" Often the better question is "do they move in the same order?" — a relationship can be reliably increasing without being straight. Rank correlation answers that softer question, and in doing so buys robustness for free.

Spearman's ρ is breathtakingly simple: replace every value by its rank, then run ordinary Pearson on the ranks. Because ranks are bounded $1,\dots,n$, no outlier can pull harder than one rank — and any strictly increasing relationship, line or not, gets $\rho = 1$.

EQ S3.4 — SPEARMAN'S RANK CORRELATION $$ \rho = 1 - \frac{6\sum_{i=1}^{n} d_i^{2}}{n\,(n^{2}-1)}, \qquad d_i = \operatorname{rank}(x_i) - \operatorname{rank}(y_i) $$

$d_i$ is the difference between the two ranks of observation $i$. This tidy formula is exact only when there are no ties; with ties you fall back to Pearson-on-the-ranks, which is the general definition. $\rho$ measures monotonicity, not linearity: it reaches $+1$ for $y = e^{x}$, $y = \log x$, or any other strictly increasing curve, where Pearson would report something less than 1. It inherits the median's robustness because ranks compress the tails.

Kendall's τ attacks the same target — monotone agreement — from a different angle. It counts ordered pairs: a pair $(i,j)$ is concordant if $x$ and $y$ agree on which is larger, and discordant if they disagree.

EQ S3.5 — KENDALL'S τ $$ \tau = \frac{C - D}{\binom{n}{2}} = \frac{(\text{concordant pairs}) - (\text{discordant pairs})}{\tfrac{1}{2}\,n(n-1)} $$

$C$ counts pairs that move the same way, $D$ pairs that move opposite ways, out of all $\binom{n}{2}$ pairs. $\tau = +1$ means every pair is concordant (perfect monotone increase); $\tau = -1$ every pair discordant. Kendall's τ has a cleaner probabilistic meaning than Spearman — $\tau = P(\text{concordant}) - P(\text{discordant})$ — is even more robust to outliers, and behaves better in small samples, at the cost of being more expensive to compute. For ranked data, Kendall is the statistician's default; Spearman is the more widely reported.

WHICH TO USE

Pearson when the relationship is plausibly linear and the data is clean and roughly normal. Spearman / Kendall when you only care about monotone direction, when outliers are present, when the data is ordinal (ratings, ranks), or when the relationship is curved but consistently increasing. A large gap between Pearson and Spearman is itself a diagnostic: it screams "non-linearity or outliers — go look at the scatter plot."

PYTHON · RUNNABLE IN-BROWSER

# Pearson vs Spearman on monotone-nonlinear data -- watch the gap open
import numpy as np

def pearson(x, y):
    xc, yc = x - x.mean(), y - y.mean()
    return float((xc * yc).sum() / np.sqrt((xc**2).sum() * (yc**2).sum()))

def spearman(x, y):                      # Pearson on the ranks
    rx = np.argsort(np.argsort(x)).astype(float)
    ry = np.argsort(np.argsort(y)).astype(float)
    return pearson(rx, ry)

x = np.linspace(0.1, 4, 80)
for name, y in [("linear   y=x",  x),
                ("exp      y=e^x", np.exp(x)),
                ("cubic    y=x^3", x**3),
                ("log      y=log x", np.log(x))]:
    print(f"{name:16s}  pearson {pearson(x, y):+.3f}   spearman {spearman(x, y):+.3f}")

print("\nEvery curve above is strictly increasing -> Spearman = +1.000 exactly.")
print("Pearson sags below 1 wherever the curve bends. The gap = non-linearity.")

edits are live — break it on purpose

3.4

Why correlation ≠ causation

Here is the cliff every analyst eventually walks off. You compute a strong $r$, the p-value is tiny, the scatter is gorgeous — and you conclude that $X$ causes $Y$. The conclusion does not follow, and no amount of additional data fixes it. A correlation is consistent with at least four very different worlds.

If X and Y correlate, it could be…	Structure	Example
X causes Y	X → Y	Smoking → lung cancer
Y causes X (reverse)	X ← Y	"Umbrellas → rain" read backwards
A confounder Z causes both	X ← Z → Y	Ice-cream sales & drownings ← summer heat
Pure coincidence	none	Spurious correlations in noisy, multiply-tested data

The most dangerous of these is the confounder: a hidden variable $Z$ that drives both $X$ and $Y$, manufacturing a correlation between them where no direct link exists. Ice-cream sales and drowning deaths rise together — not because frozen dairy is lethal, but because hot weather $Z$ independently boosts both. Condition on $Z$ (compare days at the same temperature) and the correlation evaporates.

EQ S3.6 — CONFOUNDER-INDUCED CORRELATION $$ X = aZ + \varepsilon_X, \quad Y = bZ + \varepsilon_Y, \quad (\text{no } X \to Y) \;\;\Longrightarrow\;\; \operatorname{corr}(X, Y) = \frac{ab\,\sigma_Z^2}{\sigma_X\,\sigma_Y} \neq 0 $$

Both $X$ and $Y$ are noisy copies of the same driver $Z$; the noise terms $\varepsilon_X, \varepsilon_Y$ are independent of each other and of $Z$. There is no arrow from $X$ to $Y$ — yet they correlate, purely through their shared parent. With $a=b=1$ and unit variances everywhere, $\sigma_X^2 = \sigma_Y^2 = \sigma_Z^2 + 1 = 2$, giving $\operatorname{corr}(X,Y) = 1/2$. An intervention on $X$ would move nothing in $Y$ — the spurious 0.5 would vanish the instant you set $X$ by hand instead of letting $Z$ set it. You can build this exact world in Instrument S3.3.

Simpson's paradox

Confounding has a spectacular special case. Simpson's paradox is when a trend that holds in every subgroup reverses when the groups are pooled. It is not a statistical glitch — both the aggregate and the per-group numbers are arithmetically correct. The aggregate is simply answering a different, usually wrong, question.

THE 1973 BERKELEY CASE

UC Berkeley's graduate admissions looked biased against women in aggregate (about 44% of men admitted vs 35% of women). But department by department, women were admitted at equal or higher rates than men. The resolution: women applied disproportionately to highly competitive departments with low admit rates for everyone. Department was the confounder. Pooling across it produced a reversal that defamed the wrong cause — a textbook reason never to aggregate blindly across a variable that drives both your exposure and your outcome.

INSTRUMENT S3.2 — SIMPSON'S PARADOX VISUALIZERGROUP TREND vs POOLED TREND · WATCH IT FLIP

GROUP SEPARATION 1.6

VIEW

POOLED r (ALL POINTS)

—

WITHIN-GROUP r (AVG)

—

VERDICT

—

Each colour is one subgroup with a clear downward trend. Push GROUP SEPARATION up: the groups stagger diagonally so the pooled cloud trends upward even though every group trends down. Toggle POOLED vs BY GROUP to see the grey overall fit fight the coloured within-group fits. The verdict flags when the signs disagree — that is the paradox.

3.5

Causal thinking: DAGs, backdoor paths, the do-operator

If more data cannot turn correlation into causation, what can? A causal model — an explicit, falsifiable claim about which variable affects which. Judea Pearl's framework, the standard since the 2000s, draws these claims as a directed acyclic graph (DAG): nodes are variables, arrows are direct causal effects, "acyclic" means no variable causes itself through a loop.

FIG S3.1THREE STRUCTURES · ONLY THE FORK CONFOUNDS

The same three variables, three different DAGs. The fork is the confounder — to estimate X→Y you must adjust for Z. The chain has Z as a mediator (adjusting for it would erase the very effect you want), and the collider creates a spurious link only if you control Z. Whether to control a variable depends entirely on the graph, not on the data.

This figure carries the central, counter-intuitive lesson of causal inference: "control for everything" is wrong. The arrows decide. A fork ($X \leftarrow Z \rightarrow Y$) is the confounder you must adjust for. A chain ($X \rightarrow Z \rightarrow Y$) makes $Z$ a mediator — adjust for it and you delete part of the real effect. A collider ($X \rightarrow Z \leftarrow Y$) is the trap: $X$ and $Y$ are independent until you condition on $Z$, which opens a fake association (this is collider / selection bias).

A backdoor path is any non-causal route from $X$ to $Y$ that starts with an arrow into $X$ — exactly the channel through which confounding leaks. The backdoor criterion says: to read the true causal effect of $X$ on $Y$, find a set of variables that blocks every backdoor path without opening a collider, and adjust for them. Do that, and observational data yields a causal answer.

EQ S3.7 — THE do-OPERATOR & BACKDOOR ADJUSTMENT $$ P\big(Y \mid \mathrm{do}(X = x)\big) \;=\; \sum_{z} P\big(Y \mid X = x,\, Z = z\big)\, P(Z = z) $$

$\mathrm{do}(X=x)$ means intervene — reach in and set $X$ to $x$, severing the arrows that normally point into $X$ — as opposed to merely observing $X = x$, which is $P(Y \mid X=x)$. The two are equal only when nothing confounds $X$ and $Y$. When a sufficient adjustment set $Z$ blocks the backdoors, this formula recovers the interventional distribution from purely observational data — the bridge from correlation to causation. A randomized controlled trial physically performs the $\mathrm{do}$: randomizing $X$ deletes every arrow into it, which is why an RCT needs no DAG to be valid. When you cannot randomize, the DAG plus EQ S3.7 is the next best thing.

INSTRUMENT S3.3 — CONFOUNDER TOYTOGGLE Z → SPURIOUS r APPEARS · CONDITION → IT VANISHES

CONFOUNDER STRENGTH (a=b) 1.0

ANALYSIS

OBSERVED corr(X,Y)

—

TRUE X→Y EFFECT

0.00

STATUS

—

The data-generating truth is fixed: $X \leftarrow Z \rightarrow Y$, with no direct $X \to Y$ arrow. Raise CONFOUNDER STRENGTH and the naive scatter sprouts a strong upward correlation out of nothing — pure backdoor leakage through $Z$. Now switch to CONDITION ON Z (the plot shows one narrow slice of $Z$): the spurious correlation collapses toward 0, exactly as EQ S3.6 predicts. This is backdoor adjustment, by hand.

You can now describe data and reason about what does — and does not — cause what. The missing piece is uncertainty: every $r$, every mean, every effect estimate is computed from a finite sample and could be a fluke. Chapter 04 — Inference & Testing — builds the machinery to ask "could this have happened by chance?": sampling distributions, confidence intervals, p-values, and the hypothesis tests that decide when a correlation is real enough to act on.

3.R

References

Pearson, K. (1895). Note on Regression and Inheritance in the Case of Two Parents. Proceedings of the Royal Society of London 58 — the product-moment correlation coefficient (EQ S3.3).
Spearman, C. (1904). The Proof and Measurement of Association between Two Things. American Journal of Psychology 15(1) — rank correlation, EQ S3.4.
Kendall, M. G. (1938). A New Measure of Rank Correlation. Biometrika 30(1–2) — Kendall's τ, EQ S3.5.
Bickel, P. J., Hammel, E. A. & O'Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187(4175) — the canonical Simpson's paradox case study.
Simpson, E. H. (1951). The Interpretation of Interaction in Contingency Tables. Journal of the Royal Statistical Society B 13(2) — the paradox's namesake paper.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press — DAGs, the do-operator and the backdoor criterion (EQ S3.7).
Pearl, J. (1995). Causal Diagrams for Empirical Research. Biometrika 82(4) — the foundational presentation of the backdoor criterion.