Estimators: bias, variance, consistency, MLE
An estimator is a recipe that turns data into a guess about a parameter — the sample mean \(\bar{X}\) estimating the population mean \(\mu\), the sample variance \(S^2\) estimating \(\sigma^2\). Because the data are random, the estimator is random too: it has its own distribution (§4.2). Two numbers summarize how good it is. Bias is how far its average lands from the truth; variance is how much it bounces from sample to sample.
Why does the sample variance divide by \(n-1\) instead of \(n\)? Because the deviations are measured from \(\bar{X}\), which was itself fit to the data and therefore sits closer to the points than the true \(\mu\) does. Dividing by \(n\) would systematically underestimate \(\sigma^2\); the correction \(n-1\) — one degree of freedom spent estimating the mean — makes \(S^2\) exactly unbiased. Bessel's correction is the cleanest example of a bias fix you can see in arithmetic.
A third virtue is asymptotic. An estimator is consistent if it converges in probability to the truth as the sample grows: \(\hat{\theta}_n \xrightarrow{p} \theta\). The sample mean is consistent because its variance \(\sigma^2/n\) shrinks to zero — the weak law of large numbers in one line. Consistency is a floor, not a ceiling: it promises you get there eventually, but says nothing about the rate.
Maximum likelihood
The dominant general recipe for building estimators is maximum likelihood: choose the parameter value that makes the observed data most probable. Treating the joint density as a function of \(\theta\) (the data now fixed) gives the likelihood; maximizing its logarithm — a sum, far friendlier than a product — gives the MLE.
Cross-reference: minimizing cross-entropy loss (ML · CH 03) is maximizing a likelihood, and the squared-error loss of linear regression (ML · CH 02) is the Gaussian MLE. The optimizers of machine learning are likelihood maximizers wearing different clothes.
Sampling distributions & confidence intervals
The single most important object in inference is invisible in any one dataset: the sampling distribution — the distribution of an estimator across the many samples you could have drawn but didn't. Its spread is the standard error, and it is what converts a point estimate into an honest range.
Why is the sampling distribution of a mean so often bell-shaped, whatever the data look like? The Central Limit Theorem: the standardized sum of i.i.d. variables with finite variance converges to a standard normal, regardless of the parent shape.
Confidence intervals
A confidence interval wraps the standard error around the estimate. For a mean with known \(\sigma\), the 95% interval is the estimate plus or minus \(1.96\) standard errors:
"There is a 95% chance the true mean lies in [a, b]." This sentence is false under the frequentist definition: \(\mu\) is a fixed number, and the interval is the random thing. The correct statement is about the long-run coverage of the method. The probability-about-this-interval reading is exactly what Bayesian credible intervals deliver instead (STATS · CH 05) — which is one reason the two schools talk past each other.
Hypothesis testing: null, p-value, errors, power
A hypothesis test is a formal courtroom for a claim. You state a null hypothesis \(H_0\) — the boring default, "no effect" — and ask: if \(H_0\) were true, how surprising is data at least this extreme? That surprise, measured in probability, is the p-value.
The deepest fact about the p-value is also the least intuitive: when \(H_0\) is exactly true, the p-value is uniformly distributed on \([0,1]\). Every value is equally likely. That flatness is not an accident — it is the definition of a calibrated test, and it is why a threshold \(\alpha = 0.05\) yields a 5% false-positive rate. The second Python cell below demonstrates this directly by simulating ten thousand null experiments.
| \(H_0\) true (no effect) | \(H_0\) false (real effect) | |
|---|---|---|
| Reject \(H_0\) | Type I error · prob \(\alpha\) | correct detection · prob \(1-\beta\) (power) |
| Fail to reject | correct · prob \(1-\alpha\) | Type II error · prob \(\beta\) |
A test does not tell you whether the effect is real; it controls the rate at which you cry wolf. Statistical significance is not practical importance: with a large enough \(n\), a trivially small, useless effect becomes "significant," because significance measures only whether the effect is distinguishable from zero, not whether it is big enough to care about. Always report the effect size and a confidence interval alongside the p-value.
# 10,000 experiments under H0 (no real effect): the p-value is UNIFORM
import numpy as np
rng = np.random.default_rng(0)
def norm_cdf(x): # standard normal CDF, pure numpy (A&S 7.1.26)
x = np.asarray(x, float); s = np.sign(x); z = np.abs(x) / np.sqrt(2.0)
t = 1.0 / (1.0 + 0.3275911 * z)
y = 1.0 - (((((1.061405429*t - 1.453152027)*t) + 1.421413741)*t
- 0.284496736)*t + 0.254829592)*t * np.exp(-z*z)
return 0.5 * (1.0 + s * y)
M, n = 10000, 40
A = rng.normal(0, 1, (M, n)) # both groups drawn from the SAME world
B = rng.normal(0, 1, (M, n)) # H0 is TRUE by construction
se = np.sqrt(A.var(1, ddof=1)/n + B.var(1, ddof=1)/n)
t = (A.mean(1) - B.mean(1)) / se
p = 2.0 * (1.0 - norm_cdf(np.abs(t))) # 10,000 two-sided p-values
edges = np.linspace(0, 1, 11)
counts, _ = np.histogram(p, bins=edges)
print("p-value histogram (10 equal bins, ~1000 expected each):")
for i in range(10):
print(f" [{edges[i]:.1f},{edges[i+1]:.1f}) {counts[i]:5d} " + "#"*int(counts[i]/25))
print(f"\nfalse positives (p < 0.05): {int((p < 0.05).sum())} (expect ~{int(0.05*M)})")
print("Under H0 the p-value is uniform -- that flat shape IS a calibrated test,")
print("and is exactly why alpha = 0.05 buys a 5% false-positive rate.")
plot_xy((edges[:-1] + edges[1:]) / 2, counts)
t-tests: comparing means when σ is unknown
In practice you almost never know the population \(\sigma\) — you estimate it with \(S\), and that estimate is itself noisy, especially at small \(n\). William Gosset, brewing statistics at Guinness under the pen name "Student," worked out the exact distribution of the resulting ratio. The fix is to use a heavier-tailed reference curve, the \(t\) distribution, in place of the normal.
Three flavors cover most uses. The one-sample test (EQ S4.9) compares a mean to a fixed value. The paired test applies the one-sample test to within-subject differences — before/after, left/right — and is far more powerful when it applies, because it cancels per-subject variation. The two-sample test compares two independent groups:
Assumptions, honestly stated: the \(t\)-test wants roughly normal data (or large \(n\), via the CLT) and independent observations. It is robust to mild non-normality but not to dependence or to extreme outliers, which inflate \(S\) and quietly kill power. For badly skewed or ordinal data, a rank-based test (Mann–Whitney, Wilcoxon) trades a little power for not caring about the distribution's shape.
# Two-sample Welch t-test from scratch: t statistic + normal-approx p
import numpy as np
rng = np.random.default_rng(2)
a = rng.normal(100, 15, 30) # control: true mean 100
b = rng.normal(106, 15, 30) # treatment: true mean 106 (effect = 6)
def norm_cdf(x): # standard normal CDF, pure numpy (A&S 7.1.26)
x = np.asarray(x, float); s = np.sign(x); z = np.abs(x) / np.sqrt(2.0)
t = 1.0 / (1.0 + 0.3275911 * z)
y = 1.0 - (((((1.061405429*t - 1.453152027)*t) + 1.421413741)*t
- 0.284496736)*t + 0.254829592)*t * np.exp(-z*z)
return 0.5 * (1.0 + s * y)
nx, ny = len(a), len(b)
se = np.sqrt(a.var(ddof=1)/nx + b.var(ddof=1)/ny) # Welch SE of the difference
t = (a.mean() - b.mean()) / se
p = 2.0 * (1.0 - norm_cdf(abs(t))) # two-sided, normal approximation
print(f"mean control = {a.mean():6.2f} mean treatment = {b.mean():6.2f}")
print(f"difference = {a.mean()-b.mean():+.2f} standard error = {se:.2f}")
print(f"t statistic = {t:.3f}")
print(f"two-sided p = {float(p):.4f} (normal approx; df > ~30 makes it tight)")
print("reject H0 at alpha = 0.05?", bool(p < 0.05))
print("\nThe true effect was 6. Re-run with effect = 0 (set b mean to 100)")
print("and the p-value scatters uniformly -- exactly the null demo above.")
ANOVA: partitioning variance across groups
To compare three or more group means, running a \(t\)-test on every pair is a trap — it multiplies the false-positive rate (the very problem of §4.6). The Analysis of Variance sidesteps it with one omnibus test built from a beautiful identity: total variation decomposes exactly into variation between groups and variation within them.
Sums of squares are not directly comparable — \(SS_{\text{between}}\) is built from \(k\) group means, \(SS_{\text{within}}\) from \(N\) observations. Dividing each by its degrees of freedom gives mean squares, and their ratio is the test statistic. Under \(H_0\) (all group means equal), both mean squares estimate the same noise variance, so their ratio sits near 1; a real difference inflates the numerator.
Multiple comparisons & the replication crisis
Here is the dark side of the p-value, and the reason this chapter ends in a cautionary tale. A 5% false-positive rate per test compounds ruthlessly across many tests. Run twenty independent null tests and the probability that at least one hits "significance" by chance is not 5% — it is \(1 - 0.95^{20} \approx 64\%\).
The same arithmetic, weaponized by flexibility. If you try many outcome variables, many subgroups, many covariate combinations, or peek at the data and stop when \(p < 0.05\), you are running dozens of hidden tests and reporting only the winner. This is p-hacking, and it manufactures significance from pure noise. The "garden of forking paths" makes it possible without any conscious cheating — every undocumented analytic choice is a degree of freedom that inflates the real \(\alpha\).
This is not academic hygiene; it broke a field's confidence in itself. Beginning in the 2010s, large replication efforts found that a substantial share of published findings — in psychology, parts of biomedicine, and beyond — failed to reproduce. The diagnosis pointed straight at the machinery of this chapter: chronic underpowering (§4.3), undisclosed multiple comparisons (above), publication bias toward "significant" results, and the cult of the \(p < 0.05\) threshold. John Ioannidis's 2005 argument — that most published research findings are false — followed from a few lines of conditional probability: when power is low, priors are low, and bias and multiplicity are high, a "significant" result is more likely false than true.
The reforms are real but not settled. Pre-registration, larger samples, reporting effect sizes with intervals, and sharing data are now mainstream and demonstrably help. Beyond that, consensus frays: some argue for lowering the threshold to \(p < 0.005\), some for abandoning fixed thresholds entirely, some for replacing significance testing with Bayesian model comparison (STATS · CH 05) or estimation-with-intervals. The honest summary in 2026: the p-value is a useful, badly abused tool, and "statistical significance" should be read as the start of an argument, never the end of one.
Frequentist inference controls error rates but cannot say "how probable is my hypothesis?" — only a Bayesian can. STATS · CH 05 turns the question around: instead of asking how surprising the data are under a fixed null, it puts a probability distribution on the parameter itself, updates it with Bayes' rule, and reads off credible intervals that mean exactly what the misread confidence interval of §4.2 was supposed to.
References
- Student [Gosset, W. S.] (1908). The Probable Error of a Mean.
- Neyman, J. & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses.
- Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False.
- Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science.
- Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate.
- Wasserstein, R. L. & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose.
- Welch, B. L. (1947). The Generalization of Student's Problem When Several Different Population Variances Are Involved.