Statistical Inference & Hypothesis Testing

4.1

Estimators: bias, variance, consistency, MLE

An estimator is a recipe that turns data into a guess about a parameter — the sample mean $\bar{X}$ estimating the population mean $\mu$, the sample variance $S^2$ estimating $\sigma^2$. Because the data are random, the estimator is random too: it has its own distribution (§4.2). Two numbers summarize how good it is. Bias is how far its average lands from the truth; variance is how much it bounces from sample to sample.

EQ S4.1 — BIAS, VARIANCE, AND MSE $$ \operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta, \qquad \operatorname{MSE}(\hat{\theta}) = \mathbb{E}\big[(\hat{\theta} - \theta)^2\big] = \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2 $$

The expected squared error of any estimator splits cleanly into variance plus bias-squared — the same decomposition that governs model generalization (ML · CH 06). An estimator can be unbiased yet useless if its variance is huge, or slightly biased yet excellent if that buys a large drop in variance. "Unbiased" is a property worth wanting but never worth worshipping.

Why does the sample variance divide by $n-1$ instead of $n$? Because the deviations are measured from $\bar{X}$, which was itself fit to the data and therefore sits closer to the points than the true $\mu$ does. Dividing by $n$ would systematically underestimate $\sigma^2$; the correction $n-1$ — one degree of freedom spent estimating the mean — makes $S^2$ exactly unbiased. Bessel's correction is the cleanest example of a bias fix you can see in arithmetic.

EQ S4.2 — UNBIASED SAMPLE VARIANCE $$ S^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2, \qquad \mathbb{E}[S^2] = \sigma^2 $$

The $n-1$ is the first degree of freedom you will meet — a count of independent pieces of information left after estimating the mean. Degrees of freedom reappear in every test in this chapter: the $t$ distribution's shape, the $\chi^2$, and the two $df$ of an $F$-ratio (§4.5) are all bookkeeping of how many free deviations remain.

A third virtue is asymptotic. An estimator is consistent if it converges in probability to the truth as the sample grows: $\hat{\theta}_n \xrightarrow{p} \theta$. The sample mean is consistent because its variance $\sigma^2/n$ shrinks to zero — the weak law of large numbers in one line. Consistency is a floor, not a ceiling: it promises you get there eventually, but says nothing about the rate.

Maximum likelihood

The dominant general recipe for building estimators is maximum likelihood: choose the parameter value that makes the observed data most probable. Treating the joint density as a function of $\theta$ (the data now fixed) gives the likelihood; maximizing its logarithm — a sum, far friendlier than a product — gives the MLE.

EQ S4.3 — MAXIMUM LIKELIHOOD $$ \hat{\theta}_{\text{MLE}} = \arg\max_{\theta}\; \ell(\theta), \qquad \ell(\theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta) $$

For an i.i.d. Gaussian sample, solving $\partial\ell/\partial\theta = 0$ hands back the sample mean for $\mu$ and the $\tfrac{1}{n}$ (not $\tfrac{1}{n-1}$) variance — so the Gaussian MLE of $\sigma^2$ is slightly biased downward, the cleanest case where ML trades a little bias for the method's generality. MLEs are consistent, asymptotically unbiased, and asymptotically normal with the smallest possible variance (the Cramér–Rao bound), which is why they sit under logistic regression, GLMs, and most of deep learning's loss functions.

Cross-reference: minimizing cross-entropy loss (ML · CH 03) is maximizing a likelihood, and the squared-error loss of linear regression (ML · CH 02) is the Gaussian MLE. The optimizers of machine learning are likelihood maximizers wearing different clothes.

A population has variance $ \sigma^2 = 25 $. You draw $ n = 10 $ observations and average them. Since $\bar{X}$ is unbiased, its MSE equals its variance (EQ S4.1). What is that MSE, $ \sigma^2/n $?

For an unbiased estimator the bias term vanishes, so $ \operatorname{MSE}(\bar{X}) = \operatorname{Var}(\bar{X}) = \sigma^2/n = 25/10 = $ 2.5. Averaging ten draws cuts the error variance by a factor of ten — the variance side of EQ S4.1 doing all the work.

4.2

Sampling distributions & confidence intervals

The single most important object in inference is invisible in any one dataset: the sampling distribution — the distribution of an estimator across the many samples you could have drawn but didn't. Its spread is the standard error, and it is what converts a point estimate into an honest range.

EQ S4.4 — STANDARD ERROR OF THE MEAN $$ \operatorname{Var}(\bar{X}) = \frac{\sigma^2}{n} \quad\Longrightarrow\quad \operatorname{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}} $$

The estimate gets sharper as $1/\sqrt{n}$ — the iron law of statistics. To halve your error you must quadruple your data, which is why the marginal value of more samples falls off and why "big enough" arrives sooner than intuition expects. The $\sqrt{n}$ is the lever every power calculation (§4.3) pulls.

Why is the sampling distribution of a mean so often bell-shaped, whatever the data look like? The Central Limit Theorem: the standardized sum of i.i.d. variables with finite variance converges to a standard normal, regardless of the parent shape.

EQ S4.5 — CENTRAL LIMIT THEOREM $$ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \;\xrightarrow{d}\; \mathcal{N}(0, 1) \quad \text{as } n \to \infty $$

The CLT is why the normal distribution is the lingua franca of inference even for skewed or discrete data: averages forget their parent. The caveats experts insist on: it needs finite variance (it fails for heavy-tailed laws like the Cauchy), the approximation is poor in the tails at small $n$, and strongly skewed parents need larger $n$ before the bell sets in — the folk rule "$n \ge 30$" is a rough heuristic, not a theorem.

Confidence intervals

A confidence interval wraps the standard error around the estimate. For a mean with known $\sigma$, the 95% interval is the estimate plus or minus $1.96$ standard errors:

EQ S4.6 — CONFIDENCE INTERVAL FOR A MEAN $$ \bar{X} \pm z_{1-\alpha/2}\,\frac{\sigma}{\sqrt{n}}, \qquad z_{0.975} = 1.96 \quad (\text{use } t_{n-1} \text{ when } \sigma \text{ is estimated}) $$

The interpretation is subtle and routinely mangled: the 95% refers to the procedure, not to any single interval. If you repeated the whole experiment forever, 95% of the intervals you build this way would contain the true $\mu$. A given interval either covers the truth or it doesn't — there is no "95% probability" attached to it under the frequentist reading. The instrument below makes this concrete: watch the intervals dance and roughly one in twenty miss.

A COMMON ERROR

"There is a 95% chance the true mean lies in [a, b]." This sentence is false under the frequentist definition: $\mu$ is a fixed number, and the interval is the random thing. The correct statement is about the long-run coverage of the method. The probability-about-this-interval reading is exactly what Bayesian credible intervals deliver instead (STATS · CH 05) — which is one reason the two schools talk past each other.

INSTRUMENT S4.1 — CONFIDENCE-INTERVAL COVERAGEREPEAT THE EXPERIMENT · ~95% COVER THE TRUTH · EQ S4.6

CONFIDENCE LEVEL 95%

SAMPLE SIZE n 25

INTERVALS DRAWN

—

COVERED THE TRUTH

—

EMPIRICAL COVERAGE

—

Each horizontal bar is one experiment's 95% interval; the dashed line is the true mean $\mu = 0$. Mint intervals caught it, red ones missed. Keep pressing DRAW: the empirical coverage hovers near the nominal level. Lower the confidence to 80% and watch the red bars multiply — narrower intervals miss more often. The coverage is a property of the recipe, exactly as EQ S4.6's note warns.

A measurement has population standard deviation $ \sigma = 10 $. You average $ n = 100 $ independent measurements. What is the standard error of the mean, $ \sigma/\sqrt{n} $?

$ \operatorname{SEM} = \dfrac{\sigma}{\sqrt{n}} = \dfrac{10}{\sqrt{100}} = \dfrac{10}{10} = $ 1.0. A hundredfold averaging shrinks a spread of 10 down to a standard error of 1 — the $1/\sqrt{n}$ law in a single step.

4.3

Hypothesis testing: null, p-value, errors, power

A hypothesis test is a formal courtroom for a claim. You state a null hypothesis $H_0$ — the boring default, "no effect" — and ask: if $H_0$ were true, how surprising is data at least this extreme? That surprise, measured in probability, is the p-value.

EQ S4.7 — THE p-VALUE $$ p = \mathbb{P}\big(\,|T| \ge |t_{\text{obs}}| \;\big|\; H_0\,\big) $$

The p-value is the probability, computed under the null, of a test statistic as or more extreme than the one you observed. It is not the probability that $H_0$ is true, nor the probability your result was a fluke, nor one minus the probability the alternative is true. It answers only: "is my data weird, assuming nothing is going on?" A small p means the data sit far in the tail of the null distribution.

The deepest fact about the p-value is also the least intuitive: when $H_0$ is exactly true, the p-value is uniformly distributed on $[0,1]$. Every value is equally likely. That flatness is not an accident — it is the definition of a calibrated test, and it is why a threshold $\alpha = 0.05$ yields a 5% false-positive rate. The second Python cell below demonstrates this directly by simulating ten thousand null experiments.

EQ S4.8 — THE TWO ERRORS, AND POWER $$ \alpha = \mathbb{P}(\text{reject } H_0 \mid H_0 \text{ true}), \qquad \beta = \mathbb{P}(\text{fail to reject } H_0 \mid H_0 \text{ false}), \qquad \text{power} = 1 - \beta $$

Type I error ($\alpha$) is a false alarm — convicting an innocent null. Type II error ($\beta$) is a miss — letting a real effect walk free. Power is the probability of detecting an effect that is genuinely there. The four levers are locked together: power rises with the true effect size, with the sample size $n$ (through the $\sqrt{n}$ of EQ S4.4), and with a more lenient $\alpha$ — and falls with noisier data. An underpowered study is one designed to miss; §4.6 is the story of what happens when a whole field runs them.

	$H_0$ true (no effect)	$H_0$ false (real effect)
Reject $H_0$	Type I error · prob $\alpha$	correct detection · prob $1-\beta$ (power)
Fail to reject	correct · prob $1-\alpha$	Type II error · prob $\beta$

A test does not tell you whether the effect is real; it controls the rate at which you cry wolf. Statistical significance is not practical importance: with a large enough $n$, a trivially small, useless effect becomes "significant," because significance measures only whether the effect is distinguishable from zero, not whether it is big enough to care about. Always report the effect size and a confidence interval alongside the p-value.

INSTRUMENT S4.2 — p-VALUE & POWER SIMULATORTWO-SAMPLE z-TEST · NULL → ALTERNATIVE · EQ S4.7–S4.8

TRUE EFFECT SIZE d 0.00

SAMPLE SIZE / GROUP n 30

SIGNIFICANCE α 0.05

REGIME

—

POWER (1 − β)

—

P(p < α)

—

The histogram is the distribution of the p-value across hypothetical repetitions of the experiment. Start at effect d = 0: the bars are flat — the uniform null, with exactly an α-sized slice falling left of the threshold (Type I errors). Now crank d up: the distribution piles toward zero and the shaded region left of α swells — that growing slice is the power. Raise n to watch a small effect become detectable; the $\sqrt{n}$ of EQ S4.4 is the engine.

PYTHON · RUNNABLE IN-BROWSER

# 10,000 experiments under H0 (no real effect): the p-value is UNIFORM
import numpy as np
rng = np.random.default_rng(0)

def norm_cdf(x):                          # standard normal CDF, pure numpy (A&S 7.1.26)
    x = np.asarray(x, float); s = np.sign(x); z = np.abs(x) / np.sqrt(2.0)
    t = 1.0 / (1.0 + 0.3275911 * z)
    y = 1.0 - (((((1.061405429*t - 1.453152027)*t) + 1.421413741)*t
                - 0.284496736)*t + 0.254829592)*t * np.exp(-z*z)
    return 0.5 * (1.0 + s * y)

M, n = 10000, 40
A = rng.normal(0, 1, (M, n))             # both groups drawn from the SAME world
B = rng.normal(0, 1, (M, n))             # H0 is TRUE by construction
se = np.sqrt(A.var(1, ddof=1)/n + B.var(1, ddof=1)/n)
t  = (A.mean(1) - B.mean(1)) / se
p  = 2.0 * (1.0 - norm_cdf(np.abs(t)))    # 10,000 two-sided p-values

edges = np.linspace(0, 1, 11)
counts, _ = np.histogram(p, bins=edges)
print("p-value histogram (10 equal bins, ~1000 expected each):")
for i in range(10):
    print(f"  [{edges[i]:.1f},{edges[i+1]:.1f})  {counts[i]:5d}  " + "#"*int(counts[i]/25))
print(f"\nfalse positives (p < 0.05): {int((p < 0.05).sum())}   (expect ~{int(0.05*M)})")
print("Under H0 the p-value is uniform -- that flat shape IS a calibrated test,")
print("and is exactly why alpha = 0.05 buys a 5% false-positive rate.")
plot_xy((edges[:-1] + edges[1:]) / 2, counts)

edits are live — break it on purpose

A trial is designed with a Type II error rate of $ \beta = 0.20 $. What is its statistical power, $ 1 - \beta $?

$ \text{power} = 1 - \beta = 1 - 0.20 = $ 0.8. 80% power is the conventional design target — a one-in-five chance of missing a real effect of the size you planned for.

4.4

t-tests: comparing means when σ is unknown

In practice you almost never know the population $\sigma$ — you estimate it with $S$, and that estimate is itself noisy, especially at small $n$. William Gosset, brewing statistics at Guinness under the pen name "Student," worked out the exact distribution of the resulting ratio. The fix is to use a heavier-tailed reference curve, the $t$ distribution, in place of the normal.

EQ S4.9 — THE ONE-SAMPLE t STATISTIC $$ t = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \;\sim\; t_{n-1} \quad\text{under } H_0:\, \mu = \mu_0 $$

Same shape as a z-score, but $\sigma$ is replaced by the sample $S$ — so the denominator wobbles, fattening the tails. The $t_{n-1}$ distribution has $n-1$ degrees of freedom: at small $df$ its tails are heavy (more extreme values are plausible, so critical values are larger than 1.96), and as $df \to \infty$ it converges to the normal. The extra tail weight is the price of not knowing $\sigma$ — and forgetting to pay it is why naive z-tests over-reject on small samples.

Three flavors cover most uses. The one-sample test (EQ S4.9) compares a mean to a fixed value. The paired test applies the one-sample test to within-subject differences — before/after, left/right — and is far more powerful when it applies, because it cancels per-subject variation. The two-sample test compares two independent groups:

EQ S4.10 — WELCH'S TWO-SAMPLE t $$ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}}} $$

The denominator is the standard error of the difference of two means. Welch's version does not assume equal variances and is the modern default — Student's original pooled-variance test is a special case that fails, sometimes badly, when the groups have unequal spread or unequal size. Welch costs you only a (non-integer) degrees-of-freedom adjustment and is strictly safer; reach for it unless you have a strong reason not to.

Assumptions, honestly stated: the $t$-test wants roughly normal data (or large $n$, via the CLT) and independent observations. It is robust to mild non-normality but not to dependence or to extreme outliers, which inflate $S$ and quietly kill power. For badly skewed or ordinal data, a rank-based test (Mann–Whitney, Wilcoxon) trades a little power for not caring about the distribution's shape.

PYTHON · RUNNABLE IN-BROWSER

# Two-sample Welch t-test from scratch: t statistic + normal-approx p
import numpy as np
rng = np.random.default_rng(2)
a = rng.normal(100, 15, 30)              # control:   true mean 100
b = rng.normal(106, 15, 30)             # treatment: true mean 106 (effect = 6)

def norm_cdf(x):                          # standard normal CDF, pure numpy (A&S 7.1.26)
    x = np.asarray(x, float); s = np.sign(x); z = np.abs(x) / np.sqrt(2.0)
    t = 1.0 / (1.0 + 0.3275911 * z)
    y = 1.0 - (((((1.061405429*t - 1.453152027)*t) + 1.421413741)*t
                - 0.284496736)*t + 0.254829592)*t * np.exp(-z*z)
    return 0.5 * (1.0 + s * y)

nx, ny = len(a), len(b)
se = np.sqrt(a.var(ddof=1)/nx + b.var(ddof=1)/ny)   # Welch SE of the difference
t  = (a.mean() - b.mean()) / se
p  = 2.0 * (1.0 - norm_cdf(abs(t)))                  # two-sided, normal approximation

print(f"mean control   = {a.mean():6.2f}     mean treatment = {b.mean():6.2f}")
print(f"difference     = {a.mean()-b.mean():+.2f}        standard error = {se:.2f}")
print(f"t statistic    = {t:.3f}")
print(f"two-sided p     = {float(p):.4f}   (normal approx; df > ~30 makes it tight)")
print("reject H0 at alpha = 0.05?", bool(p < 0.05))
print("\nThe true effect was 6. Re-run with effect = 0 (set b mean to 100)")
print("and the p-value scatters uniformly -- exactly the null demo above.")

edits are live — break it on purpose

A one-sample t-test has $ \bar{X} = 52 $, $ \mu_0 = 50 $, sample SD $ S = 8 $, and $ n = 100 $. What is the t statistic $ \dfrac{\bar{X}-\mu_0}{S/\sqrt{n}} $?

Standard error $ = S/\sqrt{n} = 8/\sqrt{100} = 8/10 = 0.8 $. Then $ t = (52 - 50)/0.8 = 2/0.8 = $ 2.5 — comfortably past the $\approx 1.98$ two-sided critical value at $df = 99$, so reject $H_0$ at the 5% level.

4.5

ANOVA: partitioning variance across groups

To compare three or more group means, running a $t$-test on every pair is a trap — it multiplies the false-positive rate (the very problem of §4.6). The Analysis of Variance sidesteps it with one omnibus test built from a beautiful identity: total variation decomposes exactly into variation between groups and variation within them.

EQ S4.11 — THE SUM-OF-SQUARES DECOMPOSITION $$ \underbrace{\sum_{j}\sum_{i}(x_{ij} - \bar{x})^2}_{SS_{\text{total}}} \;=\; \underbrace{\sum_{j} n_j (\bar{x}_j - \bar{x})^2}_{SS_{\text{between}}} \;+\; \underbrace{\sum_{j}\sum_{i}(x_{ij} - \bar{x}_j)^2}_{SS_{\text{within}}} $$

Every observation's distance from the grand mean splits, with no remainder, into "how far its group's mean is from the grand mean" plus "how far it is from its own group's mean." $SS_{\text{between}}$ is signal (do the groups differ?); $SS_{\text{within}}$ is noise (how much do individuals scatter inside a group?). This is the same orthogonal decomposition that underlies $R^2$ in regression (STATS · CH 03).

Sums of squares are not directly comparable — $SS_{\text{between}}$ is built from $k$ group means, $SS_{\text{within}}$ from $N$ observations. Dividing each by its degrees of freedom gives mean squares, and their ratio is the test statistic. Under $H_0$ (all group means equal), both mean squares estimate the same noise variance, so their ratio sits near 1; a real difference inflates the numerator.

EQ S4.12 — THE F-RATIO $$ F = \frac{MS_{\text{between}}}{MS_{\text{within}}} = \frac{SS_{\text{between}} / (k-1)}{SS_{\text{within}} / (N-k)} \;\sim\; F_{k-1,\,N-k} \quad\text{under } H_0 $$

$k$ groups, $N$ total observations. The numerator has $k-1$ degrees of freedom, the denominator $N-k$. $F$ is a signal-to-noise ratio: large $F$ means the spread between groups dwarfs the spread within them, which is hard to explain if the means are truly equal. For exactly two groups, $F = t^2$ — ANOVA and the two-sample $t$-test agree. A significant $F$ says "some means differ" but not which; post-hoc tests (Tukey's HSD) localize it while controlling the family-wise error of §4.6.

INSTRUMENT S4.3 — ANOVA F EXPLORER3 GROUPS · BETWEEN vs WITHIN VARIANCE · EQ S4.11–S4.12

SPREAD OF GROUP MEANS 1.5

WITHIN-GROUP SD σ 2.0

GROUP SIZE n 12

MS BETWEEN

—

MS WITHIN

—

F = MSB / MSW

—

VERDICT (α = 0.05)

—

Three groups of dots, one per column; the mint diamonds are group means, the dashed line is the grand mean. Pull the group means apart (raise the spread) and watch $MS_{\text{between}}$ and $F$ climb. Raise the within-group SD and the dots smear vertically: the same separation now drowns in noise and $F$ collapses. Shrinking the spread to zero leaves $F \approx 1$ — pure noise over noise, the null. $F$ is the ratio of the two, and the verdict flips when it crosses the critical value.

An ANOVA gives $ SS_{\text{between}} = 120 $ with $ df = 2 $, and $ SS_{\text{within}} = 300 $ with $ df = 27 $. Compute the F-ratio, $ \dfrac{SS_B/df_B}{SS_W/df_W} $.

$ MS_{\text{between}} = 120/2 = 60 $ and $ MS_{\text{within}} = 300/27 = 11.11 $. Then $ F = 60 / 11.11 = $ 5.4. Against $ F_{2,27} $ the 5% critical value is $\approx 3.35$, so $5.4$ clears it — at least one group mean differs.

4.6

Multiple comparisons & the replication crisis

Here is the dark side of the p-value, and the reason this chapter ends in a cautionary tale. A 5% false-positive rate per test compounds ruthlessly across many tests. Run twenty independent null tests and the probability that at least one hits "significance" by chance is not 5% — it is $1 - 0.95^{20} \approx 64\%$.

EQ S4.13 — FAMILY-WISE ERROR INFLATION $$ \text{FWER} = \mathbb{P}(\text{at least one false positive}) = 1 - (1 - \alpha)^{m} \approx m\alpha \;\;(\text{small } \alpha) $$

Across $m$ independent tests at level $\alpha$, the chance of some spurious hit grows toward 1. The Bonferroni correction restores control by testing each hypothesis at $\alpha/m$ — simple, conservative, and at the cost of power. For large $m$ (genomics, neuroimaging), controlling the false discovery rate instead (Benjamini–Hochberg) — the expected fraction of your "discoveries" that are false — keeps far more power. Either way, the unit of error control is the family of tests, not the single test.

p-HACKING

The same arithmetic, weaponized by flexibility. If you try many outcome variables, many subgroups, many covariate combinations, or peek at the data and stop when $p < 0.05$, you are running dozens of hidden tests and reporting only the winner. This is p-hacking, and it manufactures significance from pure noise. The "garden of forking paths" makes it possible without any conscious cheating — every undocumented analytic choice is a degree of freedom that inflates the real $\alpha$.

This is not academic hygiene; it broke a field's confidence in itself. Beginning in the 2010s, large replication efforts found that a substantial share of published findings — in psychology, parts of biomedicine, and beyond — failed to reproduce. The diagnosis pointed straight at the machinery of this chapter: chronic underpowering (§4.3), undisclosed multiple comparisons (above), publication bias toward "significant" results, and the cult of the $p < 0.05$ threshold. John Ioannidis's 2005 argument — that most published research findings are false — followed from a few lines of conditional probability: when power is low, priors are low, and bias and multiplicity are high, a "significant" result is more likely false than true.

CONTESTED

The reforms are real but not settled. Pre-registration, larger samples, reporting effect sizes with intervals, and sharing data are now mainstream and demonstrably help. Beyond that, consensus frays: some argue for lowering the threshold to $p < 0.005$, some for abandoning fixed thresholds entirely, some for replacing significance testing with Bayesian model comparison (STATS · CH 05) or estimation-with-intervals. The honest summary in 2026: the p-value is a useful, badly abused tool, and "statistical significance" should be read as the start of an argument, never the end of one.

You run $ m = 4 $ hypothesis tests and want a family-wise error rate of $ \alpha = 0.05 $. What per-test threshold does the Bonferroni correction use, $ \alpha/m $?

Bonferroni tests each hypothesis at $ \alpha/m = 0.05/4 = $ 0.0125. Only p-values below 0.0125 count as significant — the tighter bar that keeps the chance of any false positive at or below 5%.

Frequentist inference controls error rates but cannot say "how probable is my hypothesis?" — only a Bayesian can. STATS · CH 05 turns the question around: instead of asking how surprising the data are under a fixed null, it puts a probability distribution on the parameter itself, updates it with Bayes' rule, and reads off credible intervals that mean exactly what the misread confidence interval of §4.2 was supposed to.

4.R

References

Student [Gosset, W. S.] (1908). The Probable Error of a Mean. Biometrika 6(1) — the t distribution, derived for small Guinness brewing samples (EQ S4.9).
Neyman, J. & Pearson, E. S. (1933). On the Problem of the Most Efficient Tests of Statistical Hypotheses. Phil. Trans. R. Soc. A 231 — Type I/II error, power, and the framework of EQ S4.8.
Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine 2(8) — the conditional-probability argument behind §4.6.
Open Science Collaboration (2015). Estimating the Reproducibility of Psychological Science. Science 349(6251) — the large-scale replication study that crystallized the crisis.
Benjamini, Y. & Hochberg, Y. (1995). Controlling the False Discovery Rate. J. R. Stat. Soc. B 57(1) — FDR control for the many-tests regime of EQ S4.13.
Wasserstein, R. L. & Lazar, N. A. (2016). The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 70(2) — the profession's own caution on what a p-value is not.
Welch, B. L. (1947). The Generalization of Student's Problem When Several Different Population Variances Are Involved. Biometrika 34 — the unequal-variance two-sample test of EQ S4.10.

	\(H_0\) true (no effect)	\(H_0\) false (real effect)
Reject \(H_0\)	Type I error · prob \(\alpha\)	correct detection · prob \(1-\beta\) (power)
Fail to reject	correct · prob \(1-\alpha\)	Type II error · prob \(\beta\)