Why ensembles win — the error decomposition
Start from the only identity that matters here. For a squared-error regression target \(y = f(x) + \varepsilon\) with irreducible noise \(\mathrm{Var}(\varepsilon) = \sigma^2\), the expected test error of an estimator \(\hat f\) — averaged over the randomness in the training set — splits into three terms that cannot interfere with one another:
Now average \(k\) models. Let each \(\hat f_j\) have the same variance \(\sigma_m^2\), and let any two of them have correlation \(\rho\). The variance of their average is the whole story of ensembling in one line:
Two consequences follow immediately. First, ensembling helps most when the members are good but different — accurate (small \(\sigma_m^2\)) yet decorrelated (small \(\rho\)). Second, there is a hard limit: no amount of averaging beats the correlation floor \(\rho\,\sigma_m^2\), so the engineering game is decorrelation, not just adding more models. This is exactly why random forests randomize the features at each split (it lowers \(\rho\)) and why diverse model classes stack better than ten retrained copies of the same gradient-boosted tree.
Bagging & variance reduction
Bootstrap aggregating — Breiman's bagging — is the most direct cash-out of EQ M14.2. Draw \(k\) bootstrap samples (each \(n\) points sampled with replacement from the original \(n\)), fit one high-variance, low-bias learner on each, and average their predictions (or majority-vote for classification). Deep decision trees are the canonical base learner because they are exactly what the math wants: nearly unbiased and wildly unstable, so there is a mountain of variance to drain and almost no bias to protect.
Random forests add the second decorrelation lever EQ M14.2 begged for. Trees grown on bootstrap samples of the same data are still correlated — they all latch onto the few strongest features. So at every split, a random forest considers only a random subset of features (\(\sqrt{p}\) for classification, \(p/3\) for regression are the classic defaults). Forcing weaker features into the splits makes the trees genuinely different, drives \(\rho\) down, and pushes the correlation floor lower than plain bagging can reach. Extremely randomized trees (ExtraTrees) go further still, randomizing the split thresholds too — trading a touch of bias for even less correlation.
# Bagging from scratch: average bootstrapped stumps, watch variance collapse
import numpy as np
rng = np.random.default_rng(0)
def stump(x, y): # 1-split regression tree (high variance)
s = np.argsort(x); xs, ys = x[s], y[s]
best, thr, lo, hi = 1e18, x[0], y.mean(), y.mean()
for i in range(1, len(x)): # try every midpoint as a split
t = 0.5 * (xs[i-1] + xs[i])
l, r = ys[:i], ys[i:]
e = ((l - l.mean())**2).sum() + ((r - r.mean())**2).sum()
if e < best: best, thr, lo, hi = e, t, l.mean(), r.mean()
return lambda xq: np.where(xq < thr, lo, hi)
xg = np.linspace(0, 1, 60) # fixed grid where we measure variance
f = lambda x: np.sin(2*np.pi*x) # ground truth
def fit_one(): # one stump on one noisy 25-point sample
x = rng.uniform(0, 1, 25); y = f(x) + rng.normal(0, 0.4, 25)
return stump(x, y)(xg)
for k in (1, 5, 25, 100): # average k stumps, repeat to measure var
preds = np.array([np.mean([fit_one() for _ in range(k)], 0) for _ in range(40)])
print(f"k={k:3d} avg pointwise variance = {preds.var(0).mean():.4f}")
print("\nvariance falls ~1/k early, then flattens at the correlation floor (EQ M14.2).")
Boosting — sequential error correction
Bagging builds its members in parallel and independently. Boosting does the opposite: it builds them in sequence, each new member trained to fix the mistakes of the running ensemble so far. Where bagging attacks variance with strong learners, boosting attacks bias by composing many weak learners (shallow trees, often stumps) into one strong one. The cost is that members are now dependent by construction — \(\rho\) is high — so the variance lever of EQ M14.2 is gone, and boosting trades it for a march down the bias term.
The cleanest modern view is gradient boosting (Friedman): treat the ensemble as a function being optimized by gradient descent in function space. At each round you fit a new weak learner to the negative gradient of the loss — for squared error, that gradient is simply the residual — and take a small, shrunk step:
The older, equivalent-in-spirit ancestor is AdaBoost, which reweights the data instead of fitting residuals: misclassified points get heavier, so the next weak learner concentrates where the ensemble is failing. Its multiclass form, SAMME, assigns each weak learner a vote weighted by how much better than chance it does:
# AdaBoost (SAMME, K=2) on toy data: print weighted error + alpha each round
import numpy as np
rng = np.random.default_rng(1)
n = 200
X = rng.normal(0, 1, (n, 2))
y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1) # linear boundary, labels +/-1
def stump(X, y, w): # best 1-feature threshold stump
best = (1e9, 0, 0.0, 1)
for f in range(X.shape[1]):
for t in np.quantile(X[:, f], np.linspace(.1, .9, 9)):
for s in (1, -1):
pred = np.where(s * (X[:, f] - t) > 0, 1, -1)
e = w[pred != y].sum()
if e < best[0]: best = (e, f, t, s)
e, f, t, s = best
return (f, t, s), np.where(s * (X[:, f] - t) > 0, 1, -1)
w = np.full(n, 1 / n) # uniform sample weights
print("round weighted_err alpha")
for m in range(5):
_, pred = stump(X, y, w)
err = max(w[pred != y].sum(), 1e-12)
alpha = np.log((1 - err) / err) # EQ M14.5, K=2
w *= np.exp(alpha * (pred != y)) # up-weight the still-wrong
w /= w.sum()
print(f" {m:2d} {err:8.4f} {alpha:6.3f}")
print("\nerror drifts toward 0.5 as easy points are solved and hard ones dominate.")
Stacking & blending
Bagging and boosting both combine members of one kind with a fixed rule (average, weighted vote). Stacked generalization (Wolpert) asks the obvious next question: why hand-pick the combiner when you can learn it? Train several diverse base models — say a random forest, a gradient-boosted tree, a linear model, and a k-NN — then train a second-level meta-learner whose inputs are the base models' predictions and whose target is the true label. The meta-model discovers, per region of input space, which base learner to trust.
Stacking is the engine behind most Kaggle-winning solutions and many production ranking systems, precisely because EQ M14.2 rewards diversity: a forest and a boosted tree make different mistakes, so their stacked combination has lower \(\rho\) than either family alone. The returns are real but modest — typically a few percent over the best single model — and they come with real operational cost (more models to train, serve, monitor, and debug). That trade is the subject of the next section.
# Stacking toy: two biased base models; a ridge meta-learner finds the blend
import numpy as np
rng = np.random.default_rng(3)
n = 400
x = rng.uniform(-3, 3, n)
y = np.sin(x) + 0.1 * rng.normal(size=n) # ground truth + noise
# two deliberately complementary, biased base predictions
f1 = 0.9 * np.sin(x) - 0.2 # good shape, wrong offset
f2 = x / 3.0 # a linear approximation
for name, f in (("base f1", f1), ("base f2", f2)):
print(f"{name:8s} MSE = {((y - f)**2).mean():.4f}")
Z = np.column_stack([np.ones(n), f1, f2]) # meta features (+ intercept)
lam = 1e-3
beta = np.linalg.solve(Z.T @ Z + lam*np.eye(3), Z.T @ y) # ridge meta-learner
stack = Z @ beta
print(f"\nmeta weights [bias, f1, f2] = {beta.round(3)}")
print(f"stacked MSE = {((y - stack)**2).mean():.4f} <- below both bases")
When ensembles fail
The "free lunch" framing is a useful exaggeration. Ensembles are remarkably robust, but EQ M14.2 also tells you exactly where they stop helping — and several failure modes have nothing to do with the math at all.
| Failure mode | Why it happens | What it looks like |
|---|---|---|
| Correlated members | High \(\rho\) pins variance at the floor \(\rho\sigma_m^2\) | The 200th tree adds nothing; OOB error flatlined at tree ~50 |
| Bagging a stable learner | Low-variance bases (linear/SVM) have little variance to drain | Bagged linear model ≈ the single linear model, at 100× the cost |
| Boosting on noisy labels | AdaBoost up-weights hard points — which include mislabels | Train error → 0, test error climbs; the ensemble memorizes noise |
| Stacking with leakage | In-sample (not out-of-fold) base predictions feed the meta-model | Spectacular CV scores, collapse in production |
| Distribution shift | All members agree confidently on the wrong (shifted) inputs | Calibrated, unanimous, and wrong — diversity gives no safety here |
Two of these deserve emphasis because they are genuinely contested in practice. First, boosting's noise sensitivity: AdaBoost's exponential loss punishes outliers viciously, which is why robust variants (LogitBoost, gradient boosting with Huber loss, and early stopping) exist — though on clean tabular data, gradient-boosted trees remain the strongest single tool and frequently beat deep nets, a result that surprises newcomers and is still actively debated. Second, the cost-benefit ledger: an ensemble multiplies training, serving, latency, memory, and debugging cost, often for a 1–3% metric gain. In a leaderboard that wins; in a latency-bound product it may not be worth it. The honest default is a single well-tuned gradient-boosted model for tabular problems, reaching for stacking only when the last percent genuinely pays.
The deeper caveat. Ensembling reduces variance and bias, never the irreducible noise \(\sigma^2\) of EQ M14.1, and it cannot manufacture signal that the base learners never saw. Diversity is a property of errors, not of confidence: ten models can be diverse, confident, and unanimously wrong under distribution shift. Ensembles buy you stability and a few points of accuracy — not robustness to a world that has moved.
The theory is settled; the speed is not. Chapter 15 takes gradient boosting from EQ M14.4 to the libraries that dominate tabular ML — XGBoost, LightGBM, and CatBoost — covering second-order Newton splits, histogram binning, leaf-wise growth, and the regularization knobs that decide whether boosting generalizes or memorizes.
References
- Breiman, L. (1996). Bagging Predictors.
- Breiman, L. (2001). Random Forests.
- Wolpert, D. H. (1992). Stacked Generalization.
- Dietterich, T. G. (2000). Ensemble Methods in Machine Learning.
- Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting.
- Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).