AI // ENCYCLOPEDIA / MACHINE LEARNING / 14 / ENSEMBLES INDEX NEXT: BOOSTING LIBRARIES →
MACHINE LEARNING · CHAPTER 14 / 15

Ensemble Methods

A single tree overfits, a single shallow model underfits, and any single model is a single point of failure. Combine many weak models the right way and their errors largely cancel, the most dependable improvement in applied machine learning. This chapter derives why ensembling works from the bias-variance-covariance decomposition, then walks the three families that exploit it: bagging (reduce variance), boosting (reduce bias, stagewise), and stacking (let a meta-model learn the blend).

LEVELCORE READING TIME≈ 24 MIN BUILDS ONCH 04 · 06 INSTRUMENTSVARIANCE DROP · STAGEWISE FIT · STACKER
14.1

Why ensembles win — the error decomposition

Start from the only identity that matters here. For a squared-error regression target \(y = f(x) + \varepsilon\) with irreducible noise \(\mathrm{Var}(\varepsilon) = \sigma^2\), the expected test error of an estimator \(\hat f\) — averaged over the randomness in the training set — splits into three terms that cannot interfere with one another:

EQ M14.1 — BIAS-VARIANCE DECOMPOSITION $$ \mathbb{E}\big[(y - \hat f(x))^2\big] \;=\; \underbrace{\big(f(x) - \mathbb{E}[\hat f(x)]\big)^2}_{\text{bias}^2} \;+\; \underbrace{\mathbb{E}\big[(\hat f(x) - \mathbb{E}[\hat f(x)])^2\big]}_{\text{variance}} \;+\; \underbrace{\sigma^2}_{\text{noise}} $$
Bias is how far the average model is from the truth (systematic error — too rigid). Variance is how much the model jitters as the training set is resampled (instability — too flexible). Noise is the floor no model can beat. Each ensemble family attacks a different term: bagging shrinks variance, boosting shrinks bias, and a good stack can chip at both. The decomposition is exact for squared loss; for 0–1 classification it only holds in spirit, which is why we reason in regression first.

Now average \(k\) models. Let each \(\hat f_j\) have the same variance \(\sigma_m^2\), and let any two of them have correlation \(\rho\). The variance of their average is the whole story of ensembling in one line:

EQ M14.2 — VARIANCE OF A CORRELATED AVERAGE $$ \mathrm{Var}\!\left(\frac{1}{k}\sum_{j=1}^{k}\hat f_j\right) \;=\; \rho\,\sigma_m^2 \;+\; \frac{1-\rho}{k}\,\sigma_m^2 $$
Two regimes live inside this equation. As \(k \to \infty\) the second term vanishes and you are left with \(\rho\,\sigma_m^2\) — the correlation floor. If the members are independent (\(\rho = 0\)) the average has variance \(\sigma_m^2/k\): error falls like \(1/k\), a true free lunch. If they are identical (\(\rho = 1\)) you gain nothing — averaging a model with copies of itself changes nothing. Every variance-reduction trick in this chapter is an attempt to push \(\rho\) down without letting \(\sigma_m^2\) blow up. Averaging does not touch bias: the mean of \(k\) equally-biased models keeps that bias intact.

Two consequences follow immediately. First, ensembling helps most when the members are good but different — accurate (small \(\sigma_m^2\)) yet decorrelated (small \(\rho\)). Second, there is a hard limit: no amount of averaging beats the correlation floor \(\rho\,\sigma_m^2\), so the engineering game is decorrelation, not just adding more models. This is exactly why random forests randomize the features at each split (it lowers \(\rho\)) and why diverse model classes stack better than ten retrained copies of the same gradient-boosted tree.

You average \(k = 4\) independent models, each with variance \(v = 8\). By EQ M14.2 with \(\rho = 0\), what is the variance of their average \( \tfrac{1}{k}\sum_j \hat f_j \)?
With \(\rho = 0\), EQ M14.2 collapses to \(\mathrm{Var} = v/k = 8/4 = \) 2. Independence is what makes the \(1/k\) free lunch real; correlation is what spoils it.
True or false: bagging (averaging many bootstrap-trained models) primarily reduces the variance term of EQ M14.1, leaving bias roughly unchanged. Answer true or false.
EQ M14.2 shows the averaged variance shrinks toward the correlation floor, while the bias of the average equals the (shared) bias of each member — averaging cannot move it. So bagging buys variance reduction at fixed bias: true.
14.2

Bagging & variance reduction

Bootstrap aggregating — Breiman's bagging — is the most direct cash-out of EQ M14.2. Draw \(k\) bootstrap samples (each \(n\) points sampled with replacement from the original \(n\)), fit one high-variance, low-bias learner on each, and average their predictions (or majority-vote for classification). Deep decision trees are the canonical base learner because they are exactly what the math wants: nearly unbiased and wildly unstable, so there is a mountain of variance to drain and almost no bias to protect.

EQ M14.3 — THE BAGGED PREDICTOR & ITS OOB FREE LUNCH $$ \hat f_{\text{bag}}(x) = \frac{1}{k}\sum_{j=1}^{k}\hat f_j^{*}(x), \qquad \Pr(\text{point } i \notin \text{bootstrap } j) = \left(1 - \tfrac{1}{n}\right)^{\!n} \xrightarrow[n\to\infty]{} e^{-1} \approx 0.368 $$
\(\hat f_j^{*}\) is the model trained on bootstrap sample \(j\). Each bootstrap omits about 36.8% of the data purely by chance — those are the out-of-bag (OOB) points for that tree. Average each point's predictions over only the trees that did not see it and you get a validation estimate for free, no held-out set required. OOB error is bagging's built-in cross-validation, and it is why random forests are so cheap to tune.

Random forests add the second decorrelation lever EQ M14.2 begged for. Trees grown on bootstrap samples of the same data are still correlated — they all latch onto the few strongest features. So at every split, a random forest considers only a random subset of features (\(\sqrt{p}\) for classification, \(p/3\) for regression are the classic defaults). Forcing weaker features into the splits makes the trees genuinely different, drives \(\rho\) down, and pushes the correlation floor lower than plain bagging can reach. Extremely randomized trees (ExtraTrees) go further still, randomizing the split thresholds too — trading a touch of bias for even less correlation.

In a large bootstrap sample (\(n \to \infty\)), what fraction of the original points are left out of any single bootstrap draw — i.e. become out-of-bag? Use the limit in EQ M14.3.
\(\left(1 - \tfrac{1}{n}\right)^n \to e^{-1} = 0.3679\). Rounded, about 0.368 — roughly 36.8% of points sit out of each tree and form its OOB validation set.
PYTHON · RUNNABLE IN-BROWSER
# Bagging from scratch: average bootstrapped stumps, watch variance collapse
import numpy as np
rng = np.random.default_rng(0)

def stump(x, y):                       # 1-split regression tree (high variance)
    s = np.argsort(x); xs, ys = x[s], y[s]
    best, thr, lo, hi = 1e18, x[0], y.mean(), y.mean()
    for i in range(1, len(x)):         # try every midpoint as a split
        t = 0.5 * (xs[i-1] + xs[i])
        l, r = ys[:i], ys[i:]
        e = ((l - l.mean())**2).sum() + ((r - r.mean())**2).sum()
        if e < best: best, thr, lo, hi = e, t, l.mean(), r.mean()
    return lambda xq: np.where(xq < thr, lo, hi)

xg = np.linspace(0, 1, 60)             # fixed grid where we measure variance
f = lambda x: np.sin(2*np.pi*x)        # ground truth
def fit_one():                         # one stump on one noisy 25-point sample
    x = rng.uniform(0, 1, 25); y = f(x) + rng.normal(0, 0.4, 25)
    return stump(x, y)(xg)

for k in (1, 5, 25, 100):              # average k stumps, repeat to measure var
    preds = np.array([np.mean([fit_one() for _ in range(k)], 0) for _ in range(40)])
    print(f"k={k:3d}  avg pointwise variance = {preds.var(0).mean():.4f}")
print("\nvariance falls ~1/k early, then flattens at the correlation floor (EQ M14.2).")
edits are live — break it on purpose
INSTRUMENT M14.1 — VARIANCE COLLAPSEAVERAGE N NOISY STUMPS · EQ M14.2
SINGLE-TREE VARIANCE
BAGGED VARIANCE (k TREES)
VARIANCE REDUCTION
Faint lines are individual bootstrap stumps; the bright mint line is their average; the dashed line is the true curve. Push k up and watch the ragged members fuse into a smooth, accurate fit while the variance readout drops toward the correlation floor — not to zero, because bootstrap stumps stay correlated. Raise the noise to see how much more there is to gain.
14.3

Boosting — sequential error correction

Bagging builds its members in parallel and independently. Boosting does the opposite: it builds them in sequence, each new member trained to fix the mistakes of the running ensemble so far. Where bagging attacks variance with strong learners, boosting attacks bias by composing many weak learners (shallow trees, often stumps) into one strong one. The cost is that members are now dependent by construction — \(\rho\) is high — so the variance lever of EQ M14.2 is gone, and boosting trades it for a march down the bias term.

The cleanest modern view is gradient boosting (Friedman): treat the ensemble as a function being optimized by gradient descent in function space. At each round you fit a new weak learner to the negative gradient of the loss — for squared error, that gradient is simply the residual — and take a small, shrunk step:

EQ M14.4 — GRADIENT BOOSTING (FUNCTIONAL GRADIENT DESCENT) $$ r_i^{(m)} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F = F_{m-1}}, \qquad F_m(x) = F_{m-1}(x) + \nu\, h_m(x),\quad h_m \approx \arg\min_h \sum_i (r_i^{(m)} - h(x_i))^2 $$
\(L\) is the loss, \(F_{m-1}\) the ensemble after \(m-1\) rounds, \(h_m\) the weak learner fit to the pseudo-residuals \(r^{(m)}\), and \(\nu \in (0,1]\) the learning rate (shrinkage). For squared loss \(L = \tfrac12(y-F)^2\), the residual is literally \(r_i = y_i - F_{m-1}(x_i)\): each round models what the last one missed. Small \(\nu\) with many rounds nearly always beats large \(\nu\) with few — the same regularize-by-small-steps logic as in SGD (Chapter 08). XGBoost and LightGBM (Chapter 15) are this recipe plus second-order Newton steps and brutal engineering.

The older, equivalent-in-spirit ancestor is AdaBoost, which reweights the data instead of fitting residuals: misclassified points get heavier, so the next weak learner concentrates where the ensemble is failing. Its multiclass form, SAMME, assigns each weak learner a vote weighted by how much better than chance it does:

EQ M14.5 — ADABOOST/SAMME WEIGHTS $$ \alpha_m = \log\!\frac{1 - \mathrm{err}_m}{\mathrm{err}_m} + \log(K - 1), \qquad \mathrm{err}_m = \frac{\sum_i w_i\,\mathbb{1}[y_i \neq h_m(x_i)]}{\sum_i w_i} $$
\(\mathrm{err}_m\) is the weighted error of weak learner \(m\); \(K\) is the number of classes (\(K = 2\) recovers classic AdaBoost, where the \(\log(K-1)\) term is zero). A learner barely better than random (\(\mathrm{err}_m\) just under \(1 - 1/K\)) gets \(\alpha_m \approx 0\); a strong one gets a large vote. After each round, weights on the still-wrong points are multiplied by \(e^{\alpha_m}\) and renormalized. SAMME merely requires each learner to beat random guessing — far weaker than the 50% bar binary AdaBoost demands.
A binary (\(K = 2\)) AdaBoost weak learner has weighted error \(\mathrm{err}_m = 0.1\). By EQ M14.5 (the \(\log(K-1)\) term vanishes for \(K=2\)), what is its vote weight \(\alpha_m\)? Use natural log.
\(\alpha_m = \ln\!\dfrac{1 - 0.1}{0.1} = \ln\dfrac{0.9}{0.1} = \ln 9 = \) 2.197. A 10%-error learner earns a large, confident vote; a 50%-error (chance) learner would earn \(\ln 1 = 0\).
PYTHON · RUNNABLE IN-BROWSER
# AdaBoost (SAMME, K=2) on toy data: print weighted error + alpha each round
import numpy as np
rng = np.random.default_rng(1)

n = 200
X = rng.normal(0, 1, (n, 2))
y = np.where(X[:, 0] + X[:, 1] > 0, 1, -1)         # linear boundary, labels +/-1

def stump(X, y, w):                                 # best 1-feature threshold stump
    best = (1e9, 0, 0.0, 1)
    for f in range(X.shape[1]):
        for t in np.quantile(X[:, f], np.linspace(.1, .9, 9)):
            for s in (1, -1):
                pred = np.where(s * (X[:, f] - t) > 0, 1, -1)
                e = w[pred != y].sum()
                if e < best[0]: best = (e, f, t, s)
    e, f, t, s = best
    return (f, t, s), np.where(s * (X[:, f] - t) > 0, 1, -1)

w = np.full(n, 1 / n)                               # uniform sample weights
print("round  weighted_err   alpha")
for m in range(5):
    _, pred = stump(X, y, w)
    err = max(w[pred != y].sum(), 1e-12)
    alpha = np.log((1 - err) / err)                 # EQ M14.5, K=2
    w *= np.exp(alpha * (pred != y))                # up-weight the still-wrong
    w /= w.sum()
    print(f"  {m:2d}      {err:8.4f}   {alpha:6.3f}")
print("\nerror drifts toward 0.5 as easy points are solved and hard ones dominate.")
edits are live — break it on purpose
INSTRUMENT M14.2 — STAGEWISE RESIDUAL FITGRADIENT BOOSTING · EQ M14.4
ROUNDS USED
TRAIN MSE
RESIDUAL NORM
The bright line is the boosted ensemble \(F_m\) chasing the dashed target; the short red sticks are the current residuals \(r^{(m)}\) — what the next stump will be fit to. Step the rounds up and watch the residuals shrink stagewise. Drop the learning rate and you need more rounds for the same fit, but the staircase is smoother and overfits later. At round 0 the model is just the mean.
14.4

Stacking & blending

Bagging and boosting both combine members of one kind with a fixed rule (average, weighted vote). Stacked generalization (Wolpert) asks the obvious next question: why hand-pick the combiner when you can learn it? Train several diverse base models — say a random forest, a gradient-boosted tree, a linear model, and a k-NN — then train a second-level meta-learner whose inputs are the base models' predictions and whose target is the true label. The meta-model discovers, per region of input space, which base learner to trust.

EQ M14.6 — THE STACKED PREDICTOR $$ \hat y = g\big(\hat f_1(x),\, \hat f_2(x),\, \ldots,\, \hat f_M(x)\big), \qquad g = \arg\min_{g}\ \sum_{i}\, L\big(y_i,\, g(z_i)\big),\ \ z_i = \big(\hat f_1^{(-i)}(x_i),\ldots\big) $$
\(g\) is the meta-learner; \(z_i\) is the vector of base predictions for point \(i\). The crucial detail is the \((-i)\) superscript: the base predictions feeding the meta-model must be out-of-fold. Train the bases with k-fold cross-validation and predict each held-out fold, so no base model ever predicts a point it was trained on. Skip this and the meta-learner sees in-sample predictions that are unrealistically good, learns to trust an overfit base, and collapses on real data. A simple, well-regularized \(g\) (ridge, or non-negative-weighted logistic) almost always beats a fancy one. Blending is the lazy cousin: a single fixed holdout split instead of full CV — simpler, leak-resistant, but it wastes data.

Stacking is the engine behind most Kaggle-winning solutions and many production ranking systems, precisely because EQ M14.2 rewards diversity: a forest and a boosted tree make different mistakes, so their stacked combination has lower \(\rho\) than either family alone. The returns are real but modest — typically a few percent over the best single model — and they come with real operational cost (more models to train, serve, monitor, and debug). That trade is the subject of the next section.

PYTHON · RUNNABLE IN-BROWSER
# Stacking toy: two biased base models; a ridge meta-learner finds the blend
import numpy as np
rng = np.random.default_rng(3)

n = 400
x = rng.uniform(-3, 3, n)
y = np.sin(x) + 0.1 * rng.normal(size=n)            # ground truth + noise

# two deliberately complementary, biased base predictions
f1 = 0.9 * np.sin(x) - 0.2                          # good shape, wrong offset
f2 = x / 3.0                                        # a linear approximation
for name, f in (("base f1", f1), ("base f2", f2)):
    print(f"{name:8s} MSE = {((y - f)**2).mean():.4f}")

Z = np.column_stack([np.ones(n), f1, f2])           # meta features (+ intercept)
lam = 1e-3
beta = np.linalg.solve(Z.T @ Z + lam*np.eye(3), Z.T @ y)   # ridge meta-learner
stack = Z @ beta
print(f"\nmeta weights [bias, f1, f2] = {beta.round(3)}")
print(f"stacked  MSE = {((y - stack)**2).mean():.4f}  <- below both bases")
edits are live — break it on purpose
INSTRUMENT M14.3 — STACKING META-LEARNERTWO BASES + LEARNED BLEND · EQ M14.6
BASE 1 MSE
BASE 2 MSE
STACKED MSE
Two biased base learners (mint, blue) bracket the dashed truth. In MANUAL mode, drag w₁ to blend them by hand and hunt for the lowest stacked MSE. Switch to RIDGE FIT and the meta-learner solves for the optimal weights in closed form — landing at or below the best blend you can find by hand, and below either base alone. That is EQ M14.6 doing its job.
14.5

When ensembles fail

The "free lunch" framing is a useful exaggeration. Ensembles are remarkably robust, but EQ M14.2 also tells you exactly where they stop helping — and several failure modes have nothing to do with the math at all.

Failure modeWhy it happensWhat it looks like
Correlated membersHigh \(\rho\) pins variance at the floor \(\rho\sigma_m^2\)The 200th tree adds nothing; OOB error flatlined at tree ~50
Bagging a stable learnerLow-variance bases (linear/SVM) have little variance to drainBagged linear model ≈ the single linear model, at 100× the cost
Boosting on noisy labelsAdaBoost up-weights hard points — which include mislabelsTrain error → 0, test error climbs; the ensemble memorizes noise
Stacking with leakageIn-sample (not out-of-fold) base predictions feed the meta-modelSpectacular CV scores, collapse in production
Distribution shiftAll members agree confidently on the wrong (shifted) inputsCalibrated, unanimous, and wrong — diversity gives no safety here

Two of these deserve emphasis because they are genuinely contested in practice. First, boosting's noise sensitivity: AdaBoost's exponential loss punishes outliers viciously, which is why robust variants (LogitBoost, gradient boosting with Huber loss, and early stopping) exist — though on clean tabular data, gradient-boosted trees remain the strongest single tool and frequently beat deep nets, a result that surprises newcomers and is still actively debated. Second, the cost-benefit ledger: an ensemble multiplies training, serving, latency, memory, and debugging cost, often for a 1–3% metric gain. In a leaderboard that wins; in a latency-bound product it may not be worth it. The honest default is a single well-tuned gradient-boosted model for tabular problems, reaching for stacking only when the last percent genuinely pays.

The deeper caveat. Ensembling reduces variance and bias, never the irreducible noise \(\sigma^2\) of EQ M14.1, and it cannot manufacture signal that the base learners never saw. Diversity is a property of errors, not of confidence: ten models can be diverse, confident, and unanimously wrong under distribution shift. Ensembles buy you stability and a few points of accuracy — not robustness to a world that has moved.

NEXT

The theory is settled; the speed is not. Chapter 15 takes gradient boosting from EQ M14.4 to the libraries that dominate tabular ML — XGBoost, LightGBM, and CatBoost — covering second-order Newton splits, histogram binning, leaf-wise growth, and the regularization knobs that decide whether boosting generalizes or memorizes.

14.R

References

  1. Breiman, L. (1996). Bagging Predictors. Machine Learning 24(2) — bootstrap aggregating and its variance-reduction argument (EQ M14.2, M14.3).
  2. Breiman, L. (2001). Random Forests. Machine Learning 45(1) — feature subsampling as the second decorrelation lever; OOB error (§14.2).
  3. Wolpert, D. H. (1992). Stacked Generalization. Neural Networks 5(2) — learning the combiner with out-of-fold base predictions (EQ M14.6).
  4. Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Multiple Classifier Systems, LNCS 1857 — the canonical survey of why and when ensembles help (§14.1).
  5. Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1) — AdaBoost and its training-error bound (EQ M14.5).
  6. Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics 29(5) — boosting as functional gradient descent (EQ M14.4).
  7. Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — free online; Ch. 8, 10, 15–16 cover bagging, boosting, and random forests with the bias-variance lens used here.