Gradient boosting — the general algorithm
A single decision tree (Chapter 04) is a weak learner: it carves the input space into boxes and predicts a constant in each. Boosting is the idea that a sequence of such weak learners, each trained to fix the mistakes of the running total, can compose into a strong one. Where bagging (Chapter 14) builds many independent trees and averages them to cut variance, boosting builds trees sequentially, and each new tree is added to reduce the bias that remains.
Friedman's gradient boosting machine (2001) gives this a clean, general formulation: treat the additive model \(F(x)\) as a point in function space, and run gradient descent on the loss, with respect to the function itself. We grow the model one stage at a time:
How do we pick \(h_m\)? Gradient descent says: move in the direction of steepest descent. In function space, the steepest-descent direction at example \(i\) is the negative gradient of the loss with respect to the current prediction — the pseudo-residual:
The payoff is cleanest for the squared-error loss \(L(y,F)=\tfrac12(y-F)^2\). Its gradient is \(\partial L/\partial F = -(y-F)\), so the negative gradient is simply \(r_i = y_i - F(x_i)\): the ordinary residual. For squared loss, "fit the negative gradient" reduces to the intuitive "fit the leftover error" — which is why the from-scratch demo below works with plain residuals.
# Gradient boosting for regression, from scratch: stumps fit to residuals
import numpy as np
rng = np.random.default_rng(0)
x = np.linspace(-3, 3, 120)
y = np.sin(x) + 0.15 * rng.standard_normal(x.size) # noisy target
def best_stump(x, r): # depth-1 tree: one threshold, two leaf means
order = np.argsort(x); xs, rs = x[order], r[order]
best = (np.inf, x.mean(), r.mean(), r.mean())
for i in range(5, len(xs) - 5): # candidate split points
t = 0.5 * (xs[i - 1] + xs[i])
lo, hi = rs[:i].mean(), rs[i:].mean()
sse = ((rs[:i] - lo) ** 2).sum() + ((rs[i:] - hi) ** 2).sum()
if sse < best[0]: best = (sse, t, lo, hi)
return best[1:] # threshold, left value, right value
nu, M = 0.3, 40
F = np.full_like(y, y.mean()) # F0 = mean (EQ M15.1, squared loss)
loss = []
for m in range(M):
r = y - F # negative gradient = residual (EQ M15.3)
t, lo, hi = best_stump(x, r)
h = np.where(x < t, lo, hi)
F = F + nu * h # stagewise step (EQ M15.1)
loss.append(np.mean((y - F) ** 2))
print(f"start MSE {np.mean((y - y.mean())**2):.4f} -> after {M} trees {loss[-1]:.4f}")
print("loss is monotonically non-increasing:", all(np.diff(loss) <= 1e-9))
plot_xy(list(range(1, M + 1)), loss)
AdaBoost & exponential loss — where it started
Gradient boosting did not arrive first. AdaBoost (Freund & Schapire, 1997) predates the gradient view and looks, at a glance, like a different algorithm: it maintains a weight on every training example, up-weights the ones the current ensemble gets wrong, and trains the next weak learner on that re-weighted distribution. Misclassified points shout louder; the next learner is forced to attend to them.
For binary labels \(y \in \{-1,+1\}\), each round fits a classifier \(h_m\), measures its weighted error \(\varepsilon_m\), and assigns it a vote \(\alpha_m\):
The bridge to Friedman is one of the cleaner results in machine learning. Friedman, Hastie and Tibshirani (2000) showed that AdaBoost is exactly forward stagewise additive modeling under the exponential loss:
# AdaBoost weight-update demo on toy 1-D data (decision stumps)
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10.0])
y = np.array([1, 1, 1,-1, 1,-1,-1, 1,-1,-1.0]) # not linearly separable by a stump
w = np.full(y.size, 1.0 / y.size) # uniform start
def best_stump(x, y, w): # weighted-error optimal stump
best = (1.0, 0.0, 1) # (err, threshold, polarity)
for t in (x[:-1] + x[1:]) / 2:
for s in (+1, -1):
pred = np.where(x < t, s, -s)
err = w[pred != y].sum()
if err < best[0]: best = (err, t, s)
return best
print(" round eps alpha max-weight")
for m in range(3):
eps, t, s = best_stump(x, y, w)
eps = min(max(eps, 1e-9), 1 - 1e-9)
alpha = 0.5 * np.log((1 - eps) / eps) # EQ M15.4 vote weight
pred = np.where(x < t, s, -s)
w = w * np.exp(-alpha * y * pred) # reweight: errors up, correct down
w = w / w.sum() # renormalize to a distribution
print(f" {m+1} {eps:.3f} {alpha:+.3f} {w.max():.3f}")
print("\nweights after 3 rounds:", np.round(w, 3))
print("hard examples now carry the most weight (largest entries).")
XGBoost — regularization & second-order gradients
XGBoost (Chen & Guestrin, 2016) took Friedman's algorithm and made it production-grade. Two ideas matter most. First, it adds an explicit regularization term to the objective, penalizing trees that grow too many leaves or assign too-large leaf values. Second, it uses a second-order Taylor expansion of the loss — gradients and Hessians — so each tree solves a closer approximation of the true objective than first-order gradient boosting does.
Because the objective is now a sum of independent per-leaf quadratics, the optimum has a closed form. For a leaf holding instance set \(I_j\) with gradient sum \(G_j=\sum_{i\in I_j} g_i\) and Hessian sum \(H_j=\sum_{i\in I_j} h_i\), the best leaf value and the resulting loss reduction are:
Beyond the formula. XGBoost also ships a sparsity-aware split finder (it learns a default direction for missing values rather than imputing), an approximate histogram-based split mode for large data, column and row subsampling à la random forests, and shrinkage on top — so EQ M15.1's \(\nu\) and EQ M15.6's \(\lambda,\gamma\) all coexist as separate regularizers. It was the algorithm that, for several years, won the majority of Kaggle's tabular competitions, and it remains the field's reference implementation.
# XGBoost leaf math (EQ M15.7): leaf value and split gain, in pure numpy
import numpy as np
# per-instance first/second-order grads for a node (toy values)
g = np.array([-2.0, -3.0, -5.0, 1.0, 2.0, 3.0])
h = np.array([ 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])
lam, gamma = 1.0, 0.0
def leaf_value(G, H): return -G / (H + lam)
def score(G, H): return G * G / (H + lam) # structure score
G, H = g.sum(), h.sum()
print(f"whole node: G={G:.1f} H={H:.1f} w*={leaf_value(G,H):.4f}")
best = (-np.inf, None)
for s in range(1, len(g)): # try each ordered split
GL, HL = g[:s].sum(), h[:s].sum()
GR, HR = g[s:].sum(), h[s:].sum()
gain = 0.5 * (score(GL,HL) + score(GR,HR) - score(G,H)) - gamma
print(f"split after idx {s}: gain={gain:+.4f} "
f"wL={leaf_value(GL,HL):+.3f} wR={leaf_value(GR,HR):+.3f}")
if gain > best[0]: best = (gain, s)
print(f"\nbest split is after index {best[1]} with gain {best[0]:.4f}")
LightGBM — histograms & leaf-wise growth
LightGBM (Ke et al., 2017) keeps XGBoost's objective but rebuilds the machinery for speed and memory at scale. Three engineering bets define it.
Histogram binning. Finding the best split by scanning every distinct feature value is \(O(n\,d)\) per level and dominated by sorting. LightGBM instead buckets each feature into a fixed number of bins (default 255) once, up front, then builds histograms of gradient and Hessian sums per bin. Split-finding becomes a cheap scan over bins, not rows:
Leaf-wise growth. XGBoost grows trees level-wise (split every node at a depth before going deeper). LightGBM grows leaf-wise: at each step it splits the single leaf with the largest Gain, anywhere in the tree. For a fixed number of leaves this lowers loss faster — but it produces deep, unbalanced trees that overfit small data, so LightGBM caps growth with num_leaves and max_depth rather than depth alone.
num_leaves meaningfully below \(2^{\texttt{max\_depth}}\). LightGBM also adds GOSS (keep large-gradient rows, subsample small-gradient ones) and EFB (bundle mutually-exclusive sparse features), the two tricks its name — Gradient-based One-Side Sampling + Exclusive Feature Bundling — refers to.num_leaves = 8 with max_depth = 10. What fraction of the depth-10 capacity \(2^{\texttt{max\_depth}}\) does it actually use? (Give a decimal.)num_leaves far below \(2^{\texttt{max\_depth}}\) (EQ M15.9) is the standard guard against leaf-wise overfitting — a deep but narrow tree.# LightGBM histogram split (EQ M15.8): bin once, then scan bins not rows
import numpy as np
rng = np.random.default_rng(2)
n = 20000
x = rng.uniform(0, 1, n) # one feature
g = (x - 0.6) + 0.1 * rng.standard_normal(n) # toy gradients (sign flips near 0.6)
h = np.ones(n)
lam = 1.0
nbins = 256
edges = np.linspace(0, 1, nbins + 1)
b = np.clip(np.digitize(x, edges) - 1, 0, nbins - 1) # bin index per row
Gb = np.bincount(b, weights=g, minlength=nbins) # gradient histogram
Hb = np.bincount(b, weights=h, minlength=nbins) # hessian histogram
G, H = Gb.sum(), Hb.sum()
GL = np.cumsum(Gb)[:-1]; HL = np.cumsum(Hb)[:-1] # left side at each cut
GR, HR = G - GL, H - HL
gain = 0.5 * (GL**2/(HL+lam) + GR**2/(HR+lam) - G**2/(H+lam))
cut = np.argmax(gain)
print(f"rows={n}, but only {nbins} bins scanned to find the split")
print(f"best cut at bin {cut} ~ x = {edges[cut+1]:.3f} (true sign flip at 0.600)")
print(f"best gain = {gain[cut]:.2f}")
plot_xy(list(range(len(gain))), gain.tolist())
CatBoost — ordered boosting & native categoricals
CatBoost (Prokhorenkova et al., 2018) targets a subtle bug that the others share: target leakage. It shows up in two places — when you encode a categorical feature using the target, and when you compute residuals — and CatBoost's signature move, ordered boosting, is one mechanism that fixes both.
Ordered target statistics. A natural way to turn a categorical value (say, a city) into a number is to replace it with the average target for that category — target or mean encoding. Done naïvely, this leaks: the row's own label is in the average used to encode it, so the model peeks at the answer. CatBoost computes the statistic using only the rows that came before in a random permutation:
Ordered boosting. The same leakage lurks in the residuals themselves: the standard GBM uses a model trained on row \(i\) to compute row \(i\)'s gradient, a mild form of target peeking that biases the residuals and inflates apparent accuracy (a "prediction shift"). CatBoost maintains models trained on prefixes of a permutation, so the gradient for row \(i\) always comes from a model that never saw row \(i\). The result is less overfitting on small and noisy datasets — its main empirical edge.
Honest comparison. No library dominates across the board. Benchmarks (and the contested but useful 2022 Grinsztajn et al. study) agree on the broad strokes: all three beat untuned neural nets on typical tabular data, and tuned, the three are usually within noise of one another. The real differentiators are operational — LightGBM tends to be fastest to train, CatBoost is strongest out-of-the-box on heavily categorical data and needs the least preprocessing, and XGBoost is the most battle-tested with the widest ecosystem. The defensible claim is engineering, not accuracy: pick by your data shape and constraints, then tune.
| Dimension | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Tree growth | level-wise (depth-first symmetric) | leaf-wise (best-first) | symmetric / oblivious trees |
| Split finding | exact + histogram | histogram (GOSS, EFB) | histogram |
| Categoricals | manual encoding (now native too) | native (Fisher grouping) | native ordered TS (EQ M15.10) |
| Leakage guard | shrinkage + subsampling | shrinkage + subsampling | ordered boosting |
| Usual edge | ecosystem, maturity | speed on big data | categoricals, low tuning |
You can now build the model that wins the leaderboard — but a leaderboard score is only as honest as the split that produced it. The MLOps volume opens with the discipline that decides whether any of this generalizes: resampling and cross-validation — k-fold, stratified, grouped, and time-series splits, and the leakage traps that quietly turn a 0.95 held-out score into fiction. MLOps · Chapter 01: Resampling & Cross-Validation.
References
- Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
- Freund, Y. & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting.
- Friedman, J., Hastie, T. & Tibshirani, R. (2000). Additive Logistic Regression: A Statistical View of Boosting.
- Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System.
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree.
- Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. (2018). CatBoost: Unbiased Boosting with Categorical Features.
- Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?.