Hyperparameter Tuning — AI Encyclopedia

2.1

The search space & the objective

A hyperparameter is any setting fixed before training that the optimizer is not allowed to touch: the learning rate of gradient descent, the number of trees and their depth in a forest, the $C$ of an SVM, the dropout rate of a network. Cross-validation (the previous chapter) tells you how to score one configuration honestly. Tuning is the outer question it left open: which configurations should you even try?

Frame it as optimization. Let $\theta \in \Theta$ be a configuration drawn from a search space $\Theta$, and let $f(\theta)$ be the validation objective — typically the cross-validated loss, which we want to minimize:

EQ V2.1 — THE HYPERPARAMETER OPTIMIZATION PROBLEM $$ \theta^{\star} \;=\; \arg\min_{\theta \in \Theta}\; f(\theta), \qquad f(\theta) \;=\; \widehat{\mathrm{CV}}\big(\theta\big) \;=\; \frac{1}{k}\sum_{i=1}^{k} L\big(y,\, \hat{f}_{\theta}^{\,(-i)}(x)\big) $$

$\Theta$ is the Cartesian product of every hyperparameter's allowed range; $f(\theta)$ is the k-fold CV score (MLOPS · EQ V1.3) of training with configuration $\theta$. Two properties make this hard and define the whole chapter: $f$ is a black box — no gradient, no formula, you can only sample it — and each sample is expensive, since one evaluation means training the model $k$ times. Every method below is a different policy for spending a fixed budget of these expensive black-box queries.

Three properties of the space drive every design choice. First, scale: a learning rate ranges over orders of magnitude, so you search it on a log scale ($10^{-5}$ to $10^{-1}$), not a linear one — uniform-in-log, where each decade gets equal attention. Second, type: some knobs are continuous (learning rate), some integer (tree depth), some categorical (optimizer ∈ {Adam, SGD}). Third, and most consequential, effective dimensionality: of the dozen hyperparameters you nominally tune, usually only two or three actually move the score. That last fact is the hinge on which random search beats grid search.

EQ V2.2 — SIZE OF A GRID $$ |\Theta_{\text{grid}}| \;=\; \prod_{j=1}^{D} n_j \qquad\Longrightarrow\qquad \text{evaluations} \;=\; k \cdot \prod_{j=1}^{D} n_j $$

A grid that tries $n_j$ values of hyperparameter $j$ across $D$ hyperparameters has $\prod_j n_j$ configurations, and each costs $k$ model fits under k-fold CV. The product is the curse of dimensionality in one line: ten values across six hyperparameters is $10^6$ configurations — a million fits before you score anything. The objective $f$ is cheap to state and ruinous to sweep.

You define a grid over three hyperparameters with $4$, $5$, and $3$ candidate values respectively. By EQ V2.2, how many distinct configurations does the full grid contain ($\prod_j n_j$)?

The grid is the Cartesian product, so its size is the product of the per-axis counts: $4 \times 5 \times 3 = $ 60 configurations. Under 5-fold CV that is already $60 \times 5 = 300$ model fits — and adding a single fourth hyperparameter with just 4 values would quadruple it to 240 configurations.

You search a learning rate log-uniformly between $10^{-5}$ and $10^{-1}$. The geometric center of that range — the value sitting exactly halfway in log space — is $\sqrt{10^{-5}\cdot 10^{-1}}$. What is it?

In log space the midpoint of $[-5, -1]$ is $-3$, so the value is $10^{-3}$. Equivalently the geometric mean: $\sqrt{10^{-5}\cdot 10^{-1}} = \sqrt{10^{-6}} = 10^{-3} = $ 0.001. Searching log-uniformly is why each decade of learning rate gets equal sampling attention.

2.2

Grid search: exhaustive and quietly wasteful

Grid search is the reflex: pick a finite set of values per hyperparameter, take the Cartesian product, evaluate every point, keep the best. It is trivial to implement, embarrassingly parallel, and reproducible to the last digit. For one or two hyperparameters it is perfectly reasonable. Its problems begin the moment the space has more dimensions or more resolution than your budget can sweep.

The first problem is the exponential of EQ V2.2: refining a grid from 5 to 10 values per axis multiplies the cost by $2^D$, so the same compute buys you exponentially coarser resolution as $D$ grows. The second problem is subtler and more damaging. A grid spends its budget on the Cartesian product even when most axes do not matter. Project a 2D grid onto the one hyperparameter that actually drives the score, and the grid's points collapse onto each other — a $g \times g$ grid tests only $g$ distinct values of the hyperparameter that matters, no matter how large $g$ is.

EQ V2.3 — DISTINCT VALUES TESTED ON THE IMPORTANT AXIS $$ \underbrace{g^{D}}_{\text{grid evaluations}} \quad\text{but only}\quad \underbrace{g}_{\substack{\text{distinct values of} \\ \text{the one axis that matters}}} \;\;\ll\;\; \underbrace{g^{D}}_{\substack{\text{distinct values random} \\ \text{search would test}}} $$

A $g$-per-axis grid in $D$ dimensions makes $g^{D}$ evaluations but, by construction, repeats each value of every single axis $g^{D-1}$ times. If the objective depends on only one axis, all that repetition is wasted: the grid resolves the important axis at resolution $g$, while spending $g^{D}$ queries doing it. Random search spends the identical budget on $g^{D}$ distinct values of every axis — the observation that motivates §2.3.

Grid search is not obsolete. When you have exactly one or two hyperparameters, when reproducibility and an auditable, even sweep matter (a regulated model-risk setting), or when each evaluation is cheap, a small grid is clear and defensible. The lesson is narrower than "never grid": do not let a grid's tidiness lull you into spending an exponential budget resolving axes that do not move the metric.

PYTHON · RUNNABLE IN-BROWSER

# Grid vs random search on a 2D objective; best-found vs number of evaluations.
import numpy as np
rng = np.random.default_rng(0)

# A black-box objective on [0,1]^2 with one DOMINANT axis (x) and a near-flat one (y).
# Minimum sits near x*=0.30; y barely matters -> the classic random-beats-grid setup.
def f(x, y):
    return (x - 0.30)**2 + 0.03*(y - 0.7)**2 + 0.01*np.sin(12*x)

# GRID: g x g points -> g^2 evals, but only g DISTINCT x values are ever tested.
for g in (4, 6):
    xs = np.linspace(0, 1, g)
    grid = [f(x, y) for x in xs for y in xs]
    print(f"grid {g}x{g} = {g*g:2d} evals | best f = {min(grid):.4f} "
          f"| distinct x tested = {g}")

# RANDOM: same budgets, but every draw is a fresh x AND y.
print()
for n in (16, 36):
    X = rng.random((n, 2))
    vals = f(X[:, 0], X[:, 1])
    print(f"random  {n:2d} evals | best f = {vals.min():.4f} "
          f"| distinct x tested = {n}")

print("\nEqual budgets: random resolves the x that matters far more finely than grid.")

edits are live — break it on purpose

2.3

Random search: the surprising default

Random search replaces the grid with independent draws: sample each hyperparameter from a distribution (uniform, log-uniform, categorical) for some budget $n$ of trials, evaluate, keep the best. Bergstra and Bengio's 2012 result — one of the most quietly influential papers in applied ML — showed that for the same budget, random search matches or beats grid search on neural-network tuning, and the reason is exactly EQ V2.3: when only a few of the many hyperparameters matter, random search devotes its full budget to many distinct values of the important ones, while the grid wastes most of its points on combinations of the unimportant ones.

There is also a clean probabilistic guarantee that needs no assumption about which axes matter. If you call a configuration "good" when it lands in the top $p$ fraction of the search space, then $n$ independent random draws miss the good region entirely with probability $(1-p)^n$:

EQ V2.4 — RANDOM SEARCH COVERAGE GUARANTEE $$ \Pr[\text{at least one trial in the top } p] \;=\; 1 - (1-p)^{n} \qquad\Longrightarrow\qquad n \;\geq\; \frac{\ln(1-c)}{\ln(1-p)} $$

To hit the top $p = 5\%$ of configurations with confidence $c = 0.95$, you need $n \ge \ln(0.05)/\ln(0.95) \approx 59$ random trials — independent of how many hyperparameters you have. This is the dimension-free property that makes random search scale where grids cannot: the bound depends only on how good "good enough" is, not on $D$. It does assume the top-$p$ region has $p$ probability mass under your sampling distribution, which is why choosing sensible ranges and log-scales still matters.

Random search tends to beat grid search precisely when only a few of the many hyperparameters actually matter, because it spends its budget on more distinct values of those important axes. True or false? (Answer true or false.)

This is the central finding of Bergstra & Bengio (2012) and the meaning of EQ V2.3. A $g\times g$ grid tests only $g$ distinct values of the axis that matters and wastes the rest of its $g^2$ budget on the flat axis; random search at the same budget tests $g^2$ distinct values of every axis. When the effective dimensionality is low, that is a decisive advantage — so the statement is true.

Using the coverage bound of EQ V2.4, how many random trials $n$ do you need to land at least one configuration in the top $p = 5\%$ of the space with confidence $c = 95\%$? (Compute $\lceil \ln(1-c)/\ln(1-p)\rceil$.)

$\ln(1-c) = \ln(0.05) = -2.996$ and $\ln(1-p) = \ln(0.95) = -0.0513$. Their ratio is $-2.996 / -0.0513 = 58.4$, and you round up to a whole trial: $\lceil 58.4 \rceil = $ 59 trials. Notably, this number does not depend on the number of hyperparameters at all.

INSTRUMENT V2.1 — GRID vs RANDOM ON A RESPONSE SURFACEEQUAL BUDGET · ONE DOMINANT AXIS · EQ V2.3

BUDGET (EVALUATIONS) 36

AXIS IMPORTANCE SKEW high

GRID — DISTINCT x TESTED

—

GRID BEST f

—

RANDOM BEST f

—

Both panels spend the same budget on the same 2D response surface; brighter background is lower (better) loss, and the minimum is the mint ring. The grid (left) lays points on a lattice, so its projection onto the important x axis (the strip under each panel) bunches into a few repeated values. Random search (right) scatters, so its x-projection is dense. Crank AXIS IMPORTANCE SKEW up — make the vertical axis nearly irrelevant — and the grid's wasted budget becomes obvious: it keeps re-testing the same handful of x values, while random keeps finding new ones. Press RESEED to redraw the random trials.

2.4

Bayesian optimization: spend queries where they pay

Grid and random search are both uninformed — neither looks at the results of past trials when choosing the next one. When evaluations are genuinely expensive (training a large model for hours), that is wasteful: you have data about the objective and you are ignoring it. Bayesian optimization closes the loop. It fits a cheap probabilistic surrogate model of $f$ from the trials seen so far, then uses the surrogate to choose the most promising next query — sequential and sample-efficient by design.

The classic surrogate is a Gaussian process, which returns, at any candidate $\theta$, a posterior mean $\mu(\theta)$ (its best guess of the objective) and a standard deviation $\sigma(\theta)$ (its uncertainty). The genius is the second number: $\sigma$ is large where you have not looked. An acquisition function turns $(\mu, \sigma)$ into a single score that balances exploitation (go where $\mu$ is good) against exploration (go where $\sigma$ is high). A common, transparent choice is the upper/lower confidence bound:

EQ V2.5 — CONFIDENCE-BOUND ACQUISITION (minimization) $$ \theta_{\text{next}} \;=\; \arg\min_{\theta}\; \alpha(\theta), \qquad \alpha(\theta) \;=\; \mu(\theta) \;-\; \kappa\,\sigma(\theta), \qquad \kappa \geq 0 $$

For minimization we pick the point with the lowest lower confidence bound: $\mu - \kappa\sigma$ is optimistic in the direction we care about, rewarding both small predicted loss ($\mu$ low) and high uncertainty ($\sigma$ large). $\kappa$ is the explore–exploit dial: $\kappa = 0$ is pure greedy exploitation; large $\kappa$ chases the most uncertain region. Expected Improvement (EI) is the other standard acquisition and needs no $\kappa$, self-balancing via the improvement over the best seen so far. After each real evaluation the surrogate is refit and the loop repeats — typically converging in tens of trials where random search needs hundreds.

Bayesian optimization is the right tool when evaluations dominate cost and trials are necessarily sequential. Its honest caveats: a Gaussian process surrogate scales poorly past a few thousand trials and a couple dozen dimensions, and it is harder to parallelize than random search (each query depends on the last). For high-dimensional or massively parallel settings, tree-based surrogates (the TPE algorithm behind Optuna and Hyperopt) and the bandit-style methods of §2.5 often win — and in practice modern tuners (Optuna, Ray Tune) combine a Bayesian sampler with an early-stopping scheduler rather than choosing one.

A Gaussian-process surrogate predicts, at a candidate point, mean $\mu = 2.0$ and standard deviation $\sigma = 0.2$. Using the confidence-bound acquisition of EQ V2.5 with $\kappa = 2$, what is its acquisition value $\mu - \kappa\sigma$ for minimization?

$\mu - \kappa\sigma = 2.0 - 2 \times 0.2 = 2.0 - 0.4 = $ 1.6. A second candidate with the same $\mu = 2.0$ but larger $\sigma = 0.5$ would score $2.0 - 2\times0.5 = 1.0$ — lower, hence preferred, because the optimizer is rewarded for probing where it is uncertain.

PYTHON · RUNNABLE IN-BROWSER

# Toy Bayesian optimization: maximize a 1D function with a simple RBF surrogate.
import numpy as np
rng = np.random.default_rng(0)

def objective(x):                              # the expensive black box (we "don't know" it)
    return np.sin(3*x) + 0.5*np.sin(7*x) - 0.05*(x-2)**2

grid = np.linspace(0, 5, 400)                  # candidate pool for the acquisition step
X = np.array([0.5, 4.5]);  Y = objective(X)    # two initial probes
ls, kappa = 0.4, 2.0                           # RBF length-scale; explore-exploit dial

def kern(a, b):                                # RBF / squared-exponential kernel
    return np.exp(-0.5*((a[:, None]-b[None, :])/ls)**2)

for step in range(8):                          # 8 sequential, sample-efficient queries
    K = kern(X, X) + 1e-6*np.eye(len(X))
    Kinv = np.linalg.inv(K)
    ks = kern(grid, X)
    mu = ks @ Kinv @ Y                          # posterior mean over the grid
    var = 1.0 - np.sum((ks @ Kinv) * ks, axis=1)
    sd = np.sqrt(np.clip(var, 1e-9, None))      # posterior std (uncertainty)
    acq = mu + kappa*sd                         # UCB: we are MAXIMIZING here
    nxt = grid[np.argmax(acq)]                  # next query = argmax acquisition
    X = np.append(X, nxt);  Y = np.append(Y, objective(nxt))

print(f"true max over grid : {objective(grid).max():.4f} at x = {grid[objective(grid).argmax()]:.3f}")
print(f"BO best after {len(X)} evals: {Y.max():.4f} at x = {X[np.argmax(Y)]:.3f}")
plot_xy(grid, mu)                              # final surrogate mean

edits are live — break it on purpose

INSTRUMENT V2.2 — BAYESIAN-OPTIMIZATION STEPPERGP SURROGATE · μ ± κσ · NEXT QUERY = argmin · EQ V2.5

EXPLORE–EXPLOIT κ 2.0

EVALUATIONS SO FAR

—

BEST f FOUND (min)

—

NEXT QUERY x

—

The hidden objective (faint grey line) is a black box; the optimizer only knows the mint probes it has spent. The blue band is the surrogate's mean $\mu$ bracketed by its uncertainty $\pm\sigma$ — wide where nothing has been sampled, pinched to nothing at each probe. The dashed marker is the next query: the argmin of the lower confidence bound $\mu - \kappa\sigma$. Press STEP to spend one evaluation and watch the band collapse there; the optimizer converges on the true minimum in a handful of steps. Slide $\kappa$ to 0 for pure greedy (it can get stuck in a local dip) or up to 4 for restless exploration.

2.5

Hyperband & successive halving

Every method so far runs each configuration to completion before judging it. But for iterative learners — neural nets, gradient boosting — a configuration's fate is often visible early: a learning rate that will diverge usually starts diverging in the first few epochs. Successive halving exploits this. It is a tournament on budget: start many configurations on a tiny budget (a few epochs, a small data subset), throw away the worst fraction, give the survivors more budget, and repeat. Most of the compute lands on the few configurations that have already proven themselves.

EQ V2.6 — SUCCESSIVE HALVING SCHEDULE $$ n_r \;=\; \Big\lfloor \frac{n_0}{\eta^{\,r}} \Big\rfloor, \qquad b_r \;=\; b_0\,\eta^{\,r}, \qquad r = 0, 1, \ldots, \big\lfloor \log_{\eta} n_0 \big\rfloor $$

Round $r$ keeps $n_r$ survivors and gives each a per-config budget $b_r$; $\eta$ is the cull factor (keep the top $1/\eta$ each round, usually $\eta = 3$). Configurations fall geometrically while budget per survivor rises geometrically, so the total budget spent per round stays roughly constant $(n_r\, b_r \approx n_0\, b_0)$ — the scheme reallocates a fixed pot toward the promising, never enlarging it. With $\eta = 3$ and $n_0 = 27$: $27 \to 9 \to 3 \to 1$ over $\log_3 27 = 3$ elimination rounds.

Successive halving has one free parameter that hides a real dilemma: how many configurations $n_0$ to start with, given a fixed total budget $B$? Start with many cheap configs (large $n_0$, small $b_0$) and you explore widely but might cut a slow-starting winner before it blooms. Start with few well-funded configs and you risk never sampling a good one. There is no universally right answer because it depends on how fast the early signal correlates with final performance.

Hyperband (Li et al., 2017) resolves this by refusing to choose. It runs successive halving as a subroutine across a spectrum of brackets, each with a different $(n_0, b_0)$ trade-off — from "many configs, tiny budget each" (aggressive early stopping) to "few configs, full budget each" (essentially random search, which never wrongly culls). By hedging across brackets it is robust to how early-stopping-friendly the problem turns out to be, and it provably loses only a small logarithmic factor to the best fixed bracket chosen in hindsight. In 2026 the bracket idea, usually under the ASHA variant (asynchronous successive halving), is the default scheduler in Ray Tune and Optuna — and is routinely paired with a Bayesian sampler choosing which configurations enter the tournament.

Successive halving starts with $n_0 = 27$ configurations and culls with factor $\eta = 3$ (keep the top third each round). How many elimination rounds does it take to reach a single survivor ($\log_{\eta} n_0$)?

Each round divides the survivors by $\eta = 3$: $27 \to 9 \to 3 \to 1$. The number of rounds is $\log_3 27 = \log_3 3^3 = $ 3. Because budget per survivor triples each round while their count thirds, the compute spent per round is roughly constant.

INSTRUMENT V2.3 — SUCCESSIVE-HALVING BRACKETCULL THE WORST 1−1/η EACH ROUND · EQ V2.6

STARTING CONFIGS n₀ 27

CULL FACTOR η 3

ELIMINATION ROUNDS

—

SURVIVORS PER ROUND

—

TOTAL BUDGET vs RUN-ALL

—

Each column is one configuration; each row is a round, time flowing downward. A mint cell is a survivor still being trained; a faded grey cell was culled. Bar height within a survivor cell encodes the budget it now receives — short at the top (cheap early peeks), tall at the bottom (the finalists get full training). Notice the shape: a wide cheap top narrowing to a single well-funded survivor. The right-hand readout compares the total budget spent against naively running every configuration to completion — the savings that make early stopping worth its one risk, culling a late bloomer. Raise $\eta$ to cull more aggressively (fewer rounds, bigger savings, higher risk).

Tuning optimizes a number; the next chapter asks whether you chose the right number. Every method here minimized a validation metric — but accuracy, AUC, F1, RMSE, and log loss can rank the same models in opposite orders, and the "best" configuration is only as honest as the metric it was scored on. Chapter 03: the regression and classification metrics, what each one rewards and quietly punishes, and how to pick the objective your search should have been optimizing all along.

2.R

References

Bergstra, J. & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. JMLR 13 — the result that random search matches or beats grid search under low effective dimensionality (§2.3).
Snoek, J., Larochelle, H. & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS 2012 — Gaussian-process surrogates and acquisition functions for tuning (§2.4).
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. JMLR 18 — successive halving across brackets, the bandit view of early stopping (§2.5).
Jamieson, K. & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization. AISTATS 2016 — the successive-halving subroutine behind EQ V2.6.
Li, L. et al. (2020). A System for Massively Parallel Hyperparameter Tuning. MLSys 2020 — ASHA, the asynchronous successive halving used in production tuners.
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD 2019 — the TPE sampler plus pruning combination that is the practical default in 2026.