The search space & the objective
A hyperparameter is any setting fixed before training that the optimizer is not allowed to touch: the learning rate of gradient descent, the number of trees and their depth in a forest, the \(C\) of an SVM, the dropout rate of a network. Cross-validation (the previous chapter) tells you how to score one configuration honestly. Tuning is the outer question it left open: which configurations should you even try?
Frame it as optimization. Let \(\theta \in \Theta\) be a configuration drawn from a search space \(\Theta\), and let \(f(\theta)\) be the validation objective — typically the cross-validated loss, which we want to minimize:
Three properties of the space drive every design choice. First, scale: a learning rate ranges over orders of magnitude, so you search it on a log scale (\(10^{-5}\) to \(10^{-1}\)), not a linear one — uniform-in-log, where each decade gets equal attention. Second, type: some knobs are continuous (learning rate), some integer (tree depth), some categorical (optimizer ∈ {Adam, SGD}). Third, and most consequential, effective dimensionality: of the dozen hyperparameters you nominally tune, usually only two or three actually move the score. That last fact is the hinge on which random search beats grid search.
Grid search: exhaustive and quietly wasteful
Grid search is the reflex: pick a finite set of values per hyperparameter, take the Cartesian product, evaluate every point, keep the best. It is trivial to implement, embarrassingly parallel, and reproducible to the last digit. For one or two hyperparameters it is perfectly reasonable. Its problems begin the moment the space has more dimensions or more resolution than your budget can sweep.
The first problem is the exponential of EQ V2.2: refining a grid from 5 to 10 values per axis multiplies the cost by \(2^D\), so the same compute buys you exponentially coarser resolution as \(D\) grows. The second problem is subtler and more damaging. A grid spends its budget on the Cartesian product even when most axes do not matter. Project a 2D grid onto the one hyperparameter that actually drives the score, and the grid's points collapse onto each other — a \(g \times g\) grid tests only \(g\) distinct values of the hyperparameter that matters, no matter how large \(g\) is.
Grid search is not obsolete. When you have exactly one or two hyperparameters, when reproducibility and an auditable, even sweep matter (a regulated model-risk setting), or when each evaluation is cheap, a small grid is clear and defensible. The lesson is narrower than "never grid": do not let a grid's tidiness lull you into spending an exponential budget resolving axes that do not move the metric.
# Grid vs random search on a 2D objective; best-found vs number of evaluations.
import numpy as np
rng = np.random.default_rng(0)
# A black-box objective on [0,1]^2 with one DOMINANT axis (x) and a near-flat one (y).
# Minimum sits near x*=0.30; y barely matters -> the classic random-beats-grid setup.
def f(x, y):
return (x - 0.30)**2 + 0.03*(y - 0.7)**2 + 0.01*np.sin(12*x)
# GRID: g x g points -> g^2 evals, but only g DISTINCT x values are ever tested.
for g in (4, 6):
xs = np.linspace(0, 1, g)
grid = [f(x, y) for x in xs for y in xs]
print(f"grid {g}x{g} = {g*g:2d} evals | best f = {min(grid):.4f} "
f"| distinct x tested = {g}")
# RANDOM: same budgets, but every draw is a fresh x AND y.
print()
for n in (16, 36):
X = rng.random((n, 2))
vals = f(X[:, 0], X[:, 1])
print(f"random {n:2d} evals | best f = {vals.min():.4f} "
f"| distinct x tested = {n}")
print("\nEqual budgets: random resolves the x that matters far more finely than grid.")
Random search: the surprising default
Random search replaces the grid with independent draws: sample each hyperparameter from a distribution (uniform, log-uniform, categorical) for some budget \(n\) of trials, evaluate, keep the best. Bergstra and Bengio's 2012 result — one of the most quietly influential papers in applied ML — showed that for the same budget, random search matches or beats grid search on neural-network tuning, and the reason is exactly EQ V2.3: when only a few of the many hyperparameters matter, random search devotes its full budget to many distinct values of the important ones, while the grid wastes most of its points on combinations of the unimportant ones.
There is also a clean probabilistic guarantee that needs no assumption about which axes matter. If you call a configuration "good" when it lands in the top \(p\) fraction of the search space, then \(n\) independent random draws miss the good region entirely with probability \((1-p)^n\):
Bayesian optimization: spend queries where they pay
Grid and random search are both uninformed — neither looks at the results of past trials when choosing the next one. When evaluations are genuinely expensive (training a large model for hours), that is wasteful: you have data about the objective and you are ignoring it. Bayesian optimization closes the loop. It fits a cheap probabilistic surrogate model of \(f\) from the trials seen so far, then uses the surrogate to choose the most promising next query — sequential and sample-efficient by design.
The classic surrogate is a Gaussian process, which returns, at any candidate \(\theta\), a posterior mean \(\mu(\theta)\) (its best guess of the objective) and a standard deviation \(\sigma(\theta)\) (its uncertainty). The genius is the second number: \(\sigma\) is large where you have not looked. An acquisition function turns \((\mu, \sigma)\) into a single score that balances exploitation (go where \(\mu\) is good) against exploration (go where \(\sigma\) is high). A common, transparent choice is the upper/lower confidence bound:
Bayesian optimization is the right tool when evaluations dominate cost and trials are necessarily sequential. Its honest caveats: a Gaussian process surrogate scales poorly past a few thousand trials and a couple dozen dimensions, and it is harder to parallelize than random search (each query depends on the last). For high-dimensional or massively parallel settings, tree-based surrogates (the TPE algorithm behind Optuna and Hyperopt) and the bandit-style methods of §2.5 often win — and in practice modern tuners (Optuna, Ray Tune) combine a Bayesian sampler with an early-stopping scheduler rather than choosing one.
# Toy Bayesian optimization: maximize a 1D function with a simple RBF surrogate.
import numpy as np
rng = np.random.default_rng(0)
def objective(x): # the expensive black box (we "don't know" it)
return np.sin(3*x) + 0.5*np.sin(7*x) - 0.05*(x-2)**2
grid = np.linspace(0, 5, 400) # candidate pool for the acquisition step
X = np.array([0.5, 4.5]); Y = objective(X) # two initial probes
ls, kappa = 0.4, 2.0 # RBF length-scale; explore-exploit dial
def kern(a, b): # RBF / squared-exponential kernel
return np.exp(-0.5*((a[:, None]-b[None, :])/ls)**2)
for step in range(8): # 8 sequential, sample-efficient queries
K = kern(X, X) + 1e-6*np.eye(len(X))
Kinv = np.linalg.inv(K)
ks = kern(grid, X)
mu = ks @ Kinv @ Y # posterior mean over the grid
var = 1.0 - np.sum((ks @ Kinv) * ks, axis=1)
sd = np.sqrt(np.clip(var, 1e-9, None)) # posterior std (uncertainty)
acq = mu + kappa*sd # UCB: we are MAXIMIZING here
nxt = grid[np.argmax(acq)] # next query = argmax acquisition
X = np.append(X, nxt); Y = np.append(Y, objective(nxt))
print(f"true max over grid : {objective(grid).max():.4f} at x = {grid[objective(grid).argmax()]:.3f}")
print(f"BO best after {len(X)} evals: {Y.max():.4f} at x = {X[np.argmax(Y)]:.3f}")
plot_xy(grid, mu) # final surrogate mean
Hyperband & successive halving
Every method so far runs each configuration to completion before judging it. But for iterative learners — neural nets, gradient boosting — a configuration's fate is often visible early: a learning rate that will diverge usually starts diverging in the first few epochs. Successive halving exploits this. It is a tournament on budget: start many configurations on a tiny budget (a few epochs, a small data subset), throw away the worst fraction, give the survivors more budget, and repeat. Most of the compute lands on the few configurations that have already proven themselves.
Successive halving has one free parameter that hides a real dilemma: how many configurations \(n_0\) to start with, given a fixed total budget \(B\)? Start with many cheap configs (large \(n_0\), small \(b_0\)) and you explore widely but might cut a slow-starting winner before it blooms. Start with few well-funded configs and you risk never sampling a good one. There is no universally right answer because it depends on how fast the early signal correlates with final performance.
Hyperband (Li et al., 2017) resolves this by refusing to choose. It runs successive halving as a subroutine across a spectrum of brackets, each with a different \((n_0, b_0)\) trade-off — from "many configs, tiny budget each" (aggressive early stopping) to "few configs, full budget each" (essentially random search, which never wrongly culls). By hedging across brackets it is robust to how early-stopping-friendly the problem turns out to be, and it provably loses only a small logarithmic factor to the best fixed bracket chosen in hindsight. In 2026 the bracket idea, usually under the ASHA variant (asynchronous successive halving), is the default scheduler in Ray Tune and Optuna — and is routinely paired with a Bayesian sampler choosing which configurations enter the tournament.
Tuning optimizes a number; the next chapter asks whether you chose the right number. Every method here minimized a validation metric — but accuracy, AUC, F1, RMSE, and log loss can rank the same models in opposite orders, and the "best" configuration is only as honest as the metric it was scored on. Chapter 03: the regression and classification metrics, what each one rewards and quietly punishes, and how to pick the objective your search should have been optimizing all along.
References
- Bergstra, J. & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization.
- Snoek, J., Larochelle, H. & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms.
- Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A. & Talwalkar, A. (2017). Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.
- Jamieson, K. & Talwalkar, A. (2016). Non-stochastic Best Arm Identification and Hyperparameter Optimization.
- Li, L. et al. (2020). A System for Massively Parallel Hyperparameter Tuning.
- Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework.