10 · Support Vector Machines & the Kernel Trick

10.1

The maximum-margin idea

Suppose two classes of points are linearly separable. Then there are infinitely many straight lines that split them with zero training error — and the perceptron of Chapter 03 will happily return whichever one it stumbles onto first. The support vector machine asks a sharper question: among all separating hyperplanes, which one is best? Its answer is geometric and almost obvious once stated: pick the boundary that keeps the largest possible empty buffer on either side. A line that skims past a training point is fragile — nudge that point slightly and it flips sides. A line centered in a wide no-man's-land is robust.

A hyperplane is the set of points satisfying $w \cdot x + b = 0$, where $w$ is a normal vector pointing across the boundary and $b$ shifts it from the origin. The signed distance from any point $x$ to that hyperplane is $(w \cdot x + b)/\lVert w \rVert$. The classifier is the sign of that expression: $\hat{y} = \operatorname{sign}(w \cdot x + b)$. What we want to maximize is the margin — the distance from the boundary to the closest point of either class.

EQ M10.1 — THE GEOMETRIC MARGIN $$ \gamma \;=\; \min_{i}\; \frac{y_i\,(w \cdot x_i + b)}{\lVert w \rVert}, \qquad y_i \in \{-1, +1\} $$

Labels are $\pm 1$ (not $0/1$) precisely so that $y_i(w\cdot x_i + b)$ is positive exactly when the point is on the correct side — the label cancels the sign. Dividing by $\lVert w \rVert$ converts the raw score into a real distance, so $\gamma$ is measured in the units of the data, not the units of $w$. The SVM's whole objective is to find the $(w, b)$ that makes this smallest distance as large as possible. The corridor's full width is $2\gamma$.

There is a redundancy to remove first. The pair $(w, b)$ and $(2w, 2b)$ describe the same hyperplane but give different score magnitudes. SVMs fix this by a canonical normalization: scale $(w, b)$ so that the closest points score exactly $\pm 1$, i.e. $\min_i y_i(w\cdot x_i + b) = 1$. Under that convention the margin in EQ M10.1 simplifies to the single clean quantity $\gamma = 1/\lVert w \rVert$ — so maximizing the margin is the same as minimizing $\lVert w \rVert$. That equivalence is the hinge on which the next section turns.

FIG M10.AONE OF MANY SEPARATORS VS. THE MAXIMUM-MARGIN ONE

Same data, two boundaries. The left line separates the classes but hugs a blue point; the right one is centered in the widest empty corridor. The three ringed points are support vectors — the only points that touch the margin, and the only ones that matter (§10.2).

The maximum-margin principle is not just aesthetic. Margin width controls the capacity of the classifier in the bias–variance sense of Chapter 06: a wider margin is a simpler, lower-variance hypothesis, and the bound on generalization error degrades with $1/\gamma^2$ rather than with the dimension of the space. That is the deep reason SVMs tolerate enormous feature spaces (§10.4) without overfitting in the way intuition warns they should.

10.2

The hard-margin problem and its support vectors

Pin down the redundancy with the canonical scaling and the SVM becomes a tidy convex optimization problem: minimize $\lVert w \rVert$ — equivalently $\tfrac{1}{2}\lVert w \rVert^2$, which is smoother — subject to every point sitting on the correct side of its margin.

EQ M10.2 — THE HARD-MARGIN PRIMAL $$ \min_{w,\,b}\; \frac{1}{2}\lVert w \rVert^2 \qquad \text{subject to}\qquad y_i\,(w \cdot x_i + b) \;\ge\; 1 \quad \text{for all } i $$

A quadratic objective under linear inequality constraints — a quadratic program, with a unique global optimum and no local minima to trap you (contrast the loss surfaces of Chapters 07–08). Each constraint says "point $i$ is correctly classified and at least one margin-unit away from the boundary." Points that hold their constraint with equality ($y_i(w\cdot x_i + b) = 1$) sit exactly on the margin; everything else has slack to spare.

Solving the dual of EQ M10.2 — via Lagrange multipliers $\alpha_i \ge 0$, one per point — reveals the structure that gives the method its name. The optimum has the form $w = \sum_i \alpha_i y_i x_i$, and the Karush–Kuhn–Tucker conditions force $\alpha_i = 0$ for every point strictly outside the margin. Only the points on the margin carry a nonzero $\alpha_i$. Those are the support vectors.

EQ M10.3 — THE DUAL: ONLY DOT PRODUCTS SURVIVE $$ \max_{\alpha \ge 0}\; \sum_i \alpha_i \;-\; \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j\, y_i y_j\,(x_i \cdot x_j) \qquad \text{s.t.}\quad \sum_i \alpha_i y_i = 0 $$

Two facts make this the most important equation in the chapter. First, the solution is sparse in $\alpha$: typically only a small fraction of points are support vectors, so deleting every other training point leaves the boundary unchanged. Second — and this is the seed of §10.4 — the data enters only through inner products $x_i \cdot x_j$. Nowhere does the optimizer need the coordinates themselves, only how points relate. Swap that dot product for a kernel and you have changed feature spaces without touching a single line of the solver.

The consequence is striking. A trained SVM is a list of support vectors, their labels, their weights $\alpha_i$, and a bias $b$. On a clean problem that might be a dozen points out of a million. The decision function is

EQ M10.4 — THE DECISION FUNCTION $$ \hat{y}(x) \;=\; \operatorname{sign}\!\Big( \sum_{i \in \text{SV}} \alpha_i\, y_i\,(x_i \cdot x) \;+\; b \Big) $$

Classification reduces to comparing the new point against the support vectors alone. The model's memory footprint and prediction cost scale with the number of support vectors, not the size of the training set — which is exactly why SVMs were practical on 1990s hardware and why a "too easy" problem (few SVs) and a "too hard" one (almost every point becomes an SV) feel so different at deployment time.

An SVM is canonically scaled so the closest points score $\pm 1$, and its weight vector has norm $\lVert w \rVert = 2$. By the simplification of EQ M10.1 ($\gamma = 1/\lVert w \rVert$), what is the margin $\gamma$ — the distance from the boundary to the nearest point?

Under canonical scaling the geometric margin collapses to $\gamma = 1/\lVert w \rVert = 1/2 = $ 0.5. The full corridor between the two classes is twice this, $2\gamma = 1$. Doubling $\lVert w \rVert$ halves the margin — which is exactly why minimizing $\lVert w \rVert^2$ maximizes the corridor.

INSTRUMENT M10.1 — MARGIN EXPLORERDRAG POINTS · MAX-MARGIN SOLVED LIVE · EQ M10.2

ACTIONS

MARGIN γ = 1/‖w‖

—

‖w‖

—

SUPPORT VECTORS

—

STATUS

—

Drag any point and the maximum-margin boundary re-solves instantly (a small coordinate-ascent on the dual of EQ M10.2). The solid white line is the boundary; the dashed lines are the margins; ringed points are the support vectors that touch them. Drag a faraway point around — the boundary does not move, because its $\alpha_i$ is zero. Now drag a support vector and watch everything shift: only the points on the corridor have a vote.

PYTHON · RUNNABLE IN-BROWSER

# Linear soft-margin SVM by sub-gradient descent on the hinge loss
# (EQ M10.2 + M10.5). Recovers the margin and the support vectors.
import numpy as np
rng = np.random.default_rng(0)

# two clearly separable clouds, labels in {-1, +1}
n = 40
Xp = rng.normal([ 2.2,  2.0], 0.6, (n, 2))
Xm = rng.normal([-2.0, -1.6], 0.6, (n, 2))
X  = np.vstack([Xp, Xm]); y = np.r_[np.ones(n), -np.ones(n)]

w = np.zeros(2); b = 0.0; C = 1.0
for t in range(1, 4001):                       # Pegasos-style schedule
    lr = 1.0 / (0.01 * t)                       # 0.01 = regularization 1/(C·N)
    m = y * (X @ w + b)                         # margins y·f(x)
    viol = m < 1                                # points inside the margin
    grad_w = 0.01 * w - C / len(X) * (y[viol] @ X[viol])
    grad_b =          - C / len(X) * y[viol].sum()
    w -= lr * grad_w; b -= lr * grad_b

m = y * (X @ w + b)
sv = np.where(m < 1.05)[0]                      # points on/inside the margin
print(f"||w||          = {np.linalg.norm(w):.3f}")
print(f"margin 1/||w|| = {1/np.linalg.norm(w):.3f}")
print(f"train accuracy = {(np.sign(X @ w + b) == y).mean():.3f}")
print(f"support vectors= {len(sv)} of {len(X)} points")
plot_scatter(X[:, 0], X[:, 1], (y > 0).astype(int))

push the clouds together (try means ±1.0) and watch the support-vector count climb

10.3

Soft margin and the hinge loss

Real data is not cleanly separable. One mislabeled point, or two classes that genuinely overlap, and the hard-margin problem of EQ M10.2 has no feasible solution — every hyperplane violates some constraint. Cortes and Vapnik's 1995 fix, the move that turned a beautiful idea into a workhorse, was to let constraints be broken at a price. Introduce a slack variable $\xi_i \ge 0$ for each point measuring how far it intrudes into (or past) its margin, and pay for the total slack:

EQ M10.5 — THE SOFT-MARGIN PRIMAL $$ \min_{w,\,b,\,\xi}\; \frac{1}{2}\lVert w \rVert^2 \;+\; C\sum_i \xi_i \qquad \text{s.t.}\quad y_i(w \cdot x_i + b) \ge 1 - \xi_i,\;\; \xi_i \ge 0 $$

The hyperparameter $C$ sets the exchange rate between a wide margin and few violations. Large $C$ punishes every violation harshly — the boundary contorts to classify training points, low bias, high variance (toward the hard margin as $C \to \infty$). Small $C$ tolerates mistakes for the sake of a fat, smooth margin — high bias, low variance. $C$ is the single most important knob on an SVM, and it is precisely the regularization strength of Chapter 06 wearing a different letter.

Eliminating the slack variables (each is just $\xi_i = \max(0, 1 - y_i f(x_i))$ at the optimum) recasts the soft-margin SVM as plain regularized empirical-risk minimization with one specific loss — the hinge loss:

EQ M10.6 — HINGE LOSS = SVM IN DISGUISE $$ \mathcal{L}(w, b) \;=\; \underbrace{\frac{1}{2}\lVert w \rVert^2}_{\text{margin / L2}} \;+\; C \sum_i \underbrace{\max\!\big(0,\; 1 - y_i\,f(x_i)\big)}_{\text{hinge loss } \ell_{\text{hinge}}}, \qquad f(x) = w \cdot x + b $$

The hinge $\ell(z) = \max(0, 1 - z)$, with $z = y\,f(x)$ the signed margin, is the soul of the method. It is zero once a point is correctly classified with margin $\ge 1$ (so those points exert no force — the sparsity of §10.2), then rises linearly as the point drifts inside the margin or onto the wrong side. Unlike the squared loss, it does not keep penalizing points that are already comfortably right; unlike the 0–1 loss it is convex and has a usable (sub-)gradient. Logistic regression's log-loss is its smooth cousin — same shape, rounded corner.

The hinge's corner at $z = 1$ is the entire personality of the SVM. Plug in the two cases the brief cares about: a point classified correctly and well clear of the margin ($z = y\,f(x) = 1.2$) costs nothing, while a point on the wrong side ($z = -0.5$) costs $1 - (-0.5) = 1.5$. The first two exercises walk exactly these.

A correctly classified point has signed margin $z = y\cdot f(x) = 1.2$. Using the hinge loss $\ell = \max(0,\; 1 - z)$ from EQ M10.6, what loss does this point incur?

$\ell = \max(0,\; 1 - 1.2) = \max(0,\; -0.2) = $ 0. The point is past the margin ($z > 1$), so the hinge is flat at zero — it contributes no gradient and is not a support vector. This zero-region is exactly what makes the SVM solution sparse.

A point lands on the wrong side of the boundary with signed margin $z = y\cdot f(x) = -0.5$. What hinge loss $\ell = \max(0,\; 1 - z)$ does it incur (EQ M10.6)?

$\ell = \max(0,\; 1 - (-0.5)) = \max(0,\; 1.5) = $ 1.5. A misclassified point pays more than 1 (it would pay exactly 1 if it sat on the boundary at $z = 0$), and the penalty grows linearly the deeper it strays. This nonzero hinge makes it an active, margin-violating support vector.

INSTRUMENT M10.2 — HINGE LOSS & THE C TRADE-OFFEQ M10.5 / M10.6 · LIVE

REGULARIZATION C 1.0

CLASS OVERLAP 0.9

MARGIN γ = 1/‖w‖

—

TOTAL HINGE Σξ

—

MARGIN VIOLATIONS

—

TRAIN ACCURACY

—

The curve on the left is the hinge loss $\max(0, 1-z)$ plotted against the signed margin $z$ — note the elbow at $z=1$ and the flat zero beyond it. The panel on the right shows two overlapping clouds with the live SVM boundary and its margins. Slide C from small (fat margin, many tolerated violations) to large (the boundary fights to classify every point, margin shrinks). Raise the overlap until the classes genuinely mix and watch even a large C give up — there is no separating line left to find.

10.4

The kernel trick: RBF and polynomial

So far the boundary is a flat hyperplane — useless for data shaped like concentric rings or two interleaved spirals. The classical escape is to lift the data into a higher-dimensional space where it becomes linearly separable: map each $x$ through some feature transform $\phi(x)$, fit a linear SVM there, and the flat boundary upstairs is a curved one back downstairs. The catch is cost: a good $\phi$ might have thousands or infinitely many components, and computing them all is hopeless.

Here is the trick, and it is genuinely a piece of magic. Recall from EQ M10.3 and M10.4 that the SVM never needs $\phi(x)$ by itself — it only ever needs inner products $\phi(x_i)\cdot\phi(x_j)$. So if some cheap function $K(x_i, x_j)$ happens to equal that inner product, we can compute in the lifted space while never visiting it:

EQ M10.7 — THE KERNEL TRICK $$ K(x, x') \;=\; \phi(x) \cdot \phi(x') \qquad\Longrightarrow\qquad \hat{y}(x) \;=\; \operatorname{sign}\!\Big( \sum_{i \in \text{SV}} \alpha_i\, y_i\, K(x_i, x) \;+\; b \Big) $$

Every $x_i \cdot x_j$ in the dual becomes $K(x_i, x_j)$; nothing else changes. Mercer's theorem tells you which functions $K$ are legal: any symmetric $K$ whose Gram matrix $[K(x_i,x_j)]$ is positive semi-definite corresponds to some valid feature map $\phi$ — you never have to construct it. You design the similarity measure, not the coordinates. This single substitution turns the linear SVM into a universal nonlinear classifier.

Two kernels dominate practice:

EQ M10.8 — RBF AND POLYNOMIAL KERNELS $$ K_{\text{RBF}}(x, x') = \exp\!\big(-\gamma\,\lVert x - x' \rVert^2\big), \qquad K_{\text{poly}}(x, x') = \big(\gamma\, x \cdot x' + r\big)^{d} $$

The RBF (Gaussian) kernel is the default — a smooth bump of similarity that decays with distance. Its feature map $\phi$ is infinite-dimensional, so the trick is buying you something you could literally never compute by hand. The width parameter $\gamma$ is decisive: small $\gamma$ means each support vector's influence reaches far (smooth, almost-linear boundary); large $\gamma$ means influence is hyper-local (the boundary wraps tightly around individual points — overfitting). The polynomial kernel of degree $d$ implicitly contains all feature interactions up to order $d$: degree 2 gives you every pairwise product $x_a x_b$ for free.

RBF and $C$ are tuned together, almost always by cross-validated grid search, because they trade against each other: a large $\gamma$ and a large $C$ both push toward a wiggly, overfit boundary. The instrument below lets you feel that interaction on data that no straight line can touch.

INSTRUMENT M10.3 — KERNEL PLAYGROUNDRBF SVM ON INSEPARABLE RINGS · EQ M10.7 / M10.8

RBF WIDTH γ 2.0

REGULARIZATION C 5

DATASET

TRAIN ACCURACY

—

SUPPORT VECTORS

—

REGIME

—

A linear SVM scores ~50% here — the classes are concentric. The painted regions are a real RBF-kernel SVM (dual coordinate ascent over EQ M10.7) refit live in your browser. Start at a small $\gamma$: the boundary is smooth but may miss the inner ring. Crank $\gamma$ up and the boundary tightens into closed loops around clusters — eventually shrink-wrapping individual points, the visual signature of overfitting. Lower $C$ to forgive stray points and recover a calmer frontier.

PYTHON · RUNNABLE IN-BROWSER

# The RBF kernel matrix on inseparable rings, and why lifting helps.
# Inner ring = class 0, outer ring = class 1 -- no line can split them.
import numpy as np
rng = np.random.default_rng(1)

def ring(r, n):
    t = rng.uniform(0, 2*np.pi, n)
    rad = r + rng.normal(0, 0.12, n)
    return np.c_[rad*np.cos(t), rad*np.sin(t)]

X = np.vstack([ring(0.6, 60), ring(2.2, 60)])
y = np.r_[np.zeros(60, int), np.ones(60, int)]

def rbf(A, B, g):                              # K_ij = exp(-g ||a_i - b_j||^2)
    d2 = ((A[:, None, :] - B[None, :, :])**2).sum(-1)
    return np.exp(-g * d2)

K = rbf(X, X, g=1.0)                           # full 120x120 Gram matrix
print("Gram matrix shape          :", K.shape)
print("diagonal (self-similarity) :", np.round(K[0, 0], 3), "(always 1)")

# A point's mean kernel-similarity to each class is already discriminative,
# even though the raw coordinates are not linearly separable at all.
sim0 = K[:, y == 0].mean(1)                    # closeness to inner ring
sim1 = K[:, y == 1].mean(1)                    # closeness to outer ring
pred = (sim1 > sim0).astype(int)               # 1-NN-in-feature-space sanity check
print("nearest-mean acc in feature space:", f"{(pred == y).mean():.3f}")
print("=> in RBF feature space the rings ARE separable.")
plot_scatter(X[:, 0], X[:, 1], y)

drop g to 0.05 (too smooth) or push to 30 (too local) and watch the feature-space accuracy crack

WHY IT WORKS

The kernel trick is a similarity measure in disguise. An RBF SVM's prediction (EQ M10.7) is a weighted vote of support vectors, where the weight $K(x_i, x)$ is how similar the query is to each one — large $\gamma$ makes "similar" mean "almost identical." Seen this way it is a close cousin of the k-NN and kernel-density ideas you will meet in Chapter 11: distance defines everything. The SVM's contribution is choosing which points to remember (the support vectors) and how much to weight each (the $\alpha_i$) by maximizing a margin, rather than keeping the whole training set.

10.5

SVMs in practice — and against the field

For roughly a decade — from the late 1990s until deep learning's 2012 breakout — kernel SVMs were the most accurate general-purpose classifiers available, the default for text categorization, handwritten-digit recognition, and bioinformatics. They have receded, but understanding when they still win, and why they faded, is worth a section.

The practical checklist is short and unusually reliable:

Scale your features. Both the dot product and the RBF distance are dominated by large-magnitude features, so standardize to zero mean and unit variance first. This is not optional — an unscaled SVM is mostly listening to whichever column happens to have the biggest numbers.
Start with the RBF kernel, and grid-search $(C, \gamma)$ on a log scale with cross-validation. Linear kernels are the right call only when the feature space is already huge and sparse (e.g. bag-of-words text), where they are both faster and just as accurate.
Watch the support-vector count. If nearly every training point becomes a support vector, your $\gamma$ is too large or the problem is too noisy — the model is memorizing, and both accuracy and prediction speed will suffer.
SVMs output a score, not a probability. Calibrate (Platt scaling, or isotonic regression) if you need $P(y\,|\,x)$.

Model	Decision boundary	Scales to N samples	Probabilities	Best when…
Linear SVM	hyperplane (max-margin)	excellent — linear in N	via calibration	high-dim sparse text; N in the millions
RBF SVM	smooth nonlinear	poor — $O(N^2)$–$O(N^3)$ to train	via calibration	small/medium N, low-dim, clean signal
Logistic regression	hyperplane (log-loss)	excellent	native & calibrated	you need probabilities and interpretability
Gradient-boosted trees	axis-aligned, piecewise	excellent	native-ish	heterogeneous tabular data (the default to beat)
Deep nets	arbitrary, learned features	good (SGD), data-hungry	native (softmax)	images, text, audio; very large N

The honest verdict, contested only at the margins: on the medium-sized, low-dimensional, fairly clean problems that gave SVMs their name, a well-tuned RBF SVM is still competitive with anything and often the cleanest model to reason about. But its training cost grows roughly quadratically-to-cubically in the number of samples, which rules it out for the large datasets that define modern ML; on messy tabular data, gradient-boosted trees (Chapter 04) usually edge it out with far less tuning; and on perceptual data, deep networks — which learn their feature map instead of fixing it with a kernel — left SVMs behind entirely after 2012. The kernel idea itself never died, though: it reappears in Gaussian processes, in the "neural tangent kernel" theory of why wide networks train the way they do, and in the attention mechanism of Volume II, which is a learned, data-dependent similarity kernel at heart.

PITFALLS

The four ways an SVM goes wrong: (1) forgetting to scale features — the single most common failure, and it produces a model that looks trained but isn't; (2) leaving $C$ and $\gamma$ at library defaults — they must be searched jointly, and the right values span orders of magnitude; (3) running an RBF SVM on hundreds of thousands of rows and waiting forever — reach for a linear SVM or SGD instead; (4) treating the raw decision score as a probability — it is an uncalibrated margin, not $P(y\,|\,x)$.

Every kernel in this chapter was, underneath, a measure of similarity between two points. Chapter 11 makes that the whole subject: Euclidean, Manhattan, cosine, Mahalanobis, Jaccard, edit distance — what "near" means, why the choice quietly decides everything from k-NN to clustering to the retrieval step inside every modern embedding system.

10.R

References

Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3), 273–297. The paper that introduced the soft margin and slack variables (EQ M10.5) and gave the method its modern form — the canonical primary source for this chapter.
Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of COLT '92, 144–152. Where the maximum-margin hyperplane and the kernel trick (EQ M10.1, M10.7) were first combined — the true origin of the kernel SVM.
Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press — the definitive textbook treatment of kernels, Mercer's theorem, and the optimization behind EQ M10.3, M10.8.
Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167. The most-cited tutorial — derives the dual (EQ M10.3) and the KKT support-vector conditions step by step.
Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST 2(3), 1–27. The standard implementation behind scikit-learn's SVC, and the reference for practical $(C, \gamma)$ selection (§10.5).
Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming 127(1), 3–30. The hinge-loss sub-gradient method used in this chapter's first Python cell — how to train a linear SVM at scale.