The maximum-margin idea
Suppose two classes of points are linearly separable. Then there are infinitely many straight lines that split them with zero training error — and the perceptron of Chapter 03 will happily return whichever one it stumbles onto first. The support vector machine asks a sharper question: among all separating hyperplanes, which one is best? Its answer is geometric and almost obvious once stated: pick the boundary that keeps the largest possible empty buffer on either side. A line that skims past a training point is fragile — nudge that point slightly and it flips sides. A line centered in a wide no-man's-land is robust.
A hyperplane is the set of points satisfying \(w \cdot x + b = 0\), where \(w\) is a normal vector pointing across the boundary and \(b\) shifts it from the origin. The signed distance from any point \(x\) to that hyperplane is \((w \cdot x + b)/\lVert w \rVert\). The classifier is the sign of that expression: \(\hat{y} = \operatorname{sign}(w \cdot x + b)\). What we want to maximize is the margin — the distance from the boundary to the closest point of either class.
There is a redundancy to remove first. The pair \((w, b)\) and \((2w, 2b)\) describe the same hyperplane but give different score magnitudes. SVMs fix this by a canonical normalization: scale \((w, b)\) so that the closest points score exactly \(\pm 1\), i.e. \(\min_i y_i(w\cdot x_i + b) = 1\). Under that convention the margin in EQ M10.1 simplifies to the single clean quantity \(\gamma = 1/\lVert w \rVert\) — so maximizing the margin is the same as minimizing \(\lVert w \rVert\). That equivalence is the hinge on which the next section turns.
The maximum-margin principle is not just aesthetic. Margin width controls the capacity of the classifier in the bias–variance sense of Chapter 06: a wider margin is a simpler, lower-variance hypothesis, and the bound on generalization error degrades with \(1/\gamma^2\) rather than with the dimension of the space. That is the deep reason SVMs tolerate enormous feature spaces (§10.4) without overfitting in the way intuition warns they should.
The hard-margin problem and its support vectors
Pin down the redundancy with the canonical scaling and the SVM becomes a tidy convex optimization problem: minimize \(\lVert w \rVert\) — equivalently \(\tfrac{1}{2}\lVert w \rVert^2\), which is smoother — subject to every point sitting on the correct side of its margin.
Solving the dual of EQ M10.2 — via Lagrange multipliers \(\alpha_i \ge 0\), one per point — reveals the structure that gives the method its name. The optimum has the form \(w = \sum_i \alpha_i y_i x_i\), and the Karush–Kuhn–Tucker conditions force \(\alpha_i = 0\) for every point strictly outside the margin. Only the points on the margin carry a nonzero \(\alpha_i\). Those are the support vectors.
The consequence is striking. A trained SVM is a list of support vectors, their labels, their weights \(\alpha_i\), and a bias \(b\). On a clean problem that might be a dozen points out of a million. The decision function is
# Linear soft-margin SVM by sub-gradient descent on the hinge loss
# (EQ M10.2 + M10.5). Recovers the margin and the support vectors.
import numpy as np
rng = np.random.default_rng(0)
# two clearly separable clouds, labels in {-1, +1}
n = 40
Xp = rng.normal([ 2.2, 2.0], 0.6, (n, 2))
Xm = rng.normal([-2.0, -1.6], 0.6, (n, 2))
X = np.vstack([Xp, Xm]); y = np.r_[np.ones(n), -np.ones(n)]
w = np.zeros(2); b = 0.0; C = 1.0
for t in range(1, 4001): # Pegasos-style schedule
lr = 1.0 / (0.01 * t) # 0.01 = regularization 1/(C·N)
m = y * (X @ w + b) # margins y·f(x)
viol = m < 1 # points inside the margin
grad_w = 0.01 * w - C / len(X) * (y[viol] @ X[viol])
grad_b = - C / len(X) * y[viol].sum()
w -= lr * grad_w; b -= lr * grad_b
m = y * (X @ w + b)
sv = np.where(m < 1.05)[0] # points on/inside the margin
print(f"||w|| = {np.linalg.norm(w):.3f}")
print(f"margin 1/||w|| = {1/np.linalg.norm(w):.3f}")
print(f"train accuracy = {(np.sign(X @ w + b) == y).mean():.3f}")
print(f"support vectors= {len(sv)} of {len(X)} points")
plot_scatter(X[:, 0], X[:, 1], (y > 0).astype(int))
Soft margin and the hinge loss
Real data is not cleanly separable. One mislabeled point, or two classes that genuinely overlap, and the hard-margin problem of EQ M10.2 has no feasible solution — every hyperplane violates some constraint. Cortes and Vapnik's 1995 fix, the move that turned a beautiful idea into a workhorse, was to let constraints be broken at a price. Introduce a slack variable \(\xi_i \ge 0\) for each point measuring how far it intrudes into (or past) its margin, and pay for the total slack:
Eliminating the slack variables (each is just \(\xi_i = \max(0, 1 - y_i f(x_i))\) at the optimum) recasts the soft-margin SVM as plain regularized empirical-risk minimization with one specific loss — the hinge loss:
The hinge's corner at \(z = 1\) is the entire personality of the SVM. Plug in the two cases the brief cares about: a point classified correctly and well clear of the margin (\(z = y\,f(x) = 1.2\)) costs nothing, while a point on the wrong side (\(z = -0.5\)) costs \(1 - (-0.5) = 1.5\). The first two exercises walk exactly these.
The kernel trick: RBF and polynomial
So far the boundary is a flat hyperplane — useless for data shaped like concentric rings or two interleaved spirals. The classical escape is to lift the data into a higher-dimensional space where it becomes linearly separable: map each \(x\) through some feature transform \(\phi(x)\), fit a linear SVM there, and the flat boundary upstairs is a curved one back downstairs. The catch is cost: a good \(\phi\) might have thousands or infinitely many components, and computing them all is hopeless.
Here is the trick, and it is genuinely a piece of magic. Recall from EQ M10.3 and M10.4 that the SVM never needs \(\phi(x)\) by itself — it only ever needs inner products \(\phi(x_i)\cdot\phi(x_j)\). So if some cheap function \(K(x_i, x_j)\) happens to equal that inner product, we can compute in the lifted space while never visiting it:
Two kernels dominate practice:
RBF and \(C\) are tuned together, almost always by cross-validated grid search, because they trade against each other: a large \(\gamma\) and a large \(C\) both push toward a wiggly, overfit boundary. The instrument below lets you feel that interaction on data that no straight line can touch.
# The RBF kernel matrix on inseparable rings, and why lifting helps.
# Inner ring = class 0, outer ring = class 1 -- no line can split them.
import numpy as np
rng = np.random.default_rng(1)
def ring(r, n):
t = rng.uniform(0, 2*np.pi, n)
rad = r + rng.normal(0, 0.12, n)
return np.c_[rad*np.cos(t), rad*np.sin(t)]
X = np.vstack([ring(0.6, 60), ring(2.2, 60)])
y = np.r_[np.zeros(60, int), np.ones(60, int)]
def rbf(A, B, g): # K_ij = exp(-g ||a_i - b_j||^2)
d2 = ((A[:, None, :] - B[None, :, :])**2).sum(-1)
return np.exp(-g * d2)
K = rbf(X, X, g=1.0) # full 120x120 Gram matrix
print("Gram matrix shape :", K.shape)
print("diagonal (self-similarity) :", np.round(K[0, 0], 3), "(always 1)")
# A point's mean kernel-similarity to each class is already discriminative,
# even though the raw coordinates are not linearly separable at all.
sim0 = K[:, y == 0].mean(1) # closeness to inner ring
sim1 = K[:, y == 1].mean(1) # closeness to outer ring
pred = (sim1 > sim0).astype(int) # 1-NN-in-feature-space sanity check
print("nearest-mean acc in feature space:", f"{(pred == y).mean():.3f}")
print("=> in RBF feature space the rings ARE separable.")
plot_scatter(X[:, 0], X[:, 1], y)
The kernel trick is a similarity measure in disguise. An RBF SVM's prediction (EQ M10.7) is a weighted vote of support vectors, where the weight \(K(x_i, x)\) is how similar the query is to each one — large \(\gamma\) makes "similar" mean "almost identical." Seen this way it is a close cousin of the k-NN and kernel-density ideas you will meet in Chapter 11: distance defines everything. The SVM's contribution is choosing which points to remember (the support vectors) and how much to weight each (the \(\alpha_i\)) by maximizing a margin, rather than keeping the whole training set.
SVMs in practice — and against the field
For roughly a decade — from the late 1990s until deep learning's 2012 breakout — kernel SVMs were the most accurate general-purpose classifiers available, the default for text categorization, handwritten-digit recognition, and bioinformatics. They have receded, but understanding when they still win, and why they faded, is worth a section.
The practical checklist is short and unusually reliable:
- Scale your features. Both the dot product and the RBF distance are dominated by large-magnitude features, so standardize to zero mean and unit variance first. This is not optional — an unscaled SVM is mostly listening to whichever column happens to have the biggest numbers.
- Start with the RBF kernel, and grid-search \((C, \gamma)\) on a log scale with cross-validation. Linear kernels are the right call only when the feature space is already huge and sparse (e.g. bag-of-words text), where they are both faster and just as accurate.
- Watch the support-vector count. If nearly every training point becomes a support vector, your \(\gamma\) is too large or the problem is too noisy — the model is memorizing, and both accuracy and prediction speed will suffer.
- SVMs output a score, not a probability. Calibrate (Platt scaling, or isotonic regression) if you need \(P(y\,|\,x)\).
| Model | Decision boundary | Scales to N samples | Probabilities | Best when… |
|---|---|---|---|---|
| Linear SVM | hyperplane (max-margin) | excellent — linear in N | via calibration | high-dim sparse text; N in the millions |
| RBF SVM | smooth nonlinear | poor — \(O(N^2)\)–\(O(N^3)\) to train | via calibration | small/medium N, low-dim, clean signal |
| Logistic regression | hyperplane (log-loss) | excellent | native & calibrated | you need probabilities and interpretability |
| Gradient-boosted trees | axis-aligned, piecewise | excellent | native-ish | heterogeneous tabular data (the default to beat) |
| Deep nets | arbitrary, learned features | good (SGD), data-hungry | native (softmax) | images, text, audio; very large N |
The honest verdict, contested only at the margins: on the medium-sized, low-dimensional, fairly clean problems that gave SVMs their name, a well-tuned RBF SVM is still competitive with anything and often the cleanest model to reason about. But its training cost grows roughly quadratically-to-cubically in the number of samples, which rules it out for the large datasets that define modern ML; on messy tabular data, gradient-boosted trees (Chapter 04) usually edge it out with far less tuning; and on perceptual data, deep networks — which learn their feature map instead of fixing it with a kernel — left SVMs behind entirely after 2012. The kernel idea itself never died, though: it reappears in Gaussian processes, in the "neural tangent kernel" theory of why wide networks train the way they do, and in the attention mechanism of Volume II, which is a learned, data-dependent similarity kernel at heart.
The four ways an SVM goes wrong: (1) forgetting to scale features — the single most common failure, and it produces a model that looks trained but isn't; (2) leaving \(C\) and \(\gamma\) at library defaults — they must be searched jointly, and the right values span orders of magnitude; (3) running an RBF SVM on hundreds of thousands of rows and waiting forever — reach for a linear SVM or SGD instead; (4) treating the raw decision score as a probability — it is an uncalibrated margin, not \(P(y\,|\,x)\).
Every kernel in this chapter was, underneath, a measure of similarity between two points. Chapter 11 makes that the whole subject: Euclidean, Manhattan, cosine, Mahalanobis, Jaccard, edit distance — what "near" means, why the choice quietly decides everything from k-NN to clustering to the retrieval step inside every modern embedding system.
References
- Cortes, C. & Vapnik, V. (1995). Support-Vector Networks. Machine Learning 20(3), 273–297.
- Boser, B. E., Guyon, I. M. & Vapnik, V. N. (1992). A Training Algorithm for Optimal Margin Classifiers. Proceedings of COLT '92, 144–152.
- Schölkopf, B. & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond.
- Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2), 121–167.
- Chang, C.-C. & Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM TIST 2(3), 1–27.
- Shalev-Shwartz, S., Singer, Y., Srebro, N. & Cotter, A. (2011). Pegasos: Primal Estimated Sub-Gradient Solver for SVM. Mathematical Programming 127(1), 3–30.