Why a line is not enough
The obvious move is to recycle Chapter 02: code the two classes as \(y = 0\) and \(y = 1\), fit a straight line by least squares, and call anything above 0.5 a positive. This is called the linear probability model, and it fails in three instructive ways.
First, the outputs aren't probabilities. A line is unbounded: feed it an extreme input and it cheerfully predicts 1.4, or −0.3. There is no reading of "140% probability of spam" that survives contact with arithmetic — downstream decisions (expected costs, thresholds, calibration) all need outputs that live in \((0, 1)\) and behave like degrees of belief.
Second, squared error punishes being right. Take an email that is so obviously spam the line scores it 1.8. It is correctly classified by any threshold — yet squared error charges \((1.8 - 1)^2\) for it and drags the line back toward the pack, moving the boundary toward the mistakes. A loss for classification should reward confident correctness, not fine it.
Third, the geometry is brittle. Add a few far-away but trivially easy points to one class and the least-squares line pivots to appease them, misclassifying points near the frontier — where classification is actually decided. The fix is not to abandon the linear score \(z = \mathbf{w}^{\top}\mathbf{x} + b\); it is too useful. The fix is to stop treating \(z\) as the answer and start treating it as evidence — a quantity on an unbounded scale that we convert into a probability.
The sigmoid & logistic regression
The converter is the sigmoid (logistic) function — an S-shaped squash that maps any real score to a probability:
Bolting the sigmoid onto the linear score gives logistic regression — still a linear model, but linear in the right place:
Unlike least squares, there is no closed-form solution — logistic regression is trained by gradient descent (Chapter 02) on the loss of the next section. The consolation prize is substantial: that loss is convex for this model, so gradient descent finds the global optimum. It is the last model in this volume for which that is true.
Cross-entropy: the loss that trains GPT
What should the model pay when it predicts probability \(p\) and the truth is \(y\)? The principled answer comes from maximum likelihood: choose the weights that make the observed labels most probable. Taking the negative log (sums beat products; minimizing beats maximizing) yields cross-entropy, also called log loss:
Why not just use squared error on \(p\)? Two reasons. Through a sigmoid, squared error becomes non-convex — gradient descent can stall in flat regions, and precisely when the model is confidently wrong the \(\sigma'\) factor crushes the gradient toward zero. Cross-entropy's gradient cancels that factor exactly, leaving the cleanest possible signal: \(\nabla_{\mathbf{w}} \mathcal{L} = \tfrac{1}{N}\sum_i (p_i - y_i)\,\mathbf{x}_i\) — error times input, the same form as linear regression's. The worse the miss, the louder the correction.
There is a second, subtler reason to descend cross-entropy rather than the thing you ostensibly care about: accuracy is a staircase. Nudge the boundary and accuracy doesn't move at all — until a point crosses the line and it jumps. Zero gradient almost everywhere, undefined at the jumps: useless for optimization. Cross-entropy is the smooth ramp that gradient descent can actually walk. Feel the difference yourself:
Gradient descent does the same steering automatically. The cell below trains logistic regression on the same two clouds — twelve lines of numpy, the gradient from this section, nothing else:
import numpy as np
rng = np.random.default_rng(7)
n = 80 # two gaussian clouds, as in Instrument M3.1
A = rng.normal([-1.5, -1.0], 1.05, size=(n, 2)) # class 0
B = rng.normal([ 1.5, 1.1], 1.05, size=(n, 2)) # class 1
X, y = np.vstack([A, B]), np.array([0]*n + [1]*n)
w, b, lr = np.zeros(2), 0.0, 0.5
for step in range(400):
p = 1 / (1 + np.exp(-(X @ w + b))) # EQ M3.2
w -= lr * (X.T @ (p - y)) / len(y) # gradient of EQ M3.3: error x input
b -= lr * np.mean(p - y)
p = np.clip(1 / (1 + np.exp(-(X @ w + b))), 1e-12, 1 - 1e-12)
ce = -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
acc = np.mean((p > 0.5) == y)
print("w =", np.round(w, 3), " b =", round(b, 3))
print(f"cross-entropy = {ce:.4f} accuracy = {acc:.1%}")
plot_scatter(X[:, 0], X[:, 1], (p > 0.5).astype(int))
Decision boundaries: linear in input space
Where does the model actually decide? At \(p = 0.5\) — which by EQ M3.1 happens exactly where the evidence is zero: \(\mathbf{w}^{\top}\mathbf{x} + b = 0\). In two dimensions that is a straight line; in \(d\) dimensions, a flat hyperplane. The sigmoid bends probabilities, never the boundary: logistic regression is a linear classifier, however smoothly its confidence shades from one side to the other.
The parameters split into three legible roles. The direction of \(\mathbf{w}\) sets the boundary's orientation (\(\mathbf{w}\) is perpendicular to it — the mint arrow in Instrument M3.1). The bias \(b\) slides the boundary without rotating it. And the magnitude \(\lVert\mathbf{w}\rVert\) controls how fast probability ramps as you walk away from the line: the score is \(z = \lVert\mathbf{w}\rVert \cdot d(\mathbf{x})\), where \(d(\mathbf{x})\) is the signed distance to the boundary. Direction says what the model believes; magnitude says how hard. That magnitude acts as a steepness dial — an inverse temperature — on the probability curve:
A real failure mode hides in that dial. If the training data is perfectly separable, cross-entropy keeps paying the model to grow \(\lVert\mathbf{w}\rVert\) forever — every doubling sharpens probabilities toward 0/1 and shaves a little more loss, without moving the boundary at all. The result is a wildly overconfident model. The standard fixes are L2 regularization or early stopping, both of which cap \(\lVert\mathbf{w}\rVert\); Chapter 06 treats this properly.
What a line cannot do is bend. XOR-style data (positives in opposite corners) and concentric rings defeat every choice of \(\mathbf{w}\) and \(b\) — no straight boundary separates them. The classical remedy is feature engineering: feed the model \(x_1 x_2\), or \(x_1^2 + x_2^2\), and the boundary becomes linear in the new features while curving in the original space. The modern remedy is to learn those features — which is precisely what neural networks do (Chapter 07). Either way, the lesson stands: a linear classifier is only as good as the space you hand it.
Many classes: softmax
Two classes needed one score. For \(K\) classes, give each class its own linear score \(z_i = \mathbf{w}_i^{\top}\mathbf{x} + b_i\) and normalize the lot with softmax — exponentiate (so everything is positive and ratios are preserved on the log scale), then divide by the sum (so everything adds to one):
You will meet this exact function three more times in this encyclopedia, doing three different jobs:
| Where | Softmax over | Producing |
|---|---|---|
| This chapter | K class scores | p(class | input) |
| LLM output head (Vol II · EQ 1.2) | |V| ≈ 100K token logits | p(next token | context) |
| Attention (Vol II · EQ 3.1) | T relevance scores per query | mixing weights over values |
| Sampling with temperature τ | logits / τ | sharpened or flattened p |
One function, one identity: turn arbitrary scores into a probability distribution, differentiably. Every time a network must choose softly among options — classes, tokens, positions to attend to — softmax is the mechanism. Run it:
import numpy as np
def softmax(z):
e = np.exp(z - z.max()) # subtract max: free, by shift invariance
return e / e.sum()
logits = np.array([3.2, 1.1, 0.4, -1.7]) # 4 classes, raw scores
p = softmax(logits)
for name, pi in zip("ABCD", p):
print(f"class {name}: {pi:.4f}")
print("sum =", round(p.sum(), 6))
print()
print("logits + 100 :", np.round(softmax(logits + 100), 4))
print("identical — softmax reads DIFFERENCES, not absolute scores")
with np.errstate(over="ignore"): # what the max-trick prevents:
print("naive exp(z+1000):", np.exp(logits + 1000.0))
Metrics beyond accuracy
A disease afflicts 1 person in 1,000. The classifier return "healthy" scores 99.9% accuracy and has never detected anything. Under class imbalance, accuracy measures the imbalance, not the model. The honest accounting starts by splitting the four ways a binary prediction can land:
| CONFUSION MATRIX | PREDICTED + | PREDICTED − |
|---|---|---|
| ACTUAL + | TP · true positive | FN · false negative — a miss |
| ACTUAL − | FP · false positive — a false alarm | TN · true negative |
Two questions matter, and they are different questions. Precision \(= \mathrm{TP}/(\mathrm{TP}+\mathrm{FP})\): of everything I flagged, how much was real? Recall \(= \mathrm{TP}/(\mathrm{TP}+\mathrm{FN})\): of everything real, how much did I flag? They pull against each other through a dial you already own: the decision threshold. Nothing forces the cut at \(p = 0.5\) — lower it and you catch more positives (recall ↑) while flagging more junk (precision ↓); raise it and the reverse. A single model traces an entire precision–recall curve as the threshold sweeps; the F1 score \(= 2PR/(P+R)\), a harmonic mean, condenses one operating point into one number — harsh on imbalance between the two, as a harmonic mean should be.
Where you sit on that curve is a question about costs, not statistics: a missed tumor and a false alarm are not the same price, and no metric chooses for you. Worse, base rates ambush intuition. Run a genuinely good screening test on a rare condition:
Working under imbalance, in practice: judge models on precision/recall (or the PR curve), never raw accuracy; consider reweighting the loss so rare-class errors cost more, or resampling the data; and remember the cheapest fix is often just moving the threshold after training. The probabilities logistic regression emits are exactly what make that last move possible — a hard classifier offers no dial at all.
Logistic regression draws one straight, confident line. Chapter 04 takes the opposite bet: models with no line, no sigmoid, and barely any equations — decision trees that carve the space into boxes, forests that vote, and nearest neighbors that just ask, "what did similar points do?"
Further reading
- Cox, D. R. (1958). The Regression Analysis of Binary Sequences. — the founding paper of logistic regression and the log-odds (logit) link.
- Berkson, J. (1944). Application of the Logistic Function to Bio-Assay. — introduced the "logit" and popularized the sigmoid as a response curve.
- Bishop, C. (2006). Pattern Recognition and Machine Learning, Ch. 4. — the clearest modern treatment of logistic regression, cross-entropy, and the softmax for multiclass.
- Bridle, J. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs. — names and justifies the softmax as a normalized-exponential probability layer.
- Davis, J. & Goadrich, M. (2006). The Relationship Between Precision-Recall and ROC Curves. — why accuracy misleads on imbalanced data and when to read PR vs ROC.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, Ch. 4. — linear methods for classification, decision boundaries, and maximum-likelihood fitting.