03 · Classification: Logistic & Softmax

3.1

Why a line is not enough

The obvious move is to recycle Chapter 02: code the two classes as $y = 0$ and $y = 1$, fit a straight line by least squares, and call anything above 0.5 a positive. This is called the linear probability model, and it fails in three instructive ways.

First, the outputs aren't probabilities. A line is unbounded: feed it an extreme input and it cheerfully predicts 1.4, or −0.3. There is no reading of "140% probability of spam" that survives contact with arithmetic — downstream decisions (expected costs, thresholds, calibration) all need outputs that live in $(0, 1)$ and behave like degrees of belief.

Second, squared error punishes being right. Take an email that is so obviously spam the line scores it 1.8. It is correctly classified by any threshold — yet squared error charges $(1.8 - 1)^2$ for it and drags the line back toward the pack, moving the boundary toward the mistakes. A loss for classification should reward confident correctness, not fine it.

Third, the geometry is brittle. Add a few far-away but trivially easy points to one class and the least-squares line pivots to appease them, misclassifying points near the frontier — where classification is actually decided. The fix is not to abandon the linear score $z = \mathbf{w}^{\top}\mathbf{x} + b$; it is too useful. The fix is to stop treating $z$ as the answer and start treating it as evidence — a quantity on an unbounded scale that we convert into a probability.

3.2

The sigmoid & logistic regression

The converter is the sigmoid (logistic) function — an S-shaped squash that maps any real score to a probability:

EQ M3.1 — THE SIGMOID $$ \sigma(z) \;=\; \frac{1}{1 + e^{-z}}, \qquad \sigma : \mathbb{R} \to (0,1), \qquad \sigma(-z) \;=\; 1 - \sigma(z) $$

Strong positive evidence $\to$ probability near 1; strong negative $\to$ near 0; zero evidence $\to$ exactly ½. The symmetry means "evidence for class 1" and "evidence against class 0" are the same number with the sign flipped. Its derivative is $\sigma'(z) = \sigma(z)\,(1 - \sigma(z))$ — largest at the midpoint (¼), vanishing in the tails. The sigmoid is the exchange rate between evidence and probability.

A logistic model emits the evidence (logit) $z = 1$. What probability does the sigmoid assign, $\sigma(1)$? (Use $e^{-1} \approx 0.368$.)

$\sigma(1) = \dfrac{1}{1 + e^{-1}} = \dfrac{1}{1 + 0.368} = \dfrac{1}{1.368} \approx $ 0.731. One unit of positive evidence buys about 73% belief — confident, but a long way from certain.

Bolting the sigmoid onto the linear score gives logistic regression — still a linear model, but linear in the right place:

EQ M3.2 — LOGISTIC REGRESSION $$ p(y = 1 \mid \mathbf{x}) \;=\; \sigma\!\left(\mathbf{w}^{\top}\mathbf{x} + b\right) \qquad \Longleftrightarrow \qquad \log \frac{p}{1 - p} \;=\; \mathbf{w}^{\top}\mathbf{x} + b $$

Read it right-to-left: the model is linear in the log-odds. Each unit increase in feature $x_j$ multiplies the odds $p/(1-p)$ by $e^{w_j}$ — which is why logistic regression is still the lingua franca of medicine and credit scoring: every weight is a legible odds multiplier. The pre-sigmoid score $z$ is called a logit — the same word, and the same object, as the raw scores an LLM emits before its softmax (Vol II · EQ 1.2).

Unlike least squares, there is no closed-form solution — logistic regression is trained by gradient descent (Chapter 02) on the loss of the next section. The consolation prize is substantial: that loss is convex for this model, so gradient descent finds the global optimum. It is the last model in this volume for which that is true.

3.3

Cross-entropy: the loss that trains GPT

What should the model pay when it predicts probability $p$ and the truth is $y$? The principled answer comes from maximum likelihood: choose the weights that make the observed labels most probable. Taking the negative log (sums beat products; minimizing beats maximizing) yields cross-entropy, also called log loss:

EQ M3.3 — BINARY CROSS-ENTROPY $$ \mathcal{L}(\mathbf{w}, b) \;=\; -\frac{1}{N} \sum_{i=1}^{N} \Big[\, y_i \log p_i \;+\; (1 - y_i) \log (1 - p_i) \,\Big], \qquad p_i = \sigma(\mathbf{w}^{\top}\mathbf{x}_i + b) $$

Per example, only one term survives: you pay $-\log(\text{probability you gave the truth})$ — the surprisal. Assign 0.99 to what happens, pay 0.01 nats; assign 0.001, pay 6.9; the bill for confident wrongness is unbounded. Generalized from 2 classes to $|V| \approx 100\mathrm{K}$ token classes and averaged over positions, this is exactly Vol II · EQ 1.6 — the pre-training loss of GPT. Next-token prediction is this chapter, scaled up.

The true label is $y = 1$ and the model predicts $p = 0.25$. By EQ M3.3, how many nats of cross-entropy does this single example cost?

Only the $y = 1$ term survives: cost $= -\log p = -\ln(0.25) = \ln 4 \approx $ 1.386 nats. The model put just a quarter of its belief on what actually happened, and the surprisal bills it accordingly.

Why not just use squared error on $p$? Two reasons. Through a sigmoid, squared error becomes non-convex — gradient descent can stall in flat regions, and precisely when the model is confidently wrong the $\sigma'$ factor crushes the gradient toward zero. Cross-entropy's gradient cancels that factor exactly, leaving the cleanest possible signal: $\nabla_{\mathbf{w}} \mathcal{L} = \tfrac{1}{N}\sum_i (p_i - y_i)\,\mathbf{x}_i$ — error times input, the same form as linear regression's. The worse the miss, the louder the correction.

There is a second, subtler reason to descend cross-entropy rather than the thing you ostensibly care about: accuracy is a staircase. Nudge the boundary and accuracy doesn't move at all — until a point crosses the line and it jumps. Zero gradient almost everywhere, undefined at the jumps: useless for optimization. Cross-entropy is the smooth ramp that gradient descent can actually walk. Feel the difference yourself:

INSTRUMENT M3.1 — BOUNDARY EXPLOREREQ M3.2 LIVE · 140 SEEDED POINTS · TWO GAUSSIAN CLOUDS

WEIGHT w₁ 0.90

WEIGHT w₂ -0.60

BIAS b 0.40

ACCURACY (STAIRCASE)

—

CROSS-ENTROPY (SMOOTH)

—

MISCLASSIFIED

—

Points are colored by the model's prediction (mint = class 1, blue = class 0); red rings mark misclassifications; the white line is the decision boundary, the mint arrow is w. The default boundary is tilted the wrong way — drag w₂ positive and watch the rings vanish. Two lessons: (1) cross-entropy keeps improving between accuracy's jumps, which is why training descends the loss, not the metric; (2) past ≈98% you cannot win — the few remaining rings live in the overlap, and no line can claim them.

Gradient descent does the same steering automatically. The cell below trains logistic regression on the same two clouds — twelve lines of numpy, the gradient from this section, nothing else:

PYTHON · RUNNABLE IN-BROWSER

import numpy as np
rng = np.random.default_rng(7)

n = 80                                     # two gaussian clouds, as in Instrument M3.1
A = rng.normal([-1.5, -1.0], 1.05, size=(n, 2))   # class 0
B = rng.normal([ 1.5,  1.1], 1.05, size=(n, 2))   # class 1
X, y = np.vstack([A, B]), np.array([0]*n + [1]*n)

w, b, lr = np.zeros(2), 0.0, 0.5
for step in range(400):
    p = 1 / (1 + np.exp(-(X @ w + b)))     # EQ M3.2
    w -= lr * (X.T @ (p - y)) / len(y)     # gradient of EQ M3.3: error x input
    b -= lr * np.mean(p - y)

p = np.clip(1 / (1 + np.exp(-(X @ w + b))), 1e-12, 1 - 1e-12)
ce  = -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
acc = np.mean((p > 0.5) == y)
print("w =", np.round(w, 3), "  b =", round(b, 3))
print(f"cross-entropy = {ce:.4f}   accuracy = {acc:.1%}")
plot_scatter(X[:, 0], X[:, 1], (p > 0.5).astype(int))

edits are live — try lr = 5.0, or 10 steps instead of 400

3.4

Decision boundaries: linear in input space

Where does the model actually decide? At $p = 0.5$ — which by EQ M3.1 happens exactly where the evidence is zero: $\mathbf{w}^{\top}\mathbf{x} + b = 0$. In two dimensions that is a straight line; in $d$ dimensions, a flat hyperplane. The sigmoid bends probabilities, never the boundary: logistic regression is a linear classifier, however smoothly its confidence shades from one side to the other.

The parameters split into three legible roles. The direction of $\mathbf{w}$ sets the boundary's orientation ($\mathbf{w}$ is perpendicular to it — the mint arrow in Instrument M3.1). The bias $b$ slides the boundary without rotating it. And the magnitude $\lVert\mathbf{w}\rVert$ controls how fast probability ramps as you walk away from the line: the score is $z = \lVert\mathbf{w}\rVert \cdot d(\mathbf{x})$, where $d(\mathbf{x})$ is the signed distance to the boundary. Direction says what the model believes; magnitude says how hard. That magnitude acts as a steepness dial — an inverse temperature — on the probability curve:

INSTRUMENT M3.2 — SIGMOID TEMPERATUREp = σ(k·z) · STEEPNESS k ≡ ‖w‖ ≡ 1/τ

STEEPNESS k 1.00

SLOPE AT MIDPOINT (k/4)

—

GREY ZONE (0.2 < p < 0.8)

—

p AT z = +1

—

The dashed curve is k = 1 for reference; the shaded band is the "grey zone" where the model is genuinely unsure. Crank k toward 8: the sigmoid hardens into a step — decisive, but with vanishing gradients and zero humility. Drop it toward 0.2: every answer is a shrug near 50%. This is the same dial as sampling temperature in Vol II — there you divide logits by τ; here k multiplies the score, so k ≡ 1/τ. Training sets it implicitly through ‖w‖.

A real failure mode hides in that dial. If the training data is perfectly separable, cross-entropy keeps paying the model to grow $\lVert\mathbf{w}\rVert$ forever — every doubling sharpens probabilities toward 0/1 and shaves a little more loss, without moving the boundary at all. The result is a wildly overconfident model. The standard fixes are L2 regularization or early stopping, both of which cap $\lVert\mathbf{w}\rVert$; Chapter 06 treats this properly.

What a line cannot do is bend. XOR-style data (positives in opposite corners) and concentric rings defeat every choice of $\mathbf{w}$ and $b$ — no straight boundary separates them. The classical remedy is feature engineering: feed the model $x_1 x_2$, or $x_1^2 + x_2^2$, and the boundary becomes linear in the new features while curving in the original space. The modern remedy is to learn those features — which is precisely what neural networks do (Chapter 07). Either way, the lesson stands: a linear classifier is only as good as the space you hand it.

3.5

Many classes: softmax

Two classes needed one score. For $K$ classes, give each class its own linear score $z_i = \mathbf{w}_i^{\top}\mathbf{x} + b_i$ and normalize the lot with softmax — exponentiate (so everything is positive and ratios are preserved on the log scale), then divide by the sum (so everything adds to one):

EQ M3.4 — SOFTMAX $$ \mathrm{softmax}(\mathbf{z})_i \;=\; \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}, \qquad \mathrm{softmax}(\mathbf{z} + c)_i \;=\; \mathrm{softmax}(\mathbf{z})_i \;\;\text{for any constant } c $$

For $K = 2$ it collapses to EQ M3.1: $p_1 = \sigma(z_1 - z_0)$ — sigmoid is softmax for two. The shift invariance says softmax reads differences between scores, not their absolute values — which doubles as the standard numerical-stability trick: subtract $\max_j z_j$ before exponentiating, for free. The loss generalizes too: pay $-\log p_{\text{true class}}$. And the gradient stays beautiful: predicted probabilities minus the one-hot truth.

Three class scores are $\mathbf{z} = (1, 0, 0)$. By EQ M3.4, what probability does softmax assign to the first class?

Exponentiate: $e^1 \approx 2.718$, $e^0 = 1$, $e^0 = 1$. Sum $\approx 4.718$. First probability $= 2.718 / 4.718 \approx $ 0.576. A one-unit lead over the other two scores translates into a clear, but not crushing, majority of the probability mass.

You will meet this exact function three more times in this encyclopedia, doing three different jobs:

Where	Softmax over	Producing
This chapter	K class scores	p(class \| input)
LLM output head (Vol II · EQ 1.2)	\|V\| ≈ 100K token logits	p(next token \| context)
Attention (Vol II · EQ 3.1)	T relevance scores per query	mixing weights over values
Sampling with temperature τ	logits / τ	sharpened or flattened p

One function, one identity: turn arbitrary scores into a probability distribution, differentiably. Every time a network must choose softly among options — classes, tokens, positions to attend to — softmax is the mechanism. Run it:

PYTHON · RUNNABLE IN-BROWSER

import numpy as np

def softmax(z):
    e = np.exp(z - z.max())        # subtract max: free, by shift invariance
    return e / e.sum()

logits = np.array([3.2, 1.1, 0.4, -1.7])     # 4 classes, raw scores
p = softmax(logits)
for name, pi in zip("ABCD", p):
    print(f"class {name}: {pi:.4f}")
print("sum =", round(p.sum(), 6))

print()
print("logits + 100  :", np.round(softmax(logits + 100), 4))
print("identical — softmax reads DIFFERENCES, not absolute scores")

with np.errstate(over="ignore"):               # what the max-trick prevents:
    print("naive exp(z+1000):", np.exp(logits + 1000.0))

edits are live — try logits / 0.1 (cold) or logits / 10 (hot)

3.6

Metrics beyond accuracy

A disease afflicts 1 person in 1,000. The classifier return "healthy" scores 99.9% accuracy and has never detected anything. Under class imbalance, accuracy measures the imbalance, not the model. The honest accounting starts by splitting the four ways a binary prediction can land:

CONFUSION MATRIX	PREDICTED +	PREDICTED −
ACTUAL +	TP · true positive	FN · false negative — a miss
ACTUAL −	FP · false positive — a false alarm	TN · true negative

Two questions matter, and they are different questions. Precision $= \mathrm{TP}/(\mathrm{TP}+\mathrm{FP})$: of everything I flagged, how much was real? Recall $= \mathrm{TP}/(\mathrm{TP}+\mathrm{FN})$: of everything real, how much did I flag? They pull against each other through a dial you already own: the decision threshold. Nothing forces the cut at $p = 0.5$ — lower it and you catch more positives (recall ↑) while flagging more junk (precision ↓); raise it and the reverse. A single model traces an entire precision–recall curve as the threshold sweeps; the F1 score $= 2PR/(P+R)$, a harmonic mean, condenses one operating point into one number — harsh on imbalance between the two, as a harmonic mean should be.

A classifier records $\mathrm{TP} = 30$, $\mathrm{FP} = 10$, $\mathrm{FN} = 20$. What is its precision?

Precision $= \dfrac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} = \dfrac{30}{30 + 10} = \dfrac{30}{40} = $ 0.75. Of everything it flagged, three in four were real — recall (which uses FN) answers the different question of how many real cases it caught.

The same classifier ($\mathrm{TP} = 30$, $\mathrm{FP} = 10$, $\mathrm{FN} = 20$) has precision $0.75$ and recall $30/50 = 0.6$. What is its F1 score?

$F1 = \dfrac{2PR}{P + R} = \dfrac{2\cdot 0.75\cdot 0.6}{0.75 + 0.6} = \dfrac{0.9}{1.35} \approx $ 0.667. The harmonic mean sits below the arithmetic mean of $0.675$ — its way of penalizing the imbalance between precision and recall.

Where you sit on that curve is a question about costs, not statistics: a missed tumor and a false alarm are not the same price, and no metric chooses for you. Worse, base rates ambush intuition. Run a genuinely good screening test on a rare condition:

SCREENED

10,000

prevalence 1% → 100 actually positive

RECALL 90%

90 TP

10 real cases slip through (FN)

FP RATE 8%

792 FP

8% of 9,900 healthy people flagged

PRECISION

10.2%

90 / 882 flags are real — 9 in 10 alarms are false

Working under imbalance, in practice: judge models on precision/recall (or the PR curve), never raw accuracy; consider reweighting the loss so rare-class errors cost more, or resampling the data; and remember the cheapest fix is often just moving the threshold after training. The probabilities logistic regression emits are exactly what make that last move possible — a hard classifier offers no dial at all.

Logistic regression draws one straight, confident line. Chapter 04 takes the opposite bet: models with no line, no sigmoid, and barely any equations — decision trees that carve the space into boxes, forests that vote, and nearest neighbors that just ask, "what did similar points do?"

§