01 · Learning from Data — AI Encyclopedia

1.1

The trick behind all of it

For seventy years, making a computer do something meant one thing: a person figures out the rules, writes them down precisely, and the machine follows them. This works beautifully when the rules are knowable — payroll, physics simulations, chess-piece movement. It collapses when they are not. Nobody can write down the rules for recognizing a face, transcribing mumbled speech, or deciding whether an email is spam. We do these things effortlessly, and we cannot say how.

Machine learning inverts the contract. The human supplies three things: a pile of examples of the job done correctly, a flexible function with adjustable numbers inside it, and a score that measures how badly the function currently does the job. The machine's only task is to adjust the numbers until the score improves. The rules are never written by anyone. They condense out of the data, the way a curve condenses out of scattered points.

	Classical programming	Machine learning
Human writes	the rules, by hand	examples + a score + a flexible function
Machine produces	answers	the rules (as knob settings)
Wins when	rules are crisp and known	rules are unknown, fuzzy, or drift over time
Fails by	crashing — loudly, traceably	being statistically wrong — quietly, sometimes confidently

Take spam. The hand-written version — if subject contains "FREE!!!" then spam — was the actual state of the art in the 1990s, and it aged badly: spammers read the rules too. The learned version is handed two million emails that humans already labeled spam or not spam and tunes itself to agree with those labels. When spammers adapt, you don't rewrite code; you feed in fresh examples and tune again. The maintenance burden moves from logic to data — which is the real reason this paradigm conquered the industry.

The last table row is not a throwaway. A learned system's failures are statistical: it will be wrong on some inputs, with no stack trace pointing at the offending line, because there is no offending line. Knowing how often it is wrong — and on which inputs — is most of the discipline you are about to learn.

1.2

A model is a function with knobs

To make "adjust the numbers until the score improves" precise, we need names for the pieces. The cleanest setting — and the one this whole volume lives in — is supervised learning: each example is a pair $(x, y)$, where $x$ is the input and $y$ is the correct answer, the label. Square footage and sale price. Email text and spam-or-not. A photo and the word "cat". Someone, somewhere, supervised: they supplied the right answers.

A model (the older literature says hypothesis, hence the letter $h$) is a function that takes $x$ and emits a guess for $y$. What makes it a learnable function is that its behavior depends on adjustable numbers — its parameters, also called weights. The simplest interesting model on Earth has exactly two:

EQ M1.1 — A FUNCTION WITH TWO KNOBS $$ h_{w,b}(x) \;=\; w\,x + b $$

A straight line. $w$ is the slope — how much the prediction rises per unit of input — and $b$ is the intercept, the prediction at $x = 0$. The subscript records the central fact: pick different numbers $w, b$ and you get a different function. Learning means searching the space of knob settings for the function that fits. Parameters are written collectively as $\theta$ (theta), so you will see $h_\theta$ everywhere; here $\theta = (w, b)$.

Set the knobs to $w = 3$ and $b = -2$, giving the model $h(x) = 3x - 2$. What does it predict for the input $x = 4$?

$h(4) = 3\cdot 4 + (-2) = 12 - 2 = $ 10. The two knobs and one input fully determine the output — that is all a model does at prediction time.

Hold onto the geometry of that sentence: two knobs define a two-dimensional space of candidate lines, and "learning" is a search through that space. Every model in this encyclopedia is the same object scaled up. A frontier language model is a function with roughly $10^{12}$ knobs instead of two — harder to search, impossible to visualize, but not a different kind of thing. The vocabulary you are acquiring on this page transfers without modification.

An honest caveat before we proceed. Not all learning is supervised. Models can learn structure from unlabeled data (unsupervised), from data that labels itself (self-supervised — how language models pre-train, Vol II Ch 04), or from trial-and-error reward (reinforcement learning). Supervised learning is where the vocabulary is cleanest, and the other regimes reuse nearly all of it.

1.3

Loss: keeping score

"Fits the data" must become a number, or the machine has nothing to improve. For one example, the natural measure of failure is the residual — the gap between prediction and truth, $h(x_i) - y_i$. To grade the model on the whole dataset, square each residual and average:

EQ M1.2 — MEAN SQUARED ERROR $$ \mathcal{L}(w, b) \;=\; \frac{1}{n} \sum_{i=1}^{n} \big( h_{w,b}(x_i) - y_i \big)^{2} $$

$n$ examples; the $\Sigma$ just means "add them all up". Squaring does three jobs at once: it kills the sign (overshoot and undershoot both count), it punishes large misses far more than small ones (a residual of 4 costs 16; two residuals of 2 cost 8), and it leaves a smooth bowl-shaped surface with no kinks — which is what makes the automatic tuning of Chapter 02 possible. The loss is a function of the knobs, not of the data: the data is fixed; $w$ and $b$ move; $\mathcal{L}$ reports disagreement at every setting.

A model makes three predictions $\hat y = (5, 8, 6)$ for three points whose true labels are $y = (4, 6, 9)$. Using EQ M1.2, what is the mean squared error?

Residuals (prediction − truth): $5-4 = 1$, $8-6 = 2$, $6-9 = -3$. Square each: $1, 4, 9$. Sum $= 14$; average over $n = 3$: $14/3 \approx $ 4.667. The single residual of $-3$ contributes 9 — more than the other two combined, which is exactly the disproportionate punishment squaring is designed to deliver.

Now feel it in your hands. Below are 25 measurements from a noisy linear process. Your job is the machine's job: turn the two knobs and drive the disagreement down. The red stalks are the residuals — the exact quantities EQ M1.2 squares and averages.

INSTRUMENT M1.1 — HAND-FIT25 NOISY POINTS · EQ M1.2 LIVE · TARGET: MSE BELOW 4.00

SLOPE w 1.00

INTERCEPT b 0.0

MSE — EQ M1.2

—

CHALLENGE: BEAT 4.00

—

WORST SINGLE MISS

—

Drive the MSE below 4.00 — it is possible, but only just: the best achievable on these points is 3.57, at w ≈ 2.19, b ≈ 0.31. Notice the strategy your hands discover: big slope moves first, small intercept corrections after, ever-finer wiggles as you close in. That instinct — large steps far from the answer, small steps near it — is precisely what Chapter 02 turns into an algorithm.

And here is the same arithmetic with the curtain pulled back — the identical 25 points, two candidate knob settings, scored in four lines of numpy. The second candidate beats the instrument's target; neither is optimal.

PYTHON · RUNNABLE IN-BROWSER

import numpy as np

# The exact 25 points behind Instrument M1.1
x = np.array([0.117, 4.055, 2.578, 3.680, 2.475, 2.910, 5.537, 9.212,
              4.253, 4.267, 3.306, 3.224, 6.843, 9.199, 9.605, 5.486,
              7.243, 8.232, 3.124, 8.334, 7.168, 8.049, 5.069, 8.135, 6.093])
y = np.array([3.041, 9.265, 5.475, 8.950, 4.430, 8.188, 14.821, 16.660,
              7.241, 12.262, 6.484, 5.253, 17.360, 19.157, 23.652, 10.813,
              17.723, 17.977, 5.106, 16.549, 17.812, 18.369, 8.996, 20.632, 14.282])

def mse(w, b):                      # EQ M1.2, verbatim
    return np.mean((w * x + b - y) ** 2)

candidates = [(1.0, 0.0), (2.0, 1.0)]
for w, b in candidates:
    print(f"h(x) = {w:.2f}x + {b:.2f}   ->   MSE = {mse(w, b):6.2f}")

plot_scatter(x, y)

edit the candidates — can you beat the instrument by hand?

Units, briefly. Squaring changes units: if $y$ is in dollars, MSE is in dollars-squared, which no human can feel. Practitioners report $\sqrt{\mathrm{MSE}}$ (RMSE) when they want interpretability. And MSE is one loss among many — classification tasks use cross-entropy (Chapter 04), and the freedom to choose the score is a design lever, not a footnote. What never changes: some single number measures disagreement, and learning means pushing it down.

1.4

Generalization: the only thing that matters

Here is the trap at the heart of the field. If low loss on the examples were the goal, the perfect model would be a lookup table: store every $(x_i, y_i)$ pair, return $y_i$ when asked about $x_i$, achieve a loss of exactly zero. It is also perfectly useless — ask it about any $x$ it hasn't stored and it has nothing to say. Zero training loss, zero learning. Memorization is not the goal. The goal is performance on data the model has never seen. That property is called generalization, and it is the only thing anyone is ever actually paying for.

The defense is almost embarrassingly simple, and it is the single most important habit in machine learning: before doing anything else, split the data. Tune the knobs on one part (the training set) and measure on a part the model never touched (the test set). The held-out score is a rehearsal for the future; the training score is just a record of the past. Formally, the quantity we minimize is a stand-in for the quantity we want:

EQ M1.3 — EMPIRICAL RISK STANDS IN FOR TRUE RISK $$ \hat{R}(h) \;=\; \frac{1}{n} \sum_{i=1}^{n} \ell\big(h(x_i),\, y_i\big) \qquad\text{approximates}\qquad R(h) \;=\; \mathbb{E}_{(x,y)\sim\mathcal{D}}\Big[\, \ell\big(h(x),\, y\big) \Big] $$

$\ell$ is any per-example loss (squared error, here). $\mathcal{D}$ is the unseen process that generates the data — houses being sold, emails being sent — and $\mathbb{E}$ means "the average over everything that process will ever produce". We can never compute $R$, so we minimize $\hat{R}$ on a sample and hope the sample speaks for the population. Training loss measures fit. Test loss estimates risk. Only the second predicts the future. The gap between them is overfitting, made visible.

A model scores five held-out examples with per-example losses $0.4,\ 1.2,\ 0.6,\ 0.8,\ 1.0$. What is the empirical risk $\hat R$ (EQ M1.3) on this sample?

$\hat R = \tfrac{1}{5}(0.4 + 1.2 + 0.6 + 0.8 + 1.0) = 4.0/5 = $ 0.8. Empirical risk is nothing more exotic than the average loss over the sample — our computable stand-in for the uncomputable true risk $R$.

A model reaches training loss $0.30$ but its held-out test loss is $0.95$. How large is the generalization gap (test − train)?

Gap $= 0.95 - 0.30 = $ 0.65. A model that fits the training data far better than the test data is overfitting; the gap is that failure made into a number.

To see the gap open wide, give a model too much flexibility. The instrument below fits the same 25 points two ways: a straight line (two knobs), and a degree-9 polynomial (ten knobs — enough to snake through nearly every training point individually). Both are fitted to the same 18 training points; 7 points are held out. Watch what each extra knob buys, and what it costs.

INSTRUMENT M1.2 — TRAIN/TEST SPLITSAME DATA · 18 TRAIN / 7 HELD OUT · EQ M1.3

MODEL CAPACITY

● TRAIN (18)    ● TEST (7) — HELD OUT, NEVER FITTED

TRAIN MSE (18 PTS)

—

TEST MSE (7 PTS)

—

TEST / TRAIN GAP

—

Flip to DEGREE 9. Train MSE collapses from 3.13 to 0.87 — by the training score, the wiggly curve is the better model, and it always will be: more knobs can never fit the training data worse. But test MSE detonates from 5.0 to 1,373, because between the memorized points the polynomial swings wildly through territory no data constrains. The degree-9 coefficients are precomputed (an exact least-squares fit to the 18 training points); both MSE readouts are computed live from them.

Run the same experiment yourself — a 90/10 split this time, fits via np.polyfit. With only 3 points held out the verdict is noisier than the instrument's (small test sets are unreliable juries — that is a real lesson, not an apology), but it points the same way:

PYTHON · RUNNABLE IN-BROWSER

import numpy as np

x = np.array([0.117, 4.055, 2.578, 3.680, 2.475, 2.910, 5.537, 9.212,
              4.253, 4.267, 3.306, 3.224, 6.843, 9.199, 9.605, 5.486,
              7.243, 8.232, 3.124, 8.334, 7.168, 8.049, 5.069, 8.135, 6.093])
y = np.array([3.041, 9.265, 5.475, 8.950, 4.430, 8.188, 14.821, 16.660,
              7.241, 12.262, 6.484, 5.253, 17.360, 19.157, 23.652, 10.813,
              17.723, 17.977, 5.106, 16.549, 17.812, 18.369, 8.996, 20.632, 14.282])

perm = np.random.default_rng(0).permutation(len(x))
train, test = perm[:22], perm[22:]      # 90 / 10 split
z = x / 10                              # rescale so degree 9 stays well-conditioned

for deg in (1, 9):
    c = np.polyfit(z[train], y[train], deg)
    mse_tr = np.mean((np.polyval(c, z[train]) - y[train]) ** 2)
    mse_te = np.mean((np.polyval(c, z[test])  - y[test])  ** 2)
    print(f"degree {deg}:  train MSE = {mse_tr:5.2f}   test MSE = {mse_te:5.2f}")

held_out = np.isin(np.arange(len(x)), test).astype(int)
plot_scatter(x, y, held_out)            # blue = the 3 points the fit never saw

change the rng seed — watch the 3-point test verdict wobble

FINE PRINT

The split certifies less than it seems to. (1) It assumes test data is drawn from the same process $\mathcal{D}$ as training data — but the world drifts, and a model certified on last year's emails meets next year's spammers. This failure mode, distribution shift, is endemic in deployment. (2) The certificate expires with use: every time you peek at the test score and adjust your model in response, information leaks, and the test set quietly becomes training signal. Serious practice holds out a final untouched set and looks at it once. (3) For language models this discipline has a sharper name — contamination — because when your training set is the internet, your test set is usually in it somewhere (Vol II, Ch 04).

1.5

The loop you will see thirty more times

Assemble the pieces and you get the universal cadence of machine learning — the loop every chapter in this encyclopedia will replay at larger scale:

FIG M1.1PREDICT → MEASURE → ADJUST

The loop. Predict with the current knobs, measure disagreement against the labels, adjust the knobs, repeat. Everything else in machine learning is a refinement of one of these four boxes.

In Instrument M1.1, you were the ADJUST box — eyes on the residuals, hands on the sliders. That works for two knobs. It does not work for ten, and it is unthinkable for $10^{12}$. The entire next chapter is about firing you from the job: calculus can read the slope of the loss surface and announce, for every knob simultaneously, which direction reduces disagreement. That announcement is called the gradient, and following it is called gradient descent — the algorithm that trains essentially everything, from the straight line above to the largest models ever built.

What will change as this volume proceeds: the model grows from a line to a network of millions of units; the loss changes shape for new tasks; the data swells from 25 points to trillions of tokens; ADJUST acquires momentum, schedules, and tricks. What will never change: predict, measure, adjust. When the architecture of Volume II towers over you, find the four boxes. They are always there.

You fit the line by feel; the machine fits it by calculus. Chapter 02: the loss surface as a landscape, the gradient as a compass pointing downhill, the learning rate as stride length — and why the exact solution to linear regression exists yet almost nobody uses it.

§