09 · Naive Bayes & Generative Classifiers

9.1

Generative vs discriminative classifiers

Every classifier ultimately wants $p(y \mid x)$: the probability of class $y$ given features $x$. There are two roads to it, and they split machine learning down the middle.

A discriminative model attacks $p(y \mid x)$ head-on. Logistic regression (Chapter 03), neural nets (Chapter 07), SVMs (next chapter) — all of them learn the boundary between classes directly and never bother modeling what the inputs themselves look like. A generative model takes the long way around: it learns $p(x \mid y)$ — a full story of how each class generates its data — plus the class prevalences $p(y)$, and only then flips them with Bayes' rule to recover $p(y \mid x)$. You could literally sample from a trained generative classifier to hallucinate new examples of "spam" or "not spam".

Aspect	Generative — models $p(x,y)$	Discriminative — models $p(y \mid x)$
Learns	how each class produces data	where the boundary sits
Examples	Naive Bayes, LDA/QDA, GMMs, HMMs	Logistic regression, SVM, neural nets
Data hunger	low — strong assumptions fill the gaps	higher — wants enough to trace the boundary
Asymptotic accuracy	often lower (model is wrong)	often higher (fewer assumptions)
Can generate samples?	yes	no

The trade-off is captured in a classic result of Ng & Jordan (2002): a generative classifier and its discriminative twin (e.g. naive Bayes vs logistic regression) approach different error rates as data grows, but the generative one approaches its (sometimes higher) ceiling much faster — and wins outright in the small-data regime. With ten examples, the model that assumes more often beats the model that assumes less. With ten million, the assumptions become a liability. Naive Bayes lives at the assumption-heavy end of this spectrum, which is exactly why it remains a strong baseline whenever labeled data is scarce or latency is brutal.

INTUITION

A discriminative model is a border guard who has learned to spot a forged passport without ever picturing a real citizen. A generative model is a forger who has studied what genuine documents look like and judges a new one by how easily they could have produced it. Both can flag fakes — they just know the world differently.

9.2

Bayes' rule for classification & the naive assumption

The engine is Bayes' rule (Stats · EQ 1.x), read as a classifier. To score class $y$ for an input $x = (x_1, \ldots, x_d)$:

EQ M9.1 — BAYES' RULE AS A CLASSIFIER $$ p(y \mid x) \;=\; \frac{p(x \mid y)\, p(y)}{p(x)} \;=\; \frac{p(x \mid y)\, p(y)}{\sum_{y'} p(x \mid y')\, p(y')} $$

The prior $p(y)$ is just how common each class is. The likelihood $p(x \mid y)$ is the class's story of the data. The evidence $p(x)$ in the denominator is the same for every class, so for picking the winner it is pure normalization — you can ignore it entirely until you need calibrated probabilities. That single observation is why naive Bayes never has to compute the hard part.

The problem hides in $p(x \mid y)$. For $d$ binary features there are $2^d - 1$ free numbers per class — a joint distribution over every combination of feature values. With $d = 50$ that is more parameters than atoms you could ever count. No dataset estimates it. So naive Bayes makes its defining leap of faith: given the class, the features are mutually independent.

EQ M9.2 — THE NAIVE CONDITIONAL-INDEPENDENCE ASSUMPTION $$ p(x_1, x_2, \ldots, x_d \mid y) \;\approx\; \prod_{j=1}^{d} p(x_j \mid y) $$

The full joint — exponential in $d$ — collapses into a product of $d$ one-dimensional pieces, each trivially estimated by counting. The cost falls from $2^d$ parameters to $O(d)$ per class. This is almost never literally true (in text, "new" and "york" are wildly correlated), yet it is the entire trick. The assumption is the price; linear-time learning and inference is what you buy.

Plug EQ M9.2 into EQ M9.1 and drop the constant evidence. The decision rule becomes a single argmax, and — because likelihoods are tiny products that underflow to zero in floating point — you always compute it in log space, where the product becomes a sum:

EQ M9.3 — THE NAIVE BAYES DECISION RULE $$ \hat{y} \;=\; \underset{y}{\arg\max}\;\Big[\, \log p(y) + \sum_{j=1}^{d} \log p(x_j \mid y) \,\Big] $$

A prediction is one weighted vote: start from the log-prior, then add each feature's log-evidence for that class. No iteration, no gradient descent — training is counting, inference is addition. Because the scores are sums of log-probabilities, naive Bayes is linear in feature log-likelihoods; with the right parameterization it shares the exact functional form of logistic regression, just fit differently.

True or false: Naive Bayes assumes that the features are independent of one another given the class label. (Answer true or false.)

This is precisely EQ M9.2 — the conditional-independence assumption that lets the joint likelihood factor into $\prod_j p(x_j \mid y)$. It is the model's namesake "naive" leap. The answer is true. (Note the subtlety: features need not be marginally independent — only independent conditioned on the class.)

PYTHON · RUNNABLE IN-BROWSER

# EQ M9.3 by hand: log-prior + sum of feature log-likelihoods, then argmax.
# Two classes, two binary features. p(x_j=1 | y) given directly.
import numpy as np

prior   = {0: 0.6, 1: 0.4}                 # p(y): class 0 is more common
p1      = {0: [0.2, 0.7], 1: [0.8, 0.3]}   # p(x_j = 1 | y) for j = 0, 1
x       = [1, 0]                            # observe feature0 = 1, feature1 = 0

def logscore(y):
    s = np.log(prior[y])
    for j, xj in enumerate(x):
        pj = p1[y][j] if xj == 1 else 1 - p1[y][j]   # Bernoulli per feature
        s += np.log(pj)
    return s

scores = {y: logscore(y) for y in (0, 1)}
print("log-scores:", {y: round(s, 4) for y, s in scores.items()})
yhat = max(scores, key=scores.get)
# turn log-scores into a calibrated posterior via the log-sum-exp normalizer
m  = max(scores.values())
Z  = sum(np.exp(s - m) for s in scores.values())
post = {y: float(np.exp(s - m) / Z) for y, s in scores.items()}
print("posterior :", {y: round(p, 4) for y, p in post.items()})
print("prediction:", yhat, "  (evidence p(x) cancelled, never computed)")

edits are live — break it on purpose

9.3

Gaussian Naive Bayes

For continuous features — heights, pixel intensities, sensor readings — each per-feature likelihood $p(x_j \mid y)$ is modeled as a one-dimensional Gaussian. The training "fit" is nothing but estimating, from each class's data, a mean and a variance for every feature:

EQ M9.4 — GAUSSIAN PER-FEATURE LIKELIHOOD $$ p(x_j \mid y) \;=\; \frac{1}{\sqrt{2\pi\,\sigma_{y,j}^2}}\,\exp\!\left(-\frac{(x_j - \mu_{y,j})^2}{2\,\sigma_{y,j}^2}\right) $$

$\mu_{y,j}$ and $\sigma_{y,j}^2$ are simply the sample mean and variance of feature $j$ among the training points of class $y$. Because features are assumed independent given the class, the joint per-class density is an axis-aligned Gaussian — its contours are ellipses whose axes line up with the coordinate axes, never tilted. That single restriction is the visible footprint of the naive assumption in 2-D.

Where do the decision boundaries come from? Take the log-ratio of the two class posteriors. With shared variances the quadratic terms cancel and you get a straight line — this is Linear Discriminant Analysis. With per-class variances the quadratic terms survive and the boundary becomes a conic (a parabola, ellipse, or hyperbola) — Quadratic Discriminant Analysis. Gaussian NB is exactly QDA with the off-diagonal covariances forced to zero.

INSTRUMENT M9.1 — GAUSSIAN-NB BOUNDARY EXPLORERDRAG CLASS MEANS · TUNE VARIANCE · EQ M9.4

CLASS A SPREAD σ 1.00

CLASS B SPREAD σ 1.00

PRIOR p(A) 0.50

BOUNDARY SHAPE

—

A-MEAN (drag the dot)

—

B-MEAN (drag the dot)

—

Drag either coloured dot to move a class mean. Equal spreads give a straight boundary (LDA); make the spreads unequal and it bows into a conic (QDA). Skew the prior and the whole boundary slides toward the rarer class — Bayes-optimally demanding more evidence to call something uncommon. The shaded region is wherever class A wins the argmax of EQ M9.3.

PYTHON · RUNNABLE IN-BROWSER

# Gaussian Naive Bayes from scratch: fit = means + variances, predict = argmax.
import numpy as np
rng = np.random.default_rng(0)

# two 2-D classes, 200 points each, drawn from axis-aligned Gaussians
mu = {0: [0.0, 0.0], 1: [3.0, 2.5]}
Xy = [(rng.normal(mu[y], [1.0, 1.3], (200, 2)), np.full(200, y)) for y in (0, 1)]
X  = np.vstack([a for a, _ in Xy]); y = np.concatenate([b for _, b in Xy])

# --- fit: per class, per feature mean and variance (eps guards zero variance)
classes = np.unique(y); eps = 1e-9
means = np.array([X[y == c].mean(0) for c in classes])
vars  = np.array([X[y == c].var(0)  for c in classes]) + eps
logpr = np.log(np.array([(y == c).mean() for c in classes]))

# --- predict: log-prior + sum_j log N(x_j; mu, var)  (EQ M9.3 + M9.4)
def log_gauss(Xb, m, v):
    return -0.5 * (np.log(2 * np.pi * v) + (Xb[:, None, :] - m) ** 2 / v).sum(2)
scores = log_gauss(X, means, vars) + logpr          # (N, n_classes)
pred   = classes[scores.argmax(1)]
acc    = (pred == y).mean()
print(f"fitted means:\n{means.round(2)}")
print(f"fitted variances:\n{vars.round(2)}")
print(f"train accuracy: {acc:.3f}")
plot_scatter(X[:, 0], X[:, 1], list(pred.astype(int)))   # colour = predicted class

edits are live — break it on purpose

One practical wrinkle. If a feature has near-zero variance within a class — a constant column, common in real data — the Gaussian collapses to a spike and the log-likelihood explodes. Production implementations (scikit-learn's GaussianNB) add a tiny smoothing term to every variance, a fixed fraction of the largest feature variance, to keep the densities well-behaved. The same impulse — never let a probability go to zero or infinity — drives the smoothing we meet next for discrete features.

9.4

Multinomial & Bernoulli NB for text

Naive Bayes' first and most enduring job was sorting documents — Maron's 1961 "automatic indexing" experiments are arguably the technique's debut, and spam filters made it famous. Text needs a different per-feature model than the Gaussian, and there are two canonical choices that differ in what they count.

Multinomial NB treats a document as a bag of word counts. Each class $y$ has a vocabulary-sized probability vector $\theta_{y}$ — "how likely is each word in a document of this class" — and a document's likelihood is the multinomial probability of its counts. The estimate for one word is the share of that class's total word-tokens it accounts for, with Laplace (add-$\alpha$) smoothing so an unseen word never zeroes out the whole product:

EQ M9.5 — MULTINOMIAL NB WITH LAPLACE SMOOTHING $$ \hat{\theta}_{y,w} \;=\; p(w \mid y) \;=\; \frac{N_{y,w} + \alpha}{N_{y} + \alpha\,V}, \qquad N_{y} = \sum_{w'} N_{y,w'} $$

$N_{y,w}$ is how many times word $w$ appears across all class-$y$ documents; $N_{y}$ is that class's total token count; $V$ is the vocabulary size; $\alpha$ is the smoothing strength ($\alpha = 1$ is "Laplace", $\alpha = 0.5$ is "Lidstone/Jeffreys"). The $+\alpha$ in the numerator and $+\alpha V$ in the denominator together form a valid probability distribution — they sum to 1 over the vocabulary — and guarantee every word keeps a sliver of probability mass.

Bernoulli NB instead treats each vocabulary word as a binary presence/absence flag and explicitly models the absence of words too — a "free" word that never appears in spam is positive evidence for ham. It tends to win on short texts (tweets, subject lines) where a word appearing once vs. thrice carries little extra signal; multinomial wins on longer documents where counts matter. Both are linear classifiers in log space; both are one counting pass to train.

A word appears $N_{y,w} = 0$ times in class $y$. The class has $N_y = 6$ total tokens, the vocabulary size is $V = 4$, and you use Laplace smoothing with $\alpha = 1$. What smoothed probability $p(w \mid y)$ does EQ M9.5 assign this unseen word? (Give a decimal.)

Apply EQ M9.5 directly: $p(w \mid y) = \dfrac{N_{y,w} + \alpha}{N_y + \alpha V} = \dfrac{0 + 1}{6 + 1\times 4} = \dfrac{1}{10} = $ 0.1. Without smoothing the answer would be $0/6 = 0$, which would force the entire document's likelihood to zero — the catastrophe smoothing exists to prevent.

PYTHON · RUNNABLE IN-BROWSER

# Multinomial Naive Bayes on a toy bag-of-words, with Laplace smoothing (EQ M9.5).
import numpy as np
vocab = ["win", "free", "money", "meeting", "report", "lunch"]
V = len(vocab)
# rows = documents (word counts), labels: 1 = spam, 0 = ham
X = np.array([[2,1,1,0,0,0],   # spam
              [1,1,0,0,0,0],   # spam
              [0,0,0,1,1,0],   # ham
              [0,0,0,1,0,1]])  # ham
y = np.array([1, 1, 0, 0])
alpha = 1.0

# fit: per-class token totals -> smoothed word probabilities (EQ M9.5)
theta, logprior = {}, {}
for c in (0, 1):
    counts = X[y == c].sum(0)                         # N_{y,w}
    theta[c] = (counts + alpha) / (counts.sum() + alpha * V)
    logprior[c] = np.log((y == c).mean())

# predict a new document: "free money meeting"
doc = np.array([0,1,1,1,0,0])
score = {c: logprior[c] + (doc * np.log(theta[c])).sum() for c in (0, 1)}
m = max(score.values()); Z = sum(np.exp(s - m) for s in score.values())
post = {c: float(np.exp(s - m) / Z) for c, s in score.items()}
print("p(win|spam) =", round(float(theta[1][0]), 3), " p(win|ham) =", round(float(theta[0][0]), 3))
print("posterior   :", {c: round(p, 3) for c, p in post.items()})
print("prediction  :", "SPAM" if score[1] > score[0] else "HAM")

edits are live — break it on purpose

INSTRUMENT M9.2 — LIVE SPAM FILTERMULTINOMIAL NB · TYPE WORDS · POSTERIOR UPDATES PER TOKEN

MESSAGE (type freely — known words shown below)

LAPLACE α 1.0

P(SPAM | MESSAGE)

LOG-SCORE SPAM

—

LOG-SCORE HAM

—

VERDICT

—

A tiny corpus is baked in (spammy words: free, money, win, click, offer, now, prize; hammy words: meeting, report, lunch, project, team, schedule). Each token you add casts a log-vote (EQ M9.3); the bar is the softmaxed posterior. Words the model has never seen contribute only the smoothed floor — turn α up and watch unknown words pull every verdict toward 50/50.

9.5

Smoothing, failure modes & why it works anyway

We have already met the most important fix. The zero-frequency problem is fatal without it: one word the training set never saw in a class drives $p(w \mid y) = 0$, and a single zero in the product of EQ M9.3 sends the whole log-likelihood to $-\infty$, vetoing that class no matter how strong the other evidence. Laplace/Lidstone smoothing (EQ M9.5) is the cure — read as a Bayesian posterior, $\alpha$ is the strength of a Dirichlet prior that pre-loads $\alpha$ imaginary observations of every word. It is regularization wearing a probabilistic costume.

The deeper failure mode is the assumption itself. When features are correlated — and they always are — naive Bayes double-counts evidence. The bigram "New York" contributes "new" and "york" as if they were two independent witnesses, so the model becomes overconfident: its posteriors pile up near 0 and 1, badly miscalibrated. Here is the surprise that has fascinated the field for decades:

THE PARADOX

Naive Bayes' probabilities are often garbage, yet its classifications are excellent. Domingos & Pazzani (1997) showed why: the argmax only needs the correct class to score highest, not to have an accurate probability. The independence assumption can be massively violated — pushing the estimated posterior to a wildly wrong value like 0.9999 — and the decision still lands on the right side of the boundary. The model is reliably wrong about how sure it is, and reliably right about which class. If you need calibrated probabilities, post-hoc calibration (Platt scaling, isotonic regression) is mandatory; if you only need labels, ship it.

Three more practical truths round out the picture: (1) NB is robust to irrelevant features, which simply contribute roughly equal log-votes to every class and cancel; (2) it is sensitive to strongly correlated features, so de-duplicating obvious redundancy (or using TF-IDF weighting for text) helps; (3) its training cost is a single pass and its model is a handful of count tables, making it the natural choice for streaming data, on-device inference, and any setting where a heavier model's accuracy gain does not justify its cost. It is the baseline every other classifier in this volume must beat — and on small, high-dimensional, sparse data, it frequently isn't beaten.

INSTRUMENT M9.3 — INDEPENDENCE-ASSUMPTION TOYCORRELATE TWO FEATURES · WATCH NB DEGRADE

FEATURE CORRELATION ρ 0.00

TRUE-MODEL ACCURACY

—

NAIVE-BAYES ACCURACY

—

MEAN |POSTERIOR ERROR|

—

Two classes drawn from correlated Gaussians; NB still fits them as axis-aligned (ρ forced to 0). At ρ = 0 the assumption holds and NB matches the optimal classifier. Crank ρ up: the calibration error climbs steadily — NB grows overconfident — while its accuracy barely moves. That gap is the paradox of §9.5 made visible: wrong probabilities, right decisions.

Naive Bayes draws boundaries by modeling each class; the next chapter draws the single best boundary directly. Chapter 10 — Support Vector Machines & Kernels: maximum-margin separation, the kernel trick that buys nonlinear boundaries for free, and what changes when you stop modeling the data and start carving it.

9.R

References

Domingos, P. & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Machine Learning 29 — explains why a model with violated independence assumptions still classifies well (the paradox of §9.5).
Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry. Journal of the ACM 8(3) — arguably the first application of naive-Bayes-style probabilistic classification to text.
McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI-98 Workshop — the canonical multinomial vs. Bernoulli comparison (EQ M9.5 and §9.4).
Ng, A. Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. NeurIPS 14 — the generative/discriminative trade-off and small-data advantage of §9.1.
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer — §6.6.3 naive Bayes, §4.3 LDA/QDA (the Gaussian-NB connection of §9.3). Free PDF.
Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Ch. 4: Naive Bayes and Sentiment Classification. Stanford — a modern, worked treatment of multinomial NB for text with smoothing.