Generative vs discriminative classifiers
Every classifier ultimately wants \(p(y \mid x)\): the probability of class \(y\) given features \(x\). There are two roads to it, and they split machine learning down the middle.
A discriminative model attacks \(p(y \mid x)\) head-on. Logistic regression (Chapter 03), neural nets (Chapter 07), SVMs (next chapter) — all of them learn the boundary between classes directly and never bother modeling what the inputs themselves look like. A generative model takes the long way around: it learns \(p(x \mid y)\) — a full story of how each class generates its data — plus the class prevalences \(p(y)\), and only then flips them with Bayes' rule to recover \(p(y \mid x)\). You could literally sample from a trained generative classifier to hallucinate new examples of "spam" or "not spam".
| Aspect | Generative — models \(p(x,y)\) | Discriminative — models \(p(y \mid x)\) |
|---|---|---|
| Learns | how each class produces data | where the boundary sits |
| Examples | Naive Bayes, LDA/QDA, GMMs, HMMs | Logistic regression, SVM, neural nets |
| Data hunger | low — strong assumptions fill the gaps | higher — wants enough to trace the boundary |
| Asymptotic accuracy | often lower (model is wrong) | often higher (fewer assumptions) |
| Can generate samples? | yes | no |
The trade-off is captured in a classic result of Ng & Jordan (2002): a generative classifier and its discriminative twin (e.g. naive Bayes vs logistic regression) approach different error rates as data grows, but the generative one approaches its (sometimes higher) ceiling much faster — and wins outright in the small-data regime. With ten examples, the model that assumes more often beats the model that assumes less. With ten million, the assumptions become a liability. Naive Bayes lives at the assumption-heavy end of this spectrum, which is exactly why it remains a strong baseline whenever labeled data is scarce or latency is brutal.
A discriminative model is a border guard who has learned to spot a forged passport without ever picturing a real citizen. A generative model is a forger who has studied what genuine documents look like and judges a new one by how easily they could have produced it. Both can flag fakes — they just know the world differently.
Bayes' rule for classification & the naive assumption
The engine is Bayes' rule (Stats · EQ 1.x), read as a classifier. To score class \(y\) for an input \(x = (x_1, \ldots, x_d)\):
The problem hides in \(p(x \mid y)\). For \(d\) binary features there are \(2^d - 1\) free numbers per class — a joint distribution over every combination of feature values. With \(d = 50\) that is more parameters than atoms you could ever count. No dataset estimates it. So naive Bayes makes its defining leap of faith: given the class, the features are mutually independent.
Plug EQ M9.2 into EQ M9.1 and drop the constant evidence. The decision rule becomes a single argmax, and — because likelihoods are tiny products that underflow to zero in floating point — you always compute it in log space, where the product becomes a sum:
# EQ M9.3 by hand: log-prior + sum of feature log-likelihoods, then argmax.
# Two classes, two binary features. p(x_j=1 | y) given directly.
import numpy as np
prior = {0: 0.6, 1: 0.4} # p(y): class 0 is more common
p1 = {0: [0.2, 0.7], 1: [0.8, 0.3]} # p(x_j = 1 | y) for j = 0, 1
x = [1, 0] # observe feature0 = 1, feature1 = 0
def logscore(y):
s = np.log(prior[y])
for j, xj in enumerate(x):
pj = p1[y][j] if xj == 1 else 1 - p1[y][j] # Bernoulli per feature
s += np.log(pj)
return s
scores = {y: logscore(y) for y in (0, 1)}
print("log-scores:", {y: round(s, 4) for y, s in scores.items()})
yhat = max(scores, key=scores.get)
# turn log-scores into a calibrated posterior via the log-sum-exp normalizer
m = max(scores.values())
Z = sum(np.exp(s - m) for s in scores.values())
post = {y: float(np.exp(s - m) / Z) for y, s in scores.items()}
print("posterior :", {y: round(p, 4) for y, p in post.items()})
print("prediction:", yhat, " (evidence p(x) cancelled, never computed)")
Gaussian Naive Bayes
For continuous features — heights, pixel intensities, sensor readings — each per-feature likelihood \(p(x_j \mid y)\) is modeled as a one-dimensional Gaussian. The training "fit" is nothing but estimating, from each class's data, a mean and a variance for every feature:
Where do the decision boundaries come from? Take the log-ratio of the two class posteriors. With shared variances the quadratic terms cancel and you get a straight line — this is Linear Discriminant Analysis. With per-class variances the quadratic terms survive and the boundary becomes a conic (a parabola, ellipse, or hyperbola) — Quadratic Discriminant Analysis. Gaussian NB is exactly QDA with the off-diagonal covariances forced to zero.
# Gaussian Naive Bayes from scratch: fit = means + variances, predict = argmax.
import numpy as np
rng = np.random.default_rng(0)
# two 2-D classes, 200 points each, drawn from axis-aligned Gaussians
mu = {0: [0.0, 0.0], 1: [3.0, 2.5]}
Xy = [(rng.normal(mu[y], [1.0, 1.3], (200, 2)), np.full(200, y)) for y in (0, 1)]
X = np.vstack([a for a, _ in Xy]); y = np.concatenate([b for _, b in Xy])
# --- fit: per class, per feature mean and variance (eps guards zero variance)
classes = np.unique(y); eps = 1e-9
means = np.array([X[y == c].mean(0) for c in classes])
vars = np.array([X[y == c].var(0) for c in classes]) + eps
logpr = np.log(np.array([(y == c).mean() for c in classes]))
# --- predict: log-prior + sum_j log N(x_j; mu, var) (EQ M9.3 + M9.4)
def log_gauss(Xb, m, v):
return -0.5 * (np.log(2 * np.pi * v) + (Xb[:, None, :] - m) ** 2 / v).sum(2)
scores = log_gauss(X, means, vars) + logpr # (N, n_classes)
pred = classes[scores.argmax(1)]
acc = (pred == y).mean()
print(f"fitted means:\n{means.round(2)}")
print(f"fitted variances:\n{vars.round(2)}")
print(f"train accuracy: {acc:.3f}")
plot_scatter(X[:, 0], X[:, 1], list(pred.astype(int))) # colour = predicted class
One practical wrinkle. If a feature has near-zero variance within a class — a constant column, common in real data — the Gaussian collapses to a spike and the log-likelihood explodes. Production implementations (scikit-learn's GaussianNB) add a tiny smoothing term to every variance, a fixed fraction of the largest feature variance, to keep the densities well-behaved. The same impulse — never let a probability go to zero or infinity — drives the smoothing we meet next for discrete features.
Multinomial & Bernoulli NB for text
Naive Bayes' first and most enduring job was sorting documents — Maron's 1961 "automatic indexing" experiments are arguably the technique's debut, and spam filters made it famous. Text needs a different per-feature model than the Gaussian, and there are two canonical choices that differ in what they count.
Multinomial NB treats a document as a bag of word counts. Each class \(y\) has a vocabulary-sized probability vector \(\theta_{y}\) — "how likely is each word in a document of this class" — and a document's likelihood is the multinomial probability of its counts. The estimate for one word is the share of that class's total word-tokens it accounts for, with Laplace (add-\(\alpha\)) smoothing so an unseen word never zeroes out the whole product:
Bernoulli NB instead treats each vocabulary word as a binary presence/absence flag and explicitly models the absence of words too — a "free" word that never appears in spam is positive evidence for ham. It tends to win on short texts (tweets, subject lines) where a word appearing once vs. thrice carries little extra signal; multinomial wins on longer documents where counts matter. Both are linear classifiers in log space; both are one counting pass to train.
# Multinomial Naive Bayes on a toy bag-of-words, with Laplace smoothing (EQ M9.5).
import numpy as np
vocab = ["win", "free", "money", "meeting", "report", "lunch"]
V = len(vocab)
# rows = documents (word counts), labels: 1 = spam, 0 = ham
X = np.array([[2,1,1,0,0,0], # spam
[1,1,0,0,0,0], # spam
[0,0,0,1,1,0], # ham
[0,0,0,1,0,1]]) # ham
y = np.array([1, 1, 0, 0])
alpha = 1.0
# fit: per-class token totals -> smoothed word probabilities (EQ M9.5)
theta, logprior = {}, {}
for c in (0, 1):
counts = X[y == c].sum(0) # N_{y,w}
theta[c] = (counts + alpha) / (counts.sum() + alpha * V)
logprior[c] = np.log((y == c).mean())
# predict a new document: "free money meeting"
doc = np.array([0,1,1,1,0,0])
score = {c: logprior[c] + (doc * np.log(theta[c])).sum() for c in (0, 1)}
m = max(score.values()); Z = sum(np.exp(s - m) for s in score.values())
post = {c: float(np.exp(s - m) / Z) for c, s in score.items()}
print("p(win|spam) =", round(float(theta[1][0]), 3), " p(win|ham) =", round(float(theta[0][0]), 3))
print("posterior :", {c: round(p, 3) for c, p in post.items()})
print("prediction :", "SPAM" if score[1] > score[0] else "HAM")
Smoothing, failure modes & why it works anyway
We have already met the most important fix. The zero-frequency problem is fatal without it: one word the training set never saw in a class drives \(p(w \mid y) = 0\), and a single zero in the product of EQ M9.3 sends the whole log-likelihood to \(-\infty\), vetoing that class no matter how strong the other evidence. Laplace/Lidstone smoothing (EQ M9.5) is the cure — read as a Bayesian posterior, \(\alpha\) is the strength of a Dirichlet prior that pre-loads \(\alpha\) imaginary observations of every word. It is regularization wearing a probabilistic costume.
The deeper failure mode is the assumption itself. When features are correlated — and they always are — naive Bayes double-counts evidence. The bigram "New York" contributes "new" and "york" as if they were two independent witnesses, so the model becomes overconfident: its posteriors pile up near 0 and 1, badly miscalibrated. Here is the surprise that has fascinated the field for decades:
Naive Bayes' probabilities are often garbage, yet its classifications are excellent. Domingos & Pazzani (1997) showed why: the argmax only needs the correct class to score highest, not to have an accurate probability. The independence assumption can be massively violated — pushing the estimated posterior to a wildly wrong value like 0.9999 — and the decision still lands on the right side of the boundary. The model is reliably wrong about how sure it is, and reliably right about which class. If you need calibrated probabilities, post-hoc calibration (Platt scaling, isotonic regression) is mandatory; if you only need labels, ship it.
Three more practical truths round out the picture: (1) NB is robust to irrelevant features, which simply contribute roughly equal log-votes to every class and cancel; (2) it is sensitive to strongly correlated features, so de-duplicating obvious redundancy (or using TF-IDF weighting for text) helps; (3) its training cost is a single pass and its model is a handful of count tables, making it the natural choice for streaming data, on-device inference, and any setting where a heavier model's accuracy gain does not justify its cost. It is the baseline every other classifier in this volume must beat — and on small, high-dimensional, sparse data, it frequently isn't beaten.
Naive Bayes draws boundaries by modeling each class; the next chapter draws the single best boundary directly. Chapter 10 — Support Vector Machines & Kernels: maximum-margin separation, the kernel trick that buys nonlinear boundaries for free, and what changes when you stop modeling the data and start carving it.
References
- Domingos, P. & Pazzani, M. (1997). On the Optimality of the Simple Bayesian Classifier under Zero-One Loss.
- Maron, M. E. (1961). Automatic Indexing: An Experimental Inquiry.
- McCallum, A. & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification.
- Ng, A. Y. & Jordan, M. I. (2002). On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.).
- Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd ed. draft), Ch. 4: Naive Bayes and Sentiment Classification.