Categorical encoding: one-hot, ordinal, frequency
Almost every real dataset arrives with columns that are not numbers: a country, a product category, a browser, a job title. A model cannot multiply a weight by the string "Berlin". Encoding is the map from categories to numbers, and the wrong map silently injects assumptions the data never made.
The first thing to settle is whether a categorical variable is nominal (unordered — colours, cities, payment methods) or ordinal (genuinely ranked — small < medium < large, bronze < silver < gold). That single distinction decides almost everything that follows.
One-hot encoding
The default for nominal variables. A column with \(K\) distinct levels becomes \(K\) binary indicator columns, exactly one of which is hot (1) per row:
Ordinal encoding
When the categories really are ordered, map them to ascending integers — small → 0, medium → 1, large → 2. This keeps the column to a single feature and tells the model that large is "more" than small. Applied to a nominal variable, though, ordinal encoding is a trap: labelling {red, green, blue} as {0, 1, 2} tells a linear model that blue is twice green and green sits exactly between red and blue — pure fiction the model will dutifully exploit. Ordinal encoding is correct only when the order is real.
Frequency / count encoding
A cheap, single-column escape from one-hot's width problem: replace each category by how often it appears (its count or its relative frequency). It collapses \(K\) levels into one numeric feature, which suits high-cardinality columns and tree models well. The implicit claim is that rarity carries signal — often true (rare merchant codes correlate with fraud), sometimes meaningless, and it deliberately collapses two equally-frequent-but-different categories onto the same value.
| Encoding | New columns | Best for | Footgun |
|---|---|---|---|
| One-hot | K (or K−1) | Nominal, low cardinality, linear/SVM/k-NN | Cardinality blow-up; sparse, wide matrices |
| Ordinal | 1 | Genuinely ranked categories | Invents an order on nominal data |
| Frequency | 1 | High cardinality, tree models | Distinct-but-equally-common levels collide |
| Target (§3.2) | 1 | High cardinality + a target | Leakage if fit on the same rows it encodes |
The cardinality wall. One-hot is the textbook default precisely because it is honest about nominal structure, but it scales linearly with the number of levels. At a few dozen categories it is fine; at thousands (user IDs, product SKUs, ZIP codes) the matrix becomes enormous and sparse, distances degrade, and you reach for frequency or target encoding instead. That trade-off — fidelity vs. width — is the whole game, and the instrument below lets you feel it.
# One-hot, ordinal & frequency encoding of a small categorical column (numpy only)
import numpy as np
col = np.array(["red","green","blue","red","blue","red","green","red"])
cats, inv, counts = np.unique(col, return_inverse=True, return_counts=True)
K = len(cats)
print("categories :", list(cats), " (K =", K, ")")
onehot = np.eye(K, dtype=int)[inv] # EQ D3.1: K indicator columns
print("\none-hot matrix (rows = samples, cols =", list(cats), "):")
print(onehot)
print("one-hot adds", K, "columns; every row sums to", set(onehot.sum(1).tolist()))
ordinal = inv # integer code per category (order = alpha here)
freq = counts[inv] / len(col) # frequency encoding: share of each category
print("\nordinal codes :", ordinal.tolist())
print("frequency codes:", np.round(freq, 3).tolist(), " (1 column, not", K, ")")
print("\nNote: ordinal would falsely tell a linear model blue(0) < green(1) < red(2).")
Target & WOE encoding — and how to keep them leakage-safe
When a categorical column has hundreds or thousands of levels, one-hot is unwieldy and frequency throws away the relationship with the label. Target encoding (also "mean encoding", introduced by Micci-Barreca in 2001) replaces each category with the average value of the target for that category — one informative numeric column, regardless of cardinality.
Weight of evidence (WOE)
For binary classification, the closely-related weight of evidence encoding — a staple of credit scoring — replaces each category with the log-odds it contributes:
Target encoding looks at the label — so if you fit the encoding on the same rows you then train on, every row gets to peek at its own answer. The model sees a feature that is partly a copy of \(y\), validation scores soar, and production collapses. This is the single most common way a leaderboard-topping pipeline dies on real data. The fix is never to encode a row using its own target.
The disciplined remedy is out-of-fold (cross-fitted) encoding: split the training data into \(k\) folds; to encode the rows in fold \(j\), compute the category means using only the other \(k-1\) folds. No row ever contributes to its own encoded value, so the feature carries the category's signal without memorizing the answer. The test set is then encoded from statistics computed on the full training set. Smoothing (EQ D3.2) and out-of-fold computation are complementary, not alternatives — serious pipelines use both.
# Target encoding: naive (LEAKS) vs out-of-fold (safe). Watch the leak signal.
import numpy as np
rng = np.random.default_rng(0)
# 600 rows, a high-cardinality column with 200 levels, target unrelated to it
n, K = 600, 200
cat = rng.integers(0, K, n)
y = rng.integers(0, 2, n).astype(float) # pure coin flips: TRUE signal = 0
def naive_encode(cat, y): # fit on the SAME rows -> leak
enc = np.zeros(len(cat))
for c in np.unique(cat):
enc[cat == c] = y[cat == c].mean()
return enc
def oof_encode(cat, y, k=5, m=20.0): # out-of-fold + smoothing (safe)
enc, gm = np.zeros(len(cat)), y.mean()
fold = np.arange(len(cat)) % k
for j in range(k):
tr, te = fold != j, fold == j
for c in np.unique(cat[te]):
mask = tr & (cat == c); nc = mask.sum()
enc[te & (cat == c)] = (nc*y[mask].mean() + m*gm)/(nc+m) if nc else gm
return enc
def corr(a, b): a,b=a-a.mean(),b-b.mean(); return float((a*b).sum()/np.sqrt((a*a).sum()*(b*b).sum()))
print("corr(naive encoding, y) :", round(corr(naive_encode(cat,y), y), 3), " <- phantom signal (leak!)")
print("corr(out-of-fold, y) :", round(corr(oof_encode(cat,y), y), 3), " <- ~0, the honest truth")
In the cell above the target is literally a coin flip — there is no real relationship to the category — yet naive encoding manufactures a sizeable correlation with \(y\) out of thin air, because each rare category memorized its own rows. Out-of-fold encoding reports the true near-zero. Run it a few times: the leak is consistent, the honest version is consistently honest.
Scaling: standardize, min-max, robust
Once everything is numeric, the columns still live on wildly different scales — age in years (0–100), income in dollars (0–106), a fraction in [0, 1]. Any algorithm that measures distance or sums weighted features will let the large-magnitude column dominate purely by accident of units. Feature scaling puts every column on comparable footing.
Who cares about scale, and who does not? It is worth memorizing the split, because scaling a tree model is wasted effort and not scaling a k-NN model is a bug.
| Scaling matters | Why | Scaling is irrelevant |
|---|---|---|
| k-NN, k-means | Euclidean distance | Decision trees, random forests, gradient-boosted trees — they split on thresholds within a single feature, so monotone rescaling changes nothing. |
| SVM (RBF), PCA | dot products / variance | |
| Linear/logistic + regularization, neural nets | gradient conditioning; L1/L2 penalize raw coefficients |
Standardization (z-score)
Subtract the mean, divide by the standard deviation. Every column ends up centered at 0 with unit variance:
Min-max scaling
Linearly squeeze the column into a fixed interval, usually [0, 1]:
Robust scaling
When outliers are a fact of life, scale by quantities that ignore the tails — the median for centering, the interquartile range (IQR) for spread:
Every scaler has parameters learned from data — \(\mu, \sigma\) for z-score, \(x_{\min}, x_{\max}\) for min-max, median/IQR for robust. Fit those parameters on the training set, then apply the frozen transform to validation and test. Recomputing the mean on the test set leaks test information into preprocessing and quietly inflates your scores — the scaling-stage twin of the target-encoding leak in §3.2.
# Standardize vs min-max, and a Box-Cox normality gain on skewed data
import numpy as np
rng = np.random.default_rng(0)
x = rng.exponential(2.0, 4000) + 0.5 # right-skewed, strictly positive
def stats(name, v):
print(f"{name:11s} mean {v.mean():7.3f} std {v.std():6.3f} "
f"min {v.min():7.3f} max {v.max():8.3f}")
stats("raw", x)
z = (x - x.mean()) / x.std() # EQ D3.4: mean 0, std 1
mm = (x - x.min()) / (x.max() - x.min()) # EQ D3.5: [0, 1]
stats("z-score", z); stats("min-max", mm)
def skew(v): v=(v-v.mean())/v.std(); return float((v**3).mean()) # 0 = symmetric
print("\nscaling does NOT change shape -> skew(raw)=%.2f skew(z)=%.2f"
% (skew(x), skew(z)))
# Box-Cox (lambda chosen by a small grid) pulls the skew toward 0:
best = min(np.linspace(-1, 1, 41),
key=lambda L: abs(skew(np.log(x) if abs(L)<1e-9 else (x**L-1)/L)))
bc = np.log(x) if abs(best)<1e-9 else (x**best - 1)/best
print(f"Box-Cox lambda*={best:+.2f} -> skew={skew(bc):+.2f} (much closer to normal)")
Distribution transforms: log, Box-Cox, Yeo-Johnson
Scaling moves and stretches a column but never changes its shape. Yet many real features are badly skewed — incomes, prices, durations, counts — and many estimators (linear regression, anything assuming Gaussian-ish residuals, distance methods) work best on roughly symmetric inputs. Distribution transforms are nonlinear maps that pull a long right tail back toward symmetry.
The log transform
The workhorse. For strictly positive, right-skewed data, \(x \mapsto \ln x\) compresses large values far more than small ones, taming multiplicative spread into additive spread. It is the right move when a variable is naturally relative — a doubling of income matters the same whether from $10K or $1M. Use \(\ln(1+x)\) (log1p) when the column contains exact zeros.
Box-Cox
Box and Cox (1964) generalized the log into a one-parameter family and let the data choose the exponent:
Yeo-Johnson
Box-Cox's positivity requirement is a real nuisance — temperatures, profits, and standardized features all go negative. Yeo-Johnson (2000) extends the same idea to the whole real line by applying mirrored power transforms on each side of zero:
Honest caveats. These transforms optimize for marginal normality, which is neither necessary nor sufficient for a good model — modern gradient-boosted trees are invariant to any monotone transform of a feature, so this whole section is largely moot for them. Transforms also distort interpretability (a coefficient on \(\ln(\text{income})\) is an elasticity, not a dollar effect) and they extrapolate dangerously outside the fitted range. They earn their keep most for linear models, classical statistics, and any pipeline where Gaussian-ish inputs genuinely help.
# Box-Cox: scan lambda, pick the most-Gaussian, quantify the normality gain
import numpy as np
rng = np.random.default_rng(1)
x = rng.lognormal(0.0, 0.9, 5000) + 0.2 # heavy right skew, all positive
def boxcox(x, lam):
return np.log(x) if abs(lam) < 1e-9 else (x**lam - 1) / lam
def skew(v):
v = (v - v.mean()) / v.std(); return float((v**3).mean()) # 0 == symmetric
lams = np.linspace(-1, 2, 61)
sk = [abs(skew(boxcox(x, L))) for L in lams]
best = lams[int(np.argmin(sk))]
print(f"raw skew : {skew(x):+.3f}")
print(f"best lambda (min |skew|) : {best:+.2f}")
print(f"Box-Cox skew at lambda* : {skew(boxcox(x, best)):+.3f}")
print(f"\nlog (lambda=0) skew : {skew(boxcox(x, 0.0)):+.3f}")
print(f"sqrt (lambda=.5) skew : {skew(boxcox(x, 0.5)):+.3f}")
print("\n|skew| vs lambda (the U-shaped normality curve):")
plot_xy(lams.tolist(), sk)
Binning & discretization
The opposite move from a smooth transform: binning chops a continuous variable into a handful of discrete intervals — age → {child, adult, senior}, income → deciles. You deliberately throw away resolution to buy something else: robustness to outliers, the ability to capture a non-monotonic effect with a linear model, interpretable "score bands", or a categorical handoff into the encoders of §3.1–§3.2.
There are two everyday strategies, and the difference is whether the bin edges or the bin counts are held constant:
| Strategy | Edges chosen by | Each bin has… | Good / bad |
|---|---|---|---|
| Equal-width | range / k | equal interval, unequal counts | Simple & interpretable; empty bins on skewed data |
| Equal-frequency | quantiles | equal counts, unequal widths | Robust to skew; edges shift with the data |
| Supervised (e.g. tree / MDL) | target purity | edges where the label changes | Most predictive; can overfit & leak — fit on train |
Binning is genuinely contested. It can rescue a linear model from a U-shaped relationship and it makes credit-scorecards legible — but it discards information, plants artificial discontinuities at the bin edges, and (when bins are chosen using the target) leaks exactly like target encoding. The modern view: prefer letting a flexible model learn the nonlinearity (splines, gradient-boosted trees) over hand-binning, and reserve discretization for interpretability, regulatory, or robustness reasons rather than raw accuracy.
Four ways encoding & scaling silently break a model: (1) fitting any data-dependent transform — scaler, target encoder, supervised bins — on the full dataset instead of train-only, leaking test/label information; (2) ordinal-encoding a nominal variable and inventing an order; (3) min-max scaling in the presence of outliers, crushing the real data to a sliver; (4) unseen categories at inference time that the encoder has no value for — always reserve an "unknown" bucket and a global-mean fallback.
Encoding and scaling make the columns you have well-behaved; feature engineering creates the columns you wish you had. Chapter 04 — Feature Engineering — covers interactions, polynomial and spline bases, date/time and cyclical features, aggregations and lag features, and the discipline of building them without leaking the future into the past.
References
- Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.
- Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations.
- Yeo, I.-K. & Johnson, R. A. (2000). A New Family of Power Transformations to Improve Normality or Symmetry.
- Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models.
- Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python.
- Liu, H., Hussain, F., Tan, C. L. & Dash, M. (2002). Discretization: An Enabling Technique.