AI // ENCYCLOPEDIA / DATA / 03 / ENCODING & SCALING INDEX NEXT: FEATURE ENGINEERING →
DATA & FEATURE ENGINEERING · CHAPTER 03 / 05

Encoding, Scaling & Transforms

Models consume numbers, so the encoding of categories and the scaling of features often matters more than the choice of model. A linear model, an SVM, or a k-NN classifier given raw, unscaled, poorly encoded columns will lose to a mediocre model given clean ones. This chapter covers the arithmetic of turning messy columns into the well-behaved numeric matrix every estimator assumes it was handed.

LEVELCORE READING TIME≈ 26 MIN BUILDS ONDATA 01–02 INSTRUMENTSENCODER · SCALER · BOX-COX
3.1

Categorical encoding: one-hot, ordinal, frequency

Almost every real dataset arrives with columns that are not numbers: a country, a product category, a browser, a job title. A model cannot multiply a weight by the string "Berlin". Encoding is the map from categories to numbers, and the wrong map silently injects assumptions the data never made.

The first thing to settle is whether a categorical variable is nominal (unordered — colours, cities, payment methods) or ordinal (genuinely ranked — small < medium < large, bronze < silver < gold). That single distinction decides almost everything that follows.

One-hot encoding

The default for nominal variables. A column with \(K\) distinct levels becomes \(K\) binary indicator columns, exactly one of which is hot (1) per row:

EQ D3.1 — ONE-HOT ENCODING $$ \text{onehot}(x_i)_j \;=\; \mathbb{1}\!\left[\, x_i = c_j \,\right], \qquad j = 1, \ldots, K, \qquad \sum_{j=1}^{K} \text{onehot}(x_i)_j = 1 $$
\(c_1, \ldots, c_K\) are the \(K\) distinct categories; \(\mathbb{1}[\cdot]\) is the indicator (1 if true, 0 otherwise). Every row becomes a unit vector pointing at its category — all categories sit at equal, unit distance from one another, so no false ordering is implied. The cost is width: a 50-state column becomes 50 columns, a ZIP-code column becomes tens of thousands. For linear models with an intercept you often drop one level (dummy encoding, \(K-1\) columns) to avoid perfect collinearity; tree models and regularized models can keep all \(K\).
You one-hot encode a single categorical column that has 4 distinct categories. How many new indicator columns does the encoding add?
One-hot creates exactly one binary indicator per distinct level, so \(K = 4\) categories produce \(K = \) 4 columns. (If you instead used dummy encoding and dropped one level to avoid collinearity, you would add \(K-1 = 3\) — but plain one-hot adds the full 4.)

Ordinal encoding

When the categories really are ordered, map them to ascending integers — small → 0, medium → 1, large → 2. This keeps the column to a single feature and tells the model that large is "more" than small. Applied to a nominal variable, though, ordinal encoding is a trap: labelling {red, green, blue} as {0, 1, 2} tells a linear model that blue is twice green and green sits exactly between red and blue — pure fiction the model will dutifully exploit. Ordinal encoding is correct only when the order is real.

Frequency / count encoding

A cheap, single-column escape from one-hot's width problem: replace each category by how often it appears (its count or its relative frequency). It collapses \(K\) levels into one numeric feature, which suits high-cardinality columns and tree models well. The implicit claim is that rarity carries signal — often true (rare merchant codes correlate with fraud), sometimes meaningless, and it deliberately collapses two equally-frequent-but-different categories onto the same value.

EncodingNew columnsBest forFootgun
One-hotK (or K−1)Nominal, low cardinality, linear/SVM/k-NNCardinality blow-up; sparse, wide matrices
Ordinal1Genuinely ranked categoriesInvents an order on nominal data
Frequency1High cardinality, tree modelsDistinct-but-equally-common levels collide
Target (§3.2)1High cardinality + a targetLeakage if fit on the same rows it encodes

The cardinality wall. One-hot is the textbook default precisely because it is honest about nominal structure, but it scales linearly with the number of levels. At a few dozen categories it is fine; at thousands (user IDs, product SKUs, ZIP codes) the matrix becomes enormous and sparse, distances degrade, and you reach for frequency or target encoding instead. That trade-off — fidelity vs. width — is the whole game, and the instrument below lets you feel it.

PYTHON · RUNNABLE IN-BROWSER
# One-hot, ordinal & frequency encoding of a small categorical column (numpy only)
import numpy as np
col = np.array(["red","green","blue","red","blue","red","green","red"])

cats, inv, counts = np.unique(col, return_inverse=True, return_counts=True)
K = len(cats)
print("categories :", list(cats), " (K =", K, ")")

onehot = np.eye(K, dtype=int)[inv]              # EQ D3.1: K indicator columns
print("\none-hot matrix (rows = samples, cols =", list(cats), "):")
print(onehot)
print("one-hot adds", K, "columns; every row sums to", set(onehot.sum(1).tolist()))

ordinal = inv                                   # integer code per category (order = alpha here)
freq = counts[inv] / len(col)                   # frequency encoding: share of each category
print("\nordinal codes  :", ordinal.tolist())
print("frequency codes:", np.round(freq, 3).tolist(), " (1 column, not", K, ")")
print("\nNote: ordinal would falsely tell a linear model blue(0) < green(1) < red(2).")
edits are live — break it on purpose
INSTRUMENT D3.1 — ENCODING EXPLORERONE-HOT vs TARGET · CARDINALITY BLOW-UP · EQ D3.1 / D3.2
COLUMNS ADDED
MATRIX CELLS (n=10K rows)
DENSITY (NON-ZERO)
Drag K from 2 to 40 in ONE-HOT mode and watch the matrix grow one column per category — at 40 levels and 10K rows you are storing 400K cells of which only 10K (2.5%) are non-zero: the sparse, wide blow-up. Switch to TARGET and the whole thing collapses to a single dense column whatever K is. The canvas shows the actual encoded matrix; the bars show one column being added per category in one-hot, versus one fixed column in target.
3.2

Target & WOE encoding — and how to keep them leakage-safe

When a categorical column has hundreds or thousands of levels, one-hot is unwieldy and frequency throws away the relationship with the label. Target encoding (also "mean encoding", introduced by Micci-Barreca in 2001) replaces each category with the average value of the target for that category — one informative numeric column, regardless of cardinality.

EQ D3.2 — SMOOTHED TARGET ENCODING $$ \hat{t}(c) \;=\; \frac{n_c\, \bar{y}_c \;+\; m\, \bar{y}}{n_c + m}, \qquad \bar{y}_c = \frac{1}{n_c}\sum_{i:\,x_i = c} y_i $$
\(\bar{y}_c\) is the target mean inside category \(c\); \(n_c\) is how many rows fall in \(c\); \(\bar{y}\) is the global target mean; \(m\) is a smoothing strength. The encoded value is a credibility-weighted blend: a category seen thousands of times trusts its own mean (\(n_c \gg m\)); a category seen twice is pulled toward the global prior \(\bar{y}\) (\(n_c \ll m\)). Without smoothing, a category that appears once would be encoded as exactly its single row's label — a perfect, useless memory of the answer. This shrinkage toward the prior is the entire reason target encoding generalizes.

Weight of evidence (WOE)

For binary classification, the closely-related weight of evidence encoding — a staple of credit scoring — replaces each category with the log-odds it contributes:

EQ D3.3 — WEIGHT OF EVIDENCE $$ \mathrm{WOE}(c) \;=\; \ln\!\left( \frac{\Pr(x = c \mid y = 1)}{\Pr(x = c \mid y = 0)} \right) \;=\; \ln\!\left( \frac{\text{(events in }c)\,/\,\text{(total events)}}{\text{(non-events in }c)\,/\,\text{(total non-events)}} \right) $$
WOE is the log-ratio of the share of positives to the share of negatives within a category. It is monotonic in the target rate, lives on the natural log-odds scale a logistic regression already speaks, and the associated Information Value \(\mathrm{IV} = \sum_c (\text{share}_1 - \text{share}_0)\,\mathrm{WOE}(c)\) gives a single number for how predictive the whole feature is. Like target encoding, WOE must be computed with smoothing (and a small \(\varepsilon\) to avoid \(\ln 0\)) and on held-out folds.
THE LEAKAGE TRAP

Target encoding looks at the label — so if you fit the encoding on the same rows you then train on, every row gets to peek at its own answer. The model sees a feature that is partly a copy of \(y\), validation scores soar, and production collapses. This is the single most common way a leaderboard-topping pipeline dies on real data. The fix is never to encode a row using its own target.

The disciplined remedy is out-of-fold (cross-fitted) encoding: split the training data into \(k\) folds; to encode the rows in fold \(j\), compute the category means using only the other \(k-1\) folds. No row ever contributes to its own encoded value, so the feature carries the category's signal without memorizing the answer. The test set is then encoded from statistics computed on the full training set. Smoothing (EQ D3.2) and out-of-fold computation are complementary, not alternatives — serious pipelines use both.

A category appears \(n_c = 4\) times with a positive rate \(\bar{y}_c = 0.75\). The global mean is \(\bar{y} = 0.5\) and the smoothing strength is \(m = 4\). What smoothed target-encoded value \(\hat{t}(c)\) does EQ D3.2 give?
\(\hat{t}(c) = \dfrac{n_c\,\bar{y}_c + m\,\bar{y}}{n_c + m} = \dfrac{4 \times 0.75 + 4 \times 0.5}{4 + 4} = \dfrac{3 + 2}{8} = \dfrac{5}{8} = \) 0.6. With \(n_c = m\) the encoding is the simple average of the category mean (0.75) and the prior (0.5) — exactly halfway, because the category has been seen just as often as the smoothing strength assumes.
PYTHON · RUNNABLE IN-BROWSER
# Target encoding: naive (LEAKS) vs out-of-fold (safe). Watch the leak signal.
import numpy as np
rng = np.random.default_rng(0)

# 600 rows, a high-cardinality column with 200 levels, target unrelated to it
n, K = 600, 200
cat = rng.integers(0, K, n)
y   = rng.integers(0, 2, n).astype(float)        # pure coin flips: TRUE signal = 0

def naive_encode(cat, y):                         # fit on the SAME rows -> leak
    enc = np.zeros(len(cat))
    for c in np.unique(cat):
        enc[cat == c] = y[cat == c].mean()
    return enc

def oof_encode(cat, y, k=5, m=20.0):              # out-of-fold + smoothing (safe)
    enc, gm = np.zeros(len(cat)), y.mean()
    fold = np.arange(len(cat)) % k
    for j in range(k):
        tr, te = fold != j, fold == j
        for c in np.unique(cat[te]):
            mask = tr & (cat == c); nc = mask.sum()
            enc[te & (cat == c)] = (nc*y[mask].mean() + m*gm)/(nc+m) if nc else gm
    return enc

def corr(a, b): a,b=a-a.mean(),b-b.mean(); return float((a*b).sum()/np.sqrt((a*a).sum()*(b*b).sum()))
print("corr(naive encoding, y) :", round(corr(naive_encode(cat,y), y), 3), " <- phantom signal (leak!)")
print("corr(out-of-fold,   y) :", round(corr(oof_encode(cat,y),  y), 3), " <- ~0, the honest truth")
edits are live — break it on purpose

In the cell above the target is literally a coin flip — there is no real relationship to the category — yet naive encoding manufactures a sizeable correlation with \(y\) out of thin air, because each rare category memorized its own rows. Out-of-fold encoding reports the true near-zero. Run it a few times: the leak is consistent, the honest version is consistently honest.

3.3

Scaling: standardize, min-max, robust

Once everything is numeric, the columns still live on wildly different scales — age in years (0–100), income in dollars (0–106), a fraction in [0, 1]. Any algorithm that measures distance or sums weighted features will let the large-magnitude column dominate purely by accident of units. Feature scaling puts every column on comparable footing.

Who cares about scale, and who does not? It is worth memorizing the split, because scaling a tree model is wasted effort and not scaling a k-NN model is a bug.

Scaling mattersWhyScaling is irrelevant
k-NN, k-meansEuclidean distanceDecision trees, random forests, gradient-boosted trees — they split on thresholds within a single feature, so monotone rescaling changes nothing.
SVM (RBF), PCAdot products / variance
Linear/logistic + regularization, neural netsgradient conditioning; L1/L2 penalize raw coefficients

Standardization (z-score)

Subtract the mean, divide by the standard deviation. Every column ends up centered at 0 with unit variance:

EQ D3.4 — STANDARDIZATION (z-SCORE) $$ z = \frac{x - \mu}{\sigma}, \qquad \mu = \frac{1}{n}\sum_i x_i, \qquad \sigma = \sqrt{\frac{1}{n}\sum_i (x_i - \mu)^2} $$
\(z\) is the number of standard deviations \(x\) sits from the mean. The transformed column has mean 0 and standard deviation 1, but its shape is unchanged — standardizing a skewed column gives a skewed column with nicer units (that is what §3.4 is for). It is the default for most linear models, SVMs, PCA and neural nets. It does not bound the range and it is not robust: a single huge outlier inflates \(\sigma\) and squashes everyone else toward zero.
Standardize the value \( x = 8 \) for a feature whose mean is \( \mu = 5 \) and standard deviation is \( \sigma = 3 \). What is the z-score \( z \)?
\( z = \dfrac{x - \mu}{\sigma} = \dfrac{8 - 5}{3} = \dfrac{3}{3} = \) 1.0. The value sits exactly one standard deviation above the mean — which is precisely what a z-score of 1 means.

Min-max scaling

Linearly squeeze the column into a fixed interval, usually [0, 1]:

EQ D3.5 — MIN-MAX SCALING $$ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} \;\in\; [0, 1] $$
The minimum maps to 0, the maximum to 1, everything else lands proportionally between. It preserves the exact shape of the distribution and the relative spacing of points, which is why it is favoured for image pixels and for inputs to bounded activations. Its weakness is the mirror image of standardization's: it is defined by the extremes, so one outlier at \(x_{\max}\) compresses every real value into a thin band near 0. Use it when you know the bounds and trust them.

Robust scaling

When outliers are a fact of life, scale by quantities that ignore the tails — the median for centering, the interquartile range (IQR) for spread:

EQ D3.6 — ROBUST SCALING $$ x'' = \frac{x - \mathrm{median}(x)}{\mathrm{IQR}(x)}, \qquad \mathrm{IQR}(x) = Q_3 - Q_1 $$
The median has a 50% breakdown point and the IQR uses only the middle half of the data, so a handful of extreme values barely move either statistic. Robust scaling therefore keeps the bulk of the data on a sensible scale even when 10–20% of it is garbage — at the cost of the clean "mean 0, var 1" guarantee. Reach for it whenever a histogram shows fat tails or known measurement errors; reach for standardization when the data is roughly Gaussian and clean.

FIT ON TRAIN ONLY

Every scaler has parameters learned from data — \(\mu, \sigma\) for z-score, \(x_{\min}, x_{\max}\) for min-max, median/IQR for robust. Fit those parameters on the training set, then apply the frozen transform to validation and test. Recomputing the mean on the test set leaks test information into preprocessing and quietly inflates your scores — the scaling-stage twin of the target-encoding leak in §3.2.

PYTHON · RUNNABLE IN-BROWSER
# Standardize vs min-max, and a Box-Cox normality gain on skewed data
import numpy as np
rng = np.random.default_rng(0)
x = rng.exponential(2.0, 4000) + 0.5          # right-skewed, strictly positive

def stats(name, v):
    print(f"{name:11s} mean {v.mean():7.3f}  std {v.std():6.3f}  "
          f"min {v.min():7.3f}  max {v.max():8.3f}")

stats("raw", x)
z  = (x - x.mean()) / x.std()                 # EQ D3.4: mean 0, std 1
mm = (x - x.min()) / (x.max() - x.min())      # EQ D3.5: [0, 1]
stats("z-score", z); stats("min-max", mm)

def skew(v): v=(v-v.mean())/v.std(); return float((v**3).mean())   # 0 = symmetric
print("\nscaling does NOT change shape -> skew(raw)=%.2f  skew(z)=%.2f"
      % (skew(x), skew(z)))

# Box-Cox (lambda chosen by a small grid) pulls the skew toward 0:
best = min(np.linspace(-1, 1, 41),
           key=lambda L: abs(skew(np.log(x) if abs(L)<1e-9 else (x**L-1)/L)))
bc = np.log(x) if abs(best)<1e-9 else (x**best - 1)/best
print(f"Box-Cox lambda*={best:+.2f} -> skew={skew(bc):+.2f}  (much closer to normal)")
edits are live — break it on purpose
INSTRUMENT D3.2 — SCALING VISUALIZERTWO FEATURE CLOUDS · STANDARDIZE / MIN-MAX / ROBUST · OUTLIER TOGGLE
FEATURE A → range
FEATURE B → range
OUTLIER POSITION
Two clouds on very different native scales (A wide, B narrow). In RAW, feature A dominates any distance. Cycle the scalers: STANDARD and MIN-MAX equalize them — until you hit INJECT, which drops one extreme outlier. Now watch min-max crush the real data into a sliver near 0 and standard inflate its spread, while ROBUST barely flinches because the median and IQR ignore the rogue point. Grid lines mark the target scale of each method.
3.4

Distribution transforms: log, Box-Cox, Yeo-Johnson

Scaling moves and stretches a column but never changes its shape. Yet many real features are badly skewed — incomes, prices, durations, counts — and many estimators (linear regression, anything assuming Gaussian-ish residuals, distance methods) work best on roughly symmetric inputs. Distribution transforms are nonlinear maps that pull a long right tail back toward symmetry.

The log transform

The workhorse. For strictly positive, right-skewed data, \(x \mapsto \ln x\) compresses large values far more than small ones, taming multiplicative spread into additive spread. It is the right move when a variable is naturally relative — a doubling of income matters the same whether from $10K or $1M. Use \(\ln(1+x)\) (log1p) when the column contains exact zeros.

Box-Cox

Box and Cox (1964) generalized the log into a one-parameter family and let the data choose the exponent:

EQ D3.7 — BOX-COX TRANSFORM $$ x^{(\lambda)} = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\[6pt] \ln x & \lambda = 0 \end{cases} \qquad (x > 0) $$
A single knob \(\lambda\) sweeps a whole spectrum of shapes: \(\lambda = 1\) is (almost) the identity, \(\lambda = 0.5\) a square root, \(\lambda = 0\) the log, \(\lambda = -1\) a reciprocal. The \(-1\) and division by \(\lambda\) make the family continuous at \(\lambda = 0\), where it smoothly becomes the log. \(\lambda\) is chosen by maximum likelihood — the value that makes the transformed data most Gaussian. The hard constraint: Box-Cox requires strictly positive inputs.
Apply the Box-Cox transform (EQ D3.7) with \( \lambda = 1 \) to the value \( x = 2 \). What is \( x^{(\lambda)} \)?
For \( \lambda \neq 0 \), \( x^{(\lambda)} = \dfrac{x^{\lambda} - 1}{\lambda} = \dfrac{2^{1} - 1}{1} = \dfrac{1}{1} = \) 1.0. At \( \lambda = 1 \) the transform is just \( x - 1 \), a pure shift — it leaves the distribution's shape untouched, which is exactly why \( \lambda = 1 \) is the "do nothing" point of the family.

Yeo-Johnson

Box-Cox's positivity requirement is a real nuisance — temperatures, profits, and standardized features all go negative. Yeo-Johnson (2000) extends the same idea to the whole real line by applying mirrored power transforms on each side of zero:

EQ D3.8 — YEO-JOHNSON TRANSFORM $$ x^{(\lambda)} = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda} & x \ge 0,\ \lambda \neq 0 \\[4pt] \ln(x+1) & x \ge 0,\ \lambda = 0 \\[4pt] -\dfrac{(-x+1)^{2-\lambda} - 1}{2 - \lambda} & x < 0,\ \lambda \neq 2 \\[4pt] -\ln(-x+1) & x < 0,\ \lambda = 2 \end{cases} $$
For non-negative \(x\) it is essentially Box-Cox on \(x+1\); for negative \(x\) it mirrors the transform with exponent \(2-\lambda\). The result is one continuous, differentiable function over all of \(\mathbb{R}\) — no positivity constraint, no \(+\)constant hacks. \(\lambda\) is again fit by maximum likelihood for maximal normality. Default to Yeo-Johnson when the column can be zero or negative; reach for plain Box-Cox or log only when you know the data is strictly positive and want the cleaner interpretation.

Honest caveats. These transforms optimize for marginal normality, which is neither necessary nor sufficient for a good model — modern gradient-boosted trees are invariant to any monotone transform of a feature, so this whole section is largely moot for them. Transforms also distort interpretability (a coefficient on \(\ln(\text{income})\) is an elasticity, not a dollar effect) and they extrapolate dangerously outside the fitted range. They earn their keep most for linear models, classical statistics, and any pipeline where Gaussian-ish inputs genuinely help.

PYTHON · RUNNABLE IN-BROWSER
# Box-Cox: scan lambda, pick the most-Gaussian, quantify the normality gain
import numpy as np
rng = np.random.default_rng(1)
x = rng.lognormal(0.0, 0.9, 5000) + 0.2        # heavy right skew, all positive

def boxcox(x, lam):
    return np.log(x) if abs(lam) < 1e-9 else (x**lam - 1) / lam
def skew(v):
    v = (v - v.mean()) / v.std(); return float((v**3).mean())   # 0 == symmetric

lams = np.linspace(-1, 2, 61)
sk   = [abs(skew(boxcox(x, L))) for L in lams]
best = lams[int(np.argmin(sk))]

print(f"raw skew                 : {skew(x):+.3f}")
print(f"best lambda (min |skew|) : {best:+.2f}")
print(f"Box-Cox skew at lambda*  : {skew(boxcox(x, best)):+.3f}")
print(f"\nlog  (lambda=0) skew     : {skew(boxcox(x, 0.0)):+.3f}")
print(f"sqrt (lambda=.5) skew    : {skew(boxcox(x, 0.5)):+.3f}")
print("\n|skew| vs lambda (the U-shaped normality curve):")
plot_xy(lams.tolist(), sk)
edits are live — break it on purpose
INSTRUMENT D3.3 — BOX-COX TRANSFORMERSKEWED DISTRIBUTION · λ SLIDER · LIVE SKEWNESS · EQ D3.7
SKEWNESS (0 = SYMMETRIC)
TRANSFORM AT THIS λ
BEST λ (MIN |SKEW|)
The histogram is a strongly right-skewed (log-normal) feature. Drag λ from 1 (identity, the raw skew) down toward 0 (the log) and watch the long tail fold back into a near-symmetric bell as the skewness readout drives toward zero. Press SNAP TO λ* to jump to the maximum-normality value computed live. Push λ past the sweet spot toward −1 and you over-correct into a left skew — the transform is a dial, not a switch.
3.5

Binning & discretization

The opposite move from a smooth transform: binning chops a continuous variable into a handful of discrete intervals — age → {child, adult, senior}, income → deciles. You deliberately throw away resolution to buy something else: robustness to outliers, the ability to capture a non-monotonic effect with a linear model, interpretable "score bands", or a categorical handoff into the encoders of §3.1–§3.2.

There are two everyday strategies, and the difference is whether the bin edges or the bin counts are held constant:

StrategyEdges chosen byEach bin has…Good / bad
Equal-widthrange / kequal interval, unequal countsSimple & interpretable; empty bins on skewed data
Equal-frequencyquantilesequal counts, unequal widthsRobust to skew; edges shift with the data
Supervised (e.g. tree / MDL)target purityedges where the label changesMost predictive; can overfit & leak — fit on train
EQ D3.9 — EQUAL-WIDTH vs EQUAL-FREQUENCY BIN EDGES $$ \text{equal-width: } e_j = x_{\min} + j\,\frac{x_{\max} - x_{\min}}{k}; \qquad \text{equal-frequency: } e_j = Q_{j/k}(x), \quad j = 0, \ldots, k $$
Equal-width splits the value axis into \(k\) equal pieces — trivial to read ("ages 0–20, 20–40, …") but on a skewed column most points pile into one or two bins and the rest sit empty. Equal-frequency splits the data into \(k\) equal piles using quantiles, so every bin is equally populated, at the price of uneven, data-dependent widths. Equal-frequency is the safer default for skewed real-world data; equal-width wins when the bin boundaries themselves must be round, fixed, human numbers.

Binning is genuinely contested. It can rescue a linear model from a U-shaped relationship and it makes credit-scorecards legible — but it discards information, plants artificial discontinuities at the bin edges, and (when bins are chosen using the target) leaks exactly like target encoding. The modern view: prefer letting a flexible model learn the nonlinearity (splines, gradient-boosted trees) over hand-binning, and reserve discretization for interpretability, regulatory, or robustness reasons rather than raw accuracy.

PITFALLS

Four ways encoding & scaling silently break a model: (1) fitting any data-dependent transform — scaler, target encoder, supervised bins — on the full dataset instead of train-only, leaking test/label information; (2) ordinal-encoding a nominal variable and inventing an order; (3) min-max scaling in the presence of outliers, crushing the real data to a sliver; (4) unseen categories at inference time that the encoder has no value for — always reserve an "unknown" bucket and a global-mean fallback.

NEXT

Encoding and scaling make the columns you have well-behaved; feature engineering creates the columns you wish you had. Chapter 04 — Feature Engineering — covers interactions, polynomial and spline bases, date/time and cyclical features, aggregations and lag features, and the discipline of building them without leaking the future into the past.

3.R

References

  1. Micci-Barreca, D. (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Explorations 3(1) — the smoothed target/mean encoding of EQ D3.2.
  2. Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society B 26(2) — the Box-Cox power-transform family, EQ D3.7.
  3. Yeo, I.-K. & Johnson, R. A. (2000). A New Family of Power Transformations to Improve Normality or Symmetry. Biometrika 87(4) — the Yeo-Johnson extension to real-valued data, EQ D3.8.
  4. Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press (full text online) — encoding, scaling, transforms and leakage-safe resampling.
  5. Pedregosa, F. et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12 — the reference implementations of StandardScaler, MinMaxScaler, RobustScaler and PowerTransformer.
  6. Liu, H., Hussain, F., Tan, C. L. & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery 6 — a survey of binning / discretization methods (EQ D3.9).