AI // ENCYCLOPEDIA / DATA / 04 / FEATURE ENGINEERING INDEX NEXT: IMBALANCED DATA →
DATA & FEATURE ENGINEERING · CHAPTER 04 / 05

Feature Engineering & Selection

The right feature can let a linear model beat a neural net. Feature engineering is the point where domain knowledge enters the math. This chapter works from both ends: how to create features that expose structure a model cannot find on its own, and how to select from the resulting flood the few that carry signal, without contaminating your own evaluation in the process.

LEVELCORE READING TIME≈ 26 MIN BUILDS ONDATA 01–03 INSTRUMENTSINTERACTION · SELECTION RACE · VIF
4.1

Creating features: interactions, ratios, polynomials

A model can only learn relationships its inputs make expressible. A linear model on raw columns \(x_1, x_2\) can fit only \(w_0 + w_1 x_1 + w_2 x_2\) — a flat hyperplane. If the truth lives on a curve, or in the product of two variables, no amount of training data and no clever optimizer will recover it: the hypothesis class simply does not contain the answer. Feature engineering changes the hypothesis class by changing the inputs. You are doing, by hand and with domain knowledge, the representation learning that a deep network would otherwise have to discover from scratch — and on tabular data you will frequently win, because you know things about the problem that the data alone does not say.

The three workhorse transforms each inject a specific kind of structure:

TransformNew featureWhat it expressesReach for it when…
Interactionx₁ · x₂The effect of one variable depends on another (non-additivity)Effects are conditional: a drug works only at a certain dose and age
Ratiox₁ / x₂Scale-free intensity; a rate rather than a levelDensity, price-per-area, debt-to-income — the meaningful quantity is normalized
Polynomialx², x³, …Smooth curvature in a single variableDiminishing or accelerating returns; a clear bend in the partial-dependence plot

The interaction is the most important and the most underused. Consider the exclusive-or pattern: a point is positive when its two coordinates share a sign and negative otherwise. The two classes are perfectly determined, yet completely inseparable by any line in the \((x_1, x_2)\) plane — every straight cut puts roughly half of each class on each side. Add one feature, the product \(x_1 x_2\), and the problem collapses to a single threshold: \(x_1 x_2 > 0\). A linear model — a linear model — now solves it exactly. That is the whole thesis of this chapter in one example.

EQ D4.1 — INTERACTION LINEARIZES XOR $$ y = \operatorname{sign}(x_1 x_2), \qquad \hat{y} = \operatorname{sign}\!\big(w\,(x_1 x_2)\big) \quad\text{is exact, while}\quad \hat{y} = \operatorname{sign}(w_1 x_1 + w_2 x_2 + b) \text{ cannot exceed } 50\%. $$
The raw inputs carry all the information, but in a form no linear decision boundary can read. The engineered product \(z = x_1 x_2\) is a change of coordinates in which the same data becomes linearly separable on one axis. The model did not get smarter — the representation did. A neural net would have learned an equivalent product inside a hidden layer; you supplied it directly, with one multiplication and zero training.

Polynomials generalize this. Polynomial feature expansion of degree \(d\) emits every monomial up to total degree \(d\): for two inputs at degree 2 that is \(\{1,\ x_1,\ x_2,\ x_1^2,\ x_1 x_2,\ x_2^2\}\). The number of terms grows combinatorially — and that growth is the central danger of the technique.

EQ D4.2 — SIZE OF A POLYNOMIAL EXPANSION $$ \#\text{terms} = \binom{n + d}{d}, \qquad \#\text{(degree-2, no bias, } n \text{ inputs)} = n + \binom{n}{2} + n = \underbrace{n}_{\text{linear}} + \underbrace{\binom{n}{2}}_{\text{interactions}} + \underbrace{n}_{\text{squares}} $$
\(\binom{n+d}{d}\) counts all monomials of total degree \(\le d\) in \(n\) variables, including the constant. For \(n=2,\ d=2\) that is \(\binom{4}{2}=6\) terms; the squared-plus-interaction part alone (dropping the bias and the two linear terms) is \(x_1^2, x_1 x_2, x_2^2\) — exactly 3. At \(n=100,\ d=2\) you already have \(\binom{102}{2}=5151\) features; at degree 3 the count explodes into the hundreds of thousands. Curvature is cheap; the curse of dimensionality it buys is not. This is precisely why §4.3 (selection) is not optional once you start §4.1 (creation).
You take two raw features \( x_1, x_2 \) and apply a degree-2 polynomial expansion. Excluding the bias term and the two original linear terms, how many squared-plus-interaction features does it add?
The degree-2 monomials beyond linear are the two squares \( x_1^2,\ x_2^2 \) and the single interaction \( x_1 x_2 \). That is \( 2 + 1 = \) 3 new features — matching the squares-plus-interactions decomposition in EQ D4.2 with \( n = 2 \): \( n + \binom{n}{2} = 2 + 1 = 3 \).
PYTHON · RUNNABLE IN-BROWSER
# EQ D4.1: one interaction feature lets a LINEAR model solve XOR-like data.
import numpy as np
rng = np.random.default_rng(0)
n = 400
X = rng.uniform(-1, 1, (n, 2))            # two raw features in [-1, 1]
y = (X[:, 0] * X[:, 1] > 0).astype(float) # XOR pattern: same-sign => class 1

def fit_logreg(F, y, steps=400, lr=0.5):
    w = np.zeros(F.shape[1]); b = 0.0
    for _ in range(steps):
        p = 1 / (1 + np.exp(-(F @ w + b)))
        g = p - y
        w -= lr * (F.T @ g) / len(y);  b -= lr * g.mean()
    return w, b

def acc(F, w, b): return ((F @ w + b > 0) == (y > 0.5)).mean()

raw  = X                                  # [x1, x2]                -> a flat plane
poly = np.column_stack([X, X[:, 0]*X[:, 1]])   # add the x1*x2 interaction

wr, br = fit_logreg(raw,  y);  wp, bp = fit_logreg(poly, y)
print(f"linear on [x1, x2]            : accuracy {acc(raw,  wr, br):.3f}  (~chance)")
print(f"linear on [x1, x2, x1*x2]     : accuracy {acc(poly, wp, bp):.3f}  (solved)")
print(f"learned weight on x1*x2       : {wp[2]:+.2f}  <- it found the product")
plot_scatter(X[:, 0], X[:, 1], y.astype(int))  # the XOR checkerboard
edits are live — break it on purpose
INSTRUMENT D4.1 — THE INTERACTION FEATUREXOR DATA · TOGGLE x₁·x₂ · EQ D4.1
TRAIN ACCURACY
DECISION RULE
SEPARABLE?
The four quadrants form a checkerboard: same-sign points are one class, opposite-sign the other. With [ x₁, x₂ ] the best straight boundary the logistic model can draw is hopeless — accuracy hovers near 50%, and the canvas shows the flat shaded half-planes failing. Flip to [ x₁, x₂, x₁·x₂ ] and the very same model snaps to ~100%: the engineered product turns the checkerboard into a single threshold (the boundary becomes the two axes). Raising label noise is the only thing that can hurt it now.

A practical warning. Engineered features are not free: each one is another dimension in which the model can overfit, another column to compute and store at serving time, and — for ratios — another place a zero denominator can blow up your pipeline. The discipline is to create with intent (a hypothesis about why this feature should matter) and then prune hard (§4.3). Create generously in the lab; ship parsimoniously.

4.2

Datetime, text & aggregation features

Most real-world signal does not arrive as tidy numeric columns. It arrives as timestamps, free text, and one-to-many relationships between tables. Each demands its own family of feature transforms — and each is where domain knowledge pays off most.

Datetime

A raw timestamp is nearly useless to a model: as a single monotonically increasing integer it can only express "later". The information lives in its components — hour of day, day of week, month, is-weekend, is-holiday, days-since-last-event — extracted into separate features. The subtlety is that several of these are cyclical: hour 23 and hour 0 are adjacent, not maximally distant, yet a plain integer encoding tells the model they are 23 apart. The fix is a sine/cosine pair that wraps the cycle onto a circle.

EQ D4.3 — CYCLICAL ENCODING $$ x_{\sin} = \sin\!\left(\frac{2\pi\,t}{P}\right), \qquad x_{\cos} = \cos\!\left(\frac{2\pi\,t}{P}\right) $$
\(t\) is the position within the cycle (e.g. the hour, \(0\ldots23\)) and \(P\) its period (here \(24\)). The pair places each time on a unit circle, so the Euclidean distance between hour 23 and hour 0 is small — as it should be — while opposite hours sit far apart. One number cannot encode a cycle without a discontinuity; two can. Both features are needed: \(\sin\) alone is ambiguous (it gives the same value at 6:00 and 18:00), and the \(\cos\) component breaks the tie. Tree models, which split on thresholds, often do fine with the raw integer components and need this trick less than linear models and neural nets.

Text

Free text is turned into features along a spectrum of sophistication. The classical baseline is the bag of words / TF-IDF representation: count each term, then down-weight terms that appear in many documents so that common words contribute little and distinctive words contribute much.

EQ D4.4 — TF-IDF $$ \text{tfidf}(t, d) = \underbrace{\text{tf}(t, d)}_{\text{count in doc}} \times \underbrace{\log\!\frac{N}{1 + \text{df}(t)}}_{\text{inverse doc frequency}} $$
\(\text{tf}(t,d)\) is how often term \(t\) appears in document \(d\); \(\text{df}(t)\) is how many of the \(N\) documents contain it. A term in every document (\(\text{df}\approx N\)) gets an IDF near zero and is effectively ignored; a rare, document-specific term gets a large weight. The \(1+\) in the denominator avoids division by zero for unseen terms. TF-IDF is still a strong, cheap, fully interpretable baseline for classification; dense embeddings (Vol II) beat it on meaning but lose the per-term transparency that makes TF-IDF easy to debug. Simpler text features — length, digit count, punctuation ratio, sentiment lexicon hits — are often surprisingly predictive and cost almost nothing.

Aggregation

When the unit of prediction (a customer) maps to many rows in another table (their transactions), you must aggregate the many into features of the one: count, sum, mean, min, max, standard deviation, recency, and ratios of these over time windows. "Mean transaction value over the last 30 days," "number of distinct merchants this week," "ratio of this month's spend to the trailing-6-month average" — these grouped statistics are typically the most predictive features in churn, fraud, and recommendation systems, and they are exactly what automated tooling (featuretools' deep feature synthesis, modern feature stores) was built to manufacture and serve consistently between training and production.

LEAKAGE

Aggregation and time features are the two richest sources of target leakage. If an aggregate is computed over a window that includes the prediction moment — "average outcome for this customer," "total refunds including the one you are trying to predict" — your offline metric will be spectacular and your production model will fail. Every windowed feature must be computed strictly from information available before the prediction timestamp. The discipline is a point-in-time correct join: as-of each event, use only rows that existed then. Leakage through aggregation is the single most common reason a model that "worked" in a notebook collapses on deployment.

PYTHON · RUNNABLE IN-BROWSER
# EQ D4.3: cyclical hour encoding keeps midnight next to 11pm.
import numpy as np
hours = np.arange(24)
P = 24
hs = np.sin(2*np.pi*hours/P)
hc = np.cos(2*np.pi*hours/P)

def dist(a, b, vec):                       # Euclidean distance in feature space
    return np.hypot(vec[0][a]-vec[0][b], vec[1][a]-vec[1][b])

raw = (hours[None, :], np.zeros((1, 24)))  # raw integer "encoding" (1-D)
cyc = (hs, hc)                             # sin/cos pair (2-D, on a circle)

print("       raw-integer dist   cyclical dist")
for a, b, name in [(23, 0, "23h -> 00h"), (0, 12, "00h -> 12h"), (6, 18, "06h -> 18h")]:
    print(f"{name:14s}  {abs(hours[a]-hours[b]):>8.2f}        {dist(a, b, cyc):>8.3f}")

print("\nraw says 23h and 00h are 23 apart (max); cyclical says they are adjacent.")
print("00h<->12h and 06h<->18h are the true opposites -> largest cyclical distance.")
plot_xy(hs, hc)                            # the 24 hours laid out on a circle
edits are live — break it on purpose
4.3

Feature selection: filter, wrapper, embedded

§4.1 generates features by the hundred; §4.3 throws most of them away. Selection matters for three reasons that compound: fewer features means less overfitting (especially when \(p\) approaches or exceeds \(n\)), faster and cheaper models in training and serving, and — often most valuable — a model a human can actually read. The three families of methods trade compute against fidelity to the final model.

FamilyHow it scores featuresCostBlind spot
FilterUnivariate statistic vs the target (correlation, MI, χ², ANOVA F), model-agnosticcheapJudges each feature alone — misses interactions and redundancy
WrapperTrain the model on candidate subsets, search for the best (forward, backward, RFE)expensiveCombinatorial; prone to overfitting the search itself
EmbeddedSelection happens inside training (L1/Lasso zeros weights; trees rank by gain)moderateTied to one model family; unstable under collinearity

Filter methods score every feature against the target independently and keep the top \(k\). They are blisteringly fast and a fine first pass, but their independence assumption is exactly their weakness: a filter ranks each feature in isolation, so it will happily keep ten copies of the same signal and discard a feature that is useless alone yet decisive in combination (the XOR product of §4.1 has zero univariate correlation with the label, yet is the whole answer).

Wrapper methods close that gap by judging features through the actual model. Recursive feature elimination (RFE) is the canonical example: train the model on all features, drop the least important one, refit, and repeat until the target count remains. Because the model sees feature combinations at every step, RFE can keep the XOR product and discard the redundant copies — at the cost of training the model many times.

EQ D4.5 — RECURSIVE FEATURE ELIMINATION $$ S_0 = \{1,\dots,p\}, \qquad j^\star = \arg\min_{j \in S_t} \text{importance}_j\big(\text{fit on } S_t\big), \qquad S_{t+1} = S_t \setminus \{j^\star\} $$
Start with all \(p\) features. At each step, fit the model on the surviving set \(S_t\), find the feature \(j^\star\) the refit model ranks lowest, and remove it. Repeat until \(|S| = k\). Importance is whatever the model exposes — \(|\text{coefficient}|\) for a linear model, split-gain for a tree. The recursion is the point: a feature that looks weak among all \(p\) may become essential once its redundant partners are gone, so importances are recomputed after every elimination rather than ranked once. RFE is \(O(p)\) model fits — far cheaper than the \(2^p\) of exhaustive subset search, far more faithful than a single filter pass.

Embedded methods fold selection into the fit itself. L1 regularization (the Lasso) adds a penalty proportional to the sum of absolute weights; the geometry of that penalty drives many coefficients to exactly zero, performing selection and fitting in a single optimization.

EQ D4.6 — LASSO: SELECTION BY L1 PENALTY $$ \hat{\beta} = \arg\min_{\beta}\ \tfrac{1}{2n}\,\lVert y - X\beta \rVert_2^2 \;+\; \lambda \sum_{j=1}^{p} \lvert \beta_j \rvert $$
The squared-error loss pulls toward the least-squares fit; the L1 term \(\lambda\sum|\beta_j|\) pulls toward zero. Because the L1 ball has corners on the axes (unlike the round L2 ball of ridge regression), the optimum tends to land on an axis — i.e. with some \(\beta_j = 0\) exactly. Larger \(\lambda\) zeros more coefficients; sweep \(\lambda\) and you trace a selection path. L1 selects; L2 only shrinks. The caveat experts will raise: among a group of highly correlated features the Lasso tends to pick one arbitrarily and zero the rest, which is unstable — the elastic net (L1+L2) was invented precisely to tame that, and tree-based importances (§4.4) suffer a related instability.
PYTHON · RUNNABLE IN-BROWSER
# EQ D4.5: recursive feature elimination by hand on a linear model.
# 3 of 12 features are real signal; RFE should recover exactly those 3.
import numpy as np
rng = np.random.default_rng(1)
n, p, k = 300, 12, 3
X = rng.normal(0, 1, (n, p))
X /= X.std(0)                              # standardize so |coef| is comparable
true = [2, 5, 9]                           # the only features that matter
y = 3.0*X[:, 2] - 2.0*X[:, 5] + 1.5*X[:, 9] + 0.3*rng.normal(0, 1, n)

def ridge_coef(Xs, y, lam=1.0):           # closed-form ridge => stable importances
    A = Xs.T @ Xs + lam*np.eye(Xs.shape[1])
    return np.linalg.solve(A, Xs.T @ y)

kept = list(range(p))
while len(kept) > k:
    w = ridge_coef(X[:, kept], y)
    drop = int(np.argmin(np.abs(w)))      # j* : smallest |coef| in the refit model
    print(f"have {len(kept):2d} -> drop original feature #{kept[drop]:2d} (|coef|={abs(w[drop]):.3f})")
    kept.pop(drop)

print("\nRFE kept :", sorted(kept))
print("truth    :", true)
print("match    :", sorted(kept) == true)
edits are live — break it on purpose
INSTRUMENT D4.2 — FEATURE-SELECTION RACEFILTER vs WRAPPER vs L1 · 5 SIGNAL + NOISE
RECALL — TRUE FEATURES RECOVERED IN THE TOP-5 (5 = PERFECT)
FILTER (|CORR|)
WRAPPER (RFE)
EMBEDDED (L1)
Five true features drive the target; the rest are pure noise. Each method picks its top 5; the bars show how many of the real ones it recovered. With strong signal and little noise all three score 5/5. Drive NOISE FEATURES up and SIGNAL STRENGTH down: the univariate filter degrades first — it cannot tell a true feature from a noise feature that happens to correlate by chance — while the model-aware RFE and L1 hold on longer. This is the cheap-vs-faithful trade-off made visible.
4.4

Importance & redundancy: MI, correlation, VIF

"Is this feature useful?" splits into two distinct questions that beginners conflate. Importance: how much does this feature tell me about the target? Redundancy: how much of this feature is already told by the others? You want features that score high on the first and low on the second — informative and non-overlapping. Three measures cover the ground.

Correlation is the cheap importance measure, but it sees only linear association (Vol & DATA 03). A feature with a perfect quadratic relationship to the target can have correlation zero. Mutual information fixes this: it measures any statistical dependence, linear or not, in bits.

EQ D4.7 — MUTUAL INFORMATION $$ I(X; Y) = \sum_{x}\sum_{y} p(x, y)\,\log\frac{p(x, y)}{p(x)\,p(y)} \;=\; H(Y) - H(Y \mid X) $$
MI is zero if and only if \(X\) and \(Y\) are statistically independent, and it grows with any form of dependence — capturing the curved relationships correlation is blind to. The second form reads it as the reduction in the uncertainty (entropy) of \(Y\) once you know \(X\): how many bits the feature buys you about the target. MI catches non-linear importance that correlation misses; the price is that estimating it from continuous data needs binning or a \(k\)-nearest-neighbour estimator, and noisy estimates can over-rank features in small samples. Used as a filter score, MI is strictly more general than correlation.

Redundancy is the other axis, and it has its own canonical diagnostic. When several features are linear combinations of one another — multicollinearity — a linear model can still predict fine, but its coefficients become unstable and uninterpretable: the model cannot decide how to split credit between the duplicates, so tiny data changes swing the weights wildly (and sometimes flip their signs). The variance inflation factor (VIF) quantifies exactly how badly each feature is explained by the rest.

EQ D4.8 — VARIANCE INFLATION FACTOR $$ \text{VIF}_j = \frac{1}{1 - R_j^2}, \qquad R_j^2 = \text{the } R^2 \text{ from regressing feature } x_j \text{ on all the other features.} $$
Regress feature \(x_j\) on every other feature and read off the \(R_j^2\). If the others explain none of \(x_j\) (\(R_j^2 = 0\)) then \(\text{VIF}_j = 1\) — no inflation. As \(R_j^2 \to 1\) the feature becomes a near-perfect combination of the others and \(\text{VIF}_j \to \infty\). The name is literal: the variance of the estimated coefficient \(\hat\beta_j\) is multiplied by exactly \(\text{VIF}_j\) relative to the no-collinearity case. Rules of thumb: VIF > 5 warrants a look, VIF > 10 signals serious collinearity — though these thresholds are conventions, not laws, and high VIF only hurts coefficient interpretation, not pure predictive accuracy. At \(R_j^2 = 0.8\), \(\text{VIF}_j = 1/(1-0.8) = 5\): the borderline case.
Regressing feature \( x_j \) on all the other features gives \( R_j^2 = 0.8 \) — the rest of the design explains 80% of its variance. What is its variance inflation factor \( \text{VIF}_j \)?
By EQ D4.8, \( \text{VIF}_j = \dfrac{1}{1 - R_j^2} = \dfrac{1}{1 - 0.8} = \dfrac{1}{0.2} = \) 5. The coefficient's variance is inflated 5×, and \( R_j^2 = 0.8 \) sits right at the conventional "warrants a look" threshold — a feature this redundant is a prime candidate to drop or combine.
PYTHON · RUNNABLE IN-BROWSER
# EQ D4.8: variance inflation factor, computed directly from R^2.
import numpy as np
rng = np.random.default_rng(0)
n = 600
x1 = rng.normal(0, 1, n)
x2 = rng.normal(0, 1, n)
x3 = 0.9*x1 + 0.1*rng.normal(0, 1, n)     # x3 is almost a copy of x1 -> high VIF
X = np.column_stack([x1, x2, x3])
names = ["x1", "x2", "x3 (~x1)"]

def vif(X, j):                            # regress column j on the others, read R^2
    y = X[:, j]
    others = np.delete(X, j, axis=1)
    A = np.column_stack([np.ones(len(y)), others])
    beta, *_ = np.linalg.lstsq(A, y, rcond=None)
    resid = y - A @ beta
    r2 = 1 - resid.var() / y.var()
    return 1.0 / (1.0 - r2), r2

for j, nm in enumerate(names):
    v, r2 = vif(X, j)
    flag = "  <- collinear!" if v > 5 else ""
    print(f"{nm:10s}  R^2={r2:5.3f}   VIF={v:6.2f}{flag}")

print("\nx1 and x3 inflate each other; x2 is independent and sits near VIF=1.")
print("check: R^2=0.80 -> VIF = 1/(1-0.80) =", round(1/(1-0.80), 2))
edits are live — break it on purpose
INSTRUMENT D4.3 — MULTICOLLINEARITY / VIF EXPLORERTWO FEATURES · TUNE THEIR CORRELATION · EQ D4.8
R² (x₁ ~ x₂)
VIF = 1/(1−R²)
VERDICT
Two features with a tunable correlation ρ. With two features, \(R^2 = \rho^2\) exactly, so VIF \(= 1/(1-\rho^2)\) — the curve plotted on the canvas. Slide ρ up: at ρ = 0 the cloud is round and VIF = 1 (no inflation); at ρ = 0.80, R² = 0.64 and VIF ≈ 2.8; past ρ ≈ 0.89 you cross VIF = 5 and the verdict flips to the warning zone; as ρ → 1 the cloud collapses to a line and VIF blows up toward infinity. The reading is the coefficient-variance multiplier the collinearity is costing you.
4.5

Selection bias & nested cross-validation

Here is the most expensive mistake in applied machine learning, and it is committed daily by people who know better. You have 10,000 features and 200 samples. You score every feature against the target on the full dataset, keep the 20 that correlate best, then run cross-validation on those 20 — and report a beautiful cross-validated accuracy. The number is a fiction. You have already let the test folds influence which features survive, so every fold's "held-out" data was used to choose the model. This is feature-selection bias, and with enough noise features it can manufacture impressive cross-validated accuracy out of pure noise.

EQ D4.9 — WHY SELECTION-ON-ALL-DATA LEAKS $$ \widehat{\text{acc}}_{\text{biased}} = \text{CV}\Big(\text{model} \,\big|\, \underbrace{\text{features chosen using all } (X, y)}_{\text{test folds already seen}}\Big) \;\gg\; \widehat{\text{acc}}_{\text{honest}} $$
The selection step is part of the model-fitting procedure, so it must live inside the cross-validation loop, not before it. Pick features on the full data and the labels of every eventual test fold have leaked into the choice; the CV estimate is then biased upward — sometimes wildly. With \(p \gg n\) and only noise, selecting the top-\(k\) "best" features on all the data and then cross-validating can report accuracy far above chance for data with no signal whatsoever. The rule is absolute: every data-dependent decision — imputation statistics, scaling parameters, feature selection, hyperparameters — must be fit on the training portion of each fold alone.

The fix has two layers. First, put feature selection inside the cross-validation: each fold selects its own features from its own training data, and the held-out fold judges that whole pipeline honestly. Second — when you are also tuning something (which \(k\), which \(\lambda\)) — you need nested cross-validation: an inner loop selects features and tunes hyperparameters, an outer loop estimates the performance of that entire selection-and-tuning procedure. The outer fold never touches anything the inner loop saw.

THE RULE

Anything you learn from the data is part of the model and must be cross-validated as a unit. If a step looks at \(y\) — selecting features, fitting an imputer's means, choosing a scaling, tuning \(\lambda\) — it belongs inside the resampling loop. Fit it once on the whole dataset "to save time" and you have leaked the test set into training. The honest pipeline is more code and a smaller, truer number; the biased one is less code and a lie. Nested CV is simply this rule applied twice: once for selection/tuning (inner), once for honest performance estimation (outer).

PYTHON · RUNNABLE IN-BROWSER
# EQ D4.9: selection bias on PURE NOISE. There is no signal at all,
# yet selecting features on all the data fakes high CV accuracy.
import numpy as np
rng = np.random.default_rng(3)
n, p, k = 120, 4000, 20                    # p >> n: a leakage trap
X = rng.normal(0, 1, (n, p))
y = (rng.random(n) > 0.5).astype(float)    # label is a COIN FLIP -- zero signal

def cv_acc(Xs, y, folds=4):
    idx = np.array_split(rng.permutation(len(y)), folds); accs = []
    for f in range(folds):
        te = idx[f]; tr = np.concatenate([idx[g] for g in range(folds) if g != f])
        w = np.linalg.lstsq(np.column_stack([np.ones(len(tr)), Xs[tr]]), y[tr]-0.5, rcond=None)[0]
        pred = (np.column_stack([np.ones(len(te)), Xs[te]]) @ w > 0)
        accs.append((pred == (y[te] > 0.5)).mean())
    return np.mean(accs)

corr = np.array([abs(np.corrcoef(X[:, j], y)[0, 1]) for j in range(p)])
top = np.argsort(corr)[-k:]                # <- chosen using ALL of y : the leak
print(f"BIASED  (select on all data) : CV acc = {cv_acc(X[:, top], y):.3f}")

# honest: re-select inside each fold (no peeking at the test fold's labels)
acc_h = []
fold = np.array_split(rng.permutation(n), 4)
for f in range(4):
    te = fold[f]; tr = np.concatenate([fold[g] for g in range(4) if g != f])
    c = np.array([abs(np.corrcoef(X[tr, j], y[tr])[0, 1]) for j in range(p)])
    t = np.argsort(c)[-k:]
    w = np.linalg.lstsq(np.column_stack([np.ones(len(tr)), X[tr][:, t]]), y[tr]-0.5, rcond=None)[0]
    pred = (np.column_stack([np.ones(len(te)), X[te][:, t]]) @ w > 0)
    acc_h.append((pred == (y[te] > 0.5)).mean())
print(f"HONEST  (select inside folds): CV acc = {np.mean(acc_h):.3f}  (~0.50, the truth)")
edits are live — break it on purpose
NEXT

Good features and honest selection assume your classes are balanced enough to learn from. They often are not: fraud, disease, and defaults are rare by definition, and a 99%-accurate model that always predicts "no" is worthless. Chapter 05 — Imbalanced Data — covers resampling (SMOTE and friends), class weighting, threshold moving, and the precision/recall-based metrics that tell the truth when accuracy lies.

4.R

References

  1. Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 — the canonical survey of filter, wrapper and embedded selection (§4.3).
  2. Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press / open web edition — interactions, encodings, resampling-aware selection and leakage (§4.1–4.5).
  3. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58(1) — the L1 penalty that performs embedded selection (EQ D4.6).
  4. Ambroise, C. & McLachlan, G. J. (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. PNAS 99(10) — the definitive demonstration of feature-selection bias and why selection must sit inside cross-validation (§4.5).
  5. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning 46 — recursive feature elimination, EQ D4.5.
  6. Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B 67(2) — the L1+L2 fix for Lasso's instability under collinearity (EQ D4.6 note).
  7. Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information. Physical Review E 69(6) — the k-nearest-neighbour estimator behind practical MI feature scores (EQ D4.7).