Creating features: interactions, ratios, polynomials
A model can only learn relationships its inputs make expressible. A linear model on raw columns \(x_1, x_2\) can fit only \(w_0 + w_1 x_1 + w_2 x_2\) — a flat hyperplane. If the truth lives on a curve, or in the product of two variables, no amount of training data and no clever optimizer will recover it: the hypothesis class simply does not contain the answer. Feature engineering changes the hypothesis class by changing the inputs. You are doing, by hand and with domain knowledge, the representation learning that a deep network would otherwise have to discover from scratch — and on tabular data you will frequently win, because you know things about the problem that the data alone does not say.
The three workhorse transforms each inject a specific kind of structure:
| Transform | New feature | What it expresses | Reach for it when… |
|---|---|---|---|
| Interaction | x₁ · x₂ | The effect of one variable depends on another (non-additivity) | Effects are conditional: a drug works only at a certain dose and age |
| Ratio | x₁ / x₂ | Scale-free intensity; a rate rather than a level | Density, price-per-area, debt-to-income — the meaningful quantity is normalized |
| Polynomial | x², x³, … | Smooth curvature in a single variable | Diminishing or accelerating returns; a clear bend in the partial-dependence plot |
The interaction is the most important and the most underused. Consider the exclusive-or pattern: a point is positive when its two coordinates share a sign and negative otherwise. The two classes are perfectly determined, yet completely inseparable by any line in the \((x_1, x_2)\) plane — every straight cut puts roughly half of each class on each side. Add one feature, the product \(x_1 x_2\), and the problem collapses to a single threshold: \(x_1 x_2 > 0\). A linear model — a linear model — now solves it exactly. That is the whole thesis of this chapter in one example.
Polynomials generalize this. Polynomial feature expansion of degree \(d\) emits every monomial up to total degree \(d\): for two inputs at degree 2 that is \(\{1,\ x_1,\ x_2,\ x_1^2,\ x_1 x_2,\ x_2^2\}\). The number of terms grows combinatorially — and that growth is the central danger of the technique.
# EQ D4.1: one interaction feature lets a LINEAR model solve XOR-like data.
import numpy as np
rng = np.random.default_rng(0)
n = 400
X = rng.uniform(-1, 1, (n, 2)) # two raw features in [-1, 1]
y = (X[:, 0] * X[:, 1] > 0).astype(float) # XOR pattern: same-sign => class 1
def fit_logreg(F, y, steps=400, lr=0.5):
w = np.zeros(F.shape[1]); b = 0.0
for _ in range(steps):
p = 1 / (1 + np.exp(-(F @ w + b)))
g = p - y
w -= lr * (F.T @ g) / len(y); b -= lr * g.mean()
return w, b
def acc(F, w, b): return ((F @ w + b > 0) == (y > 0.5)).mean()
raw = X # [x1, x2] -> a flat plane
poly = np.column_stack([X, X[:, 0]*X[:, 1]]) # add the x1*x2 interaction
wr, br = fit_logreg(raw, y); wp, bp = fit_logreg(poly, y)
print(f"linear on [x1, x2] : accuracy {acc(raw, wr, br):.3f} (~chance)")
print(f"linear on [x1, x2, x1*x2] : accuracy {acc(poly, wp, bp):.3f} (solved)")
print(f"learned weight on x1*x2 : {wp[2]:+.2f} <- it found the product")
plot_scatter(X[:, 0], X[:, 1], y.astype(int)) # the XOR checkerboard
A practical warning. Engineered features are not free: each one is another dimension in which the model can overfit, another column to compute and store at serving time, and — for ratios — another place a zero denominator can blow up your pipeline. The discipline is to create with intent (a hypothesis about why this feature should matter) and then prune hard (§4.3). Create generously in the lab; ship parsimoniously.
Datetime, text & aggregation features
Most real-world signal does not arrive as tidy numeric columns. It arrives as timestamps, free text, and one-to-many relationships between tables. Each demands its own family of feature transforms — and each is where domain knowledge pays off most.
Datetime
A raw timestamp is nearly useless to a model: as a single monotonically increasing integer it can only express "later". The information lives in its components — hour of day, day of week, month, is-weekend, is-holiday, days-since-last-event — extracted into separate features. The subtlety is that several of these are cyclical: hour 23 and hour 0 are adjacent, not maximally distant, yet a plain integer encoding tells the model they are 23 apart. The fix is a sine/cosine pair that wraps the cycle onto a circle.
Text
Free text is turned into features along a spectrum of sophistication. The classical baseline is the bag of words / TF-IDF representation: count each term, then down-weight terms that appear in many documents so that common words contribute little and distinctive words contribute much.
Aggregation
When the unit of prediction (a customer) maps to many rows in another table (their transactions), you must aggregate the many into features of the one: count, sum, mean, min, max, standard deviation, recency, and ratios of these over time windows. "Mean transaction value over the last 30 days," "number of distinct merchants this week," "ratio of this month's spend to the trailing-6-month average" — these grouped statistics are typically the most predictive features in churn, fraud, and recommendation systems, and they are exactly what automated tooling (featuretools' deep feature synthesis, modern feature stores) was built to manufacture and serve consistently between training and production.
Aggregation and time features are the two richest sources of target leakage. If an aggregate is computed over a window that includes the prediction moment — "average outcome for this customer," "total refunds including the one you are trying to predict" — your offline metric will be spectacular and your production model will fail. Every windowed feature must be computed strictly from information available before the prediction timestamp. The discipline is a point-in-time correct join: as-of each event, use only rows that existed then. Leakage through aggregation is the single most common reason a model that "worked" in a notebook collapses on deployment.
# EQ D4.3: cyclical hour encoding keeps midnight next to 11pm.
import numpy as np
hours = np.arange(24)
P = 24
hs = np.sin(2*np.pi*hours/P)
hc = np.cos(2*np.pi*hours/P)
def dist(a, b, vec): # Euclidean distance in feature space
return np.hypot(vec[0][a]-vec[0][b], vec[1][a]-vec[1][b])
raw = (hours[None, :], np.zeros((1, 24))) # raw integer "encoding" (1-D)
cyc = (hs, hc) # sin/cos pair (2-D, on a circle)
print(" raw-integer dist cyclical dist")
for a, b, name in [(23, 0, "23h -> 00h"), (0, 12, "00h -> 12h"), (6, 18, "06h -> 18h")]:
print(f"{name:14s} {abs(hours[a]-hours[b]):>8.2f} {dist(a, b, cyc):>8.3f}")
print("\nraw says 23h and 00h are 23 apart (max); cyclical says they are adjacent.")
print("00h<->12h and 06h<->18h are the true opposites -> largest cyclical distance.")
plot_xy(hs, hc) # the 24 hours laid out on a circle
Feature selection: filter, wrapper, embedded
§4.1 generates features by the hundred; §4.3 throws most of them away. Selection matters for three reasons that compound: fewer features means less overfitting (especially when \(p\) approaches or exceeds \(n\)), faster and cheaper models in training and serving, and — often most valuable — a model a human can actually read. The three families of methods trade compute against fidelity to the final model.
| Family | How it scores features | Cost | Blind spot |
|---|---|---|---|
| Filter | Univariate statistic vs the target (correlation, MI, χ², ANOVA F), model-agnostic | cheap | Judges each feature alone — misses interactions and redundancy |
| Wrapper | Train the model on candidate subsets, search for the best (forward, backward, RFE) | expensive | Combinatorial; prone to overfitting the search itself |
| Embedded | Selection happens inside training (L1/Lasso zeros weights; trees rank by gain) | moderate | Tied to one model family; unstable under collinearity |
Filter methods score every feature against the target independently and keep the top \(k\). They are blisteringly fast and a fine first pass, but their independence assumption is exactly their weakness: a filter ranks each feature in isolation, so it will happily keep ten copies of the same signal and discard a feature that is useless alone yet decisive in combination (the XOR product of §4.1 has zero univariate correlation with the label, yet is the whole answer).
Wrapper methods close that gap by judging features through the actual model. Recursive feature elimination (RFE) is the canonical example: train the model on all features, drop the least important one, refit, and repeat until the target count remains. Because the model sees feature combinations at every step, RFE can keep the XOR product and discard the redundant copies — at the cost of training the model many times.
Embedded methods fold selection into the fit itself. L1 regularization (the Lasso) adds a penalty proportional to the sum of absolute weights; the geometry of that penalty drives many coefficients to exactly zero, performing selection and fitting in a single optimization.
# EQ D4.5: recursive feature elimination by hand on a linear model.
# 3 of 12 features are real signal; RFE should recover exactly those 3.
import numpy as np
rng = np.random.default_rng(1)
n, p, k = 300, 12, 3
X = rng.normal(0, 1, (n, p))
X /= X.std(0) # standardize so |coef| is comparable
true = [2, 5, 9] # the only features that matter
y = 3.0*X[:, 2] - 2.0*X[:, 5] + 1.5*X[:, 9] + 0.3*rng.normal(0, 1, n)
def ridge_coef(Xs, y, lam=1.0): # closed-form ridge => stable importances
A = Xs.T @ Xs + lam*np.eye(Xs.shape[1])
return np.linalg.solve(A, Xs.T @ y)
kept = list(range(p))
while len(kept) > k:
w = ridge_coef(X[:, kept], y)
drop = int(np.argmin(np.abs(w))) # j* : smallest |coef| in the refit model
print(f"have {len(kept):2d} -> drop original feature #{kept[drop]:2d} (|coef|={abs(w[drop]):.3f})")
kept.pop(drop)
print("\nRFE kept :", sorted(kept))
print("truth :", true)
print("match :", sorted(kept) == true)
Importance & redundancy: MI, correlation, VIF
"Is this feature useful?" splits into two distinct questions that beginners conflate. Importance: how much does this feature tell me about the target? Redundancy: how much of this feature is already told by the others? You want features that score high on the first and low on the second — informative and non-overlapping. Three measures cover the ground.
Correlation is the cheap importance measure, but it sees only linear association (Vol & DATA 03). A feature with a perfect quadratic relationship to the target can have correlation zero. Mutual information fixes this: it measures any statistical dependence, linear or not, in bits.
Redundancy is the other axis, and it has its own canonical diagnostic. When several features are linear combinations of one another — multicollinearity — a linear model can still predict fine, but its coefficients become unstable and uninterpretable: the model cannot decide how to split credit between the duplicates, so tiny data changes swing the weights wildly (and sometimes flip their signs). The variance inflation factor (VIF) quantifies exactly how badly each feature is explained by the rest.
# EQ D4.8: variance inflation factor, computed directly from R^2.
import numpy as np
rng = np.random.default_rng(0)
n = 600
x1 = rng.normal(0, 1, n)
x2 = rng.normal(0, 1, n)
x3 = 0.9*x1 + 0.1*rng.normal(0, 1, n) # x3 is almost a copy of x1 -> high VIF
X = np.column_stack([x1, x2, x3])
names = ["x1", "x2", "x3 (~x1)"]
def vif(X, j): # regress column j on the others, read R^2
y = X[:, j]
others = np.delete(X, j, axis=1)
A = np.column_stack([np.ones(len(y)), others])
beta, *_ = np.linalg.lstsq(A, y, rcond=None)
resid = y - A @ beta
r2 = 1 - resid.var() / y.var()
return 1.0 / (1.0 - r2), r2
for j, nm in enumerate(names):
v, r2 = vif(X, j)
flag = " <- collinear!" if v > 5 else ""
print(f"{nm:10s} R^2={r2:5.3f} VIF={v:6.2f}{flag}")
print("\nx1 and x3 inflate each other; x2 is independent and sits near VIF=1.")
print("check: R^2=0.80 -> VIF = 1/(1-0.80) =", round(1/(1-0.80), 2))
Selection bias & nested cross-validation
Here is the most expensive mistake in applied machine learning, and it is committed daily by people who know better. You have 10,000 features and 200 samples. You score every feature against the target on the full dataset, keep the 20 that correlate best, then run cross-validation on those 20 — and report a beautiful cross-validated accuracy. The number is a fiction. You have already let the test folds influence which features survive, so every fold's "held-out" data was used to choose the model. This is feature-selection bias, and with enough noise features it can manufacture impressive cross-validated accuracy out of pure noise.
The fix has two layers. First, put feature selection inside the cross-validation: each fold selects its own features from its own training data, and the held-out fold judges that whole pipeline honestly. Second — when you are also tuning something (which \(k\), which \(\lambda\)) — you need nested cross-validation: an inner loop selects features and tunes hyperparameters, an outer loop estimates the performance of that entire selection-and-tuning procedure. The outer fold never touches anything the inner loop saw.
Anything you learn from the data is part of the model and must be cross-validated as a unit. If a step looks at \(y\) — selecting features, fitting an imputer's means, choosing a scaling, tuning \(\lambda\) — it belongs inside the resampling loop. Fit it once on the whole dataset "to save time" and you have leaked the test set into training. The honest pipeline is more code and a smaller, truer number; the biased one is less code and a lie. Nested CV is simply this rule applied twice: once for selection/tuning (inner), once for honest performance estimation (outer).
# EQ D4.9: selection bias on PURE NOISE. There is no signal at all,
# yet selecting features on all the data fakes high CV accuracy.
import numpy as np
rng = np.random.default_rng(3)
n, p, k = 120, 4000, 20 # p >> n: a leakage trap
X = rng.normal(0, 1, (n, p))
y = (rng.random(n) > 0.5).astype(float) # label is a COIN FLIP -- zero signal
def cv_acc(Xs, y, folds=4):
idx = np.array_split(rng.permutation(len(y)), folds); accs = []
for f in range(folds):
te = idx[f]; tr = np.concatenate([idx[g] for g in range(folds) if g != f])
w = np.linalg.lstsq(np.column_stack([np.ones(len(tr)), Xs[tr]]), y[tr]-0.5, rcond=None)[0]
pred = (np.column_stack([np.ones(len(te)), Xs[te]]) @ w > 0)
accs.append((pred == (y[te] > 0.5)).mean())
return np.mean(accs)
corr = np.array([abs(np.corrcoef(X[:, j], y)[0, 1]) for j in range(p)])
top = np.argsort(corr)[-k:] # <- chosen using ALL of y : the leak
print(f"BIASED (select on all data) : CV acc = {cv_acc(X[:, top], y):.3f}")
# honest: re-select inside each fold (no peeking at the test fold's labels)
acc_h = []
fold = np.array_split(rng.permutation(n), 4)
for f in range(4):
te = fold[f]; tr = np.concatenate([fold[g] for g in range(4) if g != f])
c = np.array([abs(np.corrcoef(X[tr, j], y[tr])[0, 1]) for j in range(p)])
t = np.argsort(c)[-k:]
w = np.linalg.lstsq(np.column_stack([np.ones(len(tr)), X[tr][:, t]]), y[tr]-0.5, rcond=None)[0]
pred = (np.column_stack([np.ones(len(te)), X[te][:, t]]) @ w > 0)
acc_h.append((pred == (y[te] > 0.5)).mean())
print(f"HONEST (select inside folds): CV acc = {np.mean(acc_h):.3f} (~0.50, the truth)")
Good features and honest selection assume your classes are balanced enough to learn from. They often are not: fraud, disease, and defaults are rare by definition, and a 99%-accurate model that always predicts "no" is worthless. Chapter 05 — Imbalanced Data — covers resampling (SMOTE and friends), class weighting, threshold moving, and the precision/recall-based metrics that tell the truth when accuracy lies.
References
- Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection.
- Kuhn, M. & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso.
- Ambroise, C. & McLachlan, G. J. (2002). Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data.
- Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. (2002). Gene Selection for Cancer Classification using Support Vector Machines.
- Zou, H. & Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net.
- Kraskov, A., Stögbauer, H. & Grassberger, P. (2004). Estimating Mutual Information.