Backtesting & walk-forward validation
The cross-validation you learned for tabular data — shuffle the rows, hold out a random fold — is poison for a time series. Shuffling lets the model train on Thursday to predict Wednesday; random folds leak the future into the past. Temporal order is the whole point of the data, so the evaluation must respect it: train only on the past, test only on the future, never the reverse.
The disciplined way to do this is walk-forward validation (also called rolling-origin or time-series cross-validation). Fix a forecast horizon \(h\). Train on data up to some origin \(t\), forecast the next \(h\) steps, score them against what actually happened, then slide the origin forward and repeat. You end up with many out-of-sample forecasts at many origins — a far more honest estimate of live performance than a single train/test split, which can be lucky or unlucky depending on where you happened to cut.
Two practical refinements separate a toy backtest from a trustworthy one. First, leave a gap between train and test when your features embed a look-back or your labels arrive late, so information cannot bleed across the seam (this is the idea behind purged and embargoed cross-validation in finance). Second, refit the model at every origin if you can afford it — a model re-estimated as the window slides mimics what you would actually do in production, whereas freezing the parameters at the first origin quietly over-states stability.
# Walk-forward backtest: naive vs AR(1), scored by MASE (EQ T6.1 + T6.3).
import numpy as np
rng = np.random.default_rng(0)
# A trending, noisy AR(1)-ish series of 120 points.
n = 120
y = np.zeros(n)
for t in range(1, n):
y[t] = 0.6 * y[t - 1] + 0.05 * t + rng.normal(0, 1.0)
H = 1 # one-step-ahead horizon
start = 60 # first forecast origin
abs_naive, abs_ar = [], []
for t in range(start, n - H):
train = y[:t + 1] # ONLY the past -- no leakage
naive = train[-1] # last value carried forward
# AR(1) fit by least squares on the training window
a = np.vstack([train[:-1], np.ones(t)]).T
phi, c = np.linalg.lstsq(a, train[1:], rcond=None)[0]
ar = phi * train[-1] + c
actual = y[t + H]
abs_naive.append(abs(actual - naive))
abs_ar.append(abs(actual - ar))
# MASE = mean(|model error|) / mean(|naive one-step error in-sample|)
scale = np.mean(np.abs(np.diff(y[:start]))) # the naive yardstick
mase_naive = np.mean(abs_naive) / scale
mase_ar = np.mean(abs_ar) / scale
print(f"in-sample naive scale (mean |y_t - y_t-1|): {scale:.3f}")
print(f"MASE naive forecast : {mase_naive:.3f}")
print(f"MASE AR(1) forecast : {mase_ar:.3f}")
print("AR(1) beats naive out-of-sample:" , mase_ar < mase_naive)
Forecast accuracy — MAPE, MASE, sMAPE
A backtest gives you errors; a metric turns them into one comparable number. The choice is not cosmetic — each metric has a failure mode, and picking the wrong one for your data is how people ship models that look great offline and disappoint in production.
The intuitive starting point is the Mean Absolute Percentage Error: average the absolute error as a fraction of the actual value, so a forecast that is off by 10 on a quantity of 100 scores the same 10% as a forecast off by 1 on a quantity of 10.
The fix for MAPE's pathologies is the Mean Absolute Scaled Error of Hyndman & Koehler (2006). Instead of dividing by the actual value, divide the model's mean absolute error by the mean absolute error of a dirt-simple benchmark — the naive forecast, which just carries the last observed value forward. The scaling makes MASE unitless, defined even when \(y_t = 0\), symmetric, and — its whole point — readable as a comparison against the benchmark that any model must beat to justify its existence.
A third metric, the symmetric MAPE, was introduced to tame MAPE's over-/under-forecast asymmetry by putting both the actual and the forecast in the denominator. It is bounded and was used in the M3 and M4 competitions, but "symmetric" is a misnomer — it is still not perfectly even-handed, and it too misbehaves when both values approach zero. Know it because you will meet it in benchmark tables; prefer MASE when you get to choose.
# MAPE, MASE and sMAPE side by side, in plain numpy (EQ T6.2-T6.4).
import numpy as np
# A short held-out test set + the model's forecasts for it.
y_train = np.array([100, 102, 101, 105, 110, 108, 112, 115], float) # history
y_test = np.array([118, 120, 119, 125], float) # truth
y_hat = np.array([116, 121, 122, 123], float) # forecast
abs_err = np.abs(y_test - y_hat)
mape = 100 * np.mean(abs_err / np.abs(y_test))
smape = 100 * np.mean(abs_err / ((np.abs(y_test) + np.abs(y_hat)) / 2))
scale = np.mean(np.abs(np.diff(y_train))) # naive one-step MAE on TRAIN
mase = np.mean(abs_err) / scale
# What the naive (last-value) forecast would have scored, for contrast.
naive_hat = np.full_like(y_test, y_train[-1])
mase_naive = np.mean(np.abs(y_test - naive_hat)) / scale
print(f"MAPE : {mape:6.2f} %")
print(f"sMAPE : {smape:6.2f} %")
print(f"MASE : {mase:6.3f} (model)")
print(f"MASE : {mase_naive:6.3f} (naive baseline -- always >= ~1 by design)")
print("model beats naive:", mase < mase_naive)
Prediction intervals
A single number — the point forecast — is a lie of omission. The honest output of a forecaster is a distribution, or at least an interval: "demand next week is 1,200 ± 300 with 90% confidence." Decisions are made on the interval, not the point — safety stock, capital buffers, staffing all hinge on the downside, not the median. And here is the uncomfortable truth of applied forecasting: point forecasts are often decent and the intervals are usually too narrow.
For a model that assumes Gaussian errors with standard deviation \(\sigma_h\) at horizon \(h\), the symmetric prediction interval is the familiar \(z\)-band around the point forecast:
That last sentence is the whole discipline. A 90% prediction interval is calibrated if, over many forecasts, the truth falls inside it close to 90% of the time. Measure it directly:
# Interval coverage: a model that mis-estimates sigma is mis-calibrated (T6.5/6.6).
import numpy as np
rng = np.random.default_rng(3)
n = 4000
true_sigma = 1.0 # the real one-step error scale
model_sigma = 0.7 # the model THINKS errors are smaller
z95 = 1.96
errors = rng.normal(0, true_sigma, n) # realized forecast errors
half = z95 * model_sigma # the model's 95% half-width
inside = np.abs(errors) <= half # did truth land in the band?
emp_cov = inside.mean()
# Analytic check: coverage = 2*Phi(1.96*model_sigma/true_sigma) - 1
def Phi(x): # normal CDF via erf-free approximation
t = 1 / (1 + 0.2316419 * abs(x))
d = 0.3989423 * np.exp(-x * x / 2)
p = d * t * (0.3193815 + t*(-0.3565638 + t*(1.781478
+ t*(-1.821256 + t*1.330274))))
return 1 - p if x >= 0 else p
analytic = 2 * Phi(z95 * model_sigma / true_sigma) - 1
print(f"nominal coverage : 95.0 %")
print(f"empirical coverage : {100*emp_cov:5.1f} %")
print(f"analytic coverage : {100*analytic:5.1f} %")
print("under-estimating sigma -> OVER-CONFIDENT, real coverage < nominal:",
emp_cov < 0.95)
ML & DL for time series — Prophet, DeepAR, TFT
Classical methods (Vol II's ARIMA and exponential smoothing) are still astonishingly hard to beat on a single series — the M-competitions have shown this for decades, and a tuned ETS or ARIMA remains a serious baseline. The case for machine learning grows with the number of related series: when you have thousands of products, stores, or sensors, a single global model trained across all of them shares statistical strength, handles cold-start items, and ingests covariates that classical per-series models cannot. Three landmarks define the modern stack.
Prophet — structured, interpretable, robust
Prophet (Taylor & Letham, 2018) is not deep learning at all; it is a decomposable additive model — trend + seasonality + holidays — fit in a Bayesian framework. Its appeal is operational: it is robust to missing data and outliers, exposes human-tunable knobs (changepoint flexibility, seasonality strength, named holidays), and gives analysts who are not forecasting specialists a sane default. The cost is that it bakes in a structural assumption; when the series does not decompose that way, Prophet is mediocre, and it should be treated as a strong, legible baseline rather than a state-of-the-art engine.
DeepAR — global autoregressive RNN, probabilistic by design
DeepAR (Salinas et al., 2020) trains one autoregressive RNN across all series in a dataset and, critically, outputs the parameters of a probability distribution at each step (e.g. the mean and variance of a Gaussian, or a negative binomial for counts) rather than a point. Forecasts are generated by sampling forward, yielding full predictive distributions — prediction intervals for free, and well-calibrated ones when trained properly. It was the result that made deep probabilistic forecasting credible at scale, and the architecture under many production demand-forecasting systems.
TFT — attention, covariates, and interpretability
The Temporal Fusion Transformer (Lim et al., 2021) is the attention-era synthesis: it cleanly separates static metadata, known-future covariates (holidays, promotions you have already scheduled), and observed-past inputs; uses variable-selection networks to weight features; and applies interpretable multi-head attention so you can read which past time steps and which variables drove a forecast. It outputs quantiles directly via pinball loss, so calibrated intervals are native. On rich multi-horizon, multi-covariate benchmarks it set the bar — at the cost of being heavier to train and tune than everything above it.
| Method | Family | Probabilistic? | Best when… |
|---|---|---|---|
| ARIMA / ETS | classical, per-series | via residual σ | one or few series; strong baseline; full interpretability |
| Prophet | additive decomposition | Bayesian intervals | analyst-friendly trend+seasonality+holidays |
| DeepAR | global autoregressive RNN | yes — samples a distribution | many related series; counts; cold-start items |
| TFT | attention transformer | yes — quantile outputs | rich covariates; multi-horizon; need interpretability |
The honest caveat. Deep models are not free wins. The M4 competition was won by a hybrid of exponential smoothing and an RNN, and M5 by gradient-boosted trees (LightGBM) over engineered features — not by a pure transformer. The 2023–2025 wave of "foundation" time-series models (TimeGPT, Chronos, Moirai, TimesFM) brings zero-shot forecasting and is genuinely useful, but whether they consistently beat a well-tuned local model on your data is still contested. Backtest them against a naive and an ARIMA baseline before you believe the leaderboard.
Pitfalls — leakage, look-ahead & drift
Almost every forecasting disaster traces to the same root cause: the offline score was computed on information the model would not have had in production. The model looks brilliant in the notebook and falls apart on day one. The leaks are subtle, which is why they survive code review.
Look-ahead bias / data leakage. The cardinal sin is letting any post-origin information into a pre-origin decision. The classic offenders:
- Scaling on the full series. Computing a mean/std (or min/max) over all data and then splitting bakes future statistics into the training features. Fit the scaler on the training window only, inside each fold.
- Global imputation and feature engineering. Forward-filling, interpolating, or computing rolling features across the train/test seam smears the future backward. Every transform must be causal — computed from the past alone.
- Target leakage from late-arriving data. A feature that is only known after the target is realized (a revised figure, a settled outcome) is not available at forecast time. Use the value as it stood at the origin, not its final restatement.
- Tuning on the test set. Choosing hyper-parameters or a model by peeking at the same period you report — the slowest leak, because it hides in your workflow rather than your code. Use a separate validation split or nested walk-forward.
The four forecasting illusions: (1) random-split CV — shuffled folds train on the future to predict the past; always split by time. (2) leaked preprocessing — scalers, imputers and rolling features fit across the seam; everything must be fit inside the fold on past data only. (3) over-confident intervals — in-sample σ under-states real uncertainty, so nominal 95% bands cover far less; check empirical coverage (§6.3). (4) silent drift — the world moves after you froze the model, so yesterday's backtest stops describing today; monitor and re-backtest on a rolling basis (MLOPS 05).
Drift makes a backtest perishable. Even a leakage-free backtest is a statement about the past. Time series live in non-stationary worlds — regimes change, seasonality evolves, a pandemic rewrites every demand curve overnight — so a model validated last quarter can quietly decay (covariate and concept drift, MLOPS 05). Two defences: monitor forecast error in production against the live naive benchmark, and re-run walk-forward validation on a rolling basis so your reported accuracy always reflects the recent world, not a frozen snapshot.
The one rule that prevents most of this: simulate production exactly. At every forecast origin, ask "what did I actually know at this instant?" and forbid the pipeline from touching anything else. A backtest that obeys that question — and that always reports MASE against the naive forecast — is the difference between a number you can stake a decision on and a guess wearing a lab coat.
You can now forecast a series, score it honestly, and quantify what you don't know. But the deepest uncertainty is not measurement noise — it is that the process itself is random. The Quant volume opens by building time series from the ground up as stochastic processes: random walks, Brownian motion, martingales, and the Itō calculus that turns "the future is a distribution" into a rigorous mathematical object. Quant · Chapter 01: Stochastic Processes.
References
- Hyndman, R. J. & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
- Taylor, S. J. & Letham, B. (2018). Forecasting at Scale (Prophet).
- Lim, B., Arık, S. Ö., Loeff, N. & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting.
- Salinas, D., Flunkert, V., Gasthaus, J. & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks.
- Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022). The M5 competition: Background, organization, and implementation.
- Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.).
- Shafer, G. & Vovk, V. (2008). A tutorial on conformal prediction.