Forecasting in Practice — AI Encyclopedia

6.1

Backtesting & walk-forward validation

The cross-validation you learned for tabular data — shuffle the rows, hold out a random fold — is poison for a time series. Shuffling lets the model train on Thursday to predict Wednesday; random folds leak the future into the past. Temporal order is the whole point of the data, so the evaluation must respect it: train only on the past, test only on the future, never the reverse.

The disciplined way to do this is walk-forward validation (also called rolling-origin or time-series cross-validation). Fix a forecast horizon $h$. Train on data up to some origin $t$, forecast the next $h$ steps, score them against what actually happened, then slide the origin forward and repeat. You end up with many out-of-sample forecasts at many origins — a far more honest estimate of live performance than a single train/test split, which can be lucky or unlucky depending on where you happened to cut.

EQ T6.1 — ROLLING-ORIGIN BACKTEST ERROR $$ \mathrm{CV}(h) \;=\; \frac{1}{|\mathcal{O}|}\sum_{t\in\mathcal{O}}\; \frac{1}{h}\sum_{k=1}^{h}\; \ell\!\big(\,y_{t+k},\;\hat{y}_{t+k\mid t}\,\big) $$

$\mathcal{O}$ is the set of forecast origins; at each origin $t$ the model is fit on $y_{1:t}$ only and emits $h$-step forecasts $\hat{y}_{t+k\mid t}$. $\ell$ is any per-point loss (absolute error, squared error, pinball). Two flavours of the window matter: an expanding window keeps all history ($y_{1:t}$ grows) — right when the process is stationary and more data always helps; a sliding window of fixed length forgets old data — right when the process drifts and stale history is actively misleading. The forecast at origin $t$ may use nothing dated after $t$. Break that rule anywhere — feature engineering, scaling, hyper-parameter choice — and the score is fiction (§6.5).

Two practical refinements separate a toy backtest from a trustworthy one. First, leave a gap between train and test when your features embed a look-back or your labels arrive late, so information cannot bleed across the seam (this is the idea behind purged and embargoed cross-validation in finance). Second, refit the model at every origin if you can afford it — a model re-estimated as the window slides mimics what you would actually do in production, whereas freezing the parameters at the first origin quietly over-states stability.

PYTHON · RUNNABLE IN-BROWSER

# Walk-forward backtest: naive vs AR(1), scored by MASE (EQ T6.1 + T6.3).
import numpy as np
rng = np.random.default_rng(0)

# A trending, noisy AR(1)-ish series of 120 points.
n = 120
y = np.zeros(n)
for t in range(1, n):
    y[t] = 0.6 * y[t - 1] + 0.05 * t + rng.normal(0, 1.0)

H = 1                                   # one-step-ahead horizon
start = 60                              # first forecast origin
abs_naive, abs_ar = [], []
for t in range(start, n - H):
    train = y[:t + 1]                   # ONLY the past -- no leakage
    naive = train[-1]                   # last value carried forward
    # AR(1) fit by least squares on the training window
    a = np.vstack([train[:-1], np.ones(t)]).T
    phi, c = np.linalg.lstsq(a, train[1:], rcond=None)[0]
    ar = phi * train[-1] + c
    actual = y[t + H]
    abs_naive.append(abs(actual - naive))
    abs_ar.append(abs(actual - ar))

# MASE = mean(|model error|) / mean(|naive one-step error in-sample|)
scale = np.mean(np.abs(np.diff(y[:start])))     # the naive yardstick
mase_naive = np.mean(abs_naive) / scale
mase_ar    = np.mean(abs_ar)    / scale
print(f"in-sample naive scale (mean |y_t - y_t-1|): {scale:.3f}")
print(f"MASE  naive forecast : {mase_naive:.3f}")
print(f"MASE  AR(1) forecast : {mase_ar:.3f}")
print("AR(1) beats naive out-of-sample:" , mase_ar < mase_naive)

edits are live — break it on purpose

INSTRUMENT T6.1 — WALK-FORWARD BACKTESTROLLING ORIGIN · EXPANDING vs SLIDING · EQ T6.1

FORECAST HORIZON h 6

ORIGIN STEP 6

WINDOW

FOLDS (ORIGINS)

—

MEAN ABS ERROR (CV)

—

WORST FOLD

—

Each blue bracket is one fold: a training span on the left, an h-step test on the right, with the forecast drawn against the truth. Slide HORIZON up and watch error climb — longer horizons are simply harder. Switch to SLIDING to drop old history; on this drifting series the expanding window wins because every point of the past still helps. The CV error is EQ T6.1 averaged over every fold, not a single lucky split.

6.2

Forecast accuracy — MAPE, MASE, sMAPE

A backtest gives you errors; a metric turns them into one comparable number. The choice is not cosmetic — each metric has a failure mode, and picking the wrong one for your data is how people ship models that look great offline and disappoint in production.

The intuitive starting point is the Mean Absolute Percentage Error: average the absolute error as a fraction of the actual value, so a forecast that is off by 10 on a quantity of 100 scores the same 10% as a forecast off by 1 on a quantity of 10.

EQ T6.2 — MAPE $$ \mathrm{MAPE} \;=\; \frac{100\%}{n}\sum_{t=1}^{n}\left|\frac{y_t - \hat{y}_t}{y_t}\right| $$

$y_t$ is the actual, $\hat{y}_t$ the forecast. Scale-free and instantly interpretable to a business audience — "we're 8% off on average." But it has three real defects: it explodes when $y_t$ is near zero (intermittent demand, anything that can be empty), it is undefined when $y_t = 0$, and it is asymmetric — it penalizes over-forecasts more heavily than under-forecasts, so a model can game it by systematically under-predicting. Use it for strictly positive, well-away-from-zero series; reach for MASE everywhere else.

A single forecast predicts $ \hat{y} = 90 $ when the actual value turns out to be $ y = 100 $. Using EQ T6.2 on this one point, what is the MAPE, in percent?

$ \left|\dfrac{y-\hat{y}}{y}\right| = \dfrac{|100-90|}{100} = \dfrac{10}{100} = 0.10 $; times 100% gives 10%. Note the asymmetry: had the forecast been 110 (also off by 10) the MAPE would still read 10% here, but on a series where actuals vary, over- and under-forecasts of equal size do not score equally — that is the bias MASE was built to escape.

The fix for MAPE's pathologies is the Mean Absolute Scaled Error of Hyndman & Koehler (2006). Instead of dividing by the actual value, divide the model's mean absolute error by the mean absolute error of a dirt-simple benchmark — the naive forecast, which just carries the last observed value forward. The scaling makes MASE unitless, defined even when $y_t = 0$, symmetric, and — its whole point — readable as a comparison against the benchmark that any model must beat to justify its existence.

EQ T6.3 — MASE $$ \mathrm{MASE} \;=\; \frac{\frac{1}{n}\sum_{t=1}^{n}\bigl|\,y_t - \hat{y}_t\,\bigr|}{\frac{1}{T-1}\sum_{i=2}^{T}\bigl|\,y_i - y_{i-1}\,\bigr|} $$

Numerator: the model's mean absolute error on the test set. Denominator: the in-sample mean absolute error of the one-step naive forecast ($\hat{y}_i = y_{i-1}$) computed on the training data — the yardstick. The ratio is the headline: $\mathrm{MASE} < 1$ means the model beats the naive forecast; $\mathrm{MASE} > 1$ means it loses to copy-the-last-value, an embarrassing but common verdict. For seasonal data use the seasonal-naive denominator $\lvert y_i - y_{i-m}\rvert$ (one season back, period $m$). MASE was the M-competition organisers' metric of choice precisely because it averages sanely across series of wildly different scales.

A model scores $ \mathrm{MASE} = 0.7 $ on a held-out period. Does this mean the model beats the naive (last-value) forecast on that period? (Answer true or false.)

MASE is the model's mean absolute error divided by the naive forecast's mean absolute error. A value below 1 means the model's error is smaller than the naive benchmark's, so $ \mathrm{MASE} = 0.7 < 1 $ means the model is roughly 30% more accurate than copy-the-last-value: it beats the naive forecast. Hence true. (A MASE above 1 would be the humbling case — your model loses to a one-line baseline.)

A third metric, the symmetric MAPE, was introduced to tame MAPE's over-/under-forecast asymmetry by putting both the actual and the forecast in the denominator. It is bounded and was used in the M3 and M4 competitions, but "symmetric" is a misnomer — it is still not perfectly even-handed, and it too misbehaves when both values approach zero. Know it because you will meet it in benchmark tables; prefer MASE when you get to choose.

EQ T6.4 — sMAPE (MAKRIDAKIS FORM) $$ \mathrm{sMAPE} \;=\; \frac{100\%}{n}\sum_{t=1}^{n}\frac{\bigl|\,y_t - \hat{y}_t\,\bigr|}{\bigl(\lvert y_t\rvert + \lvert \hat{y}_t\rvert\bigr)/2} $$

Dividing by the average of actual and forecast bounds each term in $[0,200\%]$ (some authors drop the factor of two and cap at 100%, a common source of table-to-table confusion). It is gentler than MAPE on small actuals and less one-sided, but it still has no defined value at $y_t = \hat{y}_t = 0$ and remains mildly biased — the M4 organisers ultimately paired it with a MASE-style measure rather than trusting it alone. Always state which sMAPE convention you used.

PYTHON · RUNNABLE IN-BROWSER

# MAPE, MASE and sMAPE side by side, in plain numpy (EQ T6.2-T6.4).
import numpy as np

# A short held-out test set + the model's forecasts for it.
y_train = np.array([100, 102, 101, 105, 110, 108, 112, 115], float)  # history
y_test  = np.array([118, 120, 119, 125], float)                      # truth
y_hat   = np.array([116, 121, 122, 123], float)                      # forecast

abs_err = np.abs(y_test - y_hat)

mape  = 100 * np.mean(abs_err / np.abs(y_test))
smape = 100 * np.mean(abs_err / ((np.abs(y_test) + np.abs(y_hat)) / 2))
scale = np.mean(np.abs(np.diff(y_train)))      # naive one-step MAE on TRAIN
mase  = np.mean(abs_err) / scale

# What the naive (last-value) forecast would have scored, for contrast.
naive_hat = np.full_like(y_test, y_train[-1])
mase_naive = np.mean(np.abs(y_test - naive_hat)) / scale

print(f"MAPE   : {mape:6.2f} %")
print(f"sMAPE  : {smape:6.2f} %")
print(f"MASE   : {mase:6.3f}   (model)")
print(f"MASE   : {mase_naive:6.3f}   (naive baseline -- always >= ~1 by design)")
print("model beats naive:", mase < mase_naive)

edits are live — break it on purpose

INSTRUMENT T6.2 — NAIVE vs MODEL · MASECAN YOUR MODEL BEAT COPY-THE-LAST-VALUE? · EQ T6.3

MODEL SKILL (shrink toward truth) 0.55

TREND STRENGTH 0.40

NOISE σ 1.0

NAIVE MASE

—

MODEL MASE

—

VERDICT

—

The grey line is the truth; the blue dashed line is the naive (last-value) forecast; the mint line is your model, which is pulled toward the truth by MODEL SKILL. Push SKILL toward 1 and the model MASE drops below 1 — it beats naive. Add TREND and watch the naive forecast suffer (it always lags a trend by one step), which is exactly when a real model earns its keep. Crank SKILL to 0 and the model is no better than guessing the mean: MASE climbs above 1 and the verdict turns red.

6.3

Prediction intervals

A single number — the point forecast — is a lie of omission. The honest output of a forecaster is a distribution, or at least an interval: "demand next week is 1,200 ± 300 with 90% confidence." Decisions are made on the interval, not the point — safety stock, capital buffers, staffing all hinge on the downside, not the median. And here is the uncomfortable truth of applied forecasting: point forecasts are often decent and the intervals are usually too narrow.

For a model that assumes Gaussian errors with standard deviation $\sigma_h$ at horizon $h$, the symmetric prediction interval is the familiar $z$-band around the point forecast:

EQ T6.5 — GAUSSIAN PREDICTION INTERVAL $$ \hat{y}_{t+h} \;\pm\; z_{1-\alpha/2}\,\sigma_h, \qquad z_{0.975} \approx 1.96 \ \text{(95\%)}, \quad z_{0.95} \approx 1.645 \ \text{(90\%)} $$

$\sigma_h$ is the forecast standard deviation at horizon $h$ — and crucially it grows with $h$: predicting tomorrow is tighter than predicting next month, because uncertainty compounds. For a random walk $\sigma_h = \sigma\sqrt{h}$; for most models it widens too, and a flat band across horizons is a red flag. The catch is that $\sigma_h$ is itself estimated, usually from in-sample residuals that under-state real uncertainty (the model fit those points), so the nominal 95% band routinely covers fewer than 95% of future outcomes. The interval's only honest test is its empirical coverage.

That last sentence is the whole discipline. A 90% prediction interval is calibrated if, over many forecasts, the truth falls inside it close to 90% of the time. Measure it directly:

EQ T6.6 — EMPIRICAL COVERAGE $$ \mathrm{Coverage} \;=\; \frac{1}{n}\sum_{t=1}^{n}\mathbf{1}\!\left[\,\ell_t \le y_t \le u_t\,\right] $$

$[\ell_t, u_t]$ is the predicted interval, $y_t$ the realized value, $\mathbf{1}[\cdot]$ the indicator. Compare coverage to the nominal level: coverage well below nominal means over-confident intervals (the common case — your bands are too tight and you will be blindsided); coverage above nominal means needlessly wide bands that waste capital on slack. Two robust ways to get calibrated intervals without trusting a Gaussian assumption: take empirical quantiles of backtest residuals, or use conformal prediction, which wraps any point forecaster in finite-sample coverage guarantees under an exchangeability assumption. Pinball (quantile) loss is the proper scoring rule for the bands themselves.

PYTHON · RUNNABLE IN-BROWSER

# Interval coverage: a model that mis-estimates sigma is mis-calibrated (T6.5/6.6).
import numpy as np
rng = np.random.default_rng(3)

n = 4000
true_sigma  = 1.0                       # the real one-step error scale
model_sigma = 0.7                       # the model THINKS errors are smaller
z95 = 1.96

errors = rng.normal(0, true_sigma, n)   # realized forecast errors
half   = z95 * model_sigma              # the model's 95% half-width
inside = np.abs(errors) <= half         # did truth land in the band?
emp_cov = inside.mean()

# Analytic check: coverage = 2*Phi(1.96*model_sigma/true_sigma) - 1
def Phi(x):                             # normal CDF via erf-free approximation
    t = 1 / (1 + 0.2316419 * abs(x))
    d = 0.3989423 * np.exp(-x * x / 2)
    p = d * t * (0.3193815 + t*(-0.3565638 + t*(1.781478
        + t*(-1.821256 + t*1.330274))))
    return 1 - p if x >= 0 else p
analytic = 2 * Phi(z95 * model_sigma / true_sigma) - 1

print(f"nominal coverage      : 95.0 %")
print(f"empirical coverage    : {100*emp_cov:5.1f} %")
print(f"analytic coverage     : {100*analytic:5.1f} %")
print("under-estimating sigma -> OVER-CONFIDENT, real coverage < nominal:",
      emp_cov < 0.95)

edits are live — break it on purpose

INSTRUMENT T6.3 — PREDICTION-INTERVAL COVERAGENOMINAL vs EMPIRICAL · EQ T6.5 / T6.6

NOMINAL LEVEL 95%

MODEL σ̂ / TRUE σ 0.70

NOMINAL

—

EMPIRICAL COVERAGE

—

CALIBRATION

—

The mint band is the model's prediction interval at the NOMINAL level; each dot is a realized outcome — grey inside the band, red outside. The ratio σ̂/σ is how badly the model mis-estimates its own uncertainty: at 0.70 the band is too tight and empirical coverage falls below nominal (over-confident — the usual disease). Slide the ratio to 1.0 to recover calibration, and past 1.0 to see needlessly wide bands over-cover. A point forecast hides all of this; only coverage exposes it.

6.4

ML & DL for time series — Prophet, DeepAR, TFT

Classical methods (Vol II's ARIMA and exponential smoothing) are still astonishingly hard to beat on a single series — the M-competitions have shown this for decades, and a tuned ETS or ARIMA remains a serious baseline. The case for machine learning grows with the number of related series: when you have thousands of products, stores, or sensors, a single global model trained across all of them shares statistical strength, handles cold-start items, and ingests covariates that classical per-series models cannot. Three landmarks define the modern stack.

Prophet — structured, interpretable, robust

Prophet (Taylor & Letham, 2018) is not deep learning at all; it is a decomposable additive model — trend + seasonality + holidays — fit in a Bayesian framework. Its appeal is operational: it is robust to missing data and outliers, exposes human-tunable knobs (changepoint flexibility, seasonality strength, named holidays), and gives analysts who are not forecasting specialists a sane default. The cost is that it bakes in a structural assumption; when the series does not decompose that way, Prophet is mediocre, and it should be treated as a strong, legible baseline rather than a state-of-the-art engine.

EQ T6.7 — PROPHET'S ADDITIVE DECOMPOSITION $$ y(t) \;=\; g(t) \;+\; s(t) \;+\; h(t) \;+\; \varepsilon_t $$

$g(t)$ is the trend (piecewise-linear or logistic-growth, with automatically placed changepoints), $s(t)$ the seasonality (a Fourier series, so multiple periods stack), $h(t)$ the effect of holidays and special events, and $\varepsilon_t$ the noise. The decomposition is the feature, not a bug: a practitioner can read the fitted $g$, $s$, and $h$ and argue with each one. A multiplicative variant — $y(t)=g(t)\cdot(1+s(t))$ — handles seasonality that scales with the level.

DeepAR — global autoregressive RNN, probabilistic by design

DeepAR (Salinas et al., 2020) trains one autoregressive RNN across all series in a dataset and, critically, outputs the parameters of a probability distribution at each step (e.g. the mean and variance of a Gaussian, or a negative binomial for counts) rather than a point. Forecasts are generated by sampling forward, yielding full predictive distributions — prediction intervals for free, and well-calibrated ones when trained properly. It was the result that made deep probabilistic forecasting credible at scale, and the architecture under many production demand-forecasting systems.

TFT — attention, covariates, and interpretability

The Temporal Fusion Transformer (Lim et al., 2021) is the attention-era synthesis: it cleanly separates static metadata, known-future covariates (holidays, promotions you have already scheduled), and observed-past inputs; uses variable-selection networks to weight features; and applies interpretable multi-head attention so you can read which past time steps and which variables drove a forecast. It outputs quantiles directly via pinball loss, so calibrated intervals are native. On rich multi-horizon, multi-covariate benchmarks it set the bar — at the cost of being heavier to train and tune than everything above it.

Method	Family	Probabilistic?	Best when…
ARIMA / ETS	classical, per-series	via residual σ	one or few series; strong baseline; full interpretability
Prophet	additive decomposition	Bayesian intervals	analyst-friendly trend+seasonality+holidays
DeepAR	global autoregressive RNN	yes — samples a distribution	many related series; counts; cold-start items
TFT	attention transformer	yes — quantile outputs	rich covariates; multi-horizon; need interpretability

The honest caveat. Deep models are not free wins. The M4 competition was won by a hybrid of exponential smoothing and an RNN, and M5 by gradient-boosted trees (LightGBM) over engineered features — not by a pure transformer. The 2023–2025 wave of "foundation" time-series models (TimeGPT, Chronos, Moirai, TimesFM) brings zero-shot forecasting and is genuinely useful, but whether they consistently beat a well-tuned local model on your data is still contested. Backtest them against a naive and an ARIMA baseline before you believe the leaderboard.

6.5

Pitfalls — leakage, look-ahead & drift

Almost every forecasting disaster traces to the same root cause: the offline score was computed on information the model would not have had in production. The model looks brilliant in the notebook and falls apart on day one. The leaks are subtle, which is why they survive code review.

Look-ahead bias / data leakage. The cardinal sin is letting any post-origin information into a pre-origin decision. The classic offenders:

Scaling on the full series. Computing a mean/std (or min/max) over all data and then splitting bakes future statistics into the training features. Fit the scaler on the training window only, inside each fold.
Global imputation and feature engineering. Forward-filling, interpolating, or computing rolling features across the train/test seam smears the future backward. Every transform must be causal — computed from the past alone.
Target leakage from late-arriving data. A feature that is only known after the target is realized (a revised figure, a settled outcome) is not available at forecast time. Use the value as it stood at the origin, not its final restatement.
Tuning on the test set. Choosing hyper-parameters or a model by peeking at the same period you report — the slowest leak, because it hides in your workflow rather than your code. Use a separate validation split or nested walk-forward.

PITFALLS

The four forecasting illusions: (1) random-split CV — shuffled folds train on the future to predict the past; always split by time. (2) leaked preprocessing — scalers, imputers and rolling features fit across the seam; everything must be fit inside the fold on past data only. (3) over-confident intervals — in-sample σ under-states real uncertainty, so nominal 95% bands cover far less; check empirical coverage (§6.3). (4) silent drift — the world moves after you froze the model, so yesterday's backtest stops describing today; monitor and re-backtest on a rolling basis (MLOPS 05).

Drift makes a backtest perishable. Even a leakage-free backtest is a statement about the past. Time series live in non-stationary worlds — regimes change, seasonality evolves, a pandemic rewrites every demand curve overnight — so a model validated last quarter can quietly decay (covariate and concept drift, MLOPS 05). Two defences: monitor forecast error in production against the live naive benchmark, and re-run walk-forward validation on a rolling basis so your reported accuracy always reflects the recent world, not a frozen snapshot.

The one rule that prevents most of this: simulate production exactly. At every forecast origin, ask "what did I actually know at this instant?" and forbid the pipeline from touching anything else. A backtest that obeys that question — and that always reports MASE against the naive forecast — is the difference between a number you can stake a decision on and a guess wearing a lab coat.

You can now forecast a series, score it honestly, and quantify what you don't know. But the deepest uncertainty is not measurement noise — it is that the process itself is random. The Quant volume opens by building time series from the ground up as stochastic processes: random walks, Brownian motion, martingales, and the Itō calculus that turns "the future is a distribution" into a rigorous mathematical object. Quant · Chapter 01: Stochastic Processes.

6.R

References

Hyndman, R. J. & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International Journal of Forecasting 22(4) — introduces MASE and dissects MAPE/sMAPE (§6.2, EQ T6.3).
Taylor, S. J. & Letham, B. (2018). Forecasting at Scale (Prophet). The American Statistician 72(1) — the decomposable trend + seasonality + holidays model of §6.4 (EQ T6.7).
Lim, B., Arık, S. Ö., Loeff, N. & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37(4) — attention, variable selection and quantile outputs (§6.4).
Salinas, D., Flunkert, V., Gasthaus, J. & Januschowski, T. (2020). DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36(3) — the global probabilistic RNN of §6.4.
Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2022). The M5 competition: Background, organization, and implementation. International Journal of Forecasting 38(4) — why gradient-boosted trees, not transformers, won (§6.4 caveat).
Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts — the standard open text on time-series cross-validation, accuracy metrics and prediction intervals (§6.1–6.3).
Shafer, G. & Vovk, V. (2008). A tutorial on conformal prediction. JMLR 9 — distribution-free prediction intervals with finite-sample coverage (§6.3, EQ T6.6).