Autoregressive (AR) models
The simplest honest forecast is "tomorrow looks like today, plus a nudge." An autoregressive model formalizes that intuition: regress the series on its own past. An AR of order \(p\) — written AR(\(p\)) — predicts the current value as a weighted sum of the previous \(p\) values plus a fresh shock:
Persistence is the whole story for AR(1), \(y_t = c + \phi y_{t-1} + \varepsilon_t\). A \(\phi\) near \(+1\) means shocks linger (a slow, trending-looking series); a \(\phi\) near \(0\) means the series snaps back to its mean almost immediately (near white noise); a negative \(\phi\) makes it oscillate, flipping sign each step. The catch is stationarity: the process only has a stable mean and variance if its shocks don't compound forever. For AR(1) that means \(|\phi| < 1\); for higher orders the condition lives in the roots of the characteristic polynomial.
How do you estimate the \(\phi\)'s from data? The classical route is the Yule-Walker equations, which connect the AR coefficients to the series' autocorrelations. They say: the autocorrelation at lag \(k\) equals the same weighted combination of nearby autocorrelations that the model imposes on the values themselves.
# Fit an AR(2) by Yule-Walker (EQ T2.3) in pure numpy. No statsmodels needed.
import numpy as np
rng = np.random.default_rng(0)
phi1, phi2 = 0.5, 0.3 # the TRUE coefficients we will recover
n = 600
e = rng.normal(0, 1, n)
y = np.zeros(n)
for t in range(2, n): # simulate the AR(2) process (EQ T2.1)
y[t] = phi1 * y[t-1] + phi2 * y[t-2] + e[t]
y = y - y.mean() # center: Yule-Walker works on the mean-removed series
def acf(x, k): # sample autocorrelation at lag k
return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x)
r1, r2 = acf(y, 1), acf(y, 2)
R = np.array([[1.0, r1], [r1, 1.0]]) # Toeplitz matrix of autocorrelations
r = np.array([r1, r2])
phi_hat = np.linalg.solve(R, r) # phi = R^{-1} r (EQ T2.3)
print(f"sample autocorrelations : rho1={r1:.3f} rho2={r2:.3f}")
print(f"Yule-Walker estimate : phi1={phi_hat[0]:.3f} phi2={phi_hat[1]:.3f}")
print(f"true coefficients : phi1={phi1} phi2={phi2}")
print(f"sum phi (persistence) : {phi_hat.sum():.3f} (<1 => stationary)")
Moving-average (MA) models
An AR model remembers past values. A moving-average model remembers past shocks. MA(\(q\)) writes the current value as the current white-noise shock plus a weighted sum of the last \(q\) shocks:
The two model families are mirror images, and the mirror is precise. An invertible MA(\(q\)) — one whose MA polynomial \(\Theta(z) = 1 + \theta_1 z + \cdots + \theta_q z^q\) also has all roots outside the unit circle — can be rewritten as an infinite-order AR, and a stationary AR(\(p\)) can be rewritten as an infinite-order MA. The duality has a sharp diagnostic consequence:
| Model | ACF (autocorrelation) | PACF (partial autocorr.) |
|---|---|---|
| AR(p) | tails off (decays geometrically / sinusoidally) | cuts off after lag p |
| MA(q) | cuts off after lag q | tails off (decays geometrically) |
| ARMA(p,q) | tails off after lag q | tails off after lag p |
This table is the heart of classical model identification. The ACF measures correlation between \(y_t\) and \(y_{t-k}\); the partial ACF measures the same after removing the influence of the intervening lags. A sharp ACF cutoff at lag \(q\) with a slowly tailing PACF screams MA(\(q\)); the reverse screams AR(\(p\)). When both tail off, you are in mixed ARMA territory — and these "cutoffs" are statistical, blurred by sampling noise, so read them as strong hints, not certainties.
# MA(2) fingerprint: its ACF cuts off after lag 2, the PACF tails off (the table).
import numpy as np
rng = np.random.default_rng(1)
theta1, theta2 = 0.7, 0.4
n = 4000
e = rng.normal(0, 1, n + 2)
y = e[2:] + theta1 * e[1:-1] + theta2 * e[:-2] # MA(2) per EQ T2.4
y = y - y.mean()
def acf(x, k):
return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x)
# theoretical MA(2) autocorrelations for comparison
denom = 1 + theta1**2 + theta2**2
th = [1.0, (theta1 + theta1*theta2)/denom, theta2/denom, 0.0, 0.0]
print(" lag sample ACF theory ACF")
for k in range(5):
print(f" {k:2d} {acf(y,k):+.3f} {th[k]:+.3f}")
print("\nACF is large at lags 1-2 then ~0 -> the MA(2) cutoff. q reads straight off.")
plot_xy(list(range(8)), [acf(y, k) for k in range(8)])
ARMA & ARIMA
Combine the two memories and you get ARMA(\(p,q\)): the value depends on its own past and on past shocks. In lag-operator form the symmetry is plain — an AR polynomial acting on the values equals an MA polynomial acting on the noise:
The I stands for integrated: an ARIMA(\(p,d,q\)) is an ARMA(\(p,q\)) fitted to the series after taking \(d\) differences. Differencing is the operator \(\Delta y_t = y_t - y_{t-1} = (1-L)y_t\); apply it \(d\) times to strip out trend. A series that becomes stationary after \(d\) differences is "integrated of order \(d\)", written \(I(d)\). Most economic series are \(I(1)\): one difference — turning levels into changes — is enough.
# One-step ARIMA(1,1,1) forecast BY HAND on a tiny series (EQ T2.6).
import numpy as np
y = np.array([10., 12., 11., 14., 16., 15.]) # the observed levels
phi, theta = 0.6, 0.4 # ARMA(1,1) on the differences
w = np.diff(y) # d=1: work on changes w_t = y_t - y_{t-1}
print("differences w :", w)
# Recover the unobserved shocks e_t by the ARMA(1,1) recursion (start e_0 = 0):
# w_t = phi*w_{t-1} + e_t + theta*e_{t-1} => e_t = w_t - phi*w_{t-1} - theta*e_{t-1}
e = np.zeros(len(w))
for t in range(1, len(w)):
e[t] = w[t] - phi * w[t-1] - theta * e[t-1]
print("residuals e :", np.round(e, 3))
w_next = phi * w[-1] + theta * e[-1] # forecast the NEXT difference (E[e_next]=0)
y_next = y[-1] + w_next # integrate back: undo the differencing
print(f"\nforecast next change w_hat = {w_next:.3f}")
print(f"forecast next level y_hat = y[-1] + w_hat = {y[-1]:.1f} + ({w_next:.3f})"
f" = {y_next:.3f}")
Seasonal ARIMA (SARIMA)
Monthly sales peak every December; electricity demand cycles every 24 hours; retail repeats every 7 days. A plain ARIMA can chase a trend but it has no machinery for a repeating season. SARIMA bolts a second, seasonal ARIMA onto the first — same AR/I/MA logic, but operating at the seasonal lag \(s\) (12 for monthly-with-yearly-cycle, 7 for daily-with-weekly-cycle):
The two layers multiply rather than add, which is what lets one shock leave both a short-range footprint (the next few periods) and a seasonal echo (the same period next year). In practice you almost never need \(D > 1\): one seasonal difference plus one ordinary difference removes both a trend and a yearly cycle, and stacking more differences over-differences just as fast as in the non-seasonal case. The cost of the extra flexibility is parameters — six orders to choose instead of three — which is exactly why automated order selection (§2.5) became indispensable for SARIMA.
SARIMA's honest limits. It assumes a single, fixed seasonal period with constant amplitude. Multiple overlapping seasonalities (a daily series with both weekly and yearly cycles), seasonality that grows with the level, or non-integer periods all break it — and that is where TBATS, Fourier-term regressors with ARIMA errors, Prophet, and modern ML forecasters earn their place. SARIMA remains the right first tool for one clean season, and a strong baseline even when it is not the final one.
The Box-Jenkins methodology
Box and Jenkins did not just propose models — they proposed a procedure, an iterative loop that turns a raw series into a fitted forecast. It is the recipe the whole chapter has been building toward, and it has three stages that you cycle until the residuals are clean:
Stage one — identification — is where the AR/MA fingerprint table from §2.2 does its work. Pick orders by competing candidates on an information criterion that rewards fit and penalizes complexity, rather than by eyeballing the ACF alone:
auto.arima: it searches over candidate \((p,d,q)(P,D,Q)_s\) orders and keeps the lowest-AIC model. AICc adds a small-sample correction; BIC penalizes complexity harder (\(\ln(n)\,k\)) and prefers sparser models. None of them replaces the white-noise residual check — a low AIC with autocorrelated residuals is still a failed model.The genuinely contested part is whether to trust automation. auto.arima (and Python's pmdarima) made order selection a one-liner, and for well-behaved series it usually finds a sensible model. But it optimizes in-sample fit, can be fooled by outliers and structural breaks, and will happily return a model whose residuals still carry seasonality it failed to difference away. The defensible practice in 2026 is the same as in 1976: let the search propose, then diagnose — plot the residual ACF, run Ljung-Box, and back-test on held-out horizons before shipping.
Four ways Box-Jenkins goes wrong: (1) over-differencing — differencing a series that was already stationary injects negative autocorrelation and inflates variance; difference the minimum and confirm with ADF/KPSS. (2) reading ACF/PACF too literally — sampling noise blurs the cutoffs, so a "spike at lag 7" may be chance, not weekly seasonality. (3) trusting AIC over residuals — the lowest-AIC model can still have autocorrelated errors; the residual diagnostics are the real gate. (4) forecasting far past the data's regime — ARIMA extrapolates its fitted linear dynamics and is blind to structural breaks, so long-horizon intervals are optimistic.
ARIMA fits the conditional mean by least squares on lagged values; exponential smoothing weights the past geometrically instead. The two families overlap more than their notation suggests — simple exponential smoothing is an ARIMA(0,1,1) in disguise. Chapter 03: the exponential-smoothing family, from simple to Holt-Winters, the state-space (ETS) formulation that gives it likelihoods and intervals, and when its decaying-memory view beats ARIMA's algebra.
References
- Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.).
- Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.).
- Akaike, H. (1974). A New Look at the Statistical Model Identification.
- Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models.
- Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root.
- Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R.