AI // ENCYCLOPEDIA / TIME SERIES / 02 / ARIMA INDEX NEXT: 03 EXPONENTIAL SMOOTHING →
TIME SERIES & ECONOMETRICS · CHAPTER 02 / 06

AR, MA, ARIMA & SARIMA

Before neural networks reached forecasting, Box and Jenkins reduced it to a procedure that could be taught. The recipe has three steps: difference the series until it is stationary, read the ACF and PACF to choose orders, then fit autoregressive and moving-average terms. Now a one-line call in every statistics library, it remains the baseline that more elaborate models are measured against.

LEVELCORE READING TIME≈ 28 MIN BUILDS ONTIME SERIES 01 · STATS 04 INSTRUMENTSARIMA LAB · AR ROOTS · BOX-JENKINS
2.1

Autoregressive (AR) models

The simplest honest forecast is "tomorrow looks like today, plus a nudge." An autoregressive model formalizes that intuition: regress the series on its own past. An AR of order \(p\) — written AR(\(p\)) — predicts the current value as a weighted sum of the previous \(p\) values plus a fresh shock:

EQ T2.1 — AUTOREGRESSIVE MODEL AR(p) $$ y_t \;=\; c \;+\; \phi_1 y_{t-1} \;+\; \phi_2 y_{t-2} \;+\; \cdots \;+\; \phi_p y_{t-p} \;+\; \varepsilon_t, \qquad \varepsilon_t \sim \mathrm{WN}(0,\sigma^2) $$
\(\phi_1,\dots,\phi_p\) are the AR coefficients, \(c\) a constant tied to the long-run mean (\(\mu = c/(1-\sum\phi_i)\)), and \(\varepsilon_t\) is white noise — zero-mean, constant-variance, serially uncorrelated. The model has memory: a shock at time \(t\) propagates forward through the \(\phi\)'s, decaying geometrically. AR(1) is the workhorse — \(\phi\) is literally the one-step persistence of the series.

Persistence is the whole story for AR(1), \(y_t = c + \phi y_{t-1} + \varepsilon_t\). A \(\phi\) near \(+1\) means shocks linger (a slow, trending-looking series); a \(\phi\) near \(0\) means the series snaps back to its mean almost immediately (near white noise); a negative \(\phi\) makes it oscillate, flipping sign each step. The catch is stationarity: the process only has a stable mean and variance if its shocks don't compound forever. For AR(1) that means \(|\phi| < 1\); for higher orders the condition lives in the roots of the characteristic polynomial.

EQ T2.2 — STATIONARITY: ROOTS OUTSIDE THE UNIT CIRCLE $$ \Phi(z) \;=\; 1 - \phi_1 z - \phi_2 z^2 - \cdots - \phi_p z^p \;=\; 0 \quad\Longrightarrow\quad |z_i| > 1 \;\; \forall i $$
Write the model with the lag operator \(L\) (where \(L\,y_t = y_{t-1}\)) as \(\Phi(L)\,y_t = c + \varepsilon_t\). The process is stationary iff every root of \(\Phi(z)\) lies strictly outside the unit circle — equivalently, every reciprocal root lies inside it. For AR(1), \(1-\phi z = 0\) gives \(z = 1/\phi\), so \(|z|>1\) is exactly \(|\phi|<1\). A root on the circle (\(|z|=1\)) is a unit root — the boundary case of a random walk, which §2.3 differences away.
An AR(1) process with no constant is \( y_t = 0.5\,y_{t-1} + \varepsilon_t \). The last observed value is \( y_{t-1} = 10 \). What is the one-step forecast \( \hat{y}_t = 0.5\,y_{t-1} \) (the expected next value, since \( \mathbb{E}[\varepsilon_t] = 0 \))?
The minimum-mean-squared-error forecast sets the unknown shock to its mean, \( \mathbb{E}[\varepsilon_t] = 0 \), so \( \hat{y}_t = 0.5 \times 10 = \) 5. With \( \phi = 0.5 \) the series gives back half its current deviation each step — fast mean reversion.

How do you estimate the \(\phi\)'s from data? The classical route is the Yule-Walker equations, which connect the AR coefficients to the series' autocorrelations. They say: the autocorrelation at lag \(k\) equals the same weighted combination of nearby autocorrelations that the model imposes on the values themselves.

EQ T2.3 — YULE-WALKER EQUATIONS $$ \rho_k \;=\; \phi_1 \rho_{k-1} + \phi_2 \rho_{k-2} + \cdots + \phi_p \rho_{k-p}, \quad k = 1,\dots,p \qquad\Longleftrightarrow\qquad R\,\boldsymbol{\phi} = \mathbf{r} $$
\(\rho_k\) is the autocorrelation at lag \(k\); \(R\) is the \(p\times p\) Toeplitz matrix of autocorrelations \(\rho_{|i-j|}\) and \(\mathbf{r} = (\rho_1,\dots,\rho_p)\). Estimate the \(\rho_k\) from the data, plug them in, and solve the linear system \(\boldsymbol{\phi} = R^{-1}\mathbf{r}\) — a closed-form fit with no iteration. For AR(2) this unpacks to \(\phi_1 = \dfrac{\rho_1(1-\rho_2)}{1-\rho_1^2}\), \(\phi_2 = \dfrac{\rho_2 - \rho_1^2}{1-\rho_1^2}\). Maximum likelihood is usually preferred in production, but Yule-Walker is the transparent estimator that shows where the numbers come from.
PYTHON · RUNNABLE IN-BROWSER
# Fit an AR(2) by Yule-Walker (EQ T2.3) in pure numpy. No statsmodels needed.
import numpy as np
rng = np.random.default_rng(0)

phi1, phi2 = 0.5, 0.3              # the TRUE coefficients we will recover
n = 600
e = rng.normal(0, 1, n)
y = np.zeros(n)
for t in range(2, n):             # simulate the AR(2) process (EQ T2.1)
    y[t] = phi1 * y[t-1] + phi2 * y[t-2] + e[t]
y = y - y.mean()                  # center: Yule-Walker works on the mean-removed series

def acf(x, k):                    # sample autocorrelation at lag k
    return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x)

r1, r2 = acf(y, 1), acf(y, 2)
R = np.array([[1.0, r1], [r1, 1.0]])   # Toeplitz matrix of autocorrelations
r = np.array([r1, r2])
phi_hat = np.linalg.solve(R, r)        # phi = R^{-1} r  (EQ T2.3)

print(f"sample autocorrelations : rho1={r1:.3f}  rho2={r2:.3f}")
print(f"Yule-Walker estimate    : phi1={phi_hat[0]:.3f}  phi2={phi_hat[1]:.3f}")
print(f"true coefficients       : phi1={phi1}      phi2={phi2}")
print(f"sum phi (persistence)   : {phi_hat.sum():.3f}  (<1 => stationary)")
edits are live — break it on purpose
INSTRUMENT T2.1 — AR-COEFFICIENT STABILITY & ROOTSAR(2) STATIONARITY TRIANGLE · EQ T2.2
STATUS
ROOTS |z|
φ₁ + φ₂
The mint triangle is the AR(2) stationarity region — its three sides are \(\phi_1+\phi_2<1\), \(\phi_2-\phi_1<1\) and \(\phi_2>-1\). Drag the coefficients and watch the marker: inside the triangle the roots of \(\Phi(z)\) sit outside the unit circle and the process is stable; cross a side and a root crosses the circle, the variance blows up, and the forecast diverges. The lower parabola \(\phi_2 = -\phi_1^2/4\) splits real roots (above) from the complex-root region (below), where the series oscillates with a pseudo-period rather than decaying monotonically.
2.2

Moving-average (MA) models

An AR model remembers past values. A moving-average model remembers past shocks. MA(\(q\)) writes the current value as the current white-noise shock plus a weighted sum of the last \(q\) shocks:

EQ T2.4 — MOVING-AVERAGE MODEL MA(q) $$ y_t \;=\; \mu \;+\; \varepsilon_t \;+\; \theta_1 \varepsilon_{t-1} \;+\; \theta_2 \varepsilon_{t-2} \;+\; \cdots \;+\; \theta_q \varepsilon_{t-q} $$
\(\mu\) is the mean, \(\theta_1,\dots,\theta_q\) the MA coefficients, and the \(\varepsilon\)'s are unobserved white-noise shocks. The defining property: an MA(\(q\)) has finite memory — a shock affects only the next \(q\) observations and then vanishes completely. So an MA process is always stationary (it is a finite sum of stationary terms), and its autocorrelation function cuts off sharply after lag \(q\). That clean cutoff is exactly the fingerprint §2.5 uses to identify \(q\).

The two model families are mirror images, and the mirror is precise. An invertible MA(\(q\)) — one whose MA polynomial \(\Theta(z) = 1 + \theta_1 z + \cdots + \theta_q z^q\) also has all roots outside the unit circle — can be rewritten as an infinite-order AR, and a stationary AR(\(p\)) can be rewritten as an infinite-order MA. The duality has a sharp diagnostic consequence:

ModelACF (autocorrelation)PACF (partial autocorr.)
AR(p)tails off (decays geometrically / sinusoidally)cuts off after lag p
MA(q)cuts off after lag qtails off (decays geometrically)
ARMA(p,q)tails off after lag qtails off after lag p

This table is the heart of classical model identification. The ACF measures correlation between \(y_t\) and \(y_{t-k}\); the partial ACF measures the same after removing the influence of the intervening lags. A sharp ACF cutoff at lag \(q\) with a slowly tailing PACF screams MA(\(q\)); the reverse screams AR(\(p\)). When both tail off, you are in mixed ARMA territory — and these "cutoffs" are statistical, blurred by sampling noise, so read them as strong hints, not certainties.

The sample autocorrelation function of a series is large at lags 1 and 2, then drops to essentially zero (inside the noise band) from lag 3 onward, while the PACF tails off slowly. This is the fingerprint of an MA(\(q\)) model. What is \( q \)?
An MA(\(q\)) process has autocorrelations that cut off — are exactly zero — for all lags greater than \(q\) (EQ T2.4). The last significant spike is at lag 2, so the memory is two shocks deep: \( q = \) 2.
PYTHON · RUNNABLE IN-BROWSER
# MA(2) fingerprint: its ACF cuts off after lag 2, the PACF tails off (the table).
import numpy as np
rng = np.random.default_rng(1)

theta1, theta2 = 0.7, 0.4
n = 4000
e = rng.normal(0, 1, n + 2)
y = e[2:] + theta1 * e[1:-1] + theta2 * e[:-2]   # MA(2) per EQ T2.4
y = y - y.mean()

def acf(x, k):
    return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x)

# theoretical MA(2) autocorrelations for comparison
denom = 1 + theta1**2 + theta2**2
th = [1.0, (theta1 + theta1*theta2)/denom, theta2/denom, 0.0, 0.0]
print(" lag   sample ACF   theory ACF")
for k in range(5):
    print(f"  {k:2d}     {acf(y,k):+.3f}      {th[k]:+.3f}")
print("\nACF is large at lags 1-2 then ~0 -> the MA(2) cutoff. q reads straight off.")
plot_xy(list(range(8)), [acf(y, k) for k in range(8)])
edits are live — break it on purpose
2.3

ARMA & ARIMA

Combine the two memories and you get ARMA(\(p,q\)): the value depends on its own past and on past shocks. In lag-operator form the symmetry is plain — an AR polynomial acting on the values equals an MA polynomial acting on the noise:

EQ T2.5 — ARMA(p,q) $$ \underbrace{\Big(1 - \phi_1 L - \cdots - \phi_p L^p\Big)}_{\Phi(L)}\, y_t \;=\; c + \underbrace{\Big(1 + \theta_1 L + \cdots + \theta_q L^q\Big)}_{\Theta(L)}\, \varepsilon_t $$
\(L\) is the lag operator (\(L^k y_t = y_{t-k}\)). ARMA needs the series to be stationary already: \(\Phi(z)\) must have its roots outside the unit circle (EQ T2.2). Most real series — prices, sales, GDP — are not stationary; they trend or wander. The fix is the "I" in ARIMA.

The I stands for integrated: an ARIMA(\(p,d,q\)) is an ARMA(\(p,q\)) fitted to the series after taking \(d\) differences. Differencing is the operator \(\Delta y_t = y_t - y_{t-1} = (1-L)y_t\); apply it \(d\) times to strip out trend. A series that becomes stationary after \(d\) differences is "integrated of order \(d\)", written \(I(d)\). Most economic series are \(I(1)\): one difference — turning levels into changes — is enough.

EQ T2.6 — ARIMA(p,d,q) $$ \Phi(L)\,\underbrace{(1-L)^d\, y_t}_{\text{differenced }d\text{ times}} \;=\; c + \Theta(L)\,\varepsilon_t $$
Three integers fully specify the model: \(p\) AR terms, \(d\) differences, \(q\) MA terms. The familiar special cases all fall out of this one equation: ARIMA(\(p,0,0\)) is plain AR(\(p\)); ARIMA(\(0,0,q\)) is MA(\(q\)); ARIMA(\(0,1,0\)) is \(\Delta y_t = \varepsilon_t\), the random walk; ARIMA(\(0,1,0\)) with a constant is a random walk with drift. Over-differencing is a real hazard — it injects spurious negative autocorrelation and inflates the variance — so difference the minimum needed, checked with a unit-root test (ADF / KPSS), not reflexively.
An ARIMA(0,1,0) model — zero AR terms, one difference, zero MA terms — is exactly a random walk, \( y_t = y_{t-1} + \varepsilon_t \). True or false? (Answer true or false.)
With \(p=q=0\) and \(d=1\), EQ T2.6 reduces to \((1-L)y_t = \varepsilon_t\), i.e. \(y_t - y_{t-1} = \varepsilon_t\), which rearranges to \(y_t = y_{t-1} + \varepsilon_t\) — the textbook random walk. So the statement is true. (Add a constant and it becomes a random walk with drift.)
PYTHON · RUNNABLE IN-BROWSER
# One-step ARIMA(1,1,1) forecast BY HAND on a tiny series (EQ T2.6).
import numpy as np

y = np.array([10., 12., 11., 14., 16., 15.])   # the observed levels
phi, theta = 0.6, 0.4                           # ARMA(1,1) on the differences

w = np.diff(y)                 # d=1: work on changes  w_t = y_t - y_{t-1}
print("differences w :", w)

# Recover the unobserved shocks e_t by the ARMA(1,1) recursion (start e_0 = 0):
#   w_t = phi*w_{t-1} + e_t + theta*e_{t-1}  =>  e_t = w_t - phi*w_{t-1} - theta*e_{t-1}
e = np.zeros(len(w))
for t in range(1, len(w)):
    e[t] = w[t] - phi * w[t-1] - theta * e[t-1]
print("residuals e   :", np.round(e, 3))

w_next = phi * w[-1] + theta * e[-1]   # forecast the NEXT difference (E[e_next]=0)
y_next = y[-1] + w_next                # integrate back: undo the differencing
print(f"\nforecast next change  w_hat = {w_next:.3f}")
print(f"forecast next level   y_hat = y[-1] + w_hat = {y[-1]:.1f} + ({w_next:.3f})"
      f" = {y_next:.3f}")
edits are live — break it on purpose
INSTRUMENT T2.2 — ARIMA(p,d,q) PLAYGROUNDSIMULATE · DIFFERENCE · FORECAST · EQ T2.6
MODEL
FORECAST DRIFT / STEP
SHAPE
A fixed white-noise driver builds an ARIMA(\(p,d,q\)) series (grey), then the model forecasts the next 24 steps (mint) with its \(\sqrt{h}\)-growing uncertainty band. Set \(d=0\) and the forecast reverts to the mean; set \(d=1\) and it persists from the last value (the random-walk limit at \(p=q=0\)); set \(d=2\) and a trend extrapolates. Push \(\phi\) toward \(\pm0.9\) to feel persistence vs oscillation, and watch how a higher \(d\) widens the forecast cone — uncertainty that compounds is the price of differencing away a trend.
2.4

Seasonal ARIMA (SARIMA)

Monthly sales peak every December; electricity demand cycles every 24 hours; retail repeats every 7 days. A plain ARIMA can chase a trend but it has no machinery for a repeating season. SARIMA bolts a second, seasonal ARIMA onto the first — same AR/I/MA logic, but operating at the seasonal lag \(s\) (12 for monthly-with-yearly-cycle, 7 for daily-with-weekly-cycle):

EQ T2.7 — SARIMA(p,d,q)(P,D,Q)ₛ $$ \Phi_p(L)\,\Phi_P(L^s)\,(1-L)^d\,(1-L^s)^D\, y_t \;=\; c + \Theta_q(L)\,\Theta_Q(L^s)\,\varepsilon_t $$
The lowercase \((p,d,q)\) handle the short-range dynamics; the uppercase \((P,D,Q)_s\) handle the seasonal dynamics at multiples of the period \(s\). \((1-L^s)^D\) is seasonal differencing — subtract the value one full season ago (\(y_t - y_{t-s}\)) to remove a stable seasonal pattern, just as \((1-L)^d\) removes trend. The seasonal polynomials \(\Phi_P(L^s),\,\Theta_Q(L^s)\) act only at lags \(s, 2s, \dots\). The classic "airline model", SARIMA(0,1,1)(0,1,1)₁₂, fits a startling range of monthly business series with just two parameters — it is the seasonal baseline to beat.

The two layers multiply rather than add, which is what lets one shock leave both a short-range footprint (the next few periods) and a seasonal echo (the same period next year). In practice you almost never need \(D > 1\): one seasonal difference plus one ordinary difference removes both a trend and a yearly cycle, and stacking more differences over-differences just as fast as in the non-seasonal case. The cost of the extra flexibility is parameters — six orders to choose instead of three — which is exactly why automated order selection (§2.5) became indispensable for SARIMA.

You fit a SARIMA model to monthly data with a yearly cycle and apply one seasonal difference, \( y_t - y_{t-s} \). What is the seasonal lag \( s \) — i.e. how many steps back does the seasonal difference reach?
Monthly data with a yearly cycle repeats every 12 observations, so the seasonal period is \( s = \) 12: the seasonal difference subtracts the value from the same month one year earlier, \( y_t - y_{t-12} \). (Daily-with-weekly would be \( s = 7 \); hourly-with-daily, \( s = 24 \).)

SARIMA's honest limits. It assumes a single, fixed seasonal period with constant amplitude. Multiple overlapping seasonalities (a daily series with both weekly and yearly cycles), seasonality that grows with the level, or non-integer periods all break it — and that is where TBATS, Fourier-term regressors with ARIMA errors, Prophet, and modern ML forecasters earn their place. SARIMA remains the right first tool for one clean season, and a strong baseline even when it is not the final one.

2.5

The Box-Jenkins methodology

Box and Jenkins did not just propose models — they proposed a procedure, an iterative loop that turns a raw series into a fitted forecast. It is the recipe the whole chapter has been building toward, and it has three stages that you cycle until the residuals are clean:

EQ T2.8 — THE BOX-JENKINS LOOP $$ \textbf{Identify} \;\longrightarrow\; \textbf{Estimate} \;\longrightarrow\; \textbf{Diagnose} \;\;\xrightarrow{\text{residuals not white?}}\;\; \textbf{back to Identify} $$
Identify — make the series stationary (difference / log-transform; confirm with ADF or KPSS), then read the ACF/PACF (and seasonal lags) to propose orders \((p,d,q)(P,D,Q)_s\). Estimate — fit the coefficients by maximum likelihood. Diagnose — check that the residuals are indistinguishable from white noise (Ljung-Box test, residual ACF); if not, the model missed structure, so revise the orders and loop. The terminal condition is white-noise residuals: when nothing predictable is left in the errors, you have extracted all the linear signal.

Stage one — identification — is where the AR/MA fingerprint table from §2.2 does its work. Pick orders by competing candidates on an information criterion that rewards fit and penalizes complexity, rather than by eyeballing the ACF alone:

EQ T2.9 — AKAIKE INFORMATION CRITERION (AIC) $$ \mathrm{AIC} \;=\; -2\,\ln \hat{L} \;+\; 2k, \qquad k = p + q + P + Q + (\text{constant}) $$
\(\hat{L}\) is the maximized likelihood and \(k\) the number of estimated parameters. The first term rewards goodness of fit; \(+2k\) is the complexity penalty that stops you from adding terms that only chase noise. Lower AIC wins. This is the engine inside auto.arima: it searches over candidate \((p,d,q)(P,D,Q)_s\) orders and keeps the lowest-AIC model. AICc adds a small-sample correction; BIC penalizes complexity harder (\(\ln(n)\,k\)) and prefers sparser models. None of them replaces the white-noise residual check — a low AIC with autocorrelated residuals is still a failed model.

The genuinely contested part is whether to trust automation. auto.arima (and Python's pmdarima) made order selection a one-liner, and for well-behaved series it usually finds a sensible model. But it optimizes in-sample fit, can be fooled by outliers and structural breaks, and will happily return a model whose residuals still carry seasonality it failed to difference away. The defensible practice in 2026 is the same as in 1976: let the search propose, then diagnose — plot the residual ACF, run Ljung-Box, and back-test on held-out horizons before shipping.

PITFALLS

Four ways Box-Jenkins goes wrong: (1) over-differencing — differencing a series that was already stationary injects negative autocorrelation and inflates variance; difference the minimum and confirm with ADF/KPSS. (2) reading ACF/PACF too literally — sampling noise blurs the cutoffs, so a "spike at lag 7" may be chance, not weekly seasonality. (3) trusting AIC over residuals — the lowest-AIC model can still have autocorrelated errors; the residual diagnostics are the real gate. (4) forecasting far past the data's regime — ARIMA extrapolates its fitted linear dynamics and is blind to structural breaks, so long-horizon intervals are optimistic.

INSTRUMENT T2.3 — BOX-JENKINS IDENTIFICATION GUIDEDECISION TREE FROM ACF / PACF · EQ T2.8
DIAGNOSIS
SUGGESTED MODEL
Pick the pattern you see in a stationary series' correlograms and the tree maps it to a model order, drawing a stylized ACF (mint) and PACF (blue) so you can match the shape. This is the §2.2 fingerprint table made operational: PACF cutoff → AR, ACF cutoff → MA, both tail off → ARMA, slow ACF decay → difference first, spike at a seasonal lag → add a seasonal term. It is a teaching guide, not a substitute for fitting and diagnosing — real correlograms are noisier than these idealized stems.
NEXT

ARIMA fits the conditional mean by least squares on lagged values; exponential smoothing weights the past geometrically instead. The two families overlap more than their notation suggests — simple exponential smoothing is an ARIMA(0,1,1) in disguise. Chapter 03: the exponential-smoothing family, from simple to Holt-Winters, the state-space (ETS) formulation that gives it likelihoods and intervals, and when its decaying-memory view beats ARIMA's algebra.

2.R

References

  1. Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the foundational text that defined the AR/MA/ARIMA/SARIMA framework and the identify-estimate-diagnose loop (EQ T2.8).
  2. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts — the modern, free standard reference; its ARIMA chapter and auto.arima sit behind §2.3–§2.5.
  3. Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) — the Akaike Information Criterion behind automated order selection (EQ T2.9).
  4. Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the Ljung-Box portmanteau test used in the diagnostic stage to check for white-noise residuals.
  5. Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the Augmented Dickey-Fuller unit-root test that decides how many differences \(d\) a series needs.
  6. Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software 27(3) — the algorithm behind auto.arima and its AIC-driven stepwise order search (§2.5).