AR, MA, ARIMA & SARIMA — AI Encyclopedia

2.1

Autoregressive (AR) models

The simplest honest forecast is "tomorrow looks like today, plus a nudge." An autoregressive model formalizes that intuition: regress the series on its own past. An AR of order $p$ — written AR($p$) — predicts the current value as a weighted sum of the previous $p$ values plus a fresh shock:

EQ T2.1 — AUTOREGRESSIVE MODEL AR(p) $$ y_t \;=\; c \;+\; \phi_1 y_{t-1} \;+\; \phi_2 y_{t-2} \;+\; \cdots \;+\; \phi_p y_{t-p} \;+\; \varepsilon_t, \qquad \varepsilon_t \sim \mathrm{WN}(0,\sigma^2) $$

$\phi_1,\dots,\phi_p$ are the AR coefficients, $c$ a constant tied to the long-run mean ($\mu = c/(1-\sum\phi_i)$), and $\varepsilon_t$ is white noise — zero-mean, constant-variance, serially uncorrelated. The model has memory: a shock at time $t$ propagates forward through the $\phi$'s, decaying geometrically. AR(1) is the workhorse — $\phi$ is literally the one-step persistence of the series.

Persistence is the whole story for AR(1), $y_t = c + \phi y_{t-1} + \varepsilon_t$. A $\phi$ near $+1$ means shocks linger (a slow, trending-looking series); a $\phi$ near $0$ means the series snaps back to its mean almost immediately (near white noise); a negative $\phi$ makes it oscillate, flipping sign each step. The catch is stationarity: the process only has a stable mean and variance if its shocks don't compound forever. For AR(1) that means $|\phi| < 1$; for higher orders the condition lives in the roots of the characteristic polynomial.

EQ T2.2 — STATIONARITY: ROOTS OUTSIDE THE UNIT CIRCLE $$ \Phi(z) \;=\; 1 - \phi_1 z - \phi_2 z^2 - \cdots - \phi_p z^p \;=\; 0 \quad\Longrightarrow\quad |z_i| > 1 \;\; \forall i $$

Write the model with the lag operator $L$ (where $L\,y_t = y_{t-1}$) as $\Phi(L)\,y_t = c + \varepsilon_t$. The process is stationary iff every root of $\Phi(z)$ lies strictly outside the unit circle — equivalently, every reciprocal root lies inside it. For AR(1), $1-\phi z = 0$ gives $z = 1/\phi$, so $|z|>1$ is exactly $|\phi|<1$. A root on the circle ($|z|=1$) is a unit root — the boundary case of a random walk, which §2.3 differences away.

An AR(1) process with no constant is $ y_t = 0.5\,y_{t-1} + \varepsilon_t $. The last observed value is $ y_{t-1} = 10 $. What is the one-step forecast $ \hat{y}_t = 0.5\,y_{t-1} $ (the expected next value, since $ \mathbb{E}[\varepsilon_t] = 0 $)?

The minimum-mean-squared-error forecast sets the unknown shock to its mean, $ \mathbb{E}[\varepsilon_t] = 0 $, so $ \hat{y}_t = 0.5 \times 10 = $ 5. With $ \phi = 0.5 $ the series gives back half its current deviation each step — fast mean reversion.

How do you estimate the $\phi$'s from data? The classical route is the Yule-Walker equations, which connect the AR coefficients to the series' autocorrelations. They say: the autocorrelation at lag $k$ equals the same weighted combination of nearby autocorrelations that the model imposes on the values themselves.

EQ T2.3 — YULE-WALKER EQUATIONS $$ \rho_k \;=\; \phi_1 \rho_{k-1} + \phi_2 \rho_{k-2} + \cdots + \phi_p \rho_{k-p}, \quad k = 1,\dots,p \qquad\Longleftrightarrow\qquad R\,\boldsymbol{\phi} = \mathbf{r} $$

$\rho_k$ is the autocorrelation at lag $k$; $R$ is the $p\times p$ Toeplitz matrix of autocorrelations $\rho_{|i-j|}$ and $\mathbf{r} = (\rho_1,\dots,\rho_p)$. Estimate the $\rho_k$ from the data, plug them in, and solve the linear system $\boldsymbol{\phi} = R^{-1}\mathbf{r}$ — a closed-form fit with no iteration. For AR(2) this unpacks to $\phi_1 = \dfrac{\rho_1(1-\rho_2)}{1-\rho_1^2}$, $\phi_2 = \dfrac{\rho_2 - \rho_1^2}{1-\rho_1^2}$. Maximum likelihood is usually preferred in production, but Yule-Walker is the transparent estimator that shows where the numbers come from.

PYTHON · RUNNABLE IN-BROWSER

# Fit an AR(2) by Yule-Walker (EQ T2.3) in pure numpy. No statsmodels needed.
import numpy as np
rng = np.random.default_rng(0)

phi1, phi2 = 0.5, 0.3              # the TRUE coefficients we will recover
n = 600
e = rng.normal(0, 1, n)
y = np.zeros(n)
for t in range(2, n):             # simulate the AR(2) process (EQ T2.1)
    y[t] = phi1 * y[t-1] + phi2 * y[t-2] + e[t]
y = y - y.mean()                  # center: Yule-Walker works on the mean-removed series

def acf(x, k):                    # sample autocorrelation at lag k
    return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x)

r1, r2 = acf(y, 1), acf(y, 2)
R = np.array([[1.0, r1], [r1, 1.0]])   # Toeplitz matrix of autocorrelations
r = np.array([r1, r2])
phi_hat = np.linalg.solve(R, r)        # phi = R^{-1} r  (EQ T2.3)

print(f"sample autocorrelations : rho1={r1:.3f}  rho2={r2:.3f}")
print(f"Yule-Walker estimate    : phi1={phi_hat[0]:.3f}  phi2={phi_hat[1]:.3f}")
print(f"true coefficients       : phi1={phi1}      phi2={phi2}")
print(f"sum phi (persistence)   : {phi_hat.sum():.3f}  (<1 => stationary)")

edits are live — break it on purpose

INSTRUMENT T2.1 — AR-COEFFICIENT STABILITY & ROOTSAR(2) STATIONARITY TRIANGLE · EQ T2.2

φ₁ 0.50

φ₂ 0.30

STATUS

—

ROOTS |z|

—

φ₁ + φ₂

—

The mint triangle is the AR(2) stationarity region — its three sides are $\phi_1+\phi_2<1$, $\phi_2-\phi_1<1$ and $\phi_2>-1$. Drag the coefficients and watch the marker: inside the triangle the roots of $\Phi(z)$ sit outside the unit circle and the process is stable; cross a side and a root crosses the circle, the variance blows up, and the forecast diverges. The lower parabola $\phi_2 = -\phi_1^2/4$ splits real roots (above) from the complex-root region (below), where the series oscillates with a pseudo-period rather than decaying monotonically.

2.2

Moving-average (MA) models

An AR model remembers past values. A moving-average model remembers past shocks. MA($q$) writes the current value as the current white-noise shock plus a weighted sum of the last $q$ shocks:

EQ T2.4 — MOVING-AVERAGE MODEL MA(q) $$ y_t \;=\; \mu \;+\; \varepsilon_t \;+\; \theta_1 \varepsilon_{t-1} \;+\; \theta_2 \varepsilon_{t-2} \;+\; \cdots \;+\; \theta_q \varepsilon_{t-q} $$

$\mu$ is the mean, $\theta_1,\dots,\theta_q$ the MA coefficients, and the $\varepsilon$'s are unobserved white-noise shocks. The defining property: an MA($q$) has finite memory — a shock affects only the next $q$ observations and then vanishes completely. So an MA process is always stationary (it is a finite sum of stationary terms), and its autocorrelation function cuts off sharply after lag $q$. That clean cutoff is exactly the fingerprint §2.5 uses to identify $q$.

The two model families are mirror images, and the mirror is precise. An invertible MA($q$) — one whose MA polynomial $\Theta(z) = 1 + \theta_1 z + \cdots + \theta_q z^q$ also has all roots outside the unit circle — can be rewritten as an infinite-order AR, and a stationary AR($p$) can be rewritten as an infinite-order MA. The duality has a sharp diagnostic consequence:

Model	ACF (autocorrelation)	PACF (partial autocorr.)
AR(p)	tails off (decays geometrically / sinusoidally)	cuts off after lag p
MA(q)	cuts off after lag q	tails off (decays geometrically)
ARMA(p,q)	tails off after lag q	tails off after lag p

This table is the heart of classical model identification. The ACF measures correlation between $y_t$ and $y_{t-k}$; the partial ACF measures the same after removing the influence of the intervening lags. A sharp ACF cutoff at lag $q$ with a slowly tailing PACF screams MA($q$); the reverse screams AR($p$). When both tail off, you are in mixed ARMA territory — and these "cutoffs" are statistical, blurred by sampling noise, so read them as strong hints, not certainties.

The sample autocorrelation function of a series is large at lags 1 and 2, then drops to essentially zero (inside the noise band) from lag 3 onward, while the PACF tails off slowly. This is the fingerprint of an MA($q$) model. What is $ q $?

An MA($q$) process has autocorrelations that cut off — are exactly zero — for all lags greater than $q$ (EQ T2.4). The last significant spike is at lag 2, so the memory is two shocks deep: $ q = $ 2.

PYTHON · RUNNABLE IN-BROWSER

# MA(2) fingerprint: its ACF cuts off after lag 2, the PACF tails off (the table).
import numpy as np
rng = np.random.default_rng(1)

theta1, theta2 = 0.7, 0.4
n = 4000
e = rng.normal(0, 1, n + 2)
y = e[2:] + theta1 * e[1:-1] + theta2 * e[:-2]   # MA(2) per EQ T2.4
y = y - y.mean()

def acf(x, k):
    return np.sum(x[k:] * x[:len(x)-k]) / np.sum(x * x)

# theoretical MA(2) autocorrelations for comparison
denom = 1 + theta1**2 + theta2**2
th = [1.0, (theta1 + theta1*theta2)/denom, theta2/denom, 0.0, 0.0]
print(" lag   sample ACF   theory ACF")
for k in range(5):
    print(f"  {k:2d}     {acf(y,k):+.3f}      {th[k]:+.3f}")
print("\nACF is large at lags 1-2 then ~0 -> the MA(2) cutoff. q reads straight off.")
plot_xy(list(range(8)), [acf(y, k) for k in range(8)])

edits are live — break it on purpose

2.3

ARMA & ARIMA

Combine the two memories and you get ARMA($p,q$): the value depends on its own past and on past shocks. In lag-operator form the symmetry is plain — an AR polynomial acting on the values equals an MA polynomial acting on the noise:

EQ T2.5 — ARMA(p,q) $$ \underbrace{\Big(1 - \phi_1 L - \cdots - \phi_p L^p\Big)}_{\Phi(L)}\, y_t \;=\; c + \underbrace{\Big(1 + \theta_1 L + \cdots + \theta_q L^q\Big)}_{\Theta(L)}\, \varepsilon_t $$

$L$ is the lag operator ($L^k y_t = y_{t-k}$). ARMA needs the series to be stationary already: $\Phi(z)$ must have its roots outside the unit circle (EQ T2.2). Most real series — prices, sales, GDP — are not stationary; they trend or wander. The fix is the "I" in ARIMA.

The I stands for integrated: an ARIMA($p,d,q$) is an ARMA($p,q$) fitted to the series after taking $d$ differences. Differencing is the operator $\Delta y_t = y_t - y_{t-1} = (1-L)y_t$; apply it $d$ times to strip out trend. A series that becomes stationary after $d$ differences is "integrated of order $d$", written $I(d)$. Most economic series are $I(1)$: one difference — turning levels into changes — is enough.

EQ T2.6 — ARIMA(p,d,q) $$ \Phi(L)\,\underbrace{(1-L)^d\, y_t}_{\text{differenced }d\text{ times}} \;=\; c + \Theta(L)\,\varepsilon_t $$

Three integers fully specify the model: $p$ AR terms, $d$ differences, $q$ MA terms. The familiar special cases all fall out of this one equation: ARIMA($p,0,0$) is plain AR($p$); ARIMA($0,0,q$) is MA($q$); ARIMA($0,1,0$) is $\Delta y_t = \varepsilon_t$, the random walk; ARIMA($0,1,0$) with a constant is a random walk with drift. Over-differencing is a real hazard — it injects spurious negative autocorrelation and inflates the variance — so difference the minimum needed, checked with a unit-root test (ADF / KPSS), not reflexively.

An ARIMA(0,1,0) model — zero AR terms, one difference, zero MA terms — is exactly a random walk, $ y_t = y_{t-1} + \varepsilon_t $. True or false? (Answer true or false.)

With $p=q=0$ and $d=1$, EQ T2.6 reduces to $(1-L)y_t = \varepsilon_t$, i.e. $y_t - y_{t-1} = \varepsilon_t$, which rearranges to $y_t = y_{t-1} + \varepsilon_t$ — the textbook random walk. So the statement is true. (Add a constant and it becomes a random walk with drift.)

PYTHON · RUNNABLE IN-BROWSER

# One-step ARIMA(1,1,1) forecast BY HAND on a tiny series (EQ T2.6).
import numpy as np

y = np.array([10., 12., 11., 14., 16., 15.])   # the observed levels
phi, theta = 0.6, 0.4                           # ARMA(1,1) on the differences

w = np.diff(y)                 # d=1: work on changes  w_t = y_t - y_{t-1}
print("differences w :", w)

# Recover the unobserved shocks e_t by the ARMA(1,1) recursion (start e_0 = 0):
#   w_t = phi*w_{t-1} + e_t + theta*e_{t-1}  =>  e_t = w_t - phi*w_{t-1} - theta*e_{t-1}
e = np.zeros(len(w))
for t in range(1, len(w)):
    e[t] = w[t] - phi * w[t-1] - theta * e[t-1]
print("residuals e   :", np.round(e, 3))

w_next = phi * w[-1] + theta * e[-1]   # forecast the NEXT difference (E[e_next]=0)
y_next = y[-1] + w_next                # integrate back: undo the differencing
print(f"\nforecast next change  w_hat = {w_next:.3f}")
print(f"forecast next level   y_hat = y[-1] + w_hat = {y[-1]:.1f} + ({w_next:.3f})"
      f" = {y_next:.3f}")

edits are live — break it on purpose

INSTRUMENT T2.2 — ARIMA(p,d,q) PLAYGROUNDSIMULATE · DIFFERENCE · FORECAST · EQ T2.6

AR order p 1

DIFFERENCING d 1

MA order q 1

AR strength φ 0.60

MA strength θ 0.40

MODEL

—

FORECAST DRIFT / STEP

—

SHAPE

—

A fixed white-noise driver builds an ARIMA($p,d,q$) series (grey), then the model forecasts the next 24 steps (mint) with its $\sqrt{h}$-growing uncertainty band. Set $d=0$ and the forecast reverts to the mean; set $d=1$ and it persists from the last value (the random-walk limit at $p=q=0$); set $d=2$ and a trend extrapolates. Push $\phi$ toward $\pm0.9$ to feel persistence vs oscillation, and watch how a higher $d$ widens the forecast cone — uncertainty that compounds is the price of differencing away a trend.

2.4

Seasonal ARIMA (SARIMA)

Monthly sales peak every December; electricity demand cycles every 24 hours; retail repeats every 7 days. A plain ARIMA can chase a trend but it has no machinery for a repeating season. SARIMA bolts a second, seasonal ARIMA onto the first — same AR/I/MA logic, but operating at the seasonal lag $s$ (12 for monthly-with-yearly-cycle, 7 for daily-with-weekly-cycle):

EQ T2.7 — SARIMA(p,d,q)(P,D,Q)ₛ $$ \Phi_p(L)\,\Phi_P(L^s)\,(1-L)^d\,(1-L^s)^D\, y_t \;=\; c + \Theta_q(L)\,\Theta_Q(L^s)\,\varepsilon_t $$

The lowercase $(p,d,q)$ handle the short-range dynamics; the uppercase $(P,D,Q)_s$ handle the seasonal dynamics at multiples of the period $s$. $(1-L^s)^D$ is seasonal differencing — subtract the value one full season ago ($y_t - y_{t-s}$) to remove a stable seasonal pattern, just as $(1-L)^d$ removes trend. The seasonal polynomials $\Phi_P(L^s),\,\Theta_Q(L^s)$ act only at lags $s, 2s, \dots$. The classic "airline model", SARIMA(0,1,1)(0,1,1)₁₂, fits a startling range of monthly business series with just two parameters — it is the seasonal baseline to beat.

The two layers multiply rather than add, which is what lets one shock leave both a short-range footprint (the next few periods) and a seasonal echo (the same period next year). In practice you almost never need $D > 1$: one seasonal difference plus one ordinary difference removes both a trend and a yearly cycle, and stacking more differences over-differences just as fast as in the non-seasonal case. The cost of the extra flexibility is parameters — six orders to choose instead of three — which is exactly why automated order selection (§2.5) became indispensable for SARIMA.

You fit a SARIMA model to monthly data with a yearly cycle and apply one seasonal difference, $ y_t - y_{t-s} $. What is the seasonal lag $ s $ — i.e. how many steps back does the seasonal difference reach?

Monthly data with a yearly cycle repeats every 12 observations, so the seasonal period is $ s = $ 12: the seasonal difference subtracts the value from the same month one year earlier, $ y_t - y_{t-12} $. (Daily-with-weekly would be $ s = 7 $; hourly-with-daily, $ s = 24 $.)

SARIMA's honest limits. It assumes a single, fixed seasonal period with constant amplitude. Multiple overlapping seasonalities (a daily series with both weekly and yearly cycles), seasonality that grows with the level, or non-integer periods all break it — and that is where TBATS, Fourier-term regressors with ARIMA errors, Prophet, and modern ML forecasters earn their place. SARIMA remains the right first tool for one clean season, and a strong baseline even when it is not the final one.

2.5

The Box-Jenkins methodology

Box and Jenkins did not just propose models — they proposed a procedure, an iterative loop that turns a raw series into a fitted forecast. It is the recipe the whole chapter has been building toward, and it has three stages that you cycle until the residuals are clean:

EQ T2.8 — THE BOX-JENKINS LOOP $$ \textbf{Identify} \;\longrightarrow\; \textbf{Estimate} \;\longrightarrow\; \textbf{Diagnose} \;\;\xrightarrow{\text{residuals not white?}}\;\; \textbf{back to Identify} $$

Identify — make the series stationary (difference / log-transform; confirm with ADF or KPSS), then read the ACF/PACF (and seasonal lags) to propose orders $(p,d,q)(P,D,Q)_s$. Estimate — fit the coefficients by maximum likelihood. Diagnose — check that the residuals are indistinguishable from white noise (Ljung-Box test, residual ACF); if not, the model missed structure, so revise the orders and loop. The terminal condition is white-noise residuals: when nothing predictable is left in the errors, you have extracted all the linear signal.

Stage one — identification — is where the AR/MA fingerprint table from §2.2 does its work. Pick orders by competing candidates on an information criterion that rewards fit and penalizes complexity, rather than by eyeballing the ACF alone:

EQ T2.9 — AKAIKE INFORMATION CRITERION (AIC) $$ \mathrm{AIC} \;=\; -2\,\ln \hat{L} \;+\; 2k, \qquad k = p + q + P + Q + (\text{constant}) $$

$\hat{L}$ is the maximized likelihood and $k$ the number of estimated parameters. The first term rewards goodness of fit; $+2k$ is the complexity penalty that stops you from adding terms that only chase noise. Lower AIC wins. This is the engine inside auto.arima: it searches over candidate $(p,d,q)(P,D,Q)_s$ orders and keeps the lowest-AIC model. AICc adds a small-sample correction; BIC penalizes complexity harder ($\ln(n)\,k$) and prefers sparser models. None of them replaces the white-noise residual check — a low AIC with autocorrelated residuals is still a failed model.

The genuinely contested part is whether to trust automation. auto.arima (and Python's pmdarima) made order selection a one-liner, and for well-behaved series it usually finds a sensible model. But it optimizes in-sample fit, can be fooled by outliers and structural breaks, and will happily return a model whose residuals still carry seasonality it failed to difference away. The defensible practice in 2026 is the same as in 1976: let the search propose, then diagnose — plot the residual ACF, run Ljung-Box, and back-test on held-out horizons before shipping.

PITFALLS

Four ways Box-Jenkins goes wrong: (1) over-differencing — differencing a series that was already stationary injects negative autocorrelation and inflates variance; difference the minimum and confirm with ADF/KPSS. (2) reading ACF/PACF too literally — sampling noise blurs the cutoffs, so a "spike at lag 7" may be chance, not weekly seasonality. (3) trusting AIC over residuals — the lowest-AIC model can still have autocorrelated errors; the residual diagnostics are the real gate. (4) forecasting far past the data's regime — ARIMA extrapolates its fitted linear dynamics and is blind to structural breaks, so long-horizon intervals are optimistic.

INSTRUMENT T2.3 — BOX-JENKINS IDENTIFICATION GUIDEDECISION TREE FROM ACF / PACF · EQ T2.8

OBSERVED ACF / PACF PATTERN

DIAGNOSIS

—

SUGGESTED MODEL

—

Pick the pattern you see in a stationary series' correlograms and the tree maps it to a model order, drawing a stylized ACF (mint) and PACF (blue) so you can match the shape. This is the §2.2 fingerprint table made operational: PACF cutoff → AR, ACF cutoff → MA, both tail off → ARMA, slow ACF decay → difference first, spike at a seasonal lag → add a seasonal term. It is a teaching guide, not a substitute for fitting and diagnosing — real correlograms are noisier than these idealized stems.

ARIMA fits the conditional mean by least squares on lagged values; exponential smoothing weights the past geometrically instead. The two families overlap more than their notation suggests — simple exponential smoothing is an ARIMA(0,1,1) in disguise. Chapter 03: the exponential-smoothing family, from simple to Holt-Winters, the state-space (ETS) formulation that gives it likelihoods and intervals, and when its decaying-memory view beats ARIMA's algebra.

2.R

References

Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the foundational text that defined the AR/MA/ARIMA/SARIMA framework and the identify-estimate-diagnose loop (EQ T2.8).
Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts — the modern, free standard reference; its ARIMA chapter and auto.arima sit behind §2.3–§2.5.
Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control 19(6) — the Akaike Information Criterion behind automated order selection (EQ T2.9).
Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the Ljung-Box portmanteau test used in the diagnostic stage to check for white-noise residuals.
Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the Augmented Dickey-Fuller unit-root test that decides how many differences $d$ a series needs.
Hyndman, R. J. & Khandakar, Y. (2008). Automatic Time Series Forecasting: The forecast Package for R. Journal of Statistical Software 27(3) — the algorithm behind auto.arima and its AIC-driven stepwise order search (§2.5).