AI // ENCYCLOPEDIA / TIME SERIES / 05 / VAR & COINTEGRATION INDEX NEXT: FORECASTING IN PRACTICE →
TIME SERIES & ECONOMETRICS · CHAPTER 05 / 06

Multivariate Time Series — VAR, VECM & Cointegration

When several series move together, a Vector Autoregression captures their feedback: every variable is regressed on the recent past of all the others, so the model encodes which series lead which. When those series are individually non-stationary random walks, cointegration identifies a long-run tether between them, a linear combination that does not wander and that anchors an error-correction dynamic pulling the system back toward equilibrium.

LEVELADVANCED READING TIME≈ 30 MIN BUILDS ONTIME SERIES 01–04 · STATS 06 INSTRUMENTSVAR · IRF · COINTEGRATION
5.1

From AR to Vector Autoregression (VAR)

A scalar autoregression \(y_t = \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \varepsilon_t\) (the AR\((p)\) of the earlier chapters) explains a series by its own past. But the world rarely hands you one series in isolation: interest rates, output and inflation move together; an order book's bid and ask co-evolve. The Vector Autoregression is the minimal generalization — stack \(K\) series into a vector \(y_t \in \mathbb{R}^K\) and let every component depend on the recent past of every component, itself included:

EQ T5.1 — VAR(p), REDUCED FORM $$ y_t \;=\; c \;+\; A_1\, y_{t-1} \;+\; A_2\, y_{t-2} \;+\; \cdots \;+\; A_p\, y_{t-p} \;+\; \varepsilon_t, \qquad \varepsilon_t \sim (0,\ \Sigma) $$
\(y_t\) is \(K\times 1\); each \(A_i\) is a \(K\times K\) matrix of lag coefficients; \(c\) is a \(K\times 1\) intercept; the innovations \(\varepsilon_t\) are serially uncorrelated with contemporaneous covariance \(\Sigma\) (generally not diagonal — the series are shocked together). The off-diagonal entries of the \(A_i\) are the whole point: \(\big(A_1\big)_{12}\neq 0\) means yesterday's series 2 helps predict today's series 1. A VAR is just \(K\) ordinary regressions sharing the same right-hand side.

Sims (1980) proposed the VAR as a deliberate rebellion against the "incredible" identifying restrictions of large structural macro models: let the data speak by regressing everything on lagged everything, then ask the model questions afterward (§5.3). Each equation has an intercept plus \(K\) coefficients per lag — so a VAR\((p)\) on \(K\) variables carries \(K(Kp + 1)\) parameters in the mean, and the count explodes as \(K\) and \(p\) grow. That parameter profligacy is the VAR's defining tension, and the reason §5.2 obsesses over order selection.

A VAR(1) is fitted on \(K = 3\) variables. Ignoring the intercept, how many lag coefficients does each equation contain (the number of entries in one row of \(A_1\))?
Each equation regresses one variable on the previous values of all \(K = 3\) variables, and there is \(p = 1\) lag, so a row of \(A_1\) has \(K\cdot p = 3\times 1 = \) 3 coefficients. (A VAR\((2)\) on 3 variables would carry \(3\times 2 = 6\) lag coefficients per equation.)

For analysis it is convenient to fold a VAR\((p)\) into a VAR\((1)\) on a stacked state. Define the companion matrix \(F\), an exact analogue of the scalar companion form: a single matrix whose powers generate the entire dynamics.

EQ T5.2 — COMPANION FORM & STABILITY $$ \underbrace{\begin{pmatrix} y_t \\ y_{t-1} \\ \vdots \\ y_{t-p+1} \end{pmatrix}}_{Y_t} = \underbrace{\begin{pmatrix} A_1 & A_2 & \cdots & A_p \\ I & 0 & \cdots & 0 \\ & \ddots & & \vdots \\ 0 & \cdots & I & 0 \end{pmatrix}}_{F}\, Y_{t-1} + \, E_t, \qquad \text{stable} \iff \max_i \lvert \lambda_i(F) \rvert < 1 $$
\(F\) is \(Kp \times Kp\). The VAR is stationary (stable) exactly when every eigenvalue of \(F\) lies strictly inside the unit circle — equivalently, every root of \(\det(I - A_1 z - \cdots - A_p z^p) = 0\) lies outside it. Shocks then decay geometrically and the process has a finite, time-invariant mean \((I - A_1 - \cdots - A_p)^{-1}c\). An eigenvalue on the unit circle is a unit root — the gateway to cointegration in §5.4.

A VAR also has a clean infinite-history rewrite, the Wold / VMA(\(\infty\)) representation \(y_t = \mu + \sum_{i=0}^{\infty} \Psi_i\,\varepsilon_{t-i}\) with \(\Psi_0 = I\). The matrices \(\Psi_i\) are precisely the impulse responses of §5.3, and for a VAR\((1)\) they are simply \(\Psi_i = A_1^{\,i}\) — powers of the coefficient matrix.

INSTRUMENT T5.1 — TWO-VARIABLE VAR SIMULATORCOEFFICIENT MATRIX A₁ DRIVES COUPLED DYNAMICS · EQ T5.1
SPECTRAL RADIUS |λ|ₘₐₓ
REGIME
CROSS-FEEDBACK a₁₂·a₂₁
A single fixed seed of shocks drives both series so you compare apples to apples. Push the diagonal terms toward ±1 and the system slows and wanders; raise the off-diagonals and watch the two series lock into shared swings (cross-feedback). The instant the spectral radius crosses 1 the regime flips to UNSTABLE and trajectories diverge — exactly the eigenvalue boundary of EQ T5.2.
5.2

Estimation & order selection

Because every equation of a reduced-form VAR has the identical regressor set — a constant plus the same stacked lags — the seemingly-unrelated-regressions efficiency gain vanishes: equation-by-equation OLS is the (conditional) maximum-likelihood estimator under Gaussian errors, and it is consistent and asymptotically normal whether or not \(\Sigma\) is diagonal. You can fit a VAR with nothing more than the normal equations.

EQ T5.3 — MULTIVARIATE LEAST SQUARES $$ \widehat{B} \;=\; \big( Z^{\top} Z \big)^{-1} Z^{\top} Y, \qquad \widehat{\Sigma} \;=\; \frac{1}{T - Kp - 1}\, \widehat{U}^{\top}\widehat{U}, \qquad \widehat{U} = Y - Z\widehat{B} $$
Stack the \(T\) observations as rows of \(Y\) (\(T\times K\)); each row of the design \(Z\) is \([\,1,\ y_{t-1}^{\top},\ \ldots,\ y_{t-p}^{\top}\,]\). Then \(\widehat{B}\) holds the intercept and all \(A_i\) at once — one matrix solve recovers the entire VAR. \(\widehat\Sigma\) is the residual covariance, divided by the degrees of freedom \(T - (Kp+1)\) per equation. OLS row-by-row equals system MLE here because the regressors are shared.

The hard part is not estimation but order selection: too few lags and residuals stay autocorrelated (biasing everything downstream); too many and you burn degrees of freedom on noise. The standard tools are information criteria, which add a complexity penalty to the log-likelihood. Let \(n = K(Kp+1)\) be the total parameter count and \(\lvert\widehat\Sigma_p\rvert\) the residual-covariance determinant at lag \(p\):

EQ T5.4 — INFORMATION CRITERIA FOR VAR ORDER $$ \mathrm{AIC}(p) = \ln\lvert\widehat\Sigma_p\rvert + \frac{2}{T}\,n, \qquad \mathrm{BIC}(p) = \ln\lvert\widehat\Sigma_p\rvert + \frac{\ln T}{T}\,n, \qquad \mathrm{HQ}(p) = \ln\lvert\widehat\Sigma_p\rvert + \frac{2\ln\ln T}{T}\,n $$
All three reward fit (the determinant term falls as \(p\) grows) and punish parameters \(n = K(Kp+1)\). The penalties differ in strength: BIC's \(\ln T\) is harshest and is consistent (it selects the true order as \(T\to\infty\)); AIC's \(2\) is mild, tends to over-fit, but is asymptotically efficient for forecasting; Hannan–Quinn sits between. Pick the \(p\) that minimizes the criterion — and in practice prefer BIC for inference, AIC when prediction is the goal. Always confirm the chosen model leaves white-noise residuals.

The curse of dimensionality is real here. A VAR\((4)\) on 8 macro variables already has \(8\times(8\times 4 + 1) = 264\) parameters — easily more than a typical quarterly sample of post-war data. The modern responses are Bayesian VARs with shrinkage priors (the Minnesota prior pulls coefficients toward a random walk), factor-augmented VARs that compress many series into a few factors, and large-VAR estimators with elementwise penalties. None of that changes the OLS skeleton above; they change the prior on \(B\).

PYTHON · RUNNABLE IN-BROWSER
# Fit a VAR(1) on two simulated series by OLS (EQ T5.3) and recover A1
import numpy as np
rng = np.random.default_rng(0)

A = np.array([[0.5, 0.3],            # the true coefficient matrix A1
              [0.2, 0.4]])           # off-diagonals = cross-feedback
T = 600
y = np.zeros((T, 2))
for t in range(1, T):                # simulate the VAR(1) data
    y[t] = A @ y[t - 1] + rng.normal(0, 1, 2)

Z = y[:-1]                           # regressors: y_{t-1}   (T-1 x 2)
Y = y[1:]                            # targets:    y_t       (T-1 x 2)
B = np.linalg.solve(Z.T @ Z, Z.T @ Y).T   # OLS, one solve -> A1_hat (2x2)

np.set_printoptions(precision=3, suppress=True)
print("true A1:\n", A)
print("OLS A1_hat:\n", B)
ev = np.linalg.eigvals(B)
print("\neigenvalue moduli:", np.round(np.abs(ev), 3),
      "-> stable" if np.all(np.abs(ev) < 1) else "-> UNSTABLE")
print("max |lambda| =", round(float(np.max(np.abs(ev))), 3), "(must be < 1)")
edits are live — break it on purpose
5.3

Impulse-response & variance decomposition

A fitted VAR is a dense block of coefficients that almost no one can read directly. The two devices that make it interpretable are the impulse-response function (IRF) — how the whole system reacts over time to a one-off shock — and the forecast-error variance decomposition (FEVD) — what share of each variable's unpredictability traces back to each shock. Both fall straight out of the VMA(\(\infty\)) coefficients \(\Psi_i\).

EQ T5.5 — IMPULSE-RESPONSE FUNCTION $$ \Psi_i \;=\; \frac{\partial\, y_{t+i}}{\partial\, \varepsilon_t^{\top}}, \qquad \Psi_0 = I, \qquad \Psi_i = \sum_{j=1}^{\min(i,p)} A_j\,\Psi_{i-j} \quad\Big(\text{VAR(1): } \Psi_i = A_1^{\,i}\Big) $$
\((\Psi_i)_{mn}\) is the response of variable \(m\) at horizon \(i\) to a unit reduced-form shock in variable \(n\) at time 0. In a stable VAR the \(\Psi_i\) decay to zero, so every shock is transient. For the worked default \(A_1=\big(\begin{smallmatrix}0.5&0.3\\0.2&0.4\end{smallmatrix}\big)\), a unit shock to variable 1 traces \((\Psi_i)_{11} = 1,\ 0.5,\ 0.31,\ 0.209,\ldots\) and the long-run cumulative multiplier is \((I-A_1)^{-1}\) — for variable 1 onto itself, \(2.5\). The IRF turns a coefficient matrix into a story.

The identification caveat — say it out loud. Reduced-form shocks \(\varepsilon_t\) are contemporaneously correlated (\(\Sigma\) is not diagonal), so "a shock to variable 1 alone" is not well defined: in the data, variable 2 tends to move at the same instant. To read structural impulse responses you must impose identifying assumptions that orthogonalize the shocks. The textbook choice is a Cholesky (recursive) ordering — factor \(\Sigma = P P^{\top}\) with \(P\) lower-triangular and report \(\Psi_i P\) — which assumes a causal ordering of the variables (those earlier in the list can shock those later within the period, but not vice versa). Different orderings give different stories, and that ambiguity is the central, contested limitation of VAR analysis; sign restrictions, long-run (Blanchard–Quah) restrictions, and external-instrument (proxy-SVAR) methods are the modern alternatives.

EQ T5.6 — FORECAST-ERROR VARIANCE DECOMPOSITION $$ \mathrm{MSE}\big(y_{t+h}\big) = \sum_{i=0}^{h-1} \Theta_i\,\Theta_i^{\top}, \quad \Theta_i = \Psi_i P, \qquad \omega_{mn}(h) = \frac{\sum_{i=0}^{h-1} \big(\Theta_i\big)_{mn}^2}{\big(\mathrm{MSE}(y_{t+h})\big)_{mm}} $$
\(\omega_{mn}(h)\) is the fraction of the \(h\)-step forecast-error variance of variable \(m\) attributable to (orthogonalized) shock \(n\); the row \(\sum_n \omega_{mn}(h) = 1\). At \(h=1\) a variable's variance is dominated by its own shock; as \(h\) grows, cross-effects accumulate and the decomposition reveals how much of one series' long-run uncertainty is really imported from another. FEVD answers "where does this variable's surprise come from?" — and like the IRF it inherits the ordering dependence above.
INSTRUMENT T5.2 — IMPULSE-RESPONSE EXPLORERΨᵢ = A₁ⁱ · UNIT SHOCK · EQ T5.5
PEAK RESPONSE y₁
PEAK RESPONSE y₂
LONG-RUN MULTIPLIER (I−A)⁻¹
A unit reduced-form shock hits the chosen variable at horizon 0; the curves are \(\Psi_i = A_1^{\,i}\) applied to that impulse. In a stable system both responses decay to zero — the long-run multiplier is the area under the cumulative response, \((I-A_1)^{-1}\). Note the model uses reduced-form shocks; structural IRFs would require the Cholesky ordering of EQ T5.6.
5.4

Cointegration & the VECM

Everything above assumed stability — eigenvalues strictly inside the unit circle. But most economic and financial level series are integrated of order one, \(I(1)\): they have a unit root, wander like random walks, and only their differences are stationary. Run a VAR on the raw levels of two \(I(1)\) series and OLS will happily report a high \(R^2\) that is mostly spurious regression — two independent random walks look correlated purely because both trend.

The remarkable exception is cointegration. Two (or more) \(I(1)\) series are cointegrated if some linear combination of them is \(I(0)\) — stationary. Intuitively, the series share a common stochastic trend, and although each wanders without bound, they cannot wander independently: a spread between them is mean-reverting. Engle and Granger (1987) formalized this and, crucially, the Granger representation theorem proved that cointegration is equivalent to the existence of an error-correction representation.

EQ T5.7 — COINTEGRATION $$ y_t \sim I(1)^K, \qquad \exists\, \beta \neq 0:\ \ \beta^{\top} y_t \sim I(0). \qquad \text{Common-trend form: } \begin{cases} x_t = w_t + u_t \\ z_t = w_t + v_t \end{cases},\ \ w_t = w_{t-1} + \eta_t $$
\(\beta\) is a cointegrating vector; \(\beta^\top y_t\) is the stationary equilibrium error. In the two-series common-trend example, \(x_t\) and \(z_t\) each inherit the random walk \(w_t\) and are individually \(I(1)\), yet \(x_t - z_t = u_t - v_t\) cancels the trend and is \(I(0)\): here \(\beta = (1,-1)^\top\). The number of independent cointegrating vectors is the cointegration rank \(r\), \(0 \le r < K\). \(r=0\) means no long-run tie (difference everything and fit a VAR); \(r=K\) would mean the levels were stationary all along.
True or false: if two series are each \(I(1)\) but some linear combination of them is stationary (\(I(0)\)), the series are cointegrated. (Answer true or false.)
This is the definition of cointegration (EQ T5.7): individually non-stationary \(I(1)\) series whose linear combination \(\beta^\top y_t\) is stationary share a common stochastic trend that the combination cancels. The statement is true.

The error-correction form is the Vector Error-Correction Model (VECM). Re-parameterize the levels VAR in differences, with one term in levels left behind:

EQ T5.8 — VECTOR ERROR-CORRECTION MODEL $$ \Delta y_t \;=\; \Pi\, y_{t-1} \;+\; \sum_{i=1}^{p-1} \Gamma_i\, \Delta y_{t-i} \;+\; c \;+\; \varepsilon_t, \qquad \Pi \;=\; \alpha\,\beta^{\top}, \quad \mathrm{rank}(\Pi) = r $$
Everything except \(\Pi y_{t-1}\) is in stationary differences. The long-run information lives entirely in \(\Pi\): its rank is the cointegration rank \(r\). When \(0<r<K\), \(\Pi\) factors as \(\alpha\beta^\top\) where \(\beta\) (\(K\times r\)) holds the cointegrating vectors and \(\alpha\) (\(K\times r\)) holds the speed-of-adjustment loadings. The term \(\alpha(\beta^\top y_{t-1})\) reads: measure how far the system is from equilibrium (\(\beta^\top y_{t-1}\)), then pull each variable back at rate \(\alpha\). It is a thermostat for a system of random walks.

Estimating \(r\) and \(\beta\) is the Johansen procedure (1991): a reduced-rank regression of \(\Delta y_t\) and \(y_{t-1}\) on the lagged differences yields canonical correlations whose eigenvalues drive two likelihood-ratio tests — the trace test and the maximum-eigenvalue test — for the null \(\mathrm{rank}(\Pi)\le r\). Johansen handles multiple cointegrating relations simultaneously, unlike the two-step Engle–Granger residual test, which is simplest for the single-equation \(r=1\) case.

In a VECM the error-correction term is \(\alpha\,(\beta^{\top} y_{t-1})\). Suppose the equilibrium error \(\beta^{\top} y_{t-1} = -5\) and the speed-of-adjustment loading for this equation is \(\alpha = -0.4\). What is the error-correction contribution \(\alpha\,(\beta^{\top}y_{t-1})\) to \(\Delta y_t\) this period?
\(\alpha\,(\beta^\top y_{t-1}) = (-0.4)\times(-5) = \) 2. The series is below its long-run equilibrium (negative error), and the negative loading \(\alpha<0\) produces a positive push of \(+2\) — the system corrects upward, back toward equilibrium.
INSTRUMENT T5.3 — COINTEGRATION DEMOTWO I(1) WALKS SHARING A STOCHASTIC TREND · EQ T5.7
VAR(LEVELS) — STD(x)
STD(SPREAD x−z)
VERDICT
At θ = 1 both series \(x\) and \(z\) ride the same random walk \(w_t\), so each is \(I(1)\) and wanders freely (top panel) — yet the spread \(x-z\) (bottom panel, mint) stays pinned near zero: they are cointegrated with \(\beta=(1,-1)\). Slide θ toward 0 to decouple the trends; the spread itself becomes a random walk and the verdict flips to NOT COINTEGRATED. Raising σ widens the stationary spread band without breaking the long-run tether.
5.5

Granger causality

A VAR also lets us ask a precise, modest question about lead–lag structure. Series \(x\) Granger-causes \(y\) if the past of \(x\) helps predict \(y\) beyond what the past of \(y\) already explains. It is a statement about predictive content, framed by Granger (1969), and it is deliberately humble: Granger causality is not structural causation. Reverse causality through an omitted common driver, or anticipation (markets pricing in a future move), can produce Granger causality with no mechanism running in that direction.

EQ T5.9 — GRANGER-CAUSALITY F-TEST $$ H_0:\ \big(A_1\big)_{yx} = \cdots = \big(A_p\big)_{yx} = 0, \qquad F \;=\; \frac{\big(\mathrm{SSR}_r - \mathrm{SSR}_u\big)/q}{\mathrm{SSR}_u/\big(T - k\big)} \ \sim\ F_{q,\,T-k} $$
Fit the \(y\)-equation twice: restricted (only lags of \(y\)) and unrestricted (lags of \(y\) and \(x\)), with sums of squared residuals \(\mathrm{SSR}_r,\ \mathrm{SSR}_u\). Here \(q=p\) is the number of zero restrictions (the \(x\)-lag coefficients) and \(k\) is the parameter count of the unrestricted equation. A large \(F\) — the unrestricted model fits enough better to justify its extra terms — rejects \(H_0\), concluding \(x\) Granger-causes \(y\). The Wald/\(\chi^2\) form \(qF\) is the usual large-sample variant.

Caveats experts insist on. Granger tests are sensitive to the lag order \(p\) and to non-stationarity: applying the plain F-test to \(I(1)\) levels gives invalid critical values, so test on stationary (differenced or cointegration-corrected) data, or use the Toda–Yamamoto lag-augmented approach. Bivariate Granger causality can also be an artifact of a third, omitted variable — always interpret it as conditional on the variables actually in the VAR, and never read it as a license to claim mechanism.

PYTHON · RUNNABLE IN-BROWSER
# Granger-causality F-test by hand (EQ T5.9): does x help predict y?
import numpy as np
rng = np.random.default_rng(1)

# toy data: x is its own AR; y depends on LAGGED x -> x should Granger-cause y
T = 300
x = np.zeros(T); y = np.zeros(T)
for t in range(1, T):
    x[t] = 0.5 * x[t-1] + rng.normal(0, 1)
    y[t] = 0.4 * y[t-1] + 0.6 * x[t-1] + rng.normal(0, 1)   # y <- x[t-1]

yv, ylag, xlag = y[2:], y[1:-1], x[1:-1]          # align one lag
n = len(yv); ones = np.ones(n)

def ssr(X, t):                                     # OLS residual sum of squares
    b, *_ = np.linalg.lstsq(X, t, rcond=None)
    r = t - X @ b
    return float(r @ r)

Xr = np.column_stack([ones, ylag])                 # restricted: y on its own lag
Xu = np.column_stack([ones, ylag, xlag])           # unrestricted: + x lag
ssr_r, ssr_u = ssr(Xr, yv), ssr(Xu, yv)
q, k = 1, Xu.shape[1]                              # 1 restriction; k params
F = ((ssr_r - ssr_u) / q) / (ssr_u / (n - k))
print(f"SSR restricted   = {ssr_r:8.2f}")
print(f"SSR unrestricted = {ssr_u:8.2f}")
print(f"F({q}, {n-k}) = {F:7.2f}")
print("large F (>~3.9 at 5%) => reject H0 => x Granger-causes y")
edits are live — break it on purpose
NEXT

You now have the full multivariate toolkit; the last chapter puts it to work. Chapter 06 — Forecasting in Practice — covers backtesting protocols, walk-forward validation, combining models, prediction intervals, and the brutal lesson that a careful univariate baseline often beats an elaborate VAR out of sample.

5.R

References

  1. Sims, C. A. (1980). Macroeconomics and Reality. Econometrica 48(1) — introduced the VAR as an atheoretical alternative to large structural macro models (EQ T5.1).
  2. Engle, R. F. & Granger, C. W. J. (1987). Co-integration and Error Correction: Representation, Estimation, and Testing. Econometrica 55(2) — defines cointegration and the Granger representation theorem linking it to the VECM (EQ T5.7–T5.8).
  3. Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 37(3) — the original definition of Granger causality (EQ T5.9).
  4. Johansen, S. (1991). Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models. Econometrica 59(6) — the maximum-likelihood (trace and max-eigenvalue) tests for cointegration rank (§5.4).
  5. Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis. Springer — the standard graduate reference for VAR estimation, IRFs, FEVD, and the companion form (EQ T5.2–T5.6).
  6. Toda, H. Y. & Yamamoto, T. (1995). Statistical Inference in Vector Autoregressions with Possibly Integrated Processes. Journal of Econometrics 66(1–2) — the lag-augmented Granger test valid under unit roots and cointegration (§5.5 caveat).