From AR to Vector Autoregression (VAR)
A scalar autoregression \(y_t = \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \varepsilon_t\) (the AR\((p)\) of the earlier chapters) explains a series by its own past. But the world rarely hands you one series in isolation: interest rates, output and inflation move together; an order book's bid and ask co-evolve. The Vector Autoregression is the minimal generalization — stack \(K\) series into a vector \(y_t \in \mathbb{R}^K\) and let every component depend on the recent past of every component, itself included:
Sims (1980) proposed the VAR as a deliberate rebellion against the "incredible" identifying restrictions of large structural macro models: let the data speak by regressing everything on lagged everything, then ask the model questions afterward (§5.3). Each equation has an intercept plus \(K\) coefficients per lag — so a VAR\((p)\) on \(K\) variables carries \(K(Kp + 1)\) parameters in the mean, and the count explodes as \(K\) and \(p\) grow. That parameter profligacy is the VAR's defining tension, and the reason §5.2 obsesses over order selection.
For analysis it is convenient to fold a VAR\((p)\) into a VAR\((1)\) on a stacked state. Define the companion matrix \(F\), an exact analogue of the scalar companion form: a single matrix whose powers generate the entire dynamics.
A VAR also has a clean infinite-history rewrite, the Wold / VMA(\(\infty\)) representation \(y_t = \mu + \sum_{i=0}^{\infty} \Psi_i\,\varepsilon_{t-i}\) with \(\Psi_0 = I\). The matrices \(\Psi_i\) are precisely the impulse responses of §5.3, and for a VAR\((1)\) they are simply \(\Psi_i = A_1^{\,i}\) — powers of the coefficient matrix.
Estimation & order selection
Because every equation of a reduced-form VAR has the identical regressor set — a constant plus the same stacked lags — the seemingly-unrelated-regressions efficiency gain vanishes: equation-by-equation OLS is the (conditional) maximum-likelihood estimator under Gaussian errors, and it is consistent and asymptotically normal whether or not \(\Sigma\) is diagonal. You can fit a VAR with nothing more than the normal equations.
The hard part is not estimation but order selection: too few lags and residuals stay autocorrelated (biasing everything downstream); too many and you burn degrees of freedom on noise. The standard tools are information criteria, which add a complexity penalty to the log-likelihood. Let \(n = K(Kp+1)\) be the total parameter count and \(\lvert\widehat\Sigma_p\rvert\) the residual-covariance determinant at lag \(p\):
The curse of dimensionality is real here. A VAR\((4)\) on 8 macro variables already has \(8\times(8\times 4 + 1) = 264\) parameters — easily more than a typical quarterly sample of post-war data. The modern responses are Bayesian VARs with shrinkage priors (the Minnesota prior pulls coefficients toward a random walk), factor-augmented VARs that compress many series into a few factors, and large-VAR estimators with elementwise penalties. None of that changes the OLS skeleton above; they change the prior on \(B\).
# Fit a VAR(1) on two simulated series by OLS (EQ T5.3) and recover A1
import numpy as np
rng = np.random.default_rng(0)
A = np.array([[0.5, 0.3], # the true coefficient matrix A1
[0.2, 0.4]]) # off-diagonals = cross-feedback
T = 600
y = np.zeros((T, 2))
for t in range(1, T): # simulate the VAR(1) data
y[t] = A @ y[t - 1] + rng.normal(0, 1, 2)
Z = y[:-1] # regressors: y_{t-1} (T-1 x 2)
Y = y[1:] # targets: y_t (T-1 x 2)
B = np.linalg.solve(Z.T @ Z, Z.T @ Y).T # OLS, one solve -> A1_hat (2x2)
np.set_printoptions(precision=3, suppress=True)
print("true A1:\n", A)
print("OLS A1_hat:\n", B)
ev = np.linalg.eigvals(B)
print("\neigenvalue moduli:", np.round(np.abs(ev), 3),
"-> stable" if np.all(np.abs(ev) < 1) else "-> UNSTABLE")
print("max |lambda| =", round(float(np.max(np.abs(ev))), 3), "(must be < 1)")
Impulse-response & variance decomposition
A fitted VAR is a dense block of coefficients that almost no one can read directly. The two devices that make it interpretable are the impulse-response function (IRF) — how the whole system reacts over time to a one-off shock — and the forecast-error variance decomposition (FEVD) — what share of each variable's unpredictability traces back to each shock. Both fall straight out of the VMA(\(\infty\)) coefficients \(\Psi_i\).
The identification caveat — say it out loud. Reduced-form shocks \(\varepsilon_t\) are contemporaneously correlated (\(\Sigma\) is not diagonal), so "a shock to variable 1 alone" is not well defined: in the data, variable 2 tends to move at the same instant. To read structural impulse responses you must impose identifying assumptions that orthogonalize the shocks. The textbook choice is a Cholesky (recursive) ordering — factor \(\Sigma = P P^{\top}\) with \(P\) lower-triangular and report \(\Psi_i P\) — which assumes a causal ordering of the variables (those earlier in the list can shock those later within the period, but not vice versa). Different orderings give different stories, and that ambiguity is the central, contested limitation of VAR analysis; sign restrictions, long-run (Blanchard–Quah) restrictions, and external-instrument (proxy-SVAR) methods are the modern alternatives.
Cointegration & the VECM
Everything above assumed stability — eigenvalues strictly inside the unit circle. But most economic and financial level series are integrated of order one, \(I(1)\): they have a unit root, wander like random walks, and only their differences are stationary. Run a VAR on the raw levels of two \(I(1)\) series and OLS will happily report a high \(R^2\) that is mostly spurious regression — two independent random walks look correlated purely because both trend.
The remarkable exception is cointegration. Two (or more) \(I(1)\) series are cointegrated if some linear combination of them is \(I(0)\) — stationary. Intuitively, the series share a common stochastic trend, and although each wanders without bound, they cannot wander independently: a spread between them is mean-reverting. Engle and Granger (1987) formalized this and, crucially, the Granger representation theorem proved that cointegration is equivalent to the existence of an error-correction representation.
The error-correction form is the Vector Error-Correction Model (VECM). Re-parameterize the levels VAR in differences, with one term in levels left behind:
Estimating \(r\) and \(\beta\) is the Johansen procedure (1991): a reduced-rank regression of \(\Delta y_t\) and \(y_{t-1}\) on the lagged differences yields canonical correlations whose eigenvalues drive two likelihood-ratio tests — the trace test and the maximum-eigenvalue test — for the null \(\mathrm{rank}(\Pi)\le r\). Johansen handles multiple cointegrating relations simultaneously, unlike the two-step Engle–Granger residual test, which is simplest for the single-equation \(r=1\) case.
Granger causality
A VAR also lets us ask a precise, modest question about lead–lag structure. Series \(x\) Granger-causes \(y\) if the past of \(x\) helps predict \(y\) beyond what the past of \(y\) already explains. It is a statement about predictive content, framed by Granger (1969), and it is deliberately humble: Granger causality is not structural causation. Reverse causality through an omitted common driver, or anticipation (markets pricing in a future move), can produce Granger causality with no mechanism running in that direction.
Caveats experts insist on. Granger tests are sensitive to the lag order \(p\) and to non-stationarity: applying the plain F-test to \(I(1)\) levels gives invalid critical values, so test on stationary (differenced or cointegration-corrected) data, or use the Toda–Yamamoto lag-augmented approach. Bivariate Granger causality can also be an artifact of a third, omitted variable — always interpret it as conditional on the variables actually in the VAR, and never read it as a license to claim mechanism.
# Granger-causality F-test by hand (EQ T5.9): does x help predict y?
import numpy as np
rng = np.random.default_rng(1)
# toy data: x is its own AR; y depends on LAGGED x -> x should Granger-cause y
T = 300
x = np.zeros(T); y = np.zeros(T)
for t in range(1, T):
x[t] = 0.5 * x[t-1] + rng.normal(0, 1)
y[t] = 0.4 * y[t-1] + 0.6 * x[t-1] + rng.normal(0, 1) # y <- x[t-1]
yv, ylag, xlag = y[2:], y[1:-1], x[1:-1] # align one lag
n = len(yv); ones = np.ones(n)
def ssr(X, t): # OLS residual sum of squares
b, *_ = np.linalg.lstsq(X, t, rcond=None)
r = t - X @ b
return float(r @ r)
Xr = np.column_stack([ones, ylag]) # restricted: y on its own lag
Xu = np.column_stack([ones, ylag, xlag]) # unrestricted: + x lag
ssr_r, ssr_u = ssr(Xr, yv), ssr(Xu, yv)
q, k = 1, Xu.shape[1] # 1 restriction; k params
F = ((ssr_r - ssr_u) / q) / (ssr_u / (n - k))
print(f"SSR restricted = {ssr_r:8.2f}")
print(f"SSR unrestricted = {ssr_u:8.2f}")
print(f"F({q}, {n-k}) = {F:7.2f}")
print("large F (>~3.9 at 5%) => reject H0 => x Granger-causes y")
You now have the full multivariate toolkit; the last chapter puts it to work. Chapter 06 — Forecasting in Practice — covers backtesting protocols, walk-forward validation, combining models, prediction intervals, and the brutal lesson that a careful univariate baseline often beats an elaborate VAR out of sample.
References
- Sims, C. A. (1980). Macroeconomics and Reality.
- Engle, R. F. & Granger, C. W. J. (1987). Co-integration and Error Correction: Representation, Estimation, and Testing.
- Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods.
- Johansen, S. (1991). Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models.
- Lütkepohl, H. (2005). New Introduction to Multiple Time Series Analysis.
- Toda, H. Y. & Yamamoto, T. (1995). Statistical Inference in Vector Autoregressions with Possibly Integrated Processes.