AI // ENCYCLOPEDIA / TIME SERIES / 01 / FUNDAMENTALS INDEX NEXT: 02 ARIMA →
TIME SERIES & ECONOMETRICS · CHAPTER 01 / 06

Time Series Fundamentals

Most models assume the rows are interchangeable, so shuffling them loses nothing. Attach a clock and that assumption fails: yesterday shapes today, the order is the signal, and ordinary error bars understate uncertainty. A time index breaks the i.i.d. assumption every other model relies on, and stationarity is the weaker condition that replaces it.

LEVELINTRO READING TIME≈ 24 MIN BUILDS ONSTATS 01–03 INSTRUMENTSDECOMPOSER · ACF/PACF · RANDOM WALK
1.1

Trend, seasonality & noise

A time series is a sequence of observations indexed by time, \(y_1, y_2, \ldots, y_T\), where the index is not a label but a coordinate: \(y_t\) and \(y_{t+1}\) are neighbours, and that adjacency carries information. The first reflex of the field is to read the series as a sum of structured parts plus what is left over. The classical decomposition is additive:

EQ T1.1 — ADDITIVE DECOMPOSITION $$ y_t \;=\; T_t \;+\; S_t \;+\; R_t $$
\(T_t\) is the trend-cycle — the slow drift (a growing user base, a warming climate); \(S_t\) is the seasonal component — a pattern that repeats every \(m\) steps (weekly traffic, yearly retail); \(R_t\) is the remainder — everything the first two cannot explain, ideally structureless noise. When the seasonal swings grow with the level of the series, a multiplicative form \(y_t = T_t \times S_t \times R_t\) fits better — and taking logs turns multiplication back into the additive form above, the first hint that a transform can simplify structure.

This split is descriptive, not causal: it is a lens, and choosing additive versus multiplicative, or the seasonal period \(m\), is a modelling decision you make by looking. The remainder \(R_t\) is the part we actually want to be boring. If \(R_t\) still wiggles in a predictable way — if knowing \(R_{t-1}\) helps you guess \(R_t\) — then the decomposition left structure on the table, and the chapters that follow (ARIMA, ETS, GARCH) exist to mop it up.

A note on honesty. The classical additive split assumes the trend is smooth and the season has a fixed period and shape. Real series violate both — holidays move, regimes shift, the period itself drifts. Robust modern decompositions (STL, the loess-based method) allow the seasonal shape to evolve and resist outliers; treat any decomposition as a hypothesis to check, not a fact to trust.

Under the additive model (EQ T1.1), at a given month the trend is \( T_t = 100 \), the seasonal term is \( S_t = 25 \), and the remainder is \( R_t = -5 \). What is the observed value \( y_t \)?
The additive decomposition simply sums the parts: \( y_t = T_t + S_t + R_t = 100 + 25 + (-5) = \) 120. Each component pulls the level up or down; the remainder is the small correction the structured terms missed.
INSTRUMENT T1.1 — TIME-SERIES DECOMPOSERCOMPOSE T + S + R · EQ T1.1
OBSERVED RANGE
SEASONAL PERIOD m
12
REMAINDER VARIANCE
Four stacked panels: the observed series on top, then the three components that built it — trend, seasonal, remainder. Push the trend slope negative to watch the whole series tilt down; the seasonal panel never moves, because season is independent of level in the additive model. Crank noise up and the remainder panel fills with hash while the observed series gets ragged — that hash is exactly the \(R_t\) the next chapters try to model. With noise at zero, the observed series is a clean sum of two smooth curves: a perfect, and unrealistic, world.
1.2

Stationarity & why it matters

Here is the assumption almost every classical model needs, and the one a clock loves to break. A series is (weakly) stationary if its statistical character does not depend on when you look at it. Concretely, three things must hold for all \(t\) and all lags \(k\):

EQ T1.2 — WEAK (COVARIANCE) STATIONARITY $$ \mathbb{E}[y_t] = \mu \;\;(\text{constant}), \qquad \mathrm{Var}(y_t) = \sigma^2 \;\;(\text{constant}), \qquad \mathrm{Cov}(y_t,\, y_{t+k}) = \gamma_k \;\;(\text{depends on } k \text{ only}) $$
The mean is flat, the variance is flat, and the covariance between two points depends only on the gap \(k\) between them, never on their absolute position. A series with a trend fails the first condition; a series whose swings widen over time fails the second; a series with a moving seasonal pattern fails the third. Stationarity is what lets the past stand in for the future — if the rules of the game keep changing, a model fit on history is estimating a target that no longer exists.

Why is this the load-bearing assumption? Independent-and-identically-distributed (i.i.d.) data is the comfortable world of the rest of this encyclopedia: each row drawn fresh from one fixed distribution, so a sample average converges to the truth and a single split estimates generalization (the holdout logic of MLOPS · §1.1). A time series is emphatically not i.i.d. — the points are dependent by construction. Stationarity is the weaker substitute: it does not require independence, only that the dependence structure be stable over time. That stability is enough to make estimation and forecasting well-posed.

SeriesViolatesStationary?Fix (§1.5)
Linear upward trendconstant meannodifference once
Variance grows with levelconstant variancenolog / Box–Cox
Seasonal salesconst. mean & \(\gamma_k\)noseasonal difference
White noise— nothing —yesalready there
Stable AR(1), \(|\phi|<1\)— nothing —yesalready there

Strict vs weak. The definition above is weak (second-order) stationarity — it constrains only the first two moments. Strict stationarity asks that the entire joint distribution be time-invariant, a much stronger demand. For Gaussian processes the two coincide, which is why the weak form is the working definition in practice. Most of forecasting lives on the assumption that, after some transform, the series is weakly stationary.

A company's monthly revenue grows steadily year after year along a clear upward trend. Is that raw revenue series stationary in the sense of EQ T1.2? (Answer yes or no.)
A persistent upward trend means \(\mathbb{E}[y_t]\) climbs with \(t\) — the mean is not constant, so the first condition of EQ T1.2 fails. The series is not stationary; differencing it (§1.5) removes the trend and usually restores stationarity.
1.3

Autocorrelation — ACF & PACF

If the points are dependent, the natural question is: how dependent, and at what range? The autocorrelation function (ACF) answers it by correlating the series with a delayed copy of itself. At lag \(k\) it is the covariance \(\gamma_k\) from EQ T1.2, normalized by the variance so it lives in \([-1, +1]\):

EQ T1.3 — THE AUTOCORRELATION FUNCTION $$ \rho_k \;=\; \frac{\gamma_k}{\gamma_0} \;=\; \frac{\mathrm{Cov}(y_t,\, y_{t+k})}{\mathrm{Var}(y_t)}, \qquad \hat{\rho}_k = \frac{\sum_{t=1}^{T-k} (y_t - \bar{y})(y_{t+k} - \bar{y})}{\sum_{t=1}^{T} (y_t - \bar{y})^2} $$
\(\rho_0 = 1\) always (a series is perfectly correlated with itself). The plot of \(\hat{\rho}_k\) against \(k\) is a correlogram. Under the null of pure white noise, the estimates scatter inside a band of roughly \(\pm 1.96/\sqrt{T}\) — bars that poke outside it are evidence of real structure. The shape of the ACF is a fingerprint: a slow geometric decay says "autoregressive memory"; a sharp cut-off after a few lags says "moving-average"; a single tall spike at lag \(m\) says "seasonality of period \(m\)".

The ACF has a blind spot. If today depends on yesterday, then today also correlates with the day before — not directly, but through yesterday. The ACF cannot tell a direct link from a relayed one. The partial autocorrelation function (PACF) closes that gap: \(\alpha_k\) is the correlation between \(y_t\) and \(y_{t-k}\) after removing the linear effect of all the lags in between. It is the direct dependence at range \(k\), with the relayed paths stripped out.

EQ T1.4 — ACF / PACF SIGNATURES $$ \text{AR}(p): \quad \text{ACF decays},\;\; \text{PACF cuts off after lag } p; \qquad \text{MA}(q): \quad \text{ACF cuts off after lag } q,\;\; \text{PACF decays} $$
This duality is the classic Box–Jenkins identification rule, and it is why both plots are read together. An AR(\(p\)) process — each value a weighted sum of its own past — shows a PACF that drops to zero past lag \(p\), because once you condition on the first \(p\) lags there is no direct link left. An MA(\(q\)) process — each value a weighted sum of past shocks — is its mirror image. Chapter 02 turns these fingerprints into fitted models.

For the workhorse AR(1) process \(y_t = \phi\, y_{t-1} + \varepsilon_t\), the theory is exact and worth memorizing: the ACF is a clean geometric decay, \(\rho_k = \phi^k\), and the PACF is a single spike of height \(\phi\) at lag 1 and exactly zero everywhere after. That pair — exponential ACF, one-spike PACF — is the textbook AR(1) signature, and it is what the next instrument lets you see.

An AR(1) process \( y_t = \phi\,y_{t-1} + \varepsilon_t \) has \( \phi = 0.7 \). Using the AR(1) result \( \rho_k = \phi^{k} \), what is its theoretical autocorrelation at lag \( k = 3 \)?
For an AR(1), the ACF decays geometrically: \( \rho_3 = \phi^3 = 0.7^3 = 0.7 \times 0.7 \times 0.7 = \) 0.343. Memory fades by a constant factor \(\phi\) per step — the defining shape of an autoregressive correlogram.
PYTHON · RUNNABLE IN-BROWSER
# Simulate AR(1), then compute and plot its sample ACF (EQ T1.3).
import numpy as np
rng = np.random.default_rng(0)

phi, T = 0.7, 600
eps = rng.normal(0, 1, T)
y = np.zeros(T)
for t in range(1, T):            # y_t = phi * y_{t-1} + eps_t
    y[t] = phi * y[t-1] + eps[t]
y = y - y.mean()                 # center so the ACF formula is clean

def acf(x, K):                   # sample autocorrelation up to lag K
    denom = np.sum(x * x)
    return np.array([np.sum(x[:len(x)-k] * x[k:]) / denom for k in range(K+1)])

K = 12
r = acf(y, K)
band = 1.96 / np.sqrt(T)         # +/- white-noise significance band
print(" lag   sample ACF   theory phi^k")
for k in range(K+1):
    flag = "  *" if abs(r[k]) > band and k > 0 else ""
    print(f" {k:3d}    {r[k]:8.3f}     {phi**k:8.3f}{flag}")
print(f"\nwhite-noise band +/-{band:.3f}; bars marked * are real memory.")
print("note the sample ACF tracks the geometric phi^k decay of an AR(1).")
plot_xy(list(range(K+1)), list(r))
edits are live — break it on purpose
INSTRUMENT T1.2 — ACF / PACF EXPLORERAR & MA SERIES → CORRELOGRAMS · EQ T1.4
PROCESS
SIGNATURE
WHITE-NOISE BAND
Top panel: a simulated realization. Bottom two: its sample ACF and PACF, with the grey \(\pm 1.96/\sqrt{T}\) band — bars inside it are indistinguishable from noise. Pick AR(1) and watch the ACF decay smoothly while the PACF shows one spike and quits (EQ T1.4); flip to MA(1) and the two plots swap roles. WHITE NOISE keeps almost every bar inside the band — the look of a series with no exploitable memory. Drag the coefficient negative to make the correlogram alternate sign, and press RESHUFFLE to feel how much a finite sample wobbles around the theory.
1.4

White noise & the random walk

Two reference processes anchor the whole subject — one the picture of "no structure," the other the most important non-stationary series in practice. White noise is the boring ideal: a sequence of uncorrelated, zero-mean, constant-variance shocks. It is stationary by construction and, crucially, unforecastable beyond its mean.

EQ T1.5 — WHITE NOISE $$ \varepsilon_t \;\sim\; (0,\, \sigma^2), \qquad \mathbb{E}[\varepsilon_t] = 0, \quad \mathrm{Var}(\varepsilon_t) = \sigma^2, \quad \mathrm{Cov}(\varepsilon_t,\, \varepsilon_{t+k}) = 0 \;\; \text{for } k \neq 0 $$
Every autocorrelation past lag 0 is zero, so its ACF is a single spike at the origin and flat thereafter. White noise is the goal, not the enemy: when the residuals of a fitted model look like white noise, you have extracted all the linear structure the data offered. Tools like the Ljung–Box test formalize "do these residuals look white?" by checking whether a batch of autocorrelations is jointly indistinguishable from zero.

Now cumulate that noise. A random walk sets each value equal to the previous one plus a fresh independent shock — it is the running sum of white noise, and it is the canonical model for an unpredictable price, a diffusing particle, or any quantity that wanders without an anchor:

EQ T1.6 — RANDOM WALK $$ y_t \;=\; y_{t-1} + \varepsilon_t \;=\; y_0 + \sum_{i=1}^{t} \varepsilon_i, \qquad \mathrm{Var}(y_t) = t\,\sigma^2 $$
It is the AR(1) of §1.3 pushed to its boundary, \(\phi = 1\) — a unit root. That single fact is decisive: the variance \(t\sigma^2\) grows without bound, so the constant-variance condition of EQ T1.2 fails and a random walk is not stationary. There is no fixed mean to revert to; a shock today is never forgotten, it is baked permanently into every future value. This is why "the series looks like it has momentum" is so often just a random walk fooling the eye — and why distinguishing a true trend from a unit root (the Dickey–Fuller test, Chapter 03) is one of the field's defining problems.

The contested part, stated plainly. Whether a given real series — GDP, a stock index, an exchange rate — is "trend-stationary" (a deterministic trend plus stationary noise) or "difference-stationary" (a random walk with drift) is genuinely hard to decide from finite data, and decades of econometrics have been spent arguing specific cases. The two imply very different forecasts and very different long-run behaviour. Unit-root tests give evidence, not certainty; honest practice reports the ambiguity rather than hiding it.

Is a random walk \( y_t = y_{t-1} + \varepsilon_t \) a stationary process? (Answer yes or no.)
From EQ T1.6, \(\mathrm{Var}(y_t) = t\,\sigma^2\) grows without bound as \(t\) increases, violating the constant-variance condition of EQ T1.2 — and there is no fixed mean to revert to. A random walk is not stationary; its first difference \(y_t - y_{t-1} = \varepsilon_t\) is white noise, which is.
INSTRUMENT T1.3 — RANDOM WALK vs STATIONARY AR(1)φ → 1 IS A UNIT ROOT · EQ T1.6
REGIME
THEORETICAL Var(y∞)
STATIONARY?
Five independent paths share one set of shocks but differ only in \(\phi\). Down near \(\phi = 0.5\) every path is a tight, mean-reverting AR(1): pulled back toward zero, finite variance \(\sigma^2/(1-\phi^2)\), the dashed envelope holds them in. Slide \(\phi\) toward 1 and the envelope flares open — at exactly \(\phi = 1\) it becomes a random walk, the paths wander off and never come home, and the readout's variance goes to ∞. That divergence at the unit root is the loss of stationarity, made visible. Press NEW SHOCKS to redraw.
1.5

Differencing & transforms to stationarity

So a great many real series are not stationary — and the entire toolkit needs them to be. The fix is a pair of cheap, reversible transforms that attack the two ways stationarity fails: a non-constant mean, and a non-constant variance.

The mean problem — trend — is killed by differencing: replace the series with the step-to-step changes. Define the difference operator \(\nabla y_t = y_t - y_{t-1}\). One difference removes a linear trend; a second difference removes a quadratic one. The payoff is exact for the random walk:

EQ T1.7 — FIRST DIFFERENCING $$ \nabla y_t \;=\; y_t - y_{t-1}, \qquad \text{random walk} \;\Rightarrow\; \nabla y_t = (y_{t-1} + \varepsilon_t) - y_{t-1} = \varepsilon_t $$
Differencing a random walk returns pure white noise — the non-stationary unit root is annihilated in one step. A series that needs \(d\) differences to become stationary is called integrated of order \(d\), written \(I(d)\); a random walk is \(I(1)\), white noise is \(I(0)\). That little \(d\) is precisely the "I" in ARIMA (Chapter 02). For seasonal trends, the seasonal difference \(\nabla_m y_t = y_t - y_{t-m}\) does the same job at lag \(m\). Caution: over-differencing injects artificial negative autocorrelation and inflates variance — difference only as much as you must.

The variance problem — swings that widen as the series grows — is killed by a variance-stabilizing transform. The log is the everyday choice; the Box–Cox family generalizes it with a single tunable power \(\lambda\), smoothly spanning from "no transform" (\(\lambda = 1\)) through "square root" (\(\lambda = 0.5\)) to "log" (\(\lambda \to 0\)):

EQ T1.8 — THE BOX–COX TRANSFORM $$ y_t^{(\lambda)} = \begin{cases} \dfrac{y_t^{\lambda} - 1}{\lambda} & \lambda \neq 0 \\[4pt] \ln y_t & \lambda = 0 \end{cases} \qquad (y_t > 0) $$
Choose \(\lambda\) so the spread of the series stops depending on its level. Because \(\ln\) turns a multiplicative seasonal pattern into an additive one (recall §1.1), the log is also what converts a multiplicative decomposition into the friendly additive form. The standard recipe stacks the two: first stabilize the variance with a transform, then stabilize the mean with differencing — variance before mean, because differencing a heteroscedastic series just relocates the problem.
Apply the first-difference operator \(\nabla y_t = y_t - y_{t-1}\) to the series \( [\,2,\ 5,\ 9,\ 14\,] \). What is the last value of the differenced series?
The differences are \(5-2 = 3\), \(9-5 = 4\), \(14-9 = 5\), giving \([\,3,\ 4,\ 5\,]\). Differencing shortens the series by one (you cannot difference the first point), and the last value is 5. Notice the gaps are themselves rising by 1 each step — a hint this series has quadratic curvature that a second difference would flatten.
PYTHON · RUNNABLE IN-BROWSER
# Difference a trending series and watch the variance collapse (EQ T1.7).
import numpy as np
rng = np.random.default_rng(1)

T = 400
trend = 0.5 * np.arange(T)               # a steady linear climb: non-stationary mean
y = trend + np.cumsum(rng.normal(0, 1, T))  # trend + a random-walk wander on top

d1 = np.diff(y)                          # first difference: nabla y_t = y_t - y_{t-1}
d2 = np.diff(d1)                         # second difference

def stats(name, x):
    print(f"{name:18s} mean {x.mean():8.3f}   variance {x.var():12.1f}")

print("level vs differenced series:")
stats("y (level)", y)                    # huge variance: the trend dominates
stats("diff once (d=1)", d1)             # variance plummets; mean ~ the slope 0.5
stats("diff twice (d=2)", d2)            # flat mean ~0; over-differenced -> var rises again
print("\none difference removes the trend (mean -> the slope, variance collapses);")
print("a SECOND difference over-does it -- variance climbs back. Difference sparingly.")
plot_xy(list(range(len(d1))), list(d1))  # the stationary-looking differenced series
edits are live — break it on purpose
NEXT

You now have the vocabulary; ARIMA gives it grammar. Once a series is stationary — variance-stabilized, then differenced \(d\) times — its leftover memory is exactly the AR and MA structure the correlograms revealed. Chapter 02 fuses the three letters: the Integration order \(d\) from this chapter, the AutoRegression and Moving Average orders \(p\) and \(q\) read off the ACF and PACF, into the single most-used forecasting model in the world.

1.R

References

  1. Box, G. E. P., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley — the canonical text; the ACF/PACF identification method (§1.3) and integration order \(d\) (§1.5) are its core.
  2. Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press — the graduate-level reference for stationarity, unit roots, and the econometric theory behind §1.2 and §1.4.
  3. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts (free online) — the modern practitioner's guide; decomposition (§1.1), STL, and Box–Cox (§1.5).
  4. Dickey, D. A. & Fuller, W. A. (1979). Distribution of the Estimators for Autoregressive Time Series with a Unit Root. JASA 74(366) — the unit-root test that decides random walk vs stationary (§1.4).
  5. Box, G. E. P. & Cox, D. R. (1964). An Analysis of Transformations. J. R. Stat. Soc. B 26(2) — the variance-stabilizing power transform of EQ T1.8.
  6. Ljung, G. M. & Box, G. E. P. (1978). On a Measure of Lack of Fit in Time Series Models. Biometrika 65(2) — the portmanteau test for "are these residuals white noise?" (§1.4).
  7. Yule, G. U. (1927). On a Method of Investigating Periodicities in Disturbed Series. Phil. Trans. R. Soc. A 226 — the paper that introduced the autoregressive model (§1.3).