Exponential Smoothing & Holt-Winters

3.1

Simple exponential smoothing

Start with a series that has no trend and no season — just a level that wanders, buried in noise. A naïve forecast uses only the last value; a long moving average uses many values but weights them all equally, which is plainly wrong: a reading from a year ago should not count as much as yesterday's. Simple exponential smoothing (SES) resolves the tension with one parameter. Maintain a running estimate of the level $\ell_t$ and, at every new observation, nudge it toward the latest value by a fraction $\alpha$:

EQ T3.1 — THE SMOOTHING RECURRENCE $$ \ell_t \;=\; \alpha\, y_t + (1-\alpha)\,\ell_{t-1}, \qquad 0 < \alpha < 1, \qquad \hat{y}_{t+1\mid t} = \ell_t $$

$\ell_t$ is the smoothed level after seeing $y_t$; the one-step-ahead forecast is simply that level, and so is the forecast for every horizon (a flat line — SES has no trend). $\alpha$ is the learning rate: $\alpha \to 1$ recovers the naïve "repeat the last value" forecast; $\alpha \to 0$ freezes the level at its initial estimate, a long-run average. The whole method is this single line, applied once per observation — $O(n)$ time, $O(1)$ memory.

The error-correction form makes the "learning rate" reading explicit. Rearranging EQ T3.1 around the one-step forecast error $e_t = y_t - \ell_{t-1}$:

EQ T3.2 — ERROR-CORRECTION FORM $$ \ell_t \;=\; \ell_{t-1} + \alpha\,(y_t - \ell_{t-1}) \;=\; \ell_{t-1} + \alpha\, e_t $$

Read it as gradient descent on squared error with step size $\alpha$: each forecast miss $e_t$ pulls the level a fraction $\alpha$ of the way toward correcting it. This is the same shape as the perceptron and Widrow-Hoff (LMS) update — exponential smoothing is, quite literally, online learning of a moving level, decades before that name existed.

Why "exponential"? Unrolling the recurrence shows the forecast is a weighted average of all past observations, with weights that decay geometrically into the past:

EQ T3.3 — GEOMETRIC WEIGHTING OF THE PAST $$ \hat{y}_{t+1\mid t} \;=\; \alpha \sum_{k=0}^{t-1} (1-\alpha)^{k}\, y_{t-k} \;+\; (1-\alpha)^{t}\,\ell_0, \qquad \sum_{k=0}^{\infty} \alpha\,(1-\alpha)^{k} = 1 $$

The weight on the observation $k$ steps back is $\alpha(1-\alpha)^k$ — largest for the most recent point and shrinking by a constant factor $(1-\alpha)$ each step. The weights are a geometric series that sums to one, so the forecast is a genuine weighted average. This is the entire idea of the chapter in one line: the past is never thrown away, it just fades. A small $\alpha$ means a long memory (slow fade); a large $\alpha$ means a short one. The instrument below draws this decay.

Simple exponential smoothing with $\alpha = 0.3$. The current level is $\ell_{t-1} = 10$ and a new observation arrives, $y_t = 20$. What is the updated level $\ell_t$?

EQ T3.1: $\ell_t = \alpha\,y_t + (1-\alpha)\,\ell_{t-1} = 0.3 \times 20 + 0.7 \times 10 = 6 + 7 = $ 13. The level moves 30% of the way from 10 toward the new reading.

With $\alpha = 0.3$, what weight does EQ T3.3 place on the observation two steps in the past, $y_{t-2}$? (Use $k = 2$: $\alpha(1-\alpha)^k$.)

$\alpha(1-\alpha)^2 = 0.3 \times 0.7^2 = 0.3 \times 0.49 = $ 0.147. Compare $y_t$'s weight of $0.30$ and $y_{t-1}$'s of $0.21$: each step back loses a factor of $0.7$.

PYTHON · RUNNABLE IN-BROWSER

# Simple exponential smoothing in numpy: fit a level, print fitted vs actual
import numpy as np
rng = np.random.default_rng(0)

# a wandering level (random walk) plus observation noise -- no trend, no season
n = 24
level_true = 50 + np.cumsum(rng.normal(0, 1.2, n))
y = level_true + rng.normal(0, 2.0, n)

alpha = 0.3
ell = y[0]                       # initialise the level at the first observation
fitted = np.empty(n)
fitted[0] = ell
for t in range(1, n):
    fitted[t] = ell             # one-step forecast BEFORE seeing y[t] is the old level
    ell = alpha * y[t] + (1 - alpha) * ell   # EQ T3.1 update

sse = np.sum((y[1:] - fitted[1:]) ** 2)
print(f"alpha = {alpha},  one-step SSE = {sse:.2f},  final level = {ell:.2f}")
print(" t   actual   forecast   error")
for t in range(1, 8):
    print(f"{t:2d}  {y[t]:7.2f}  {fitted[t]:8.2f}  {y[t]-fitted[t]:+7.2f}")
plot_xy(list(range(n)), list(y))    # the noisy series; fitted line tracks its level

edits are live — break it on purpose

INSTRUMENT T3.1 — EXPONENTIAL-SMOOTHING EXPLORERGEOMETRIC WEIGHTS · EQ T3.3 · LIVE

SMOOTHING α 0.30

WEIGHT ON LAST OBS

—

EFFECTIVE MEMORY (½-LIFE)

—

WEIGHT IN LAST 5 OBS

—

Each mint bar is the weight EQ T3.3 places on an observation that many steps in the past; they form a geometric decay that sums to one. Drag α toward 1 and the forecast collapses onto the most recent point (a spike at lag 0 — short memory, twitchy). Drag it toward 0 and the bars flatten into a long, even tail — the method becomes a slow long-run average. The half-life readout, $\ln 2 / -\ln(1-\alpha)$, is how many steps back the cumulative weight reaches 50%.

3.2

Holt's linear trend method

SES forecasts a flat line, so it lags badly on any series that is climbing or falling: it is forever chasing a level that has already moved on. Holt (1957) added a second smoothed component — a trend $b_t$, the estimated change per period — updated by its own smoothing parameter $\beta$. Now two recurrences run in lockstep, and the forecast extrapolates the trend forward:

EQ T3.4 — HOLT'S LINEAR (DOUBLE) SMOOTHING $$ \begin{aligned} \ell_t &= \alpha\, y_t + (1-\alpha)\,(\ell_{t-1} + b_{t-1}) \\ b_t &= \beta\,(\ell_t - \ell_{t-1}) + (1-\beta)\,b_{t-1} \\ \hat{y}_{t+h\mid t} &= \ell_t + h\, b_t \end{aligned} $$

The level update now smooths toward $y_t$ but starts from $\ell_{t-1}+b_{t-1}$ — last level plus where the trend said it would go. The trend update smooths the latest observed slope $(\ell_t - \ell_{t-1})$ against the old trend. The forecast is no longer flat: it is a straight line of slope $b_t$, projected $h$ steps out. Set $\beta = 0$ (constant trend) or $b_0 = 0$ and Holt degenerates back to SES.

One honest caveat: a linear trend projected far into the future is usually too aggressive — real series flatten. The standard fix is the damped trend of Gardner & McKenzie (1985), which multiplies the trend by a damping factor $0 < \phi < 1$ so the forecast bends toward a horizontal asymptote:

EQ T3.5 — DAMPED TREND $$ \hat{y}_{t+h\mid t} \;=\; \ell_t + (\phi + \phi^2 + \cdots + \phi^{h})\,b_t, \qquad \lim_{h\to\infty} \hat{y}_{t+h\mid t} = \ell_t + \frac{\phi}{1-\phi}\,b_t $$

With $\phi = 1$ this is exactly Holt's undamped line; with $\phi < 1$ the per-step contribution of the trend shrinks geometrically and the forecast saturates at a finite ceiling. The damped-trend method is one of the most reliable automatic forecasters known — it was the benchmark to beat across the M-competitions, and a hard one.

PYTHON · RUNNABLE IN-BROWSER

# Holt's linear method: vary alpha and beta, forecast h steps ahead (EQ T3.4)
import numpy as np

# a trending series: level rises ~1.5/period with a little noise
n = 30
y = 10 + 1.5 * np.arange(n) + np.array([0,1,-1,2,0,-2,1,3,-1,0,
                                        2,-1,1,0,-2,1,2,-1,0,1,
                                        -1,2,0,1,-2,0,1,-1,2,0], float)

def holt(y, alpha, beta, h=4):
    ell, b = y[0], y[1] - y[0]          # init: level=y0, trend=first difference
    for t in range(1, len(y)):
        prev = ell
        ell = alpha * y[t] + (1 - alpha) * (ell + b)   # level
        b   = beta * (ell - prev) + (1 - beta) * b     # trend
    fc = [ell + (i + 1) * b for i in range(h)]         # straight-line forecast
    return ell, b, fc

print(" alpha  beta | final level   trend   4-step forecast")
for alpha, beta in [(0.8, 0.2), (0.5, 0.1), (0.3, 0.05)]:
    ell, b, fc = holt(y, alpha, beta)
    print(f"  {alpha:4.2f}  {beta:4.2f} | {ell:9.2f}  {b:6.3f}   "
          + " ".join(f"{v:6.1f}" for v in fc))
print("\nhigher beta -> trend reacts faster to slope changes (and to noise).")
plot_xy(list(range(n)), list(y))

edits are live — break it on purpose

A naming map for the confused. SES is "single" smoothing; Holt is "double"; Holt-Winters (next) is "triple". The labels just count how many recurrences run — one per component you choose to track: level, then trend, then season.

3.3

Holt-Winters seasonal method

Most operational series breathe on a calendar: weekly retail, daily electricity, monthly tourism. Winters (1960) completed Holt's method by adding a third smoothed component — a vector of $m$ seasonal indices $s_t$ (one per position in the cycle, $m=12$ for monthly, $m=7$ for daily-of-week), each updated by its own parameter $\gamma$. The result, Holt-Winters, smooths level, trend, and season simultaneously. There are two flavours, depending on whether seasonal swings are a fixed amount or a fixed fraction of the level.

EQ T3.6 — HOLT-WINTERS (ADDITIVE SEASONALITY) $$ \begin{aligned} \ell_t &= \alpha\,(y_t - s_{t-m}) + (1-\alpha)\,(\ell_{t-1} + b_{t-1}) \\ b_t &= \beta\,(\ell_t - \ell_{t-1}) + (1-\beta)\,b_{t-1} \\ s_t &= \gamma\,(y_t - \ell_t) + (1-\gamma)\,s_{t-m} \\ \hat{y}_{t+h\mid t} &= \ell_t + h\, b_t + s_{t+h-m(k+1)} \end{aligned} $$

Compared with Holt (EQ T3.4), the level now smooths the deseasonalised observation $y_t - s_{t-m}$, and a third recurrence smooths the seasonal index from the detrended residual $y_t - \ell_t$. The forecast adds back the matching seasonal index, where $k = \lfloor (h-1)/m \rfloor$ just selects the right slot in the last estimated cycle. The seasonal indices are conventionally normalised to sum to zero each cycle so they do not absorb the level.

EQ T3.7 — HOLT-WINTERS (MULTIPLICATIVE SEASONALITY) $$ \ell_t = \alpha\,\frac{y_t}{s_{t-m}} + (1-\alpha)(\ell_{t-1}+b_{t-1}), \qquad s_t = \gamma\,\frac{y_t}{\ell_t} + (1-\gamma)\,s_{t-m}, \qquad \hat{y}_{t+h\mid t} = (\ell_t + h\,b_t)\, s_{t+h-m(k+1)} $$

Here seasonal indices are multipliers around 1 (e.g. December = 1.4× the level), normalised to average one per cycle. Use additive when the seasonal swing is a constant size; use multiplicative when the swing grows with the level — the classic airline-passengers series, whose December peaks balloon as traffic grows, is the textbook case for multiplicative.

Holt's method (EQ T3.4) smooths two components: a level and a trend. Holt-Winters adds a third recurrence. Which component does it add? (one word)

Winters added a seasonal component — the vector of indices $s_t$ updated by $\gamma$ in EQ T3.6/T3.7. SES (single) tracks level; Holt (double) adds trend; Holt-Winters (triple) adds season.

A multiplicative Holt-Winters model has level $\ell_t = 200$, zero trend, and a December seasonal multiplier $s = 1.4$. What is the one-step December forecast $\hat{y} = \ell_t \cdot s$, expressed as a multiple of the level (i.e. give $s$)? Equivalently: the forecast is 280, which is the level times what factor?

$\hat{y} = \ell_t \cdot s = 200 \times 1.4 = 280$. The factor relative to the level is $280/200 = $ 1.4 — December runs 40% above the deseasonalised level.

INSTRUMENT T3.2 — HOLT-WINTERS DECOMPOSITIONSEASONAL SERIES · m = 12 · EQ T3.6

LEVEL α 0.30

TREND β 0.10

SEASON γ 0.30

IN-SAMPLE SSE

—

FINAL LEVEL · TREND

—

SEASON AMPLITUDE

—

The grey line is a synthetic monthly series (rising trend + 12-month season + noise); the mint line is the Holt-Winters one-step fit, and the blue segment past the divider is its 12-step seasonal forecast. Push γ up and the seasonal indices chase every wobble (overfit); push it down and the model holds a stable seasonal shape. Watch the SSE readout: the seasonal recurrence is what lets the fit hug the peaks and troughs an SES line would slice straight through.

3.4

The ETS state-space framework

For forty years exponential smoothing was a bag of recurrences with no probability model behind them — you could forecast, but you could not say how uncertain the forecast was, nor choose a method by a principled criterion. Hyndman, Koehler, Ord & Snyder (2002, 2008) fixed that by showing every smoothing method is the point forecast of an underlying state-space model with a single source of error. This is the ETS family: Error · Trend · Season.

EQ T3.8 — ETS AS A STATE-SPACE MODEL (additive-error, "innovations" form) $$ \underbrace{y_t = \ell_{t-1} + b_{t-1} + s_{t-m} + \varepsilon_t}_{\text{measurement}}, \qquad \underbrace{\ell_t = \ell_{t-1} + b_{t-1} + \alpha\,\varepsilon_t,\;\; b_t = b_{t-1} + \beta\,\varepsilon_t,\;\; s_t = s_{t-m} + \gamma\,\varepsilon_t}_{\text{state update}} $$

A single shock $\varepsilon_t \sim \mathcal{N}(0,\sigma^2)$ drives both the observation and every state update — hence "single source of error". Recover EQ T3.6's smoothing constants by substituting $\varepsilon_t = y_t - \hat{y}_{t\mid t-1}$. The payoff is enormous: a likelihood you can maximise, AIC/BIC for model selection, and — most importantly — exact prediction intervals, which the old recurrences could never produce.

ETS classifies a model by a three-letter code: Error ∈ {A, M}, Trend ∈ {N, A, A_d}, Season ∈ {N, A, M}. So ETS(A,N,N) is SES with additive noise, ETS(A,A,N) is Holt, ETS(A,A,A) is additive Holt-Winters, and ETS(M,A,M) is the multiplicative-error airline model. There are 30 admissible combinations; the practical recipe is to let software fit all of them and pick by AIC.

Method	Components (E,T,S)	ETS code	Forecast shape
SES	level only	(A,N,N)	flat line
Holt	level + trend	(A,A,N)	straight line
Damped Holt	level + damped trend	(A,A_d,N)	bends to asymptote
Additive HW	level + trend + season	(A,A,A)	line + fixed season
Multiplicative HW	level + trend + ×season	(M,A,M)	line × growing season

The empirical verdict. In the M3 competition (3,003 series) and again in M4 (100,000 series), simple exponential-smoothing and ETS variants — especially damped trend — were brutally hard to beat; the M4 winner was a hybrid that combined exponential smoothing with a neural net (Smyl's ES-RNN). The lesson the field keeps relearning: for a single, short, noisy series, a one-parameter smoother often beats a deep model, and any serious forecaster keeps ETS as the baseline that earns its keep.

3.5

Choosing the smoothing parameters

You do not set $\alpha, \beta, \gamma$ by hand. The standard procedure picks them — together with the initial states $\ell_0, b_0, s_0$ — by minimising the in-sample sum of squared one-step errors (equivalently, maximising the Gaussian likelihood of EQ T3.8):

EQ T3.9 — PARAMETER ESTIMATION BY MINIMISING SSE $$ (\hat{\alpha}, \hat{\beta}, \hat{\gamma},\, \hat{\ell}_0, \hat{b}_0, \hat{s}_0) \;=\; \arg\min \; \sum_{t=1}^{n} \big(y_t - \hat{y}_{t\mid t-1}\big)^2 \;=\; \arg\min \; \sum_{t=1}^{n} e_t^2 $$

Each $\hat{y}_{t\mid t-1}$ is the model's one-step forecast computed from the recurrences, so the objective is a nonlinear function of the parameters — solved by numerical optimisation (Nelder-Mead, L-BFGS). The smoothing parameters are box-constrained to $(0,1)$; some references add an "admissible region" constraint that keeps the implied state-space model stable. SSE is minimised on one-step errors, not on the long-horizon forecast — a subtlety that matters when the two disagree.

Two cautions experts will raise. First, do not minimise SSE on the data you will also report accuracy on; hold out the tail of the series, or use time-series cross-validation (rolling-origin evaluation, Time Series 01), or trust the AIC from the likelihood. Second, an optimiser will happily push $\alpha \to 1$ on a series that is really a random walk — a correct answer that looks like overfitting but is not. The instrument below traces the SSE objective for SES so you can see its shape: usually convex with a clear minimum, occasionally flat (the data barely constrains $\alpha$).

INSTRUMENT T3.3 — SMOOTHING-PARAMETER OPTIMIZERSES · SSE(α) CURVE · EQ T3.9

NOISE LEVEL σ 2.0

LEVEL DRIFT 1.0

YOUR α 0.30

SSE AT YOUR α

—

OPTIMAL α*

—

SSE AT α*

—

The mint curve is the SES objective SSE(α) swept across the whole $(0,1)$ range on a freshly simulated series; the blue dot marks the grid-search minimum α* and the grey dot marks your slider's α. Crank the noise up and the minimum slides left (a smoother level filters out observation noise); crank the drift up and it slides right (the level is genuinely moving, so trust recent data more). When the curve goes flat, the data simply does not pin α down — the honest answer is "any value in this basin forecasts about the same".

Exponential smoothing models the mean of a series and treats the variance as a constant nuisance. For financial returns that assumption is exactly backwards: the mean is near-unforecastable but the variance clusters — calm begets calm, a shock begets shocks. Time Series 04 turns the smoothing machinery loose on the variance itself: ARCH, GARCH, and the volatility models that price risk.

3.R

References

Holt, C. C. (2004, orig. 1957). Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting 20(1) — reprint of the 1957 ONR memorandum that introduced double smoothing (EQ T3.4).
Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages. Management Science 6(3) — adds the seasonal component, completing Holt-Winters (EQ T3.6/T3.7).
Hyndman, R. J., Koehler, A. B., Ord, J. K. & Snyder, R. D. (2008). Forecasting with Exponential Smoothing: The State Space Approach. Springer — the definitive treatment of the ETS innovations state-space framework (EQ T3.8).
Hyndman, R. J., Koehler, A. B., Snyder, R. D. & Grose, S. (2002). A state space framework for automatic forecasting using exponential smoothing methods. International Journal of Forecasting 18(3) — the taxonomy of 30 ETS models and automatic AIC selection (§3.4).
Gardner, E. S. & McKenzie, E. (1985). Forecasting trends in time series. Management Science 31(10) — the damped-trend method (EQ T3.5), a perennial competition benchmark.
Makridakis, S., Spiliotis, E. & Assimakopoulos, V. (2020). The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting 36(1) — the modern evidence that exponential smoothing remains a top baseline (§3.4).
Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.), Ch. 8. OTexts — the freely available standard textbook treatment of SES, Holt-Winters, and ETS.