Sequence Models — RNN, LSTM & GRU

3.1

Recurrent networks — weights shared over time

Many of the things we want a model to read have no fixed length and no fixed structure except order: a sentence, an audio clip, a stock-price tape, a stream of sensor readings. A recurrent neural network (RNN) processes such a sequence $x_1, x_2, \ldots, x_T$ one step at a time, maintaining a hidden state $h_t$ — a running summary of everything seen so far — and updating it with the same weights at every step.

EQ N3.1 — THE RECURRENT CELL $$ h_t = \tanh\!\big( W_{xh}\,x_t + W_{hh}\,h_{t-1} + b_h \big), \qquad \hat{y}_t = W_{hy}\,h_t + b_y $$

$W_{hh}$ feeds the previous state back into the present — the loop that makes the network recurrent. Crucially, $W_{xh}, W_{hh}, W_{hy}$ are shared across all $T$ time steps: the cell that reads token 1 is the identical cell that reads token 1,000. That weight tying is what lets one fixed-size model consume sequences of any length, and what makes the gradient a long product (§3.2). $h_0$ is usually the zero vector. The $\tanh$ keeps each state bounded in $(-1,1)$.

It is easier to reason about an RNN once you unroll it: copy the cell once per time step and lay the copies in a row, threading $h_t$ from each copy into the next. The unrolled graph is just a very deep feed-forward network — depth $T$ — whose layers happen to share parameters. Everything we know about training deep nets (Deep Learning 02) applies, including the failure mode that dominates §3.2.

The output head is flexible. A many-to-one RNN reads the whole sequence and emits a single prediction from $h_T$ (sentiment of a review). A many-to-many RNN emits a label at every step (part-of-speech tags). A one-to-many RNN runs a single input forward into a generated sequence (image captioning). The recurrence is the same; only where you read the head changes.

A scalar RNN has $W_{xh}=2$, $W_{hh}=0.5$, $b_h=0$, and starts from $h_0=0$. The first input is $x_1=0.5$. What is $h_1=\tanh(W_{xh}x_1 + W_{hh}h_0)$? (Use $\tanh(1)=0.7616$.)

With $h_0=0$ the recurrent term vanishes: $W_{xh}x_1 + W_{hh}h_0 = 2(0.5) + 0.5(0) = 1$. So $h_1=\tanh(1)=$ 0.7616.

PYTHON · RUNNABLE IN-BROWSER

# EQ N3.1: a vanilla RNN cell, forward pass over a toy sequence
import numpy as np
rng = np.random.default_rng(0)
H, D, T = 4, 3, 6                       # hidden size, input size, seq length

Wxh = rng.normal(0, 0.5, (H, D))       # input  -> hidden
Whh = rng.normal(0, 0.5, (H, H))       # hidden -> hidden (the recurrent loop)
bh  = np.zeros(H)
X   = rng.normal(0, 1, (T, D))         # a length-6 input sequence

h = np.zeros(H)                        # h_0 = 0
print("step    ||h_t||   (running summary grows then stabilises)")
for t in range(T):
    h = np.tanh(Wxh @ X[t] + Whh @ h + bh)   # the recurrence, same weights each step
    print(f"  t={t}   {np.linalg.norm(h):.4f}")

print("\nfinal hidden state h_T:", h.round(3))
print("every step reused the SAME Wxh, Whh -- that is what 'recurrent' means.")

edits are live — break it on purpose

INSTRUMENT N3.1 — RNN UNROLL VISUALIZERSCALAR CELL · EQ N3.1 · LIVE

INPUT WEIGHT W_xh 1.00

RECURRENT WEIGHT W_hh 0.60

SEQUENCE LENGTH

FINAL STATE h₁₀

—

REGIME

—

The cell reads a fixed input pulse (1 at $t=1$, then 0) so you can watch memory decay. Push $W_{hh}$ toward 0 and the state forgets the pulse within a step or two; push it toward $\pm1$ and the memory lingers across the whole sequence — but go past 1 and $\tanh$ saturates and the state pins to its extreme. This single knob previews the stability problem of §3.2.

3.2

The vanishing / exploding gradient problem

The promise of recurrence is long-range memory: a model that, after reading "I grew up in France … so I speak fluent ___", can reach back hundreds of tokens to fill the blank with French. In practice a vanilla RNN cannot. The reason is in the gradient. To train, we backpropagate the loss at step $T$ through every earlier step. By the chain rule, the gradient of $h_T$ with respect to a distant $h_k$ is a product of Jacobians:

EQ N3.2 — GRADIENT IS A LONG PRODUCT $$ \frac{\partial h_T}{\partial h_k} \;=\; \prod_{t=k+1}^{T} \frac{\partial h_t}{\partial h_{t-1}} \;=\; \prod_{t=k+1}^{T} \operatorname{diag}\!\big(\tanh'(a_t)\big)\, W_{hh}^{\top}, \qquad a_t = W_{xh}x_t + W_{hh}h_{t-1} + b_h $$

Each factor multiplies by $W_{hh}$ (through its transpose) and by the diagonal of $\tanh'$, which never exceeds 1 and is usually well below it. Multiplying $T-k$ such factors makes the whole product behave like a power. If the factors are typically smaller than 1, the gradient vanishes geometrically with distance; if larger, it explodes. Either way the model cannot learn dependencies that span many steps.

The controlling quantity is the largest singular value (spectral norm) of the recurrent Jacobian. Bound it loosely: $\tanh'\le 1$, so each factor has norm at most $\|W_{hh}\|$. A sufficient condition for vanishing is therefore $\|W_{hh}\| < 1$; a necessary condition for exploding is $\|W_{hh}\| > 1/\max\tanh'$. In the clean scalar case the whole product collapses to $(w\,\tanh'(a))^{\,T-k}$, which is exactly geometric in the distance $T-k$.

EQ N3.3 — SCALAR DECAY RATE $$ \left| \frac{\partial h_T}{\partial h_k} \right| \;\approx\; \big| w_{hh} \cdot \overline{\tanh'} \big|^{\,T-k} \;=\; \lambda^{\,T-k}, \qquad \lambda < 1 \Rightarrow \text{vanish}, \quad \lambda > 1 \Rightarrow \text{explode} $$

$\lambda$ is the effective per-step gain. With $\lambda=0.8$, the gradient from 50 steps away is scaled by $0.8^{50}\approx 1.4\times10^{-5}$ — for all practical purposes zero. This is why a plain RNN's effective memory is only a handful of steps, no matter how long the sequence. Bengio, Simard & Frasconi proved in 1994 that this trade-off is fundamental: the same contraction that gives a vanilla RNN stable dynamics is what starves the long-range gradient.

Exploding gradients have a cheap fix; vanishing ones do not. When $\lambda > 1$ the gradient blows up to NaN, but you can simply clip its norm to a ceiling before the optimizer step — a standard, robust trick. Vanishing gradients are insidious because nothing crashes: training proceeds, the loss even falls, but the model is silently blind to anything more than a few steps back. No clipping can manufacture a signal that has decayed to numerical zero. The real cure is architectural — change the cell so the gradient has a path that does not get multiplied down. That is §3.3.

An RNN has effective per-step gain $\lambda = 0.9$ (from EQ N3.3). By what factor is the gradient scaled when it travels from step $k$ to step $T$ that are $T-k = 44$ steps apart, i.e. $0.9^{44}$?

$0.9^{44} = e^{44\ln 0.9} = e^{44(-0.10536)} = e^{-4.636} \approx$ 0.0097 (≈ 0.01). A signal one percent of its original size is, for learning purposes, gone — even though only 44 steps separate the two positions.

PYTHON · RUNNABLE IN-BROWSER

# Vanishing vs exploding: measure backprop gradient norm vs distance
import numpy as np

def grad_norm(w_hh, length):
    # scalar RNN, fixed input pulse; product of Jacobian factors (EQ N3.2)
    h, a = 0.0, []
    for t in range(length):
        x = 1.0 if t == 0 else 0.0
        pre = w_hh * h + 1.0 * x          # W_xh = 1
        h = np.tanh(pre); a.append(pre)
    g = 1.0                                # d h_T / d h_0
    for t in range(length - 1, -1, -1):
        g *= (1 - np.tanh(a[t])**2) * w_hh # tanh'(a) * W_hh
    return abs(g)

for w in (0.5, 0.9, 1.1):
    norms = [grad_norm(w, L) for L in range(1, 61)]
    tag = "VANISH" if norms[-1] < 1e-3 else ("EXPLODE" if norms[-1] > 1e3 else "ok")
    print(f"W_hh={w}: grad@1={norms[0]:.3f}  grad@60={norms[-1]:.3e}  [{tag}]")

print("\n|d h_T / d h_0| over distance for W_hh = 0.9 (geometric decay):")
plot_xy(list(range(1, 61)), [grad_norm(0.9, L) for L in range(1, 61)])

edits are live — break it on purpose

INSTRUMENT N3.2 — VANISHING-GRADIENT DECAY|∂hₜ/∂h₀| VS DISTANCE · EQ N3.3

RECURRENT GAIN W_hh 0.90

SEQUENCE LENGTH 60

EFFECTIVE GAIN λ

—

GRADIENT AT FULL DISTANCE

—

REGIME

—

The curve is the magnitude of the gradient flowing back from the final step to step 0, plotted on a log axis against distance. Below $W_{hh}=1$ it plunges to the floor (white "VANISH" line at $10^{-3}$) within a few dozen steps; above 1 it climbs off the top (EXPLODE). Only a hairline near $W_{hh}\approx 1$ keeps the gradient alive across the whole sequence — and a vanilla RNN cannot stay on that knife-edge while also fitting the data. That dilemma is the entire motivation for gates.

3.3

LSTM — gates & the cell state

Hochreiter & Schmidhuber's 1997 Long Short-Term Memory attacks EQ N3.2 head-on. It adds a second, parallel memory track — the cell state $c_t$ — whose update is (mostly) additive rather than a repeated matrix multiply. Information can ride that track across many steps almost untouched, so the gradient flowing back along it is multiplied by numbers near 1, not by a contracting Jacobian. Three learned gates — sigmoids in $(0,1)$ that act as soft, differentiable valves — decide what to keep, what to add, and what to read out.

EQ N3.4 — THE THREE GATES & CANDIDATE $$ \begin{aligned} f_t &= \sigma\!\big(W_f[h_{t-1},x_t]+b_f\big) &\text{(forget)} \\ i_t &= \sigma\!\big(W_i[h_{t-1},x_t]+b_i\big) &\text{(input)} \\ o_t &= \sigma\!\big(W_o[h_{t-1},x_t]+b_o\big) &\text{(output)} \\ \tilde{c}_t &= \tanh\!\big(W_c[h_{t-1},x_t]+b_c\big) &\text{(candidate)} \end{aligned} $$

Each gate is a sigmoid, so its entries live in $(0,1)$: 0 means "block this channel completely", 1 means "let it through untouched". $[h_{t-1},x_t]$ is the previous state concatenated with the current input. $\tilde c_t$ is the candidate new content (a $\tanh$, so in $(-1,1)$) that the input gate may write. An LSTM has exactly three gates — forget, input, output — plus this candidate, which is not itself a gate.

EQ N3.5 — CELL & HIDDEN UPDATE (THE HIGHWAY) $$ c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, \qquad h_t = o_t \odot \tanh(c_t) $$

$\odot$ is element-wise product. The cell update is the heart of the design: the old memory $c_{t-1}$ is scaled by the forget gate and the candidate is scaled by the input gate, then they are added. When $f_t\approx 1$ and $i_t\approx 0$, $c_t\approx c_{t-1}$ — memory persists and $\partial c_t/\partial c_{t-1}\approx 1$, so the gradient flows back with no geometric decay. This near-identity path is the "constant error carousel" that defeats EQ N3.2. The hidden state $h_t$ is a gated, squashed view of the cell — what the rest of the network gets to see.

Read the gates as operations on memory. The forget gate $f_t$ erases: an entry near 0 zeroes that slot of $c_{t-1}$. The input gate $i_t$ writes: it decides how much of the fresh candidate $\tilde c_t$ to commit. The output gate $o_t$ reads: it exposes a filtered copy of the cell as the hidden state. A practical detail that matters in real training: the forget-gate bias $b_f$ is usually initialized to $+1$ or higher, so the network defaults to remembering and only learns to forget when the data demands it.

Counting the sigmoid valves in EQ N3.4 — forget, input, and output — how many gates does a standard LSTM cell have? (The $\tanh$ candidate $\tilde c_t$ is content, not a gate.)

The three sigmoid gates are the forget gate $f_t$, the input gate $i_t$, and the output gate $o_t$. The candidate $\tilde c_t$ is a $\tanh$, not a gate. So an LSTM has 3 gates.

True or false: in EQ N3.5 the previous cell state enters as $f_t \odot c_{t-1}$, so a forget gate whose entries are near 0 erases (nearly zeroes) the cell's stored memory. (Answer true or false.)

Multiplying $c_{t-1}$ element-wise by a gate near 0 drives those entries toward 0, discarding the corresponding memory before the new candidate is added. The statement is true — that is precisely the forget gate's job, and why a stuck-closed forget gate is a known cause of memory loss.

PYTHON · RUNNABLE IN-BROWSER

# EQ N3.4-N3.5: one LSTM cell forward step; print gates and cell state
import numpy as np
rng = np.random.default_rng(1)
H, D = 3, 2
def sig(z): return 1 / (1 + np.exp(-z))

# stacked weights for [forget, input, output, candidate]; bias_f starts at +1
Wx = rng.normal(0, 0.6, (4 * H, D))
Wh = rng.normal(0, 0.6, (4 * H, H))
b  = np.zeros(4 * H); b[:H] = 1.0          # forget-gate bias = +1 (default remember)

x  = np.array([1.0, -0.5])                 # one input vector
h  = np.zeros(H); c = np.array([0.4, -0.2, 0.9])   # carried-in state

z  = Wx @ x + Wh @ h + b
f, i, o = sig(z[:H]), sig(z[H:2*H]), sig(z[2*H:3*H])
g  = np.tanh(z[3*H:])                       # candidate c~
c_new = f * c + i * g                       # the additive highway
h_new = o * np.tanh(c_new)

np.set_printoptions(precision=3, suppress=True)
print("forget gate f :", f)
print("input  gate i :", i)
print("output gate o :", o)
print("candidate  g~ :", g)
print("old cell   c  :", c)
print("new cell   c' = f*c + i*g~ :", c_new)
print("hidden     h' = o*tanh(c') :", h_new)

edits are live — break it on purpose

INSTRUMENT N3.3 — LSTM GATE EXPLORERONE SCALAR CELL · EQ N3.5 · LIVE

FORGET GATE f 0.90

INPUT GATE i 0.50

OUTPUT GATE o 0.80

CELL HALF-LIFE (STEPS)

—

FINAL CELL c

—

FINAL HIDDEN h

—

A single value is written into the cell at $t=1$ (with gate $i$), then the cell runs free under the forget gate $f$ while the output gate $o$ controls what leaks into $h$. Set $f=1,\ i=0$: the memory is held flat forever — the constant error carousel, half-life $\infty$. Drop $f$ to 0.5 and the memory halves every step. Close $o$ and the cell still remembers internally while $h$ shows nothing — memory and exposure are separate, which a vanilla RNN cannot do.

A common worry: doesn't the additive cell state grow without bound? It can — which is why $h_t=o_t\odot\tanh(c_t)$ squashes the readout, and why a well-trained forget gate occasionally dips below 1 to bleed off stale magnitude. Modern variants add a forget on the candidate too, and peephole connections let the gates see $c_{t-1}$ directly; both are refinements, not changes to the core highway.

3.4

GRU — a lighter gate

The LSTM works, but it carries two state vectors and four weight matrices per cell. Cho et al. (2014) asked how much of that machinery is essential and arrived at the Gated Recurrent Unit: a single state vector, two gates, and a clever trick that ties "forget" and "input" into one decision.

EQ N3.6 — THE GRU CELL $$ \begin{aligned} z_t &= \sigma\!\big(W_z[h_{t-1},x_t]\big) &\text{(update gate)} \\ r_t &= \sigma\!\big(W_r[h_{t-1},x_t]\big) &\text{(reset gate)} \\ \tilde{h}_t &= \tanh\!\big(W_h[\,r_t\odot h_{t-1},\,x_t\,]\big) &\text{(candidate)} \\ h_t &= (1-z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t &\text{(blend)} \end{aligned} $$

The update gate $z_t$ interpolates between keeping the old state and overwriting it with the candidate — one knob does the work of the LSTM's separate forget and input gates, which is why their weights sum to 1 by construction $(1-z_t)+z_t$. The reset gate $r_t$ decides how much past state feeds the candidate, letting the cell drop irrelevant history when composing new content. There is no separate cell state and no output gate: $h_t$ is the memory. When $z_t\approx 0$, $h_t\approx h_{t-1}$ — the same near-identity skip that protects the gradient.

Property	Vanilla RNN	LSTM	GRU
Gates	0	3 (f, i, o)	2 (z, r)
State vectors	1 (h)	2 (h, c)	1 (h)
Params / cell	~1×	~4×	~3×
Long-range gradient	vanishes	protected	protected
Separate read-out gate	—	yes (o)	no

Which to use? On many tasks GRU and LSTM are statistically indistinguishable, and GRU's smaller parameter count trains a little faster and needs less data — so it is often the better first choice. LSTM tends to edge ahead when very long-range memory or precise readout control matters, partly because the output gate and dedicated cell state give it one more degree of freedom. The honest answer, repeated across the literature since the 2014–2017 comparisons: there is no universal winner; the gap is task-dependent and usually small. What both share — and what actually mattered — is the additive, gated state path that keeps the gradient alive.

Historical footnote, important for honesty: this entire family has been largely displaced for language by the Transformer (Deep Learning 04 onward), whose attention removes recurrence and parallelizes across the sequence. RNNs persist where streaming or strict left-to-right causality with small state is an advantage — on-device speech, low-latency control, some time-series — and the gating idea itself resurfaced in 2023–2025 in linear-recurrent and state-space models (S4, Mamba) that reclaim $O(T)$ inference while approaching Transformer quality.

3.5

Backpropagation through time

How is any of this trained? By unrolling the network into its depth-$T$ feed-forward equivalent (§3.1) and running ordinary backpropagation through it — a procedure named backpropagation through time (BPTT). Because the weights are shared across steps, the gradient with respect to a weight is the sum of its contributions at every step:

EQ N3.7 — THE BPTT GRADIENT $$ \frac{\partial \mathcal{L}}{\partial W} \;=\; \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial W}, \qquad \frac{\partial \mathcal{L}_T}{\partial W} \;=\; \sum_{k=1}^{T} \frac{\partial \mathcal{L}_T}{\partial h_T}\, \frac{\partial h_T}{\partial h_k}\, \frac{\partial h_k}{\partial W} $$

The outer sum collects the loss from every output step; the inner sum routes each loss back through every earlier state via the Jacobian product $\partial h_T/\partial h_k$ — the same product as EQ N3.2. So BPTT is exactly where vanishing and exploding gradients are born: the cure of §3.3 (an additive cell path) makes the $k$-distant term survive instead of decaying to zero.

Truncated BPTT is the practical version. Backpropagating through a 100,000-token sequence would cost prohibitive memory (every intermediate state must be stored for the backward pass) and re-incur the gradient pathologies. So we cut the sequence into chunks of length $k$ (say 64–256), backpropagate only within each chunk, but carry the hidden state forward between chunks so the forward pass still sees unlimited context. The model can therefore remember arbitrarily far while only being trained on gradients that span $k$ steps — a deliberate bias toward shorter-range credit assignment that keeps training tractable.

KEY

Three ideas, one thread. (1) Recurrence shares weights over time, turning a sequence into a deep net. (2) That depth makes the gradient a long product, which vanishes or explodes (EQ N3.2–N3.3). (3) Gates with an additive state path (LSTM/GRU) give the gradient a near-identity highway, and truncated BPTT makes training that highway affordable. Everything else in sequence modeling is variation on this.

Gates let a single state carry the past; the next leap lets every step look back at every other step directly. Chapter 04 builds the encoder–decoder (seq2seq) framework and the attention mechanism that frees a model from squeezing a whole sequence through one fixed-size bottleneck — the idea that, taken to its limit, becomes the Transformer.

3.R

References

Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation 9(8) — introduces the LSTM cell, the constant error carousel, and the gating scheme of EQ N3.4–N3.5.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 — introduces the GRU (EQ N3.6) and the encoder–decoder framing carried into Chapter 04.
Bengio, Y., Simard, P. & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Transactions on Neural Networks 5(2) — the formal analysis of vanishing/exploding gradients behind EQ N3.2–N3.3.
Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks. ICML 2013 — the spectral-norm view of the gradient product and the gradient-clipping remedy for explosion (§3.2).
Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. IEEE TNNLS 28(10) — systematic ablation of LSTM components, including the value of the forget gate and forget-bias initialization (§3.3).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS 2014 Deep Learning Workshop — the LSTM-vs-GRU comparison underpinning the "no universal winner" claim (§3.4).
Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2312.00752 — the modern selective state-space model reviving gated linear recurrence at scale (§3.4 footnote).