Recurrent networks — weights shared over time
Many of the things we want a model to read have no fixed length and no fixed structure except order: a sentence, an audio clip, a stock-price tape, a stream of sensor readings. A recurrent neural network (RNN) processes such a sequence \(x_1, x_2, \ldots, x_T\) one step at a time, maintaining a hidden state \(h_t\) — a running summary of everything seen so far — and updating it with the same weights at every step.
It is easier to reason about an RNN once you unroll it: copy the cell once per time step and lay the copies in a row, threading \(h_t\) from each copy into the next. The unrolled graph is just a very deep feed-forward network — depth \(T\) — whose layers happen to share parameters. Everything we know about training deep nets (Deep Learning 02) applies, including the failure mode that dominates §3.2.
The output head is flexible. A many-to-one RNN reads the whole sequence and emits a single prediction from \(h_T\) (sentiment of a review). A many-to-many RNN emits a label at every step (part-of-speech tags). A one-to-many RNN runs a single input forward into a generated sequence (image captioning). The recurrence is the same; only where you read the head changes.
# EQ N3.1: a vanilla RNN cell, forward pass over a toy sequence
import numpy as np
rng = np.random.default_rng(0)
H, D, T = 4, 3, 6 # hidden size, input size, seq length
Wxh = rng.normal(0, 0.5, (H, D)) # input -> hidden
Whh = rng.normal(0, 0.5, (H, H)) # hidden -> hidden (the recurrent loop)
bh = np.zeros(H)
X = rng.normal(0, 1, (T, D)) # a length-6 input sequence
h = np.zeros(H) # h_0 = 0
print("step ||h_t|| (running summary grows then stabilises)")
for t in range(T):
h = np.tanh(Wxh @ X[t] + Whh @ h + bh) # the recurrence, same weights each step
print(f" t={t} {np.linalg.norm(h):.4f}")
print("\nfinal hidden state h_T:", h.round(3))
print("every step reused the SAME Wxh, Whh -- that is what 'recurrent' means.")
The vanishing / exploding gradient problem
The promise of recurrence is long-range memory: a model that, after reading "I grew up in France … so I speak fluent ___", can reach back hundreds of tokens to fill the blank with French. In practice a vanilla RNN cannot. The reason is in the gradient. To train, we backpropagate the loss at step \(T\) through every earlier step. By the chain rule, the gradient of \(h_T\) with respect to a distant \(h_k\) is a product of Jacobians:
The controlling quantity is the largest singular value (spectral norm) of the recurrent Jacobian. Bound it loosely: \(\tanh'\le 1\), so each factor has norm at most \(\|W_{hh}\|\). A sufficient condition for vanishing is therefore \(\|W_{hh}\| < 1\); a necessary condition for exploding is \(\|W_{hh}\| > 1/\max\tanh'\). In the clean scalar case the whole product collapses to \((w\,\tanh'(a))^{\,T-k}\), which is exactly geometric in the distance \(T-k\).
Exploding gradients have a cheap fix; vanishing ones do not. When \(\lambda > 1\) the gradient blows up to NaN, but you can simply clip its norm to a ceiling before the optimizer step — a standard, robust trick. Vanishing gradients are insidious because nothing crashes: training proceeds, the loss even falls, but the model is silently blind to anything more than a few steps back. No clipping can manufacture a signal that has decayed to numerical zero. The real cure is architectural — change the cell so the gradient has a path that does not get multiplied down. That is §3.3.
# Vanishing vs exploding: measure backprop gradient norm vs distance
import numpy as np
def grad_norm(w_hh, length):
# scalar RNN, fixed input pulse; product of Jacobian factors (EQ N3.2)
h, a = 0.0, []
for t in range(length):
x = 1.0 if t == 0 else 0.0
pre = w_hh * h + 1.0 * x # W_xh = 1
h = np.tanh(pre); a.append(pre)
g = 1.0 # d h_T / d h_0
for t in range(length - 1, -1, -1):
g *= (1 - np.tanh(a[t])**2) * w_hh # tanh'(a) * W_hh
return abs(g)
for w in (0.5, 0.9, 1.1):
norms = [grad_norm(w, L) for L in range(1, 61)]
tag = "VANISH" if norms[-1] < 1e-3 else ("EXPLODE" if norms[-1] > 1e3 else "ok")
print(f"W_hh={w}: grad@1={norms[0]:.3f} grad@60={norms[-1]:.3e} [{tag}]")
print("\n|d h_T / d h_0| over distance for W_hh = 0.9 (geometric decay):")
plot_xy(list(range(1, 61)), [grad_norm(0.9, L) for L in range(1, 61)])
LSTM — gates & the cell state
Hochreiter & Schmidhuber's 1997 Long Short-Term Memory attacks EQ N3.2 head-on. It adds a second, parallel memory track — the cell state \(c_t\) — whose update is (mostly) additive rather than a repeated matrix multiply. Information can ride that track across many steps almost untouched, so the gradient flowing back along it is multiplied by numbers near 1, not by a contracting Jacobian. Three learned gates — sigmoids in \((0,1)\) that act as soft, differentiable valves — decide what to keep, what to add, and what to read out.
Read the gates as operations on memory. The forget gate \(f_t\) erases: an entry near 0 zeroes that slot of \(c_{t-1}\). The input gate \(i_t\) writes: it decides how much of the fresh candidate \(\tilde c_t\) to commit. The output gate \(o_t\) reads: it exposes a filtered copy of the cell as the hidden state. A practical detail that matters in real training: the forget-gate bias \(b_f\) is usually initialized to \(+1\) or higher, so the network defaults to remembering and only learns to forget when the data demands it.
# EQ N3.4-N3.5: one LSTM cell forward step; print gates and cell state
import numpy as np
rng = np.random.default_rng(1)
H, D = 3, 2
def sig(z): return 1 / (1 + np.exp(-z))
# stacked weights for [forget, input, output, candidate]; bias_f starts at +1
Wx = rng.normal(0, 0.6, (4 * H, D))
Wh = rng.normal(0, 0.6, (4 * H, H))
b = np.zeros(4 * H); b[:H] = 1.0 # forget-gate bias = +1 (default remember)
x = np.array([1.0, -0.5]) # one input vector
h = np.zeros(H); c = np.array([0.4, -0.2, 0.9]) # carried-in state
z = Wx @ x + Wh @ h + b
f, i, o = sig(z[:H]), sig(z[H:2*H]), sig(z[2*H:3*H])
g = np.tanh(z[3*H:]) # candidate c~
c_new = f * c + i * g # the additive highway
h_new = o * np.tanh(c_new)
np.set_printoptions(precision=3, suppress=True)
print("forget gate f :", f)
print("input gate i :", i)
print("output gate o :", o)
print("candidate g~ :", g)
print("old cell c :", c)
print("new cell c' = f*c + i*g~ :", c_new)
print("hidden h' = o*tanh(c') :", h_new)
A common worry: doesn't the additive cell state grow without bound? It can — which is why \(h_t=o_t\odot\tanh(c_t)\) squashes the readout, and why a well-trained forget gate occasionally dips below 1 to bleed off stale magnitude. Modern variants add a forget on the candidate too, and peephole connections let the gates see \(c_{t-1}\) directly; both are refinements, not changes to the core highway.
GRU — a lighter gate
The LSTM works, but it carries two state vectors and four weight matrices per cell. Cho et al. (2014) asked how much of that machinery is essential and arrived at the Gated Recurrent Unit: a single state vector, two gates, and a clever trick that ties "forget" and "input" into one decision.
| Property | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Gates | 0 | 3 (f, i, o) | 2 (z, r) |
| State vectors | 1 (h) | 2 (h, c) | 1 (h) |
| Params / cell | ~1× | ~4× | ~3× |
| Long-range gradient | vanishes | protected | protected |
| Separate read-out gate | — | yes (o) | no |
Which to use? On many tasks GRU and LSTM are statistically indistinguishable, and GRU's smaller parameter count trains a little faster and needs less data — so it is often the better first choice. LSTM tends to edge ahead when very long-range memory or precise readout control matters, partly because the output gate and dedicated cell state give it one more degree of freedom. The honest answer, repeated across the literature since the 2014–2017 comparisons: there is no universal winner; the gap is task-dependent and usually small. What both share — and what actually mattered — is the additive, gated state path that keeps the gradient alive.
Historical footnote, important for honesty: this entire family has been largely displaced for language by the Transformer (Deep Learning 04 onward), whose attention removes recurrence and parallelizes across the sequence. RNNs persist where streaming or strict left-to-right causality with small state is an advantage — on-device speech, low-latency control, some time-series — and the gating idea itself resurfaced in 2023–2025 in linear-recurrent and state-space models (S4, Mamba) that reclaim \(O(T)\) inference while approaching Transformer quality.
Backpropagation through time
How is any of this trained? By unrolling the network into its depth-\(T\) feed-forward equivalent (§3.1) and running ordinary backpropagation through it — a procedure named backpropagation through time (BPTT). Because the weights are shared across steps, the gradient with respect to a weight is the sum of its contributions at every step:
Truncated BPTT is the practical version. Backpropagating through a 100,000-token sequence would cost prohibitive memory (every intermediate state must be stored for the backward pass) and re-incur the gradient pathologies. So we cut the sequence into chunks of length \(k\) (say 64–256), backpropagate only within each chunk, but carry the hidden state forward between chunks so the forward pass still sees unlimited context. The model can therefore remember arbitrarily far while only being trained on gradients that span \(k\) steps — a deliberate bias toward shorter-range credit assignment that keeps training tractable.
Three ideas, one thread. (1) Recurrence shares weights over time, turning a sequence into a deep net. (2) That depth makes the gradient a long product, which vanishes or explodes (EQ N3.2–N3.3). (3) Gates with an additive state path (LSTM/GRU) give the gradient a near-identity highway, and truncated BPTT makes training that highway affordable. Everything else in sequence modeling is variation on this.
Gates let a single state carry the past; the next leap lets every step look back at every other step directly. Chapter 04 builds the encoder–decoder (seq2seq) framework and the attention mechanism that frees a model from squeezing a whole sequence through one fixed-size bottleneck — the idea that, taken to its limit, becomes the Transformer.
References
- Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory.
- Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.
- Bengio, Y., Simard, P. & Frasconi, P. (1994). Learning Long-Term Dependencies with Gradient Descent is Difficult.
- Pascanu, R., Mikolov, T. & Bengio, Y. (2013). On the Difficulty of Training Recurrent Neural Networks.
- Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. (2017). LSTM: A Search Space Odyssey.
- Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.
- Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.