The encoder-decoder framework
Machine translation poses a hard problem for a plain recurrent net: the input and output are both sequences, but of different lengths, in different languages, with no word-by-word alignment. Sutskever, Vinyals and Le (2014) cut the knot with a deceptively simple architecture, now called sequence-to-sequence (seq2seq): one RNN to read, a second RNN to write.
The encoder consumes the source tokens \(x_1, \ldots, x_{T_x}\) one at a time, updating a hidden state. Its final hidden state \(h_{T_x}\) is taken as a summary of the whole sentence — the context vector \(c\). The decoder is a language model conditioned on \(c\): it starts from \(c\), emits a token, feeds that token back in, and repeats until it produces an end-of-sequence symbol.
Two LSTMs (the encoder typically deep, 4 layers in the original) trained end-to-end on millions of sentence pairs reached competitive WMT'14 English→French BLEU — the first time a pure neural system rivaled the phrase-based statistical machine-translation pipelines it would soon replace. The framework generalizes far beyond translation: summarization, dialogue, code generation, speech-to-text, and image captioning are all seq2seq with different encoders.
# A toy encoder: roll up a sentence into ONE context vector c (EQ N4.1).
# The point: no matter how long the source, c has fixed width -> the bottleneck.
import numpy as np
rng = np.random.default_rng(0)
d = 6 # hidden width
def encode(x_embeds): # a stand-in RNN: c = tanh(W h + U x)
Wh = rng.normal(0, 0.4, (d, d)); Ux = rng.normal(0, 0.4, (d, d))
h = np.zeros(d)
for x in x_embeds: # read left to right, keep ONLY the last state
h = np.tanh(Wh @ h + Ux @ x)
return h # c = h_{T_x}
for T in (3, 9, 27): # short, medium, long source sentences
src = rng.normal(0, 1, (T, d)) # T token embeddings
c = encode(src)
print(f"source length {T:2d} tokens -> context vector c has width {c.size} "
f"(norm {np.linalg.norm(c):.2f})")
print("\nThe vector NEVER grows. 27 words must fit in the same 6 numbers as 3.")
The fixed-vector bottleneck
The architecture's elegance is also its flaw. Every nuance of a 40-word source sentence — who did what to whom, every clause, every named entity — must be squeezed into one fixed-dimensional vector \(c\) and held there, unchanged, while the decoder unspools a translation that may itself be 40 words long. The encoder's last state is a lossy, length-blind summary.
The symptom is unmistakable: seq2seq BLEU is fine on short sentences and falls off a cliff as length grows. Cho et al. (2014) documented the decay directly; the longer the input, the more the single vector saturates and the earlier source words it must remember fade. This is an information-theoretic ceiling, not a tuning problem — you cannot store an arbitrarily long sentence in a constant number of bits without loss.
Imagine reading a paragraph, then writing its translation from memory without looking back at the page. That is the fixed-vector decoder. Attention is being allowed to glance back at the source — at whichever word you need, exactly when you need it.
Bahdanau (additive) attention
Bahdanau, Cho and Bengio (2014) made the decisive move. Keep all the encoder hidden states — one per source word, \(h_1, \ldots, h_{T_x}\), now called annotations (and produced by a bidirectional RNN so each \(h_j\) summarizes the whole sentence centered on word \(j\)). At every decoding step \(i\), build a different context vector \(c_i\) by taking a weighted average of those annotations — with weights the decoder chooses on the fly.
The weights come from an alignment model: a tiny feedforward net that scores how well decoder state \(s_{i-1}\) matches each annotation \(h_j\). Because the score is computed with a sum inside a \(\tanh\), this is called additive attention.
Softmax over the source positions turns scores into a probability distribution — the attention weights \(\alpha_{ij}\) — and the context vector is their weighted sum of annotations:
# EQ N4.2: additive attention scores from scratch, then softmax to weights.
import numpy as np
rng = np.random.default_rng(1)
d, a = 5, 4 # hidden width d, alignment width a
Tx = 4 # four source words
H = rng.normal(0, 1, (Tx, d)) # encoder annotations h_1..h_Tx
s_prev = rng.normal(0, 1, d) # decoder state s_{i-1}
Wa = rng.normal(0, 0.5, (a, d)) # project the query
Ua = rng.normal(0, 0.5, (a, d)) # project each key
va = rng.normal(0, 0.5, a) # collapse to a scalar
e = np.array([va @ np.tanh(Wa @ s_prev + Ua @ h) for h in H]) # EQ N4.2
alpha = np.exp(e - e.max()); alpha /= alpha.sum() # softmax, EQ N4.3
np.set_printoptions(precision=3, suppress=True)
print("raw alignment scores e_ij :", e)
print("attention weights alpha :", alpha)
print("weights sum to :", round(float(alpha.sum()), 6), "<- always 1")
print("argmax source word :", int(alpha.argmax()))
Luong (multiplicative) attention
A year later, Luong, Pham and Manning (2015) simplified and systematized the idea. Their headline observation: the \(\tanh\) feedforward scorer is more machinery than you need. If query and key live in the same space, a plain dot product already measures their alignment — and a dot product is a single, GPU-friendly matrix multiply rather than a small MLP. Hence multiplicative (a.k.a. dot-product) attention.
Two architectures, one essential idea. The differences are practical: additive attention is marginally more robust when query and key dimensions differ; multiplicative attention is faster and more memory-efficient, and at large dimension it needs the now-famous \(1/\sqrt{d_k}\) rescaling to keep softmax out of saturation. That scaled dot product is exactly the score function the Transformer would adopt — Luong's general form, with the projections renamed \(W_Q\) and \(W_K\), is scaled dot-product attention.
| Property | Bahdanau (2014) | Luong (2015) |
|---|---|---|
| Score | additive (tanh MLP) | dot / general / concat |
| Decoder state used | \(s_{i-1}\) (previous) | \(s_i\) (current) |
| Encoder | bidirectional RNN | top LSTM layer |
| Cost / extra params | MLP per pair | one matmul (dot: none) |
| Descendant | — | scaled dot-product attn |
# EQ N4.3/N4.4: the context vector as the attention-weighted sum of encoder states,
# scored with Luong dot-product attention. Verify it is a convex combination.
import numpy as np
rng = np.random.default_rng(2)
d, Tx = 5, 4
H = rng.normal(0, 1, (Tx, d)) # encoder states (rows = source words)
s_i = rng.normal(0, 1, d) # current decoder state
scores = H @ s_i # EQ N4.4 "dot": one matmul, no params
alpha = np.exp(scores - scores.max()); alpha /= alpha.sum() # softmax
c_i = alpha @ H # EQ N4.3: weighted sum of states
np.set_printoptions(precision=3, suppress=True)
print("attention weights alpha :", alpha, " (sum", round(float(alpha.sum()),3), ")")
print("context vector c_i :", c_i)
# A convex combo must lie inside the per-dim min/max of the states it blends:
lo, hi = H.min(0), H.max(0)
print("c_i within state hull? :", bool(np.all(c_i >= lo - 1e-9) and
np.all(c_i <= hi + 1e-9)))
The bridge to the Transformer
By 2016 attention was bolted onto every competitive RNN translator. But it still rode on top of recurrence: the encoder and decoder remained sequential RNNs, and that sequentiality — each step waiting on the last — capped how much you could parallelize on a GPU and how far gradients reached across long sentences.
Vaswani et al. (2017) asked the obvious next question: if attention is doing the real work of moving information, do we need the RNN at all? "Attention Is All You Need" answered no. Three moves complete the bridge from this chapter:
- Self-attention. Bahdanau and Luong attention is cross-attention — the decoder attending to the encoder. Point the same mechanism at a sequence's own positions and you get self-attention, which replaces recurrence entirely. Every token can mix with every other in one parallel step.
- Scaled dot-product, multi-head. Luong's dot/general score, divided by \(\sqrt{d_k}\) (EQ N4.4 plus the variance fix), becomes the core operation; running \(h\) of them in parallel subspaces gives multi-head attention. The query/key/value vocabulary is just the alignment-model query and the annotation keys/values, renamed and made symmetric.
- Positional encodings. Drop recurrence and the model loses all sense of order, so position is injected directly into the embeddings — the one piece the RNN used to supply for free.
Attention gave the decoder a memory; the next chapter asks what a network learns when it has no labels at all. Chapter 05: autoencoders — the encoder-decoder shape turned inward to compress, denoise, and discover latent structure, and the variational twist that makes those latents generate.
References
- Sutskever, I., Vinyals, O. & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks.
- Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate.
- Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation.
- Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need.
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.