Seq2Seq & the Birth of Attention

4.1

The encoder-decoder framework

Machine translation poses a hard problem for a plain recurrent net: the input and output are both sequences, but of different lengths, in different languages, with no word-by-word alignment. Sutskever, Vinyals and Le (2014) cut the knot with a deceptively simple architecture, now called sequence-to-sequence (seq2seq): one RNN to read, a second RNN to write.

The encoder consumes the source tokens $x_1, \ldots, x_{T_x}$ one at a time, updating a hidden state. Its final hidden state $h_{T_x}$ is taken as a summary of the whole sentence — the context vector $c$. The decoder is a language model conditioned on $c$: it starts from $c$, emits a token, feeds that token back in, and repeats until it produces an end-of-sequence symbol.

EQ N4.1 — THE SEQ2SEQ OBJECTIVE $$ c = h_{T_x}, \qquad p(y_1, \ldots, y_{T_y} \mid x) \;=\; \prod_{i=1}^{T_y} p\!\left( y_i \,\middle|\, y_{<i},\, c \right) $$

The encoder compresses the entire source into one vector $c$; the decoder factorizes the output probability autoregressively, each token conditioned on all previous output tokens $y_{<i}$ and on $c$. Training maximizes the log-likelihood of the correct translation — exactly the cross-entropy you already know, summed over output positions. Note that every output token sees the same, frozen $c$: the encoder gets one chance to say everything. Sutskever et al. found a single trick — reversing the source word order — bought a large BLEU gain, a tell that the fixed vector was straining under long-range dependencies.

Two LSTMs (the encoder typically deep, 4 layers in the original) trained end-to-end on millions of sentence pairs reached competitive WMT'14 English→French BLEU — the first time a pure neural system rivaled the phrase-based statistical machine-translation pipelines it would soon replace. The framework generalizes far beyond translation: summarization, dialogue, code generation, speech-to-text, and image captioning are all seq2seq with different encoders.

PYTHON · RUNNABLE IN-BROWSER

# A toy encoder: roll up a sentence into ONE context vector c (EQ N4.1).
# The point: no matter how long the source, c has fixed width -> the bottleneck.
import numpy as np
rng = np.random.default_rng(0)
d = 6                                  # hidden width

def encode(x_embeds):                  # a stand-in RNN: c = tanh(W h + U x)
    Wh = rng.normal(0, 0.4, (d, d)); Ux = rng.normal(0, 0.4, (d, d))
    h = np.zeros(d)
    for x in x_embeds:                 # read left to right, keep ONLY the last state
        h = np.tanh(Wh @ h + Ux @ x)
    return h                           # c = h_{T_x}

for T in (3, 9, 27):                   # short, medium, long source sentences
    src = rng.normal(0, 1, (T, d))     # T token embeddings
    c = encode(src)
    print(f"source length {T:2d} tokens -> context vector c has width {c.size} "
          f"(norm {np.linalg.norm(c):.2f})")
print("\nThe vector NEVER grows. 27 words must fit in the same 6 numbers as 3.")

edits are live — break it on purpose

4.2

The fixed-vector bottleneck

The architecture's elegance is also its flaw. Every nuance of a 40-word source sentence — who did what to whom, every clause, every named entity — must be squeezed into one fixed-dimensional vector $c$ and held there, unchanged, while the decoder unspools a translation that may itself be 40 words long. The encoder's last state is a lossy, length-blind summary.

The symptom is unmistakable: seq2seq BLEU is fine on short sentences and falls off a cliff as length grows. Cho et al. (2014) documented the decay directly; the longer the input, the more the single vector saturates and the earlier source words it must remember fade. This is an information-theoretic ceiling, not a tuning problem — you cannot store an arbitrarily long sentence in a constant number of bits without loss.

INTUITION

Imagine reading a paragraph, then writing its translation from memory without looking back at the page. That is the fixed-vector decoder. Attention is being allowed to glance back at the source — at whichever word you need, exactly when you need it.

INSTRUMENT N4.1 — BOTTLENECK vs ATTENTIONTRANSLATION QUALITY vs SOURCE LENGTH

CONTEXT WIDTH d 512

DECODER READS

QUALITY @ 10 WORDS

—

QUALITY @ 50 WORDS

—

REGIME

—

A stylized model of the empirical curve from Bahdanau et al. (Fig. 2). With a fixed context vector, quality decays past the length the width can hold — widen d and the cliff moves right but never disappears. Switch to attention and the curve goes flat: the decoder reads the source afresh at every step, so length stops mattering.

4.3

Bahdanau (additive) attention

Bahdanau, Cho and Bengio (2014) made the decisive move. Keep all the encoder hidden states — one per source word, $h_1, \ldots, h_{T_x}$, now called annotations (and produced by a bidirectional RNN so each $h_j$ summarizes the whole sentence centered on word $j$). At every decoding step $i$, build a different context vector $c_i$ by taking a weighted average of those annotations — with weights the decoder chooses on the fly.

The weights come from an alignment model: a tiny feedforward net that scores how well decoder state $s_{i-1}$ matches each annotation $h_j$. Because the score is computed with a sum inside a $\tanh$, this is called additive attention.

EQ N4.2 — ADDITIVE ALIGNMENT SCORE $$ e_{ij} \;=\; v_a^{\top} \tanh\!\left( W_a\, s_{i-1} + U_a\, h_j \right) $$

$s_{i-1}$ is the decoder's previous state (the "query"); $h_j$ is the $j$-th source annotation (a "key"). $W_a, U_a$ project both into a shared space; $\tanh$ mixes them; $v_a$ collapses the result to one scalar relevance score. Crucially, $W_a, U_a, v_a$ are learned jointly with the whole translator — alignment is never supervised, it emerges from the translation loss. This is the original attention mechanism, three years before "Attention Is All You Need."

Softmax over the source positions turns scores into a probability distribution — the attention weights $\alpha_{ij}$ — and the context vector is their weighted sum of annotations:

EQ N4.3 — WEIGHTS & CONTEXT VECTOR $$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \qquad c_i = \sum_{j=1}^{T_x} \alpha_{ij}\, h_j, \qquad \sum_{j=1}^{T_x} \alpha_{ij} = 1 $$

For each output position $i$, the weights $\alpha_{i\cdot}$ form a convex combination over the source — they sum to exactly 1, so $c_i$ is a soft, differentiable lookup into the encoder. The context now varies per output step: translating the verb pulls weight onto the source verb; translating the object pulls weight onto the object. There is no longer one frozen $c$. The fixed-length bottleneck is gone — the decoder's "memory" is the whole source, re-addressed every step.

A decoder attends over $T_x = 4$ encoder annotations with weights $ \alpha_{i1}, \alpha_{i2}, \alpha_{i3}, \alpha_{i4} $ produced by softmax (EQ N4.3). What does $ \sum_{j=1}^{4} \alpha_{ij} $ equal?

Softmax normalizes its outputs by their own sum, so they always form a probability distribution: $ \sum_{j} \alpha_{ij} = $ 1. This is why $c_i$ is a convex combination — a true weighted average — of the annotations.

PYTHON · RUNNABLE IN-BROWSER

# EQ N4.2: additive attention scores from scratch, then softmax to weights.
import numpy as np
rng = np.random.default_rng(1)
d, a = 5, 4                              # hidden width d, alignment width a
Tx = 4                                   # four source words

H = rng.normal(0, 1, (Tx, d))           # encoder annotations h_1..h_Tx
s_prev = rng.normal(0, 1, d)            # decoder state s_{i-1}
Wa = rng.normal(0, 0.5, (a, d))         # project the query
Ua = rng.normal(0, 0.5, (a, d))         # project each key
va = rng.normal(0, 0.5, a)              # collapse to a scalar

e = np.array([va @ np.tanh(Wa @ s_prev + Ua @ h) for h in H])   # EQ N4.2
alpha = np.exp(e - e.max()); alpha /= alpha.sum()               # softmax, EQ N4.3

np.set_printoptions(precision=3, suppress=True)
print("raw alignment scores e_ij :", e)
print("attention weights  alpha :", alpha)
print("weights sum to           :", round(float(alpha.sum()), 6), "<- always 1")
print("argmax source word       :", int(alpha.argmax()))

edits are live — break it on purpose

INSTRUMENT N4.2 — ATTENTION-WEIGHT HEATMAPEN → FR · ROWS = OUTPUT · COLS = SOURCE · EQ N4.3

SOFTMAX TEMPERATURE 1.00

OUTPUT TOKEN (HOVER ROW)

the

ALIGNED SOURCE WORD

the

PEAK WEIGHT

—

A toy alignment for "the agreement on the economic area" → "l'accord sur la zone économique". Each row is one output word's distribution over the source (each row sums to 1). The bright near-diagonal band is monotonic translation; the off-diagonal cells are real reordering — French zone économique flips the adjective order of English economic area, exactly the case where a fixed vector fails. Hover a row; drop the temperature to sharpen each lookup toward a hard pick, raise it to blur toward a uniform average.

4.4

Luong (multiplicative) attention

A year later, Luong, Pham and Manning (2015) simplified and systematized the idea. Their headline observation: the $\tanh$ feedforward scorer is more machinery than you need. If query and key live in the same space, a plain dot product already measures their alignment — and a dot product is a single, GPU-friendly matrix multiply rather than a small MLP. Hence multiplicative (a.k.a. dot-product) attention.

EQ N4.4 — LUONG SCORING FUNCTIONS $$ \mathrm{score}(s_i, h_j) = \begin{cases} s_i^{\top} h_j & \textbf{dot} \\[4pt] s_i^{\top} W_a\, h_j & \textbf{general} \\[4pt] v_a^{\top}\tanh\!\left(W_a [\,s_i;\,h_j\,]\right) & \textbf{concat} \end{cases} $$

Three variants, increasing in flexibility. dot assumes encoder and decoder share a space — zero new parameters. general inserts one learned matrix $W_a$ to bridge mismatched spaces — the usual default. concat (≈ Bahdanau) recovers the additive form. Luong also used the current decoder state $s_i$ (not $s_{i-1}$ as in Bahdanau), and reframed it as "global vs local" attention — local restricting the window to a few source positions for very long inputs.

Two architectures, one essential idea. The differences are practical: additive attention is marginally more robust when query and key dimensions differ; multiplicative attention is faster and more memory-efficient, and at large dimension it needs the now-famous $1/\sqrt{d_k}$ rescaling to keep softmax out of saturation. That scaled dot product is exactly the score function the Transformer would adopt — Luong's general form, with the projections renamed $W_Q$ and $W_K$, is scaled dot-product attention.

Property	Bahdanau (2014)	Luong (2015)
Score	additive (tanh MLP)	dot / general / concat
Decoder state used	$s_{i-1}$ (previous)	$s_i$ (current)
Encoder	bidirectional RNN	top LSTM layer
Cost / extra params	MLP per pair	one matmul (dot: none)
Descendant	—	scaled dot-product attn

True or false: attention removes the fixed-length context bottleneck of plain seq2seq, because the decoder rebuilds a fresh context vector $c_i$ from all encoder states at every output step. (Answer true or false.)

The whole point of EQ N4.3 is that $c_i = \sum_j \alpha_{ij} h_j$ is recomputed for each $i$ over the entire source. Nothing is forced through a single constant-width vector, so the length-blind bottleneck disappears. The answer is true.

PYTHON · RUNNABLE IN-BROWSER

# EQ N4.3/N4.4: the context vector as the attention-weighted sum of encoder states,
# scored with Luong dot-product attention. Verify it is a convex combination.
import numpy as np
rng = np.random.default_rng(2)
d, Tx = 5, 4
H = rng.normal(0, 1, (Tx, d))           # encoder states (rows = source words)
s_i = rng.normal(0, 1, d)               # current decoder state

scores = H @ s_i                        # EQ N4.4 "dot": one matmul, no params
alpha  = np.exp(scores - scores.max()); alpha /= alpha.sum()   # softmax
c_i    = alpha @ H                       # EQ N4.3: weighted sum of states

np.set_printoptions(precision=3, suppress=True)
print("attention weights alpha :", alpha, " (sum", round(float(alpha.sum()),3), ")")
print("context vector  c_i     :", c_i)
# A convex combo must lie inside the per-dim min/max of the states it blends:
lo, hi = H.min(0), H.max(0)
print("c_i within state hull?  :", bool(np.all(c_i >= lo - 1e-9) and
                                        np.all(c_i <= hi + 1e-9)))

edits are live — break it on purpose

INSTRUMENT N4.3 — ALIGNMENT VISUALIZERSOFT WORD-TO-WORD LINKS · DOT-PRODUCT SCORE

OUTPUT STEP i 1 / 5

SHARPNESS (1/τ) 1.0×

EMITTING

l'accord

STRONGEST LINK

agreement

ENTROPY (bits)

—

Step through the output one token at a time and watch the soft links re-aim at the source words that matter. The line opacity is $\alpha_{ij}$; raising sharpness collapses the fan toward a single hard link (low entropy ≈ a dictionary lookup), lowering it spreads attention across the sentence (high entropy ≈ averaging). Notice the crossing lines at zone économique: attention reorders without being told the alignment.

4.5

The bridge to the Transformer

By 2016 attention was bolted onto every competitive RNN translator. But it still rode on top of recurrence: the encoder and decoder remained sequential RNNs, and that sequentiality — each step waiting on the last — capped how much you could parallelize on a GPU and how far gradients reached across long sentences.

Vaswani et al. (2017) asked the obvious next question: if attention is doing the real work of moving information, do we need the RNN at all? "Attention Is All You Need" answered no. Three moves complete the bridge from this chapter:

Self-attention. Bahdanau and Luong attention is cross-attention — the decoder attending to the encoder. Point the same mechanism at a sequence's own positions and you get self-attention, which replaces recurrence entirely. Every token can mix with every other in one parallel step.
Scaled dot-product, multi-head. Luong's dot/general score, divided by $\sqrt{d_k}$ (EQ N4.4 plus the variance fix), becomes the core operation; running $h$ of them in parallel subspaces gives multi-head attention. The query/key/value vocabulary is just the alignment-model query and the annotation keys/values, renamed and made symmetric.
Positional encodings. Drop recurrence and the model loses all sense of order, so position is injected directly into the embeddings — the one piece the RNN used to supply for free.

EQ N4.5 — FROM ALIGNMENT SCORE TO SCALED DOT-PRODUCT $$ \underbrace{e_{ij} = v_a^{\top}\tanh(W_a s_i + U_a h_j)}_{\text{Bahdanau, EQ N4.2}} \;\longrightarrow\; \underbrace{e_{ij} = \frac{(W_Q s_i)^{\top}(W_K h_j)}{\sqrt{d_k}}}_{\text{Transformer (Vol II · EQ 3.1)}} $$

Same skeleton — score every key against the query, softmax, take a weighted sum of values — with the learned MLP scorer swapped for a cheap scaled dot product and the recurrent backbone deleted. Everything that followed (BERT, GPT, and modern LLMs) is this idea scaled up. The 2014 bottleneck and the 2017 Transformer are two ends of one short, straight line; the full mechanism, multi-head, KV cache and all, is the subject of Vol II · Chapter 03.

Attention gave the decoder a memory; the next chapter asks what a network learns when it has no labels at all. Chapter 05: autoencoders — the encoder-decoder shape turned inward to compress, denoise, and discover latent structure, and the variational twist that makes those latents generate.

4.R

References

Sutskever, I., Vinyals, O. & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. NeurIPS 2014 — the encoder-decoder LSTM framework (EQ N4.1) and the source-reversal trick.
Bahdanau, D., Cho, K. & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 — additive attention (EQ N4.2/N4.3); the birth of the mechanism and the length-decay figure.
Luong, M.-T., Pham, H. & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 — multiplicative (dot/general/concat) and global-vs-local attention (EQ N4.4).
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014 — the RNN encoder-decoder and GRU; documents the fixed-vector length decay (§4.2).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 — drops recurrence for pure self-attention; the destination of EQ N4.5.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R. & Bengio, Y. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 — soft/hard attention beyond translation; shows the mechanism generalizes (§4.1).