PyTorch — Tensors, Autograd & Training

1.1

Tensors & the GPU

A tensor is PyTorch's only data structure that matters: an n-dimensional array of one dtype, laid out in a contiguous block of memory with a shape (the logical dimensions) and a stride (how many elements to step to advance along each dimension). If you know NumPy's ndarray, you already know 90% of torch.Tensor — the API was deliberately built to rhyme. The first superpower is that the same tensor can live on a different device: x.to("cuda") moves the bytes to GPU VRAM, after which every operation on x dispatches to a CUDA kernel instead of a CPU one. The math is identical; only the silicon changes.

Stride is the quiet hero. view, transpose, permute and most slices return a new tensor that shares storage with the original and merely reinterprets the strides — zero copies, zero allocation. That is why reshaping a billion-element tensor is instant, and also why a transposed tensor is non-contiguous and sometimes needs .contiguous() before an op that demands a packed layout.

EQ F1.1 — STRIDED ADDRESSING $$ \text{offset}(i_0,\dots,i_{n-1}) \;=\; \sum_{k=0}^{n-1} i_k \cdot s_k, \qquad s_k = \prod_{j>k} d_j \;\;(\text{row-major}) $$

The flat memory address of element $(i_0,\dots,i_{n-1})$ is just a dot product of its index with the stride vector $s$. For a contiguous $2\times3$ tensor the strides are $(3,1)$: element $(1,2)$ lives at offset $1\cdot3 + 2\cdot1 = 5$. A "view" never moves data — it only hands you a new $(\text{shape}, \text{stride})$ lens onto the same bytes.

Broadcasting

The second piece of tensor fluency is broadcasting: the rule that lets operands of different shapes combine without explicit replication. Align shapes from the right; a pair of dimensions is compatible if they are equal or one of them is 1; a size-1 dimension is stretched (virtually, with stride 0 — no memory is copied) to match. A (64, 1) bias added to a (64, 768) activation is broadcast across all 768 columns. Get broadcasting wrong and you get a silent (64, 64) outer-product-shaped bug, not a crash — the single most common source of shape errors in real code.

EQ F1.2 — BROADCAST COMPATIBILITY $$ \text{out}_k = \max(a_k,\, b_k) \quad\text{is defined} \iff a_k = b_k \;\lor\; a_k = 1 \;\lor\; b_k = 1 $$

Read shapes right-to-left, padding the shorter with leading 1s. Each axis must match exactly or have a 1 on one side; the output takes the larger. $(64,1)$ with $(768,)$ → $(64,768)$; $(3,)$ with $(4,)$ → error, because $3\neq4$ and neither is 1. The stretched axis costs no memory: it is broadcast by setting its stride to zero.

You add a tensor of shape $(1, 3)$ to one of shape $(2, 1)$. Broadcasting produces an output shape $(R, C)$. What is the total number of elements $R \times C$ in the result?

Align right: axis 1 is $3$ vs $1 \to 3$; axis 0 is $1$ vs $2 \to 2$. Output shape is $(2, 3)$, so $R\times C = 2\times3 = $ 6. Each input is virtually stretched along its size-1 axis with no copy.

INSTRUMENT F1.1 — SHAPE & BROADCASTING EXPLOREREQ F1.2 · ALIGN-RIGHT RULE

A · DIM 0 64

A · DIM 1 1

B · DIM 0 64

B · DIM 1 768

A.shape

—

B.shape

—

RESULT

—

Set "A · DIM 1" to 1 to broadcast a column across B's width — the green result is bigger than either input, allocated for free. Now make A's last dim 3 and B's last dim 768 (neither is 1, neither matches): the explorer flags the exact axis PyTorch would reject. This single rule explains most real-world shape bugs.

One honest caveat. "NumPy with a GPU" is the right intuition, not the whole truth. PyTorch tensors carry an autograd history NumPy arrays do not; default float dtype is float32 (NumPy defaults to float64); and bit-exact results differ between CPU and GPU because floating-point reductions run in different orders. The mental model is excellent for learning and slightly wrong for numerical forensics.

1.2

Autograd — the dynamic graph

The second superpower is automatic differentiation. Set requires_grad=True on a tensor and PyTorch begins recording every operation that touches it into a directed acyclic graph — the autograd graph. Each node remembers the operation that produced it and a grad_fn that knows how to compute its local derivative. Crucially the graph is dynamic: it is built on the fly during the forward pass and torn down after the backward pass, so an if or a Python for in your model produces a different graph every iteration. This "define-by-run" design is the single biggest reason PyTorch displaced static-graph frameworks for research.

Calling loss.backward() walks that graph in reverse, applying the chain rule at each node, and deposits $\partial \text{loss} / \partial \theta$ into each leaf tensor's .grad field. This is reverse-mode autodiff: one backward pass computes the gradient of one scalar output with respect to all inputs at once — exactly the regime of deep learning, where loss is a scalar and parameters number in the billions.

EQ F1.3 — REVERSE-MODE CHAIN RULE $$ \bar{x} \;\equiv\; \frac{\partial L}{\partial x} \;=\; \sum_{y \,\in\, \text{children}(x)} \frac{\partial L}{\partial y}\,\frac{\partial y}{\partial x} \;=\; \sum_{y} \bar{y}\,\frac{\partial y}{\partial x} $$

The adjoint $\bar{x}$ of a node is the sum of contributions flowing back from every child it feeds. Backward visits nodes in reverse topological order so every $\bar{y}$ is finished before $x$ needs it. When a tensor fans out to several children, its gradient is the sum over all paths — which is exactly why a leaf used twice accumulates both contributions in .grad.

Two facts trip up everyone once. First, .grad accumulates — each backward() adds to whatever is already there — so you must call optimizer.zero_grad() (or x.grad = None) every step or your gradients will sum across iterations. This is a feature, not a bug: it is how you split a large batch across several backward passes. Second, only leaf tensors (the ones you created with requires_grad=True, typically parameters) keep a .grad; intermediate tensors discard theirs to save memory unless you ask with retain_grad().

Build the graph $c = a\cdot b$, $d = c + a$, $L = d^2$ at $a = 2,\ b = 3$. After L.backward(), what value lands in a.grad $= \dfrac{\partial L}{\partial a}$? (Remember $a$ feeds both $c$ and $d$.)

Forward: $c=6,\ d=8,\ L=64$. Backward: $\bar d = 2d = 16$; $\bar c = \bar d \cdot 1 = 16$. Now $a$ has two children: through $c$, $\bar c\,\partial c/\partial a = 16\cdot b = 48$; through $d$, $\bar d\,\partial d/\partial a = 16\cdot 1 = 16$. Sum the paths (EQ F1.3): $48 + 16 = $ 64.

True or false: a leaf tensor accumulates its gradient in tensor.grad. (Answer true or false.)

Leaf tensors created with requires_grad=True have their $\partial L/\partial\theta$ deposited — and added — into .grad on every backward(); intermediates do not keep one unless you call retain_grad(). So the statement is true.

True or false: backward() computes gradients via the chain rule. (Answer true or false.)

backward() traverses the autograd graph in reverse topological order and applies EQ F1.3 — the chain rule — at every node, multiplying each child's adjoint by the local derivative and summing over paths. True.

PYTHON · RUNNABLE IN-BROWSER

# Reverse-mode autodiff by hand in numpy -- mimic .backward() on a tiny graph
# Graph:  c = a*b ;  d = c + a ;  L = d**2   at a=2, b=3
import numpy as np
a, b = 2.0, 3.0

# ---- forward pass: compute values, remember them as "tape" ----
c = a * b          # multiply
d = c + a          # add (a fans out: feeds both c and d)
L = d * d          # square
print(f"forward : c={c}, d={d}, L={L}")

# ---- backward pass: seed dL/dL = 1, push adjoints back ----
gL = 1.0
gd = gL * (2 * d)              # d/dd of d**2
gc = gd * 1.0                  # d = c + a  ->  dd/dc = 1
ga = gc * b + gd * 1.0         # a -> c (dc/da=b) AND a -> d (dd/da=1): SUM paths
gb = gc * a                    # c = a*b  ->  dc/db = a
print(f"backward: dL/da={ga}, dL/db={gb}")

# ---- check against a numerical gradient (finite differences) ----
h = 1e-6
fd_a = ((a+h)*b + (a+h) + (a+h))  # rebuild L(a+h): messy, so use a function
def Lf(a, b): c=a*b; d=c+a; return d*d
print("numeric dL/da:", round((Lf(a+h,b)-Lf(a-h,b))/(2*h), 4),
      "| numeric dL/db:", round((Lf(a,b+h)-Lf(a,b-h))/(2*h), 4))

edits are live — break it on purpose

INSTRUMENT F1.2 — AUTOGRAD GRAPH VISUALIZERFORWARD VALUES · BACKWARD ADJOINTS · EQ F1.3

LEAF a 2.0

LEAF b 3.0

FORWARD L

—

a.grad (∂L/∂a)

—

b.grad (∂L/∂b)

—

The graph $c=a\cdot b,\ d=c+a,\ L=d^2$. Black numbers on nodes are forward values; mint numbers on edges are the adjoints $\bar y\,\partial y/\partial x$ that backward() pushes leftward. Leaf a fans out to two children, so its two incoming edges add — drag the sliders and watch a.grad track $16b + 16$ exactly as EQ F1.3 predicts.

1.3

nn.Module & building models

You could build a network from bare tensors and requires_grad, but PyTorch gives you nn.Module: a base class that bookkeeps parameters for you. Subclass it, register child modules and nn.Parameter tensors as attributes in __init__, and define forward(self, x). Then module.parameters() recursively yields every learnable tensor — exactly what you hand to the optimizer — and module.to(device), module.state_dict() (for saving), and module.train()/module.eval() all just work.

# A two-layer MLP — the canonical nn.Module shape
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super().__init__()
        self.fc1 = nn.Linear(d_in, d_hidden)   # weight + bias auto-registered
        self.fc2 = nn.Linear(d_hidden, d_out)
    def forward(self, x):
        x = torch.relu(self.fc1(x))            # define-by-run: just write the math
        return self.fc2(x)

model = MLP(784, 128, 10)
n = sum(p.numel() for p in model.parameters())   # count params

A single nn.Linear(d_in, d_out) holds a weight of shape $(d_{\text{out}}, d_{\text{in}})$ and a bias of shape $(d_{\text{out}})$, and computes the affine map below. The shape convention — output dim first — exists so the forward pass can write x @ W.T + b with x batched on the leading axis.

EQ F1.4 — nn.Linear $$ y \;=\; x W^{\top} + b, \qquad W \in \mathbb{R}^{d_{\text{out}}\times d_{\text{in}}},\; b \in \mathbb{R}^{d_{\text{out}}},\; x \in \mathbb{R}^{B \times d_{\text{in}}} $$

A linear (fully-connected) layer is a learned affine transform. Parameter count is $d_{\text{out}}(d_{\text{in}} + 1)$: the $+1$ is the bias. Batched inputs $x$ of shape $(B, d_{\text{in}})$ map to outputs $(B, d_{\text{out}})$ — the batch axis rides along for free via broadcasting (EQ F1.2) of the bias. Everything in this encyclopedia, Transformers included, is towers of EQ F1.4 with nonlinearities between them.

The MLP above is nn.Linear(784,128) then nn.Linear(128,10), each with bias. Using $d_{\text{out}}(d_{\text{in}}+1)$ per layer, how many learnable parameters does the whole model have?

Layer 1: $128\times(784+1) = 128\times785 = 100{,}480$. Layer 2: $10\times(128+1) = 10\times129 = 1{,}290$. Total $= 100{,}480 + 1{,}290 = $ 101770. ReLU adds none — it has no parameters.

Why nn.Parameter and not a plain tensor? nn.Parameter is a tensor subclass that is automatically (a) registered in parameters() so the optimizer sees it, and (b) given requires_grad=True. Assign a raw tensor as a module attribute and it is invisible to the optimizer — a classic "my loss won't go down" bug. For tensors that should move with the model but never train (e.g. running stats), use register_buffer.

1.4

The training loop

Everything above converges on five lines that you will write thousands of times. PyTorch deliberately does not hide the loop — you write it yourself, which is why the framework is so easy to debug. The canonical step:

for xb, yb in loader:                 # 0. a minibatch
    optimizer.zero_grad()           # 1. clear last step's grads (they accumulate!)
    pred = model(xb)                # 2. FORWARD  — builds the autograd graph
    loss = loss_fn(pred, yb)        # 3. LOSS     — a single scalar
    loss.backward()                 # 4. BACKWARD — chain rule fills every .grad
    optimizer.step()                # 5. STEP     — theta -= lr * theta.grad

The optimizer's step() is where learning happens. For plain SGD it is one line of the gradient-descent rule you met in Volume I; for Adam it adds per-parameter adaptive scaling from running moment estimates. Either way the update consumes the .grad values backward() just deposited.

EQ F1.5 — THE SGD UPDATE optimizer.step() $$ \theta_{t+1} \;=\; \theta_t - \eta\,\nabla_\theta L, \qquad \nabla_\theta L = \theta.\texttt{grad} $$

After backward(), each parameter's .grad holds $\partial L/\partial\theta$; step() walks every parameter and subtracts the learning rate $\eta$ times that gradient. The order is load-bearing: zero_grad → forward → loss → backward → step. Forget zero_grad and gradients sum across iterations (EQ F1.3 accumulation); call step before backward and you update on stale or empty gradients.

A parameter sits at $\theta = 2.0$. After backward() its .grad is $3.0$. With SGD(lr=0.2), what is $\theta$ after one optimizer.step() (EQ F1.5)?

$\theta_{t+1} = \theta_t - \eta\,\nabla = 2.0 - 0.2\times3.0 = 2.0 - 0.6 = $ 1.4. (Then zero_grad would reset the $.grad$ to zero before the next forward.)

PYTHON · RUNNABLE IN-BROWSER

# A full PyTorch-style training loop in numpy: forward / loss / backward / step
# Fit y = 2x + 1 with a 1-param-per-weight linear model, by hand.
import numpy as np
rng = np.random.default_rng(0)
X = rng.uniform(-2, 2, 64)               # 64 examples
y = 2.0 * X + 1.0 + rng.normal(0, 0.1, 64)   # true line + noise

w, b = 0.0, 0.0                          # parameters (our "leaves")
lr = 0.1
hist = []
for epoch in range(60):
    # 1. zero_grad is implicit: we recompute grads fresh each step
    pred = w * X + b                     # 2. FORWARD
    err  = pred - y
    loss = np.mean(err ** 2)            # 3. LOSS (MSE, a scalar)
    # 4. BACKWARD: dL/dw, dL/db by the chain rule on mean((wx+b-y)^2)
    gw = np.mean(2 * err * X)
    gb = np.mean(2 * err)
    w -= lr * gw                        # 5. STEP (EQ F1.5)
    b -= lr * gb
    hist.append(loss)

print(f"learned w={w:.3f} (true 2.0), b={b:.3f} (true 1.0)")
print(f"loss: {hist[0]:.3f} -> {hist[-1]:.5f} over 60 steps")
plot_xy(list(range(len(hist))), hist)   # the loss curve descending

edits are live — break it on purpose

INSTRUMENT F1.3 — TRAINING-LOOP ANATOMYFIVE STEPS · WHAT EACH LINE TOUCHES

STEP 0 · IDLE

CALL

—

GRAPH STATE

—

.grad STATE

—

Drag the slider through the five canonical lines. Watch how zero_grad empties .grad, forward grows the autograd graph, backward fills the gradients (and frees the graph), and step consumes them. The cycle is what every PyTorch model — from this MLP to a frontier LLM — runs millions of times.

1.5

Datasets, DataLoaders & devices

The loop above iterates over loader — and that object is the last primitive worth knowing. A Dataset answers two questions: __len__ (how many examples) and __getitem__(i) (return example i, usually a (features, label) tuple). A DataLoader wraps a Dataset and handles the operational concerns: batching (collate many examples into one tensor), shuffling (reorder each epoch so batches are i.i.d.), and parallel loading (num_workers subprocesses fetch the next batch while the GPU chews the current one, hiding I/O latency).

EQ F1.6 — STEPS PER EPOCH $$ \text{steps/epoch} \;=\; \left\lceil \frac{N}{B} \right\rceil, \qquad \text{last batch has } N - B\!\left\lfloor \frac{N}{B}\right\rfloor \text{ examples (if } \texttt{drop\_last=False)} $$

$N$ examples, batch size $B$: the loader yields $\lceil N/B\rceil$ minibatches, hence that many optimizer steps per pass over the data. The final batch is ragged unless $B$ divides $N$ or you set drop_last=True. One epoch = one full sweep; total optimizer steps = epochs $\times \lceil N/B\rceil$ — the number that actually sets your training budget.

A dataset has $N = 10{,}000$ examples and you use batch size $B = 128$ with drop_last=False. How many optimizer steps does one epoch take (EQ F1.6)?

$\lceil 10000/128\rceil = \lceil 78.125\rceil = $ 79. The first 78 batches hold 128 each ($78\times128 = 9984$); the 79th holds the remaining $16$ examples.

The final practical concern is device discipline. A tensor and the parameters it meets must live on the same device, or PyTorch raises a "expected all tensors to be on the same device" error. The idiom is to pick one device up front and move both model and each batch to it:

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)              # parameters now on the GPU
for xb, yb in loader:
    xb, yb = xb.to(device), yb.to(device)   # move the batch too
    ...                              # the loop from 1.4, unchanged

Where to go from here. torch.compile (stable since 2.0) traces your define-by-run model into a fused, optimized graph for large speedups with no code change; mixed precision (torch.autocast + GradScaler) halves memory and accelerates matmuls; and DistributedDataParallel scales the very same loop across many GPUs. None of them change the four ideas in this chapter — tensor, autograd, module, loop.

PyTorch hands you the loop; the next framework hides it. Chapter 02 covers TensorFlow and its Keras front-end — static graphs, model.fit(), and the engineering trade-offs of declaring your computation up front instead of running it line by line.

1.R

References

Paszke, A., Gross, S., Massa, F. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 32 — the system paper for define-by-run autograd.
The PyTorch Team. PyTorch Documentation (stable). Official reference for tensors, autograd, nn, and optim.
Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. JMLR 18(153) — forward vs reverse mode, the theory behind EQ F1.3.
The PyTorch Team. Automatic Differentiation with torch.autograd. Tutorial — the dynamic graph, .grad accumulation, and zero_grad.
Ansel, J. et al. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ASPLOS 2024 — torch.compile and TorchDynamo.