Tensors & the GPU
A tensor is PyTorch's only data structure that matters: an n-dimensional array of one dtype, laid out in a contiguous block of memory with a shape (the logical dimensions) and a stride (how many elements to step to advance along each dimension). If you know NumPy's ndarray, you already know 90% of torch.Tensor — the API was deliberately built to rhyme. The first superpower is that the same tensor can live on a different device: x.to("cuda") moves the bytes to GPU VRAM, after which every operation on x dispatches to a CUDA kernel instead of a CPU one. The math is identical; only the silicon changes.
Stride is the quiet hero. view, transpose, permute and most slices return a new tensor that shares storage with the original and merely reinterprets the strides — zero copies, zero allocation. That is why reshaping a billion-element tensor is instant, and also why a transposed tensor is non-contiguous and sometimes needs .contiguous() before an op that demands a packed layout.
Broadcasting
The second piece of tensor fluency is broadcasting: the rule that lets operands of different shapes combine without explicit replication. Align shapes from the right; a pair of dimensions is compatible if they are equal or one of them is 1; a size-1 dimension is stretched (virtually, with stride 0 — no memory is copied) to match. A (64, 1) bias added to a (64, 768) activation is broadcast across all 768 columns. Get broadcasting wrong and you get a silent (64, 64) outer-product-shaped bug, not a crash — the single most common source of shape errors in real code.
One honest caveat. "NumPy with a GPU" is the right intuition, not the whole truth. PyTorch tensors carry an autograd history NumPy arrays do not; default float dtype is float32 (NumPy defaults to float64); and bit-exact results differ between CPU and GPU because floating-point reductions run in different orders. The mental model is excellent for learning and slightly wrong for numerical forensics.
Autograd — the dynamic graph
The second superpower is automatic differentiation. Set requires_grad=True on a tensor and PyTorch begins recording every operation that touches it into a directed acyclic graph — the autograd graph. Each node remembers the operation that produced it and a grad_fn that knows how to compute its local derivative. Crucially the graph is dynamic: it is built on the fly during the forward pass and torn down after the backward pass, so an if or a Python for in your model produces a different graph every iteration. This "define-by-run" design is the single biggest reason PyTorch displaced static-graph frameworks for research.
Calling loss.backward() walks that graph in reverse, applying the chain rule at each node, and deposits \(\partial \text{loss} / \partial \theta\) into each leaf tensor's .grad field. This is reverse-mode autodiff: one backward pass computes the gradient of one scalar output with respect to all inputs at once — exactly the regime of deep learning, where loss is a scalar and parameters number in the billions.
.grad.Two facts trip up everyone once. First, .grad accumulates — each backward() adds to whatever is already there — so you must call optimizer.zero_grad() (or x.grad = None) every step or your gradients will sum across iterations. This is a feature, not a bug: it is how you split a large batch across several backward passes. Second, only leaf tensors (the ones you created with requires_grad=True, typically parameters) keep a .grad; intermediate tensors discard theirs to save memory unless you ask with retain_grad().
L.backward(), what value lands in a.grad \(= \dfrac{\partial L}{\partial a}\)? (Remember \(a\) feeds both \(c\) and \(d\).)tensor.grad. (Answer true or false.)requires_grad=True have their \(\partial L/\partial\theta\) deposited — and added — into .grad on every backward(); intermediates do not keep one unless you call retain_grad(). So the statement is true.backward() computes gradients via the chain rule. (Answer true or false.)backward() traverses the autograd graph in reverse topological order and applies EQ F1.3 — the chain rule — at every node, multiplying each child's adjoint by the local derivative and summing over paths. True.# Reverse-mode autodiff by hand in numpy -- mimic .backward() on a tiny graph
# Graph: c = a*b ; d = c + a ; L = d**2 at a=2, b=3
import numpy as np
a, b = 2.0, 3.0
# ---- forward pass: compute values, remember them as "tape" ----
c = a * b # multiply
d = c + a # add (a fans out: feeds both c and d)
L = d * d # square
print(f"forward : c={c}, d={d}, L={L}")
# ---- backward pass: seed dL/dL = 1, push adjoints back ----
gL = 1.0
gd = gL * (2 * d) # d/dd of d**2
gc = gd * 1.0 # d = c + a -> dd/dc = 1
ga = gc * b + gd * 1.0 # a -> c (dc/da=b) AND a -> d (dd/da=1): SUM paths
gb = gc * a # c = a*b -> dc/db = a
print(f"backward: dL/da={ga}, dL/db={gb}")
# ---- check against a numerical gradient (finite differences) ----
h = 1e-6
fd_a = ((a+h)*b + (a+h) + (a+h)) # rebuild L(a+h): messy, so use a function
def Lf(a, b): c=a*b; d=c+a; return d*d
print("numeric dL/da:", round((Lf(a+h,b)-Lf(a-h,b))/(2*h), 4),
"| numeric dL/db:", round((Lf(a,b+h)-Lf(a,b-h))/(2*h), 4))
nn.Module & building models
You could build a network from bare tensors and requires_grad, but PyTorch gives you nn.Module: a base class that bookkeeps parameters for you. Subclass it, register child modules and nn.Parameter tensors as attributes in __init__, and define forward(self, x). Then module.parameters() recursively yields every learnable tensor — exactly what you hand to the optimizer — and module.to(device), module.state_dict() (for saving), and module.train()/module.eval() all just work.
# A two-layer MLP — the canonical nn.Module shape
import torch.nn as nn
class MLP(nn.Module):
def __init__(self, d_in, d_hidden, d_out):
super().__init__()
self.fc1 = nn.Linear(d_in, d_hidden) # weight + bias auto-registered
self.fc2 = nn.Linear(d_hidden, d_out)
def forward(self, x):
x = torch.relu(self.fc1(x)) # define-by-run: just write the math
return self.fc2(x)
model = MLP(784, 128, 10)
n = sum(p.numel() for p in model.parameters()) # count params
A single nn.Linear(d_in, d_out) holds a weight of shape \((d_{\text{out}}, d_{\text{in}})\) and a bias of shape \((d_{\text{out}})\), and computes the affine map below. The shape convention — output dim first — exists so the forward pass can write x @ W.T + b with x batched on the leading axis.
nn.Linear(784,128) then nn.Linear(128,10), each with bias. Using \(d_{\text{out}}(d_{\text{in}}+1)\) per layer, how many learnable parameters does the whole model have?Why nn.Parameter and not a plain tensor? nn.Parameter is a tensor subclass that is automatically (a) registered in parameters() so the optimizer sees it, and (b) given requires_grad=True. Assign a raw tensor as a module attribute and it is invisible to the optimizer — a classic "my loss won't go down" bug. For tensors that should move with the model but never train (e.g. running stats), use register_buffer.
The training loop
Everything above converges on five lines that you will write thousands of times. PyTorch deliberately does not hide the loop — you write it yourself, which is why the framework is so easy to debug. The canonical step:
for xb, yb in loader: # 0. a minibatch
optimizer.zero_grad() # 1. clear last step's grads (they accumulate!)
pred = model(xb) # 2. FORWARD — builds the autograd graph
loss = loss_fn(pred, yb) # 3. LOSS — a single scalar
loss.backward() # 4. BACKWARD — chain rule fills every .grad
optimizer.step() # 5. STEP — theta -= lr * theta.grad
The optimizer's step() is where learning happens. For plain SGD it is one line of the gradient-descent rule you met in Volume I; for Adam it adds per-parameter adaptive scaling from running moment estimates. Either way the update consumes the .grad values backward() just deposited.
backward(), each parameter's .grad holds \(\partial L/\partial\theta\); step() walks every parameter and subtracts the learning rate \(\eta\) times that gradient. The order is load-bearing: zero_grad → forward → loss → backward → step. Forget zero_grad and gradients sum across iterations (EQ F1.3 accumulation); call step before backward and you update on stale or empty gradients.backward() its .grad is \(3.0\). With SGD(lr=0.2), what is \(\theta\) after one optimizer.step() (EQ F1.5)?zero_grad would reset the \(.grad\) to zero before the next forward.)# A full PyTorch-style training loop in numpy: forward / loss / backward / step
# Fit y = 2x + 1 with a 1-param-per-weight linear model, by hand.
import numpy as np
rng = np.random.default_rng(0)
X = rng.uniform(-2, 2, 64) # 64 examples
y = 2.0 * X + 1.0 + rng.normal(0, 0.1, 64) # true line + noise
w, b = 0.0, 0.0 # parameters (our "leaves")
lr = 0.1
hist = []
for epoch in range(60):
# 1. zero_grad is implicit: we recompute grads fresh each step
pred = w * X + b # 2. FORWARD
err = pred - y
loss = np.mean(err ** 2) # 3. LOSS (MSE, a scalar)
# 4. BACKWARD: dL/dw, dL/db by the chain rule on mean((wx+b-y)^2)
gw = np.mean(2 * err * X)
gb = np.mean(2 * err)
w -= lr * gw # 5. STEP (EQ F1.5)
b -= lr * gb
hist.append(loss)
print(f"learned w={w:.3f} (true 2.0), b={b:.3f} (true 1.0)")
print(f"loss: {hist[0]:.3f} -> {hist[-1]:.5f} over 60 steps")
plot_xy(list(range(len(hist))), hist) # the loss curve descending
Datasets, DataLoaders & devices
The loop above iterates over loader — and that object is the last primitive worth knowing. A Dataset answers two questions: __len__ (how many examples) and __getitem__(i) (return example i, usually a (features, label) tuple). A DataLoader wraps a Dataset and handles the operational concerns: batching (collate many examples into one tensor), shuffling (reorder each epoch so batches are i.i.d.), and parallel loading (num_workers subprocesses fetch the next batch while the GPU chews the current one, hiding I/O latency).
drop_last=True. One epoch = one full sweep; total optimizer steps = epochs \(\times \lceil N/B\rceil\) — the number that actually sets your training budget.drop_last=False. How many optimizer steps does one epoch take (EQ F1.6)?The final practical concern is device discipline. A tensor and the parameters it meets must live on the same device, or PyTorch raises a "expected all tensors to be on the same device" error. The idiom is to pick one device up front and move both model and each batch to it:
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device) # parameters now on the GPU
for xb, yb in loader:
xb, yb = xb.to(device), yb.to(device) # move the batch too
... # the loop from 1.4, unchanged
Where to go from here. torch.compile (stable since 2.0) traces your define-by-run model into a fused, optimized graph for large speedups with no code change; mixed precision (torch.autocast + GradScaler) halves memory and accelerates matmuls; and DistributedDataParallel scales the very same loop across many GPUs. None of them change the four ideas in this chapter — tensor, autograd, module, loop.
PyTorch hands you the loop; the next framework hides it. Chapter 02 covers TensorFlow and its Keras front-end — static graphs, model.fit(), and the engineering trade-offs of declaring your computation up front instead of running it line by line.
References
- Paszke, A., Gross, S., Massa, F. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library.
- The PyTorch Team. PyTorch Documentation (stable).
- Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey.
- The PyTorch Team. Automatic Differentiation with torch.autograd.
- Ansel, J. et al. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation.