TensorFlow & the graph heritage
TensorFlow began in 2015 as Google's successor to an internal system called DistBelief, and its founding bet was the computation graph. You did not run arithmetic; you described it — building a directed graph whose nodes are operations and whose edges are tensors flowing between them — and only then handed that static graph to a runtime (a Session) that executed it, possibly across many GPUs or a TPU pod. The name is literal: tensors flow through the graph.
The payoff of a static graph is that the framework sees the entire program before running a single op. That lets it fuse adjacent operations, prune dead branches, lay out memory ahead of time, place each node on the best device, and serialize the whole thing to a language-independent artifact you can deploy in C++, on a phone, or in a browser with no Python in sight. The cost was ergonomic: TensorFlow 1.x's define-then-run model meant your Python code built a graph but never touched a number, so a shape bug surfaced as an inscrutable error from deep inside the runtime, and a print showed you a symbolic node, not a value.
TensorFlow 2.0 (2019) flipped the default to eager execution — operations run immediately, like NumPy, so tensors hold real values you can print and debug — while keeping the graph available on demand through tf.function, which traces your Python into a graph the first time it runs and reuses it thereafter. The mental model that survives to 2026 is exactly this duality: eager for writing and debugging, graph for speed and deployment. The same idea reappears in PyTorch's torch.compile (Chapter 01) — the field converged on "write eager, compile to a graph when it matters."
tf.GradientTape records and replays. The graph is not an optimization detail bolted on later: it is the data structure that makes backpropagation mechanical. Static graphs let the compiler see all of \(f\) at once; eager mode builds the same graph one node at a time and differentiates the trace.Honest caveat. The 1.x-to-2.x transition was painful and fragmented the ecosystem; a great deal of legacy code, tutorials, and Stack Overflow answers still assume Sessions and placeholders that no longer apply. If you read TensorFlow material that calls sess.run(...), it predates 2.0 — treat it as historical.
# A "graph" is just a record of ops + their local derivatives.
# Here is reverse-mode autodiff (EQ F2.1) for z = (x*y) + sin(x), by hand.
import numpy as np
x, y = 2.0, 3.0
# forward pass: compute each node, remember the values we'll need
a = x * y # node a = x*y
b = np.sin(x) # node b = sin(x)
z = a + b # node z = a + b (the "loss")
# backward pass: seed dz/dz = 1, push gradients to inputs (chain rule)
dz = 1.0
da = dz * 1.0 # z = a + b -> dz/da = 1
db = dz * 1.0 # z = a + b -> dz/db = 1
dx = da * y + db * np.cos(x) # a=x*y -> y ; b=sin(x) -> cos(x)
dy = da * x # a=x*y -> x
print(f"z = {z:.4f}")
print(f"dz/dx = {dx:.4f} (analytic: y + cos(x) = {y + np.cos(x):.4f})")
print(f"dz/dy = {dy:.4f} (analytic: x = {x:.4f})")
print("this is exactly what tf.GradientTape / torch.autograd automate.")
Keras — the high-level API
Keras started in 2015 as François Chollet's framework-agnostic frontend; since TensorFlow 2.0 it has been TF's official high-level API (imported as tf.keras), and as of Keras 3 (2023) it once again runs on a choice of backends — TensorFlow, JAX, or PyTorch — behind one identical API. Its design philosophy is "progressive disclosure of complexity": the easy thing is one line, and every layer of customization is available exactly when you need it, never before.
The smallest unit is the layer: an object that owns weights and maps an input tensor to an output tensor. The workhorse is Dense — a fully-connected layer computing an affine transform followed by a nonlinearity:
Dense(u) layer reading \(d_{\text{in}}\) features holds a weight matrix \(W\) of shape \(u \times d_{\text{in}}\) and a bias vector \(\mathbf{b}\) of length \(u\); \(\phi\) is the activation (ReLU, softmax, …). Its trainable parameter count is therefore \(d_{\text{in}}\!\cdot u + u = (d_{\text{in}}+1)\,u\) — the +1 is the bias. Note that \(d_{\text{in}}\) is not something you pass: Keras infers it from whatever tensor first flows in, which is why a freshly built layer reports 0 parameters until it sees a shape. The single most common surprise for newcomers is exactly this lazy build.A whole network is then just an ordered stack of such layers. Keras offers three ways to express one, in increasing power:
| API | Shape | Use when… |
|---|---|---|
| Sequential | a plain list of layers | The model is a single chain, input → output, no branching. |
| Functional | a DAG of layers | You need multiple inputs/outputs, skip connections, or shared layers — most real models. |
| Subclassing | imperative call() | Control flow depends on the data (dynamic loops, research architectures). |
The parameter count of a stack is just the sum over its layers, and tracking it is not academic: parameters set your memory budget, your overfitting risk, and your serving cost. The instrument and the calculator below make EQ F2.2 tactile — add a layer, watch the count move.
Dense(64) layer receives an input with \(32\) features. How many weights does it hold, excluding the bias terms? (Use \(d_{\text{in}}\cdot u\) from EQ F2.2.)# A tiny MLP forward pass "Keras-style", but in pure numpy (EQ F2.2).
# Sequential([Dense(8, relu), Dense(3, softmax)]) on a 4-feature input.
import numpy as np
rng = np.random.default_rng(0)
def dense(x, W, b, act): # y = act(x @ W.T + b)
z = x @ W.T + b
if act == "relu": return np.maximum(0.0, z)
if act == "softmax":
e = np.exp(z - z.max(-1, keepdims=True))
return e / e.sum(-1, keepdims=True)
return z
x = rng.normal(0, 1, (5, 4)) # batch of 5, 4 features each
W1 = rng.normal(0, 0.5, (8, 4)); b1 = np.zeros(8) # Dense(8): 4 -> 8
W2 = rng.normal(0, 0.5, (3, 8)); b2 = np.zeros(3) # Dense(3): 8 -> 3
h = dense(x, W1, b1, "relu") # hidden activations
out = dense(h, W2, b2, "softmax") # class probabilities
np.set_printoptions(precision=3, suppress=True)
print("output probabilities (rows = examples, cols = 3 classes):")
print(out)
print("\nevery row sums to 1:", out.sum(1).round(6))
print("params:", W1.size + b1.size + W2.size + b2.size,
"= (4+1)*8 + (8+1)*3 =", (4+1)*8 + (8+1)*3)
# Parameter-count calculator for a Dense stack (EQ F2.2): (d_in+1)*u per layer.
# This is what model.summary() prints, by hand.
def dense_params(input_dim, units):
"""Return (weights, biases, total) for one Dense layer."""
weights = input_dim * units
biases = units
return weights, biases, weights + biases
# A small MLP: 32 features -> 64 -> 64 -> 10 classes
layers = [("Dense(64)", 32, 64),
("Dense(64)", 64, 64),
("Dense(10)", 64, 10)]
total = 0
print(f"{'layer':<12}{'in':>5}{'units':>7}{'weights':>10}{'+bias':>8}{'params':>10}")
for name, d_in, u in layers:
w, b, p = dense_params(d_in, u)
total += p
print(f"{name:<12}{d_in:>5}{u:>7}{w:>10,}{b:>8}{p:>10,}")
print("-" * 52)
print(f"{'TOTAL':<12}{'':>5}{'':>7}{'':>10}{'':>8}{total:>10,} trainable params")
tf.data input pipelines
A model is only as fast as the data reaching it. If the GPU finishes a step in 8 ms but the CPU needs 20 ms to read, decode, and augment the next batch, the accelerator sits idle 60% of the time — you bought a sports car and left it in the garage. tf.data is TensorFlow's answer: a declarative pipeline that overlaps data preparation with model execution so the accelerator never waits.
A pipeline is a chain of transformations on a tf.data.Dataset: .map() applies a preprocessing function, .shuffle(buffer) randomizes order through a fixed-size buffer, .batch(n) groups examples, and the two that matter most for throughput:
.prefetch(k)— lets the input pipeline produce batch \(t{+}1\) on the CPU while the model trains on batch \(t\) on the GPU. Withtf.data.AUTOTUNEthe runtime tunes the buffer size for you. This single call is usually the largest free speedup available..cache()— keeps the dataset in memory (or on disk) after the first epoch, so repeated epochs skip re-reading and re-decoding. Place it after expensive deterministic work and before random augmentation, or you will cache one fixed set of augmentations forever.
The reason prefetch works is a pipeline identity. Without overlap, each step costs the sum of input time and compute time; with overlap, the two run concurrently, so a step costs only the larger of the two:
map, a better file format like TFRecord)..prefetch() overlapping the two (EQ F2.3), how many milliseconds does one overlapped step take?# EQ F2.3: why .prefetch() turns a sum into a max. A throughput simulator.
import numpy as np
t_input, t_compute = 12.0, 20.0 # ms per batch (CPU prep, GPU step)
n_steps = 500
serial = n_steps * (t_input + t_compute) # no overlap
overlapped = t_input + n_steps * max(t_input, t_compute) # +1 warm-up batch
print(f"per-step serial : {t_input + t_compute:.1f} ms")
print(f"per-step overlapped : {max(t_input, t_compute):.1f} ms (max, not sum)")
print(f"{n_steps} steps serial : {serial/1000:.2f} s")
print(f"{n_steps} steps prefetch : {overlapped/1000:.2f} s")
print(f"speedup : {serial/overlapped:.2f}x")
# utilization = fraction of wall time the GPU is actually computing
util_serial = t_compute / (t_input + t_compute)
util_overlapped = t_compute / max(t_input, t_compute)
print(f"\nGPU utilization serial : {util_serial*100:4.1f} %")
print(f"GPU utilization prefetch: {min(util_overlapped,1)*100:4.1f} %")
The same overlap principle governs PyTorch's DataLoader (num_workers for parallel loading, pin_memory + prefetch_factor for the host-to-device overlap). The vocabulary differs; the bottleneck — and the fix — does not. Every framework's input API exists to win back EQ F2.3's wasted t_input.
Training — model.fit vs custom loops
Once a model is built and a dataset is ready, training is a loop. Keras gives you that loop for free. After model.compile(optimizer, loss, metrics), a single call to model.fit(dataset, epochs=...) runs the entire schedule: it iterates batches, does the forward pass, computes the loss, runs backpropagation, applies the optimizer update, accumulates metrics, and — if you pass validation_data — evaluates each epoch. Everything inside the loop is the framework's responsibility; you supplied only the model, the loss, and the data.
What fit automates, one batch at a time, is exactly the gradient-descent step:
fit wraps this in epoch and batch loops, handles the optimizer's internal state (momentum, Adam moments), runs metric accumulation, and fires callbacks at the right moments. The math is identical to a hand-written loop — fit is convenience, not magic, and that is precisely why it is the right default.The extensibility hooks matter as much as the loop itself:
- Callbacks — objects that observe and steer training without you touching the loop:
ModelCheckpoint(save the best weights),EarlyStopping(halt when validation stalls),ReduceLROnPlateau,TensorBoardlogging. They are the reasonfitscales from a toy to a serious run. - Override
train_step— keepfit's outer machinery (callbacks, distribution, metrics) but replace just the inner forward/backward logic. This is the modern sweet spot for custom losses like GANs or contrastive learning.
You drop to a fully custom loop — opening a tf.GradientTape, computing gradients, and calling optimizer.apply_gradients yourself — only when control flow genuinely demands it: multiple interacting optimizers, exotic gradient surgery, reinforcement-learning rollouts, or research that needs to inspect intermediate quantities every step. The trade is total control for the loss of every convenience fit bundled in. A good engineer reaches for the custom loop last, not first — most "I need a custom loop" instincts are satisfied by a custom train_step or a callback.
model.fit() runs the training loop on your behalf — iterating over batches and epochs and performing the forward pass, backpropagation, and optimizer update each step — so you do not have to write that loop yourself. (Answer true or false.)fit is precisely the built-in training loop: given a compiled model and a dataset, it walks every batch of every epoch and on each one runs the forward pass, computes and backpropagates the loss, and applies the optimizer update (EQ F2.4), while also accumulating metrics and firing callbacks. Writing that loop by hand is exactly what you avoid by calling fit. The statement is true.TensorFlow vs PyTorch — choosing
This is the question every team eventually asks, and the honest 2026 answer starts with a fact: in research and in most new projects, PyTorch is the default. By citation share at the major ML conferences and by the dominant choice of new open-source models on Hugging Face, PyTorch has been the leading framework since roughly 2019–2020 and remains so. That is the realistic baseline; the rest is where the picture is genuinely more nuanced than the headline.
| Dimension | PyTorch | TensorFlow / Keras |
|---|---|---|
| Default style | Eager, Pythonic; torch.compile for graphs | Eager by default (TF2); tf.function for graphs |
| Research mindshare | Dominant | Declining for new work |
| Production / mobile / web | Strong & improving (ExecuTorch, TorchServe) | Mature: TF Serving, TF Lite/LiteRT, TF.js |
| High-level API | Lightning, fastai (third-party) | Keras, built in |
| TPU support | via PyTorch/XLA | First-class, native |
| Backend-agnostic | — | Keras 3: TF / JAX / PyTorch |
So where does TensorFlow still earn its place in 2026? Three honest answers. (1) Deployment breadth: the mobile/edge story (LiteRT, formerly TF Lite) and the browser story (TensorFlow.js) remain more battle-tested than the alternatives, and TF Serving is a known quantity in production. (2) TPUs: if you train on Google's TPU hardware, TensorFlow — and increasingly JAX — is the path of least resistance. (3) Keras 3 as a hedge: because Keras now runs on TensorFlow, JAX, or PyTorch behind one API, you can write Keras and stay portable across backends — a genuinely new reason to consider it that did not exist a few years ago.
A note of nuance experts would insist on: the eager-versus-graph and Pythonic-versus-not distinctions that once cleanly separated the two frameworks have largely converged. Both default to eager; both compile to graphs; both target the same accelerators. The remaining differences are about ecosystem and deployment surface, not core capability. The pragmatic advice: pick the framework your team and your target hardware already know, prefer PyTorch if you are starting fresh in research, prefer the TensorFlow/Keras stack if your hard constraint is TPUs or a mature mobile/web/serving pipeline — and remember that the deployment layer, covered next, often matters more than the training framework you started in.
Contested point, stated plainly. Framework "market share" numbers vary wildly by source and methodology (conference papers vs. job postings vs. GitHub stars vs. enterprise surveys), and partisans cite whichever favors their side. The defensible claim is narrow and the one made above: PyTorch leads new research and open-model releases; TensorFlow retains an edge in certain deployment targets and TPUs. Anything stronger than that is marketing.
You can now build and train a model in either framework — but a trained model is not a product. Chapter 03 turns to the ecosystem and deployment: exporting to SavedModel / ONNX / TorchScript, serving with TF Serving and friends, shrinking for the edge with LiteRT, and running inference in the browser — where, as this chapter hinted, the framework you trained in often stops mattering and the framework you serve in takes over.
References
- Abadi, M. et al. (2016). TensorFlow: A System for Large-Scale Machine Learning.
- Chollet, F. (2015). Keras.
- Chollet, F. (2021). Deep Learning with Python (2nd ed.).
- Abadi, M. et al. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems.
- TensorFlow Team. tf.data: Build TensorFlow input pipelines.
- Paszke, A. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library.