TensorFlow & Keras — AI Encyclopedia

2.1

TensorFlow & the graph heritage

TensorFlow began in 2015 as Google's successor to an internal system called DistBelief, and its founding bet was the computation graph. You did not run arithmetic; you described it — building a directed graph whose nodes are operations and whose edges are tensors flowing between them — and only then handed that static graph to a runtime (a Session) that executed it, possibly across many GPUs or a TPU pod. The name is literal: tensors flow through the graph.

The payoff of a static graph is that the framework sees the entire program before running a single op. That lets it fuse adjacent operations, prune dead branches, lay out memory ahead of time, place each node on the best device, and serialize the whole thing to a language-independent artifact you can deploy in C++, on a phone, or in a browser with no Python in sight. The cost was ergonomic: TensorFlow 1.x's define-then-run model meant your Python code built a graph but never touched a number, so a shape bug surfaced as an inscrutable error from deep inside the runtime, and a print showed you a symbolic node, not a value.

TensorFlow 2.0 (2019) flipped the default to eager execution — operations run immediately, like NumPy, so tensors hold real values you can print and debug — while keeping the graph available on demand through tf.function, which traces your Python into a graph the first time it runs and reuses it thereafter. The mental model that survives to 2026 is exactly this duality: eager for writing and debugging, graph for speed and deployment. The same idea reappears in PyTorch's torch.compile (Chapter 01) — the field converged on "write eager, compile to a graph when it matters."

EQ F2.1 — A NODE IN THE COMPUTATION GRAPH $$ z \;=\; f(x, y), \qquad \frac{\partial \mathcal{L}}{\partial x} \;=\; \frac{\partial \mathcal{L}}{\partial z}\,\frac{\partial z}{\partial x} \quad\text{(reverse-mode, accumulated over the graph)} $$

Every op $f$ records both how to compute its output $z$ (the forward pass) and how to push a gradient from its output back to its inputs (the backward pass). Stringing the per-node local derivatives together by the chain rule, in reverse topological order, is reverse-mode automatic differentiation — what tf.GradientTape records and replays. The graph is not an optimization detail bolted on later: it is the data structure that makes backpropagation mechanical. Static graphs let the compiler see all of $f$ at once; eager mode builds the same graph one node at a time and differentiates the trace.

Honest caveat. The 1.x-to-2.x transition was painful and fragmented the ecosystem; a great deal of legacy code, tutorials, and Stack Overflow answers still assume Sessions and placeholders that no longer apply. If you read TensorFlow material that calls sess.run(...), it predates 2.0 — treat it as historical.

PYTHON · RUNNABLE IN-BROWSER

# A "graph" is just a record of ops + their local derivatives.
# Here is reverse-mode autodiff (EQ F2.1) for z = (x*y) + sin(x), by hand.
import numpy as np

x, y = 2.0, 3.0
# forward pass: compute each node, remember the values we'll need
a = x * y                 # node a = x*y
b = np.sin(x)             # node b = sin(x)
z = a + b                 # node z = a + b   (the "loss")

# backward pass: seed dz/dz = 1, push gradients to inputs (chain rule)
dz = 1.0
da = dz * 1.0             # z = a + b  -> dz/da = 1
db = dz * 1.0             # z = a + b  -> dz/db = 1
dx = da * y + db * np.cos(x)   # a=x*y -> y ; b=sin(x) -> cos(x)
dy = da * x                    # a=x*y -> x

print(f"z = {z:.4f}")
print(f"dz/dx = {dx:.4f}   (analytic: y + cos(x) = {y + np.cos(x):.4f})")
print(f"dz/dy = {dy:.4f}   (analytic: x          = {x:.4f})")
print("this is exactly what tf.GradientTape / torch.autograd automate.")

edits are live — break it on purpose

INSTRUMENT F2.1 — EAGER vs GRAPH EXECUTIONWHAT tf.function TRADES · CONCEPTUAL

EXECUTION MODE

CALLS TO THE STEP 200

MODE

EAGER

MODELLED WALL TIME

—

DEBUGGABLE?

—

A toy cost model. Eager dispatches every op from Python each call: cheap to start, but a fixed per-op Python overhead is paid on every one of N calls. Graph mode pays a one-time tracing/compile cost, then runs a fused graph with near-zero Python overhead per call. Slide the call count: graph mode loses on a single call and wins decisively once the step runs in a loop — which is exactly what training does. The crossover is the whole reason tf.function exists.

2.2

Keras — the high-level API

Keras started in 2015 as François Chollet's framework-agnostic frontend; since TensorFlow 2.0 it has been TF's official high-level API (imported as tf.keras), and as of Keras 3 (2023) it once again runs on a choice of backends — TensorFlow, JAX, or PyTorch — behind one identical API. Its design philosophy is "progressive disclosure of complexity": the easy thing is one line, and every layer of customization is available exactly when you need it, never before.

The smallest unit is the layer: an object that owns weights and maps an input tensor to an output tensor. The workhorse is Dense — a fully-connected layer computing an affine transform followed by a nonlinearity:

EQ F2.2 — A DENSE LAYER $$ \mathbf{y} \;=\; \phi\!\big( W\mathbf{x} + \mathbf{b} \big), \qquad W \in \mathbb{R}^{\,u \times d_{\text{in}}},\;\; \mathbf{b} \in \mathbb{R}^{\,u} $$

A Dense(u) layer reading $d_{\text{in}}$ features holds a weight matrix $W$ of shape $u \times d_{\text{in}}$ and a bias vector $\mathbf{b}$ of length $u$; $\phi$ is the activation (ReLU, softmax, …). Its trainable parameter count is therefore $d_{\text{in}}\!\cdot u + u = (d_{\text{in}}+1)\,u$ — the +1 is the bias. Note that $d_{\text{in}}$ is not something you pass: Keras infers it from whatever tensor first flows in, which is why a freshly built layer reports 0 parameters until it sees a shape. The single most common surprise for newcomers is exactly this lazy build.

A whole network is then just an ordered stack of such layers. Keras offers three ways to express one, in increasing power:

API	Shape	Use when…
Sequential	a plain list of layers	The model is a single chain, input → output, no branching.
Functional	a DAG of layers	You need multiple inputs/outputs, skip connections, or shared layers — most real models.
Subclassing	imperative `call()`	Control flow depends on the data (dynamic loops, research architectures).

The parameter count of a stack is just the sum over its layers, and tracking it is not academic: parameters set your memory budget, your overfitting risk, and your serving cost. The instrument and the calculator below make EQ F2.2 tactile — add a layer, watch the count move.

A Dense(64) layer receives an input with $32$ features. How many weights does it hold, excluding the bias terms? (Use $d_{\text{in}}\cdot u$ from EQ F2.2.)

The weight matrix $W$ has shape $u \times d_{\text{in}} = 64 \times 32$, so it contains $64 \cdot 32 = $ 2048 weights. Adding the $64$ bias terms would bring the layer's total trainable parameters to $2048 + 64 = 2112$; the question asked for weights only, so the answer is 2048.

PYTHON · RUNNABLE IN-BROWSER

# A tiny MLP forward pass "Keras-style", but in pure numpy (EQ F2.2).
# Sequential([Dense(8, relu), Dense(3, softmax)]) on a 4-feature input.
import numpy as np
rng = np.random.default_rng(0)

def dense(x, W, b, act):                 # y = act(x @ W.T + b)
    z = x @ W.T + b
    if act == "relu":    return np.maximum(0.0, z)
    if act == "softmax":
        e = np.exp(z - z.max(-1, keepdims=True))
        return e / e.sum(-1, keepdims=True)
    return z

x  = rng.normal(0, 1, (5, 4))            # batch of 5, 4 features each
W1 = rng.normal(0, 0.5, (8, 4)); b1 = np.zeros(8)   # Dense(8): 4 -> 8
W2 = rng.normal(0, 0.5, (3, 8)); b2 = np.zeros(3)   # Dense(3): 8 -> 3

h   = dense(x, W1, b1, "relu")           # hidden activations
out = dense(h, W2, b2, "softmax")        # class probabilities

np.set_printoptions(precision=3, suppress=True)
print("output probabilities (rows = examples, cols = 3 classes):")
print(out)
print("\nevery row sums to 1:", out.sum(1).round(6))
print("params:", W1.size + b1.size + W2.size + b2.size,
      "= (4+1)*8 + (8+1)*3 =", (4+1)*8 + (8+1)*3)

edits are live — break it on purpose

PYTHON · RUNNABLE IN-BROWSER

# Parameter-count calculator for a Dense stack (EQ F2.2): (d_in+1)*u per layer.
# This is what model.summary() prints, by hand.
def dense_params(input_dim, units):
    """Return (weights, biases, total) for one Dense layer."""
    weights = input_dim * units
    biases  = units
    return weights, biases, weights + biases

# A small MLP: 32 features -> 64 -> 64 -> 10 classes
layers = [("Dense(64)", 32, 64),
          ("Dense(64)", 64, 64),
          ("Dense(10)", 64, 10)]

total = 0
print(f"{'layer':<12}{'in':>5}{'units':>7}{'weights':>10}{'+bias':>8}{'params':>10}")
for name, d_in, u in layers:
    w, b, p = dense_params(d_in, u)
    total += p
    print(f"{name:<12}{d_in:>5}{u:>7}{w:>10,}{b:>8}{p:>10,}")
print("-" * 52)
print(f"{'TOTAL':<12}{'':>5}{'':>7}{'':>10}{'':>8}{total:>10,} trainable params")

edits are live — break it on purpose

INSTRUMENT F2.2 — KERAS SEQUENTIAL BUILDERSTACK DENSE LAYERS · LIVE PARAM COUNT · EQ F2.2

INPUT FEATURES 784

HIDDEN LAYERS 2

UNITS / HIDDEN LAYER 128

OUTPUT UNITS (classes) 10

BIAS

TOTAL PARAMS

—

LARGEST LAYER

—

WEIGHTS @ FP32

—

Each bar is one Dense layer; its width is its parameter share of the model. Defaults build the classic MNIST classifier Sequential([Dense(128, relu), Dense(128, relu), Dense(10, softmax)]) on flattened 28×28 = 784 pixels — about 118K params. Turn the bias off and every layer loses exactly its units count. Push input features or units up and watch the first layer dominate: the layer touching the raw input is almost always the heaviest, which is why convolutions and embeddings exist to tame it.

2.3

tf.data input pipelines

A model is only as fast as the data reaching it. If the GPU finishes a step in 8 ms but the CPU needs 20 ms to read, decode, and augment the next batch, the accelerator sits idle 60% of the time — you bought a sports car and left it in the garage. tf.data is TensorFlow's answer: a declarative pipeline that overlaps data preparation with model execution so the accelerator never waits.

A pipeline is a chain of transformations on a tf.data.Dataset: .map() applies a preprocessing function, .shuffle(buffer) randomizes order through a fixed-size buffer, .batch(n) groups examples, and the two that matter most for throughput:

.prefetch(k) — lets the input pipeline produce batch $t{+}1$ on the CPU while the model trains on batch $t$ on the GPU. With tf.data.AUTOTUNE the runtime tunes the buffer size for you. This single call is usually the largest free speedup available.
.cache() — keeps the dataset in memory (or on disk) after the first epoch, so repeated epochs skip re-reading and re-decoding. Place it after expensive deterministic work and before random augmentation, or you will cache one fixed set of augmentations forever.

The reason prefetch works is a pipeline identity. Without overlap, each step costs the sum of input time and compute time; with overlap, the two run concurrently, so a step costs only the larger of the two:

EQ F2.3 — PIPELINE STEP TIME $$ t_{\text{serial}} = t_{\text{input}} + t_{\text{compute}}, \qquad t_{\text{overlapped}} = \max\!\big(t_{\text{input}},\, t_{\text{compute}}\big) $$

Prefetch turns a sum into a max. If input and compute are balanced (each $t$), serial costs $2t$ per step and overlapped costs $t$ — a clean 2× speedup. The win shrinks as one side dominates: if compute is 10× the input cost, you were already nearly compute-bound and prefetch buys little. The corollary is the rule every practitioner learns the hard way: profile first. Overlap can hide the input cost only up to the point where input becomes the bottleneck — past that, you must make the input itself cheaper (cache, parallel map, a better file format like TFRecord).

A training step needs $t_{\text{input}} = 12$ ms to prepare a batch on the CPU and $t_{\text{compute}} = 20$ ms to run it on the GPU. With .prefetch() overlapping the two (EQ F2.3), how many milliseconds does one overlapped step take?

Overlapped step time is $\max(t_{\text{input}}, t_{\text{compute}}) = \max(12, 20) = $ 20 ms — the pipeline is fully hidden behind compute. Without prefetch it would be $12 + 20 = 32$ ms, so overlap removes the entire 12 ms of input cost here; if input had instead been 25 ms, the step would be $\max(25,20)=25$ ms and the pipeline, not the GPU, would be your bottleneck.

PYTHON · RUNNABLE IN-BROWSER

# EQ F2.3: why .prefetch() turns a sum into a max. A throughput simulator.
import numpy as np

t_input, t_compute = 12.0, 20.0     # ms per batch (CPU prep, GPU step)
n_steps = 500

serial      = n_steps * (t_input + t_compute)        # no overlap
overlapped  = t_input + n_steps * max(t_input, t_compute)  # +1 warm-up batch

print(f"per-step serial     : {t_input + t_compute:.1f} ms")
print(f"per-step overlapped : {max(t_input, t_compute):.1f} ms  (max, not sum)")
print(f"{n_steps} steps serial    : {serial/1000:.2f} s")
print(f"{n_steps} steps prefetch  : {overlapped/1000:.2f} s")
print(f"speedup             : {serial/overlapped:.2f}x")

# utilization = fraction of wall time the GPU is actually computing
util_serial     = t_compute / (t_input + t_compute)
util_overlapped = t_compute / max(t_input, t_compute)
print(f"\nGPU utilization  serial : {util_serial*100:4.1f} %")
print(f"GPU utilization  prefetch: {min(util_overlapped,1)*100:4.1f} %")

edits are live — break it on purpose

The same overlap principle governs PyTorch's DataLoader (num_workers for parallel loading, pin_memory + prefetch_factor for the host-to-device overlap). The vocabulary differs; the bottleneck — and the fix — does not. Every framework's input API exists to win back EQ F2.3's wasted t_input.

2.4

Training — model.fit vs custom loops

Once a model is built and a dataset is ready, training is a loop. Keras gives you that loop for free. After model.compile(optimizer, loss, metrics), a single call to model.fit(dataset, epochs=...) runs the entire schedule: it iterates batches, does the forward pass, computes the loss, runs backpropagation, applies the optimizer update, accumulates metrics, and — if you pass validation_data — evaluates each epoch. Everything inside the loop is the framework's responsibility; you supplied only the model, the loss, and the data.

What fit automates, one batch at a time, is exactly the gradient-descent step:

EQ F2.4 — ONE fit() STEP (SGD) $$ \theta \;\leftarrow\; \theta \;-\; \eta\,\nabla_{\!\theta}\,\frac{1}{|B|}\sum_{(\mathbf{x},y)\,\in\,B}\mathcal{L}\big(f_\theta(\mathbf{x}),\,y\big) $$

For each minibatch $B$: run the forward pass $f_\theta$, average the per-example loss, take the gradient with respect to the parameters $\theta$ (the backward pass), and step downhill with learning rate $\eta$. fit wraps this in epoch and batch loops, handles the optimizer's internal state (momentum, Adam moments), runs metric accumulation, and fires callbacks at the right moments. The math is identical to a hand-written loop — fit is convenience, not magic, and that is precisely why it is the right default.

The extensibility hooks matter as much as the loop itself:

Callbacks — objects that observe and steer training without you touching the loop: ModelCheckpoint (save the best weights), EarlyStopping (halt when validation stalls), ReduceLROnPlateau, TensorBoard logging. They are the reason fit scales from a toy to a serious run.
Override train_step — keep fit's outer machinery (callbacks, distribution, metrics) but replace just the inner forward/backward logic. This is the modern sweet spot for custom losses like GANs or contrastive learning.

You drop to a fully custom loop — opening a tf.GradientTape, computing gradients, and calling optimizer.apply_gradients yourself — only when control flow genuinely demands it: multiple interacting optimizers, exotic gradient surgery, reinforcement-learning rollouts, or research that needs to inspect intermediate quantities every step. The trade is total control for the loss of every convenience fit bundled in. A good engineer reaches for the custom loop last, not first — most "I need a custom loop" instincts are satisfied by a custom train_step or a callback.

True or false: calling model.fit() runs the training loop on your behalf — iterating over batches and epochs and performing the forward pass, backpropagation, and optimizer update each step — so you do not have to write that loop yourself. (Answer true or false.)

fit is precisely the built-in training loop: given a compiled model and a dataset, it walks every batch of every epoch and on each one runs the forward pass, computes and backpropagates the loss, and applies the optimizer update (EQ F2.4), while also accumulating metrics and firing callbacks. Writing that loop by hand is exactly what you avoid by calling fit. The statement is true.

INSTRUMENT F2.3 — KERAS fit vs PyTorch LOOPSAME MLP, SIDE BY SIDE

SHOW

STAGE

KERAS LINES (TRAIN)

—

PYTORCH LINES (TRAIN)

—

WHO WRITES THE LOOP

—

The same two-layer classifier, defined and trained in each framework. Switch STAGE to TRAIN and the asymmetry is the lesson: Keras hides the loop behind fit; PyTorch makes you write zero_grad → forward → loss → backward → step explicitly every iteration. Neither is "better" — Keras optimizes for the common case, PyTorch for visibility. Knowing what the Keras side elides is the difference the chapter is about.

2.5

TensorFlow vs PyTorch — choosing

This is the question every team eventually asks, and the honest 2026 answer starts with a fact: in research and in most new projects, PyTorch is the default. By citation share at the major ML conferences and by the dominant choice of new open-source models on Hugging Face, PyTorch has been the leading framework since roughly 2019–2020 and remains so. That is the realistic baseline; the rest is where the picture is genuinely more nuanced than the headline.

Dimension	PyTorch	TensorFlow / Keras
Default style	Eager, Pythonic; `torch.compile` for graphs	Eager by default (TF2); `tf.function` for graphs
Research mindshare	Dominant	Declining for new work
Production / mobile / web	Strong & improving (ExecuTorch, TorchServe)	Mature: TF Serving, TF Lite/LiteRT, TF.js
High-level API	Lightning, fastai (third-party)	Keras, built in
TPU support	via PyTorch/XLA	First-class, native
Backend-agnostic	—	Keras 3: TF / JAX / PyTorch

So where does TensorFlow still earn its place in 2026? Three honest answers. (1) Deployment breadth: the mobile/edge story (LiteRT, formerly TF Lite) and the browser story (TensorFlow.js) remain more battle-tested than the alternatives, and TF Serving is a known quantity in production. (2) TPUs: if you train on Google's TPU hardware, TensorFlow — and increasingly JAX — is the path of least resistance. (3) Keras 3 as a hedge: because Keras now runs on TensorFlow, JAX, or PyTorch behind one API, you can write Keras and stay portable across backends — a genuinely new reason to consider it that did not exist a few years ago.

A note of nuance experts would insist on: the eager-versus-graph and Pythonic-versus-not distinctions that once cleanly separated the two frameworks have largely converged. Both default to eager; both compile to graphs; both target the same accelerators. The remaining differences are about ecosystem and deployment surface, not core capability. The pragmatic advice: pick the framework your team and your target hardware already know, prefer PyTorch if you are starting fresh in research, prefer the TensorFlow/Keras stack if your hard constraint is TPUs or a mature mobile/web/serving pipeline — and remember that the deployment layer, covered next, often matters more than the training framework you started in.

Contested point, stated plainly. Framework "market share" numbers vary wildly by source and methodology (conference papers vs. job postings vs. GitHub stars vs. enterprise surveys), and partisans cite whichever favors their side. The defensible claim is narrow and the one made above: PyTorch leads new research and open-model releases; TensorFlow retains an edge in certain deployment targets and TPUs. Anything stronger than that is marketing.

You can now build and train a model in either framework — but a trained model is not a product. Chapter 03 turns to the ecosystem and deployment: exporting to SavedModel / ONNX / TorchScript, serving with TF Serving and friends, shrinking for the edge with LiteRT, and running inference in the browser — where, as this chapter hinted, the framework you trained in often stops mattering and the framework you serve in takes over.

2.R

References

Abadi, M. et al. (2016). TensorFlow: A System for Large-Scale Machine Learning. OSDI 2016 — the computation-graph runtime and distributed execution model behind §2.1.
Chollet, F. (2015). Keras. Official documentation — the high-level layers/Sequential/Functional API of §2.2 and the current Keras 3 multi-backend design.
Chollet, F. (2021). Deep Learning with Python (2nd ed.). Manning — the canonical Keras text by its author; covers layers, fit, callbacks, and custom training loops.
Abadi, M. et al. (2016). TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv — the original TensorFlow white paper describing the dataflow-graph design.
TensorFlow Team. tf.data: Build TensorFlow input pipelines. Official guide — map/shuffle/batch/prefetch/cache and the overlap of §2.3 (EQ F2.3).
Paszke, A. et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS 2019 — the eager/imperative design contrasted against TensorFlow in §2.5.