The Ecosystem & Deployment

3.1

From notebook to artifact

Inside a training loop, a model is not really a thing — it is a live Python object: a graph of nn.Module calls, autograd tape, optimizer state, and a pile of imports that only exist because someone ran pip install on a research box. That object is perfect for experimentation and useless for production. The first job of deployment is to turn it into an artifact: a self-contained, versioned, serializable representation of just the forward computation and its learned parameters, with the training scaffolding stripped away.

Three properties separate an artifact from a notebook. It is portable — it runs without the original training code, often without Python at all (a C++ service, a phone, a browser). It is frozen — the graph and weights are fixed, so the same input gives the same output forever, which is what makes it auditable. And it is optimizable — once the graph is static, a compiler can fuse operators, fold constants, and pick kernels for the target hardware. Everything in this chapter is a different answer to the question "what is the artifact, and who runs it?"

Stage	Representation	Lives where	Optimizable?
Research	live Python `nn.Module` + autograd	training cluster / notebook	no — dynamic, eager
Checkpoint	weight tensors (`state_dict`, safetensors)	object store	weights only, no graph
Artifact	graph + weights (ONNX, TorchScript, SavedModel)	artifact registry	yes — graph is static
Engine	compiled kernels (TensorRT, TFLite, CoreML)	the target device	already optimized, hardware-locked

The unit that matters at the artifact stage is the computational graph: nodes are operators (matmul, add, softmax, layer-norm), edges are tensors, and the whole thing is a directed acyclic graph from inputs to outputs. Two techniques recover that graph from eager Python. Tracing runs one example through the model and records the operations it actually executed — fast and universal, but it bakes in whatever control flow that one input took (an if on tensor shape becomes a constant). Scripting parses the Python source itself and compiles control flow into the graph — it preserves loops and branches, but only over a typed subset of the language. Every export path below is built on one of these two.

EQ F3.1 — MODEL SIZE FROM PARAMETERS $$ \text{bytes} \;=\; N_{\text{params}} \times \frac{b_{\text{bits}}}{8}, \qquad \text{FP32} \to 4,\;\; \text{FP16/BF16} \to 2,\;\; \text{INT8} \to 1,\;\; \text{INT4} \to 0.5 \;\; \tfrac{\text{bytes}}{\text{param}} $$

The artifact's on-disk and in-memory footprint is, to first order, just the parameter count times the per-element width. A 7-billion-parameter model is 28 GB in FP32, 14 GB in FP16, 7 GB in INT8, 3.5 GB in INT4. This single line drives the whole deployment funnel: dtype choice decides whether the artifact fits on a server GPU, a laptop, or a phone — which in turn decides which format from §3.2–3.3 you reach for.

Using EQ F3.1, how many GB does a 7-billion-parameter model occupy when quantized to INT4 (0.5 bytes/param)?

$7\times10^9 \text{ params} \times 0.5 \text{ bytes} = 3.5\times10^9$ bytes $=$ 3.5 GB. That is small enough to load on a high-end phone or a modest laptop GPU — the reason edge formats lean so hard on aggressive quantization.

An artifact is exported in FP16 (2 bytes/param) and then quantized to INT8 (1 byte/param). By what factor does its file size shrink? (Give the ratio.)

Size scales linearly with the per-element width (EQ F3.1), so the ratio is $2/1 =$ 2×. Halving the bit-width halves the bytes — and, in the memory-bound decode regime of §3.5, roughly halves per-token latency too.

INSTRUMENT F3.1 — DEPLOYMENT-PATH DECISION TREEPICK A TARGET · GET THE EXPORT PATH

SOURCE FRAMEWORK

DEPLOYMENT TARGET

RECOMMENDED ARTIFACT

—

RUNTIME / ENGINE

—

EXPORT VERB

—

Pick a framework and a target; the tree draws the canonical route from live model to running engine. Note that cross-framework always funnels through ONNX — it is the only target where the source framework stops mattering. Mobile and browser force a quantized edge format; the server path keeps full precision and leans on a runtime compiler (TensorRT) instead.

3.2

ONNX — the interchange format

Frameworks are silos. A model trained in PyTorch is a graph of PyTorch ops; a TensorFlow model is a graph of TensorFlow ops; the runtimes that execute them fastest (NVIDIA's, Intel's, Qualcomm's, Apple's) are written by hardware vendors who do not want to maintain a separate backend for every framework. ONNX — the Open Neural Network Exchange — is the lingua franca that breaks the deadlock: a single, framework-agnostic graph format plus a versioned, standardized operator set. Export from any framework into ONNX once, and every ONNX-aware runtime can run your model.

An ONNX file is a serialized protobuf describing exactly the graph from §3.1: a list of nodes (each naming an operator from the standard set — MatMul, Gemm, Conv, Softmax, LayerNormalization, …), the tensors flowing between them, the initializer tensors that hold the trained weights, and typed input/output shapes. The contract that makes it portable is the opset version: opset 17 means "these operators behave exactly as the opset-17 spec says," so a model exported against opset 17 produces identical results on any runtime that implements opset 17, regardless of who wrote that runtime.

EQ F3.2 — THE INTERCHANGE COUNT $$ \underbrace{F \times R}_{\text{point-to-point}} \;\longrightarrow\; \underbrace{F + R}_{\text{hub-and-spoke via ONNX}}, \qquad F \text{ frameworks},\; R \text{ runtimes} $$

Without a hub, connecting $F$ frameworks to $R$ runtimes is an $F\times R$ matrix of bespoke converters that someone must write and maintain. With ONNX as the hub, each framework writes one exporter and each runtime writes one importer: $F+R$ connectors total. For 5 frameworks and 6 runtimes that is 30 integrations collapsing to 11 — the network-effect argument for why a standard interchange format exists at all.

You have $F = 5$ training frameworks and $R = 6$ inference runtimes. Using EQ F3.2, how many connectors are needed in the hub-and-spoke design where ONNX is the hub ($F + R$)?

Each framework writes one exporter and each runtime one importer: $F + R = 5 + 6 =$ 11 connectors — versus $F \times R = 30$ for point-to-point. The savings grow with every new framework or runtime added to the ecosystem.

The runtime side of the standard is ONNX Runtime (ORT), a high-performance C++ engine with a graph optimizer and a system of pluggable execution providers: the same ONNX graph dispatches to CUDA, TensorRT, OpenVINO (Intel), CoreML (Apple), DirectML (Windows), or a portable CPU kernel set, chosen at load time. There is even ONNX Runtime Web, which runs the graph in a browser via WebAssembly and WebGPU — no server round-trip. This is exactly why §3.1's tree routes every cross-framework and browser target through ONNX.

Honest caveats. ONNX is not free. Models with dynamic control flow or framework-specific custom ops do not always export cleanly — you hit "unsupported operator" or a traced graph that hard-codes a shape. The opset moves, so an old runtime may not implement the operator a fresh export emits, and numerical results can differ at the last few bits because two runtimes fuse operators differently. For large language models the picture is mixed: ONNX export works but the serving ecosystem has largely standardized on framework-native paths (vLLM, TensorRT-LLM) for the heaviest workloads. ONNX shines brightest for portability — vision models, classical nets, anything that must run on heterogeneous edge hardware.

PYTHON · RUNNABLE IN-BROWSER

# Export-then-load round trip: serialize weights to JSON, reload, verify match.
# Stand-in for ONNX/TorchScript export: the artifact must reproduce outputs.
import numpy as np, json
rng = np.random.default_rng(0)

# A tiny 2-layer MLP "model": just weights + a fixed forward graph.
W1 = rng.normal(0, 0.3, (4, 6)); b1 = rng.normal(0, 0.1, 6)
W2 = rng.normal(0, 0.3, (6, 2)); b2 = rng.normal(0, 0.1, 2)
def forward(x, p):
    h = np.maximum(0.0, x @ p["W1"] + p["b1"])     # ReLU layer
    return h @ p["W2"] + p["b2"]                    # linear head

x = rng.normal(0, 1, (3, 4))                        # a batch of 3 inputs
live = forward(x, {"W1": W1, "b1": b1, "W2": W2, "b2": b2})

# "Serialize the artifact": graph is fixed, only weights travel (as JSON).
blob = json.dumps({k: v.tolist() for k, v in
                   {"W1": W1, "b1": b1, "W2": W2, "b2": b2}.items()})
print(f"serialized artifact size : {len(blob):,} bytes of JSON")

# "Load on the serving side" and re-run the SAME graph.
loaded = {k: np.array(v) for k, v in json.loads(blob).items()}
reloaded = forward(x, loaded)

print("max abs output diff      :", float(np.abs(live - reloaded).max()))
print("artifact reproduces model:", np.allclose(live, reloaded))
print("(this exact-match check is what 'export validation' means in prod)")

edits are live — break it on purpose

True or false: ONNX is a cross-framework model interchange format — a single graph representation that lets a model trained in one framework run on runtimes written for another. (Answer true or false.)

ONNX defines a framework-agnostic graph plus a versioned standard operator set; a model exported from PyTorch or TensorFlow into ONNX runs on any ONNX-compatible runtime (ONNX Runtime, TensorRT, OpenVINO, CoreML, …). That is precisely what "cross-framework interchange" means, so the statement is true.

3.3

TorchScript, SavedModel & TFLite

ONNX is the neutral hub; each framework also has a native artifact that stays inside its own ecosystem and often preserves more than ONNX can. These native formats are what you reach for when source and serving live in the same world.

PyTorch: TorchScript and `torch.export`

TorchScript was PyTorch's original answer to "run my model without Python." It captures the model via tracing or scripting (the two techniques from §3.1) into an intermediate representation that the libtorch C++ runtime executes — so a PyTorch model can be loaded and run inside a C++ service, an iOS app (via PyTorch Mobile / ExecuTorch), or anywhere libtorch builds. As of 2026 TorchScript is in maintenance mode: the strategic direction is torch.export, which produces a clean, ahead-of-time-captured graph (an FX ExportedProgram) that feeds torch.compile, the Inductor backend, and the ExecuTorch edge runtime. The mental model is unchanged — capture the graph once, run it without the training stack — but the machinery is newer and the captured graph is more faithful to dynamic shapes.

TensorFlow: SavedModel and TFLite

SavedModel is TensorFlow's complete, language-neutral serialization: the graph (as one or more typed signatures), the trained variables, and any assets, in a directory that TensorFlow Serving, TensorFlow.js, or the C/Java/Go APIs can all load. It is the SavedModel a Keras model.export() writes, and the thing TF Serving from §3.4 actually mounts.

TFLite (now LiteRT) is the edge sibling: a converter takes a SavedModel and emits a single .tflite FlatBuffer — a compact, mmap-able artifact tuned for phones, microcontrollers, and embedded accelerators. The conversion typically applies post-training quantization (float32 → int8, or float16) and can target hardware delegates (the NNAPI / GPU / Hexagon / CoreML delegates). TFLite is the canonical answer to the mobile / edge branch of §3.1's tree, and the reason that branch always forces a quantized format.

Format	Origin	Runs on	Best for
ONNX	neutral hub	ONNX Runtime, TensorRT, OpenVINO, CoreML, browser (ORT-Web)	cross-framework portability; heterogeneous edge
TorchScript / `torch.export`	PyTorch	libtorch (C++), PyTorch Mobile, ExecuTorch	staying in the PyTorch world; C++ services, iOS
SavedModel	TensorFlow	TF Serving, TensorFlow.js, TF C/Java APIs	TF-native server deployment
TFLite / LiteRT	TensorFlow (edge)	phones, MCUs, embedded NPUs (NNAPI / GPU / CoreML delegates)	on-device, latency- and battery-bound mobile/edge
TensorRT engine	NVIDIA (from ONNX/TF/Torch)	NVIDIA GPUs only	last-mile, hardware-locked server speed

True or false: TFLite (LiteRT) targets mobile / edge deployment — phones, microcontrollers, and embedded accelerators — rather than server-side GPU serving. (Answer true or false.)

TFLite converts a SavedModel into a compact FlatBuffer designed to run on-device, typically with int8/float16 quantization and hardware delegates (NNAPI, GPU, CoreML), under tight latency and battery budgets. That is the mobile/edge niche by design, so the statement is true.

INSTRUMENT F3.2 — MODEL-FORMAT COMPARISON MATRIXFILTER BY CAPABILITY · HIGHLIGHT THE FIT

REQUIRE

FORMATS MATCHING

—

BEST PICK

—

Each row is a format; green cells are capabilities it has. Click a requirement and the matrix dims every format that lacks it, leaving the candidates lit. Notice that only ONNX survives the cross-framework filter, and only the edge formats (TFLite, ExecuTorch, ONNX) survive the mobile filter — the same logic the decision tree in F3.1 encodes, here laid flat.

3.4

Serving — TorchServe, Triton & friends

An artifact is a passive file. A serving system is the process that loads it, listens on a network port, and turns a stream of requests into a stream of predictions — while squeezing every drop of throughput out of expensive accelerators. Serving is where most of the engineering lives, because the naive loop ("read request, run model, return") wastes the hardware almost completely. The single most important trick is dynamic batching: hold incoming requests for a few milliseconds and run them through the model together, because a GPU runs a batch of 32 almost as fast as a batch of 1.

EQ F3.3 — THROUGHPUT VS LATENCY UNDER BATCHING $$ \text{throughput} \;=\; \frac{B}{t_{\text{batch}}(B)}, \qquad t_{\text{batch}}(B) \approx t_0 + \beta B \;\;(\beta \ll t_0\ \text{while compute-bound}) $$

$B$ is the batch size; $t_{\text{batch}}(B)$ is the time to process the whole batch. Because there is a large fixed per-call cost $t_0$ (kernel launches, weight reads) and only a small marginal cost $\beta$ per extra item, throughput rises sharply with $B$ until the GPU saturates. The catch: a request that waits to fill a batch sees its tail latency grow. Serving is the art of choosing the batch window that maximizes throughput inside a latency budget — exactly the SLA trade-off below.

The two reference servers occupy different points on the generality axis:

TorchServe — PyTorch's own server. You package a model into a .mar archive with a Python handler (pre-process → infer → post-process), and it gives you HTTP/gRPC endpoints, dynamic batching, multi-model hosting, versioning, and metrics. Simplest path when everything is PyTorch. (Its stewardship moved to the community in 2024–25; check current maintenance status before standardizing on it.)
NVIDIA Triton Inference Server — the framework-agnostic workhorse. A single Triton process serves models in many backends at once — TensorRT, ONNX Runtime, PyTorch (libtorch), TensorFlow, Python, vLLM — behind one HTTP/gRPC API. It adds concurrent model execution (multiple model instances per GPU), dynamic batching, model ensembles / business-logic scripting (chain pre-process → model → post-process on the server), and rich Prometheus metrics. When a fleet must serve a zoo of models from different frameworks on shared GPUs, Triton is the default.

Alongside them sit the targeted specialists: TensorFlow Serving (mounts a SavedModel directory, hot-swaps versions), and for large language models specifically, throughput-oriented engines like vLLM (paged-attention KV cache, continuous batching) and TensorRT-LLM — frequently run as a Triton backend so the LLM gets vLLM's scheduler and Triton's serving plumbing together.

EQ F3.4 — LITTLE'S LAW (FLEET SIZING) $$ L \;=\; \lambda \, W \quad\Longrightarrow\quad \text{concurrent requests in flight} = (\text{arrival rate}) \times (\text{mean latency}) $$

A queueing identity that sizes serving fleets: the average number of requests in the system equals the arrival rate $\lambda$ times the average time each spends there $W$. At 200 requests/sec with a 0.5 s mean latency, $L = 100$ requests are always in flight — so your replicas must hold 100 concurrent slots or the queue grows without bound. Batching lowers effective $W$ per request at high $\lambda$, which is why it is the lever that keeps fleets small.

A service receives $\lambda = 200$ requests per second with a mean end-to-end latency of $W = 0.5$ s. By Little's Law (EQ F3.4), how many requests are in flight on average ($L = \lambda W$)?

$L = \lambda \, W = 200 \text{ req/s} \times 0.5 \text{ s} =$ 100 concurrent requests. The serving fleet must provision at least this much concurrency (across replicas and batch slots) or the queue — and tail latency — blows up.

PYTHON · RUNNABLE IN-BROWSER

# Simulate a quantization size/latency trade-off table for a 7B model.
# Latency model: decode is memory-bandwidth-bound, so per-token time scales
# with bytes moved per param (EQ F3.1) -> smaller dtype, faster + smaller.
import numpy as np
N = 7e9                                   # parameters (7B)
bw = 2.0e12                               # ~2 TB/s memory bandwidth (server GPU)

dtypes = ["FP32", "FP16", "INT8", "INT4"]
bytes_per = np.array([4.0, 2.0, 1.0, 0.5])
size_gb   = N * bytes_per / 1e9           # on-disk / VRAM footprint
# memory-bound per-token latency ~ (weight bytes read) / bandwidth
lat_ms    = (N * bytes_per / bw) * 1e3
acc_drop  = np.array([0.0, 0.1, 0.7, 2.5])  # typical % task-accuracy loss

print(f"{'dtype':6}{'size(GB)':>10}{'lat(ms/tok)':>13}{'tok/s':>9}{'acc drop':>10}")
for d, s, l, a in zip(dtypes, size_gb, lat_ms, acc_drop):
    print(f"{d:6}{s:10.1f}{l:13.2f}{1000/l:9.0f}{a:9.1f}%")

base = lat_ms[0]
print("\nINT4 vs FP32: "
      f"{size_gb[0]/size_gb[3]:.0f}x smaller, {base/lat_ms[3]:.0f}x faster decode,")
print("at the cost of ~2.5% accuracy -- the core deployment bargain.")
plot_xy(size_gb, lat_ms)   # smaller artifact -> lower latency (down-left is best)

edits are live — break it on purpose

INSTRUMENT F3.3 — LATENCY / SIZE TRADE-OFF EXPLORERQUANTIZE + BATCH UNDER AN SLA · EQ F3.1 / F3.3

MODEL SIZE 7B params

BATCH SIZE B 8

LATENCY SLA 50 ms

BEST DTYPE WITHIN SLA

—

ARTIFACT SIZE

—

THROUGHPUT

—

Four dots — FP32, FP16, INT8, INT4 — plotted as (artifact size, batch latency). The red line is your latency SLA; dots under it pass. Smaller dtype moves a dot down-and-left (smaller and faster, EQ F3.1), and the instrument picks the highest-precision dtype that still clears the SLA at your batch size. Raise the batch and latency climbs (EQ F3.3) while throughput rises — push it until the only survivors are the aggressive quantizations. This is the quantize-vs-quality bargain made visual.

3.5

JAX & the wider ecosystem

The deployment story so far is PyTorch- and TensorFlow-shaped, but the modern frontier runs through a third stack. JAX is a NumPy-compatible library of composable function transformations: grad (reverse-mode autodiff), vmap (automatic vectorization over a batch axis), pmap/shard_map (parallelism across devices), and above all jit, which traces a pure Python function into XLA — the same Accelerated Linear Algebra compiler underneath TensorFlow — and ahead-of-time compiles it into fused kernels for TPUs and GPUs.

EQ F3.5 — JAX AS FUNCTION TRANSFORMATIONS $$ f \;\xrightarrow{\;\texttt{jit}\;}\; \text{XLA}(f), \qquad f \;\xrightarrow{\;\texttt{grad}\;}\; \nabla f, \qquad f \;\xrightarrow{\;\texttt{vmap}\;}\; f^{\,(\text{batched})}, \qquad \text{and they } \textbf{compose} $$

JAX's defining idea is that these transformations are orthogonal and composable: jit(grad(vmap(f))) is a single compiled, batched gradient — written once, lowered by XLA to optimal kernels. The price is purity: transformed functions must be side-effect-free, which is why JAX is loved for research clarity and large-scale training (it underpins much frontier-lab TPU work) and why its deployment path is distinct — you typically export the traced computation via StableHLO rather than ONNX.

For serving, JAX's natural route is StableHLO (the portable, versioned dialect XLA consumes) feeding either an XLA runtime, the Orbax checkpoint format, or — increasingly — conversion to TFLite/LiteRT for edge. The wider ecosystem clusters around three poles you should be able to place:

Compilers / IRs: XLA and StableHLO (TF/JAX), TorchInductor and the FX graph (PyTorch), Apache TVM and MLIR as cross-cutting compiler infrastructure. All do the same job — lower a high-level graph to fused device kernels.
Vendor engines: TensorRT and TensorRT-LLM (NVIDIA), OpenVINO (Intel), CoreML (Apple), the Qualcomm / Hexagon stack (mobile NPUs). These are the last-mile, hardware-locked artifacts from §3.1's bottom row.
Weight formats: safetensors has become the de-facto safe, fast, zero-copy checkpoint format (no arbitrary-code pickle risk), and GGUF dominates the local/consumer LLM world (llama.cpp), pairing weights with quantization metadata in one mmap-able file.

The honest 2026 summary. There is no single winning format, and anyone who tells you otherwise is selling something. The durable pattern is the funnel this chapter walked: train in a research framework (PyTorch dominant, JAX strong at the frontier), capture the graph into an artifact (ONNX for portability, native formats to stay in-ecosystem), compile to a hardware engine (TensorRT, TFLite, CoreML), and serve behind a batching server (Triton for heterogeneous fleets, vLLM/TensorRT-LLM for LLMs). Each arrow loses some generality and gains some speed. Knowing which arrow you are on — and what it costs you — is the whole skill of shipping a model.

You have reached the end of the Frameworks volume — and the path from a tensor to a served model is now complete. From the autograd engines and tensor libraries that train a model (Frameworks 01), through TensorFlow and Keras' high-level construction (Frameworks 02), to the export formats, runtimes, and serving stack that carry it into production (Frameworks 03), the loop closes: build it, capture it, compile it, serve it. Return to the index to continue across the other volumes.

3.R

References

Bai, J., Lu, F., Zhang, K. et al. (2019). ONNX: Open Neural Network Exchange. onnx.ai — the framework-agnostic graph format and standard operator set at the heart of §3.2.
Bradbury, J., Frostig, R., Hawkins, P. et al. (2018). JAX: composable transformations of Python+NumPy programs. github.com/google/jax — jit/grad/vmap/pmap over XLA (EQ F3.5).
NVIDIA. Triton Inference Server. developer.nvidia.com — multi-framework serving with concurrent execution and dynamic batching (§3.4).
Google. TensorFlow Lite / LiteRT — On-Device Machine Learning. tensorflow.org/lite — the SavedModel→FlatBuffer converter and edge runtime of §3.3.
Microsoft. ONNX Runtime. onnxruntime.ai — the cross-platform inference engine with pluggable execution providers (CUDA, TensorRT, OpenVINO, CoreML, WebGPU).
PyTorch Team. torch.export & ExecuTorch. docs.pytorch.org — ahead-of-time graph capture (ExportedProgram) and the edge runtime succeeding TorchScript.
Google. TensorFlow Serving. tensorflow.org/tfx — SavedModel hosting with versioned hot-swap, the TF-native serving path.
Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM). github.com/vllm-project/vllm — paged-attention KV cache and continuous batching for LLM throughput.