From notebook to artifact
Inside a training loop, a model is not really a thing — it is a live Python object: a graph of nn.Module calls, autograd tape, optimizer state, and a pile of imports that only exist because someone ran pip install on a research box. That object is perfect for experimentation and useless for production. The first job of deployment is to turn it into an artifact: a self-contained, versioned, serializable representation of just the forward computation and its learned parameters, with the training scaffolding stripped away.
Three properties separate an artifact from a notebook. It is portable — it runs without the original training code, often without Python at all (a C++ service, a phone, a browser). It is frozen — the graph and weights are fixed, so the same input gives the same output forever, which is what makes it auditable. And it is optimizable — once the graph is static, a compiler can fuse operators, fold constants, and pick kernels for the target hardware. Everything in this chapter is a different answer to the question "what is the artifact, and who runs it?"
| Stage | Representation | Lives where | Optimizable? |
|---|---|---|---|
| Research | live Python nn.Module + autograd | training cluster / notebook | no — dynamic, eager |
| Checkpoint | weight tensors (state_dict, safetensors) | object store | weights only, no graph |
| Artifact | graph + weights (ONNX, TorchScript, SavedModel) | artifact registry | yes — graph is static |
| Engine | compiled kernels (TensorRT, TFLite, CoreML) | the target device | already optimized, hardware-locked |
The unit that matters at the artifact stage is the computational graph: nodes are operators (matmul, add, softmax, layer-norm), edges are tensors, and the whole thing is a directed acyclic graph from inputs to outputs. Two techniques recover that graph from eager Python. Tracing runs one example through the model and records the operations it actually executed — fast and universal, but it bakes in whatever control flow that one input took (an if on tensor shape becomes a constant). Scripting parses the Python source itself and compiles control flow into the graph — it preserves loops and branches, but only over a typed subset of the language. Every export path below is built on one of these two.
ONNX — the interchange format
Frameworks are silos. A model trained in PyTorch is a graph of PyTorch ops; a TensorFlow model is a graph of TensorFlow ops; the runtimes that execute them fastest (NVIDIA's, Intel's, Qualcomm's, Apple's) are written by hardware vendors who do not want to maintain a separate backend for every framework. ONNX — the Open Neural Network Exchange — is the lingua franca that breaks the deadlock: a single, framework-agnostic graph format plus a versioned, standardized operator set. Export from any framework into ONNX once, and every ONNX-aware runtime can run your model.
An ONNX file is a serialized protobuf describing exactly the graph from §3.1: a list of nodes (each naming an operator from the standard set — MatMul, Gemm, Conv, Softmax, LayerNormalization, …), the tensors flowing between them, the initializer tensors that hold the trained weights, and typed input/output shapes. The contract that makes it portable is the opset version: opset 17 means "these operators behave exactly as the opset-17 spec says," so a model exported against opset 17 produces identical results on any runtime that implements opset 17, regardless of who wrote that runtime.
The runtime side of the standard is ONNX Runtime (ORT), a high-performance C++ engine with a graph optimizer and a system of pluggable execution providers: the same ONNX graph dispatches to CUDA, TensorRT, OpenVINO (Intel), CoreML (Apple), DirectML (Windows), or a portable CPU kernel set, chosen at load time. There is even ONNX Runtime Web, which runs the graph in a browser via WebAssembly and WebGPU — no server round-trip. This is exactly why §3.1's tree routes every cross-framework and browser target through ONNX.
Honest caveats. ONNX is not free. Models with dynamic control flow or framework-specific custom ops do not always export cleanly — you hit "unsupported operator" or a traced graph that hard-codes a shape. The opset moves, so an old runtime may not implement the operator a fresh export emits, and numerical results can differ at the last few bits because two runtimes fuse operators differently. For large language models the picture is mixed: ONNX export works but the serving ecosystem has largely standardized on framework-native paths (vLLM, TensorRT-LLM) for the heaviest workloads. ONNX shines brightest for portability — vision models, classical nets, anything that must run on heterogeneous edge hardware.
# Export-then-load round trip: serialize weights to JSON, reload, verify match.
# Stand-in for ONNX/TorchScript export: the artifact must reproduce outputs.
import numpy as np, json
rng = np.random.default_rng(0)
# A tiny 2-layer MLP "model": just weights + a fixed forward graph.
W1 = rng.normal(0, 0.3, (4, 6)); b1 = rng.normal(0, 0.1, 6)
W2 = rng.normal(0, 0.3, (6, 2)); b2 = rng.normal(0, 0.1, 2)
def forward(x, p):
h = np.maximum(0.0, x @ p["W1"] + p["b1"]) # ReLU layer
return h @ p["W2"] + p["b2"] # linear head
x = rng.normal(0, 1, (3, 4)) # a batch of 3 inputs
live = forward(x, {"W1": W1, "b1": b1, "W2": W2, "b2": b2})
# "Serialize the artifact": graph is fixed, only weights travel (as JSON).
blob = json.dumps({k: v.tolist() for k, v in
{"W1": W1, "b1": b1, "W2": W2, "b2": b2}.items()})
print(f"serialized artifact size : {len(blob):,} bytes of JSON")
# "Load on the serving side" and re-run the SAME graph.
loaded = {k: np.array(v) for k, v in json.loads(blob).items()}
reloaded = forward(x, loaded)
print("max abs output diff :", float(np.abs(live - reloaded).max()))
print("artifact reproduces model:", np.allclose(live, reloaded))
print("(this exact-match check is what 'export validation' means in prod)")
TorchScript, SavedModel & TFLite
ONNX is the neutral hub; each framework also has a native artifact that stays inside its own ecosystem and often preserves more than ONNX can. These native formats are what you reach for when source and serving live in the same world.
PyTorch: TorchScript and torch.export
TorchScript was PyTorch's original answer to "run my model without Python." It captures the model via tracing or scripting (the two techniques from §3.1) into an intermediate representation that the libtorch C++ runtime executes — so a PyTorch model can be loaded and run inside a C++ service, an iOS app (via PyTorch Mobile / ExecuTorch), or anywhere libtorch builds. As of 2026 TorchScript is in maintenance mode: the strategic direction is torch.export, which produces a clean, ahead-of-time-captured graph (an FX ExportedProgram) that feeds torch.compile, the Inductor backend, and the ExecuTorch edge runtime. The mental model is unchanged — capture the graph once, run it without the training stack — but the machinery is newer and the captured graph is more faithful to dynamic shapes.
TensorFlow: SavedModel and TFLite
SavedModel is TensorFlow's complete, language-neutral serialization: the graph (as one or more typed signatures), the trained variables, and any assets, in a directory that TensorFlow Serving, TensorFlow.js, or the C/Java/Go APIs can all load. It is the SavedModel a Keras model.export() writes, and the thing TF Serving from §3.4 actually mounts.
TFLite (now LiteRT) is the edge sibling: a converter takes a SavedModel and emits a single .tflite FlatBuffer — a compact, mmap-able artifact tuned for phones, microcontrollers, and embedded accelerators. The conversion typically applies post-training quantization (float32 → int8, or float16) and can target hardware delegates (the NNAPI / GPU / Hexagon / CoreML delegates). TFLite is the canonical answer to the mobile / edge branch of §3.1's tree, and the reason that branch always forces a quantized format.
| Format | Origin | Runs on | Best for |
|---|---|---|---|
| ONNX | neutral hub | ONNX Runtime, TensorRT, OpenVINO, CoreML, browser (ORT-Web) | cross-framework portability; heterogeneous edge |
TorchScript / torch.export | PyTorch | libtorch (C++), PyTorch Mobile, ExecuTorch | staying in the PyTorch world; C++ services, iOS |
| SavedModel | TensorFlow | TF Serving, TensorFlow.js, TF C/Java APIs | TF-native server deployment |
| TFLite / LiteRT | TensorFlow (edge) | phones, MCUs, embedded NPUs (NNAPI / GPU / CoreML delegates) | on-device, latency- and battery-bound mobile/edge |
| TensorRT engine | NVIDIA (from ONNX/TF/Torch) | NVIDIA GPUs only | last-mile, hardware-locked server speed |
Serving — TorchServe, Triton & friends
An artifact is a passive file. A serving system is the process that loads it, listens on a network port, and turns a stream of requests into a stream of predictions — while squeezing every drop of throughput out of expensive accelerators. Serving is where most of the engineering lives, because the naive loop ("read request, run model, return") wastes the hardware almost completely. The single most important trick is dynamic batching: hold incoming requests for a few milliseconds and run them through the model together, because a GPU runs a batch of 32 almost as fast as a batch of 1.
The two reference servers occupy different points on the generality axis:
- TorchServe — PyTorch's own server. You package a model into a
.mararchive with a Python handler (pre-process → infer → post-process), and it gives you HTTP/gRPC endpoints, dynamic batching, multi-model hosting, versioning, and metrics. Simplest path when everything is PyTorch. (Its stewardship moved to the community in 2024–25; check current maintenance status before standardizing on it.) - NVIDIA Triton Inference Server — the framework-agnostic workhorse. A single Triton process serves models in many backends at once — TensorRT, ONNX Runtime, PyTorch (libtorch), TensorFlow, Python, vLLM — behind one HTTP/gRPC API. It adds concurrent model execution (multiple model instances per GPU), dynamic batching, model ensembles / business-logic scripting (chain pre-process → model → post-process on the server), and rich Prometheus metrics. When a fleet must serve a zoo of models from different frameworks on shared GPUs, Triton is the default.
Alongside them sit the targeted specialists: TensorFlow Serving (mounts a SavedModel directory, hot-swaps versions), and for large language models specifically, throughput-oriented engines like vLLM (paged-attention KV cache, continuous batching) and TensorRT-LLM — frequently run as a Triton backend so the LLM gets vLLM's scheduler and Triton's serving plumbing together.
# Simulate a quantization size/latency trade-off table for a 7B model.
# Latency model: decode is memory-bandwidth-bound, so per-token time scales
# with bytes moved per param (EQ F3.1) -> smaller dtype, faster + smaller.
import numpy as np
N = 7e9 # parameters (7B)
bw = 2.0e12 # ~2 TB/s memory bandwidth (server GPU)
dtypes = ["FP32", "FP16", "INT8", "INT4"]
bytes_per = np.array([4.0, 2.0, 1.0, 0.5])
size_gb = N * bytes_per / 1e9 # on-disk / VRAM footprint
# memory-bound per-token latency ~ (weight bytes read) / bandwidth
lat_ms = (N * bytes_per / bw) * 1e3
acc_drop = np.array([0.0, 0.1, 0.7, 2.5]) # typical % task-accuracy loss
print(f"{'dtype':6}{'size(GB)':>10}{'lat(ms/tok)':>13}{'tok/s':>9}{'acc drop':>10}")
for d, s, l, a in zip(dtypes, size_gb, lat_ms, acc_drop):
print(f"{d:6}{s:10.1f}{l:13.2f}{1000/l:9.0f}{a:9.1f}%")
base = lat_ms[0]
print("\nINT4 vs FP32: "
f"{size_gb[0]/size_gb[3]:.0f}x smaller, {base/lat_ms[3]:.0f}x faster decode,")
print("at the cost of ~2.5% accuracy -- the core deployment bargain.")
plot_xy(size_gb, lat_ms) # smaller artifact -> lower latency (down-left is best)
JAX & the wider ecosystem
The deployment story so far is PyTorch- and TensorFlow-shaped, but the modern frontier runs through a third stack. JAX is a NumPy-compatible library of composable function transformations: grad (reverse-mode autodiff), vmap (automatic vectorization over a batch axis), pmap/shard_map (parallelism across devices), and above all jit, which traces a pure Python function into XLA — the same Accelerated Linear Algebra compiler underneath TensorFlow — and ahead-of-time compiles it into fused kernels for TPUs and GPUs.
jit(grad(vmap(f))) is a single compiled, batched gradient — written once, lowered by XLA to optimal kernels. The price is purity: transformed functions must be side-effect-free, which is why JAX is loved for research clarity and large-scale training (it underpins much frontier-lab TPU work) and why its deployment path is distinct — you typically export the traced computation via StableHLO rather than ONNX.For serving, JAX's natural route is StableHLO (the portable, versioned dialect XLA consumes) feeding either an XLA runtime, the Orbax checkpoint format, or — increasingly — conversion to TFLite/LiteRT for edge. The wider ecosystem clusters around three poles you should be able to place:
- Compilers / IRs: XLA and StableHLO (TF/JAX), TorchInductor and the FX graph (PyTorch), Apache TVM and MLIR as cross-cutting compiler infrastructure. All do the same job — lower a high-level graph to fused device kernels.
- Vendor engines: TensorRT and TensorRT-LLM (NVIDIA), OpenVINO (Intel), CoreML (Apple), the Qualcomm / Hexagon stack (mobile NPUs). These are the last-mile, hardware-locked artifacts from §3.1's bottom row.
- Weight formats:
safetensorshas become the de-facto safe, fast, zero-copy checkpoint format (no arbitrary-codepicklerisk), and GGUF dominates the local/consumer LLM world (llama.cpp), pairing weights with quantization metadata in one mmap-able file.
The honest 2026 summary. There is no single winning format, and anyone who tells you otherwise is selling something. The durable pattern is the funnel this chapter walked: train in a research framework (PyTorch dominant, JAX strong at the frontier), capture the graph into an artifact (ONNX for portability, native formats to stay in-ecosystem), compile to a hardware engine (TensorRT, TFLite, CoreML), and serve behind a batching server (Triton for heterogeneous fleets, vLLM/TensorRT-LLM for LLMs). Each arrow loses some generality and gains some speed. Knowing which arrow you are on — and what it costs you — is the whole skill of shipping a model.
You have reached the end of the Frameworks volume — and the path from a tensor to a served model is now complete. From the autograd engines and tensor libraries that train a model (Frameworks 01), through TensorFlow and Keras' high-level construction (Frameworks 02), to the export formats, runtimes, and serving stack that carry it into production (Frameworks 03), the loop closes: build it, capture it, compile it, serve it. Return to the index to continue across the other volumes.
References
- Bai, J., Lu, F., Zhang, K. et al. (2019). ONNX: Open Neural Network Exchange.
- Bradbury, J., Frostig, R., Hawkins, P. et al. (2018). JAX: composable transformations of Python+NumPy programs.
- NVIDIA. Triton Inference Server.
- Google. TensorFlow Lite / LiteRT — On-Device Machine Learning.
- Microsoft. ONNX Runtime.
- PyTorch Team. torch.export & ExecuTorch.
- Google. TensorFlow Serving.
- Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for LLM Serving with PagedAttention (vLLM).