The local inference stack
An open-weights release is a directory of tensors plus a config. To produce text you need an inference engine that loads those tensors, applies the model's chat template, runs the forward pass, manages the KV cache, and samples tokens. The engine is where almost all of your practical choices live — the weights are inert until something runs them.
It helps to see the stack as four layers, from metal to prompt:
| Layer | What it decides | Examples |
|---|---|---|
| Hardware | memory ceiling & bandwidth — the hard limit on what fits | RTX 4090 (24 GB), M-series unified memory, H100 (80 GB), CPU + RAM |
| Kernels / runtime | how a matmul actually executes on that silicon | CUDA, Metal, ROCm, Vulkan, plain AVX2/NEON CPU |
| Inference engine | quant format, KV cache, batching, sampling | llama.cpp, vLLM, SGLang, TensorRT-LLM, MLX, ExLlamaV2 |
| Front-end / API | how you talk to it | Ollama, LM Studio, an OpenAI-compatible /v1 endpoint |
Two questions decide your engine. First, where does the model live? If the weights fit in GPU VRAM the GPU does everything; if not, the engine must split layers between GPU and CPU/RAM (offloading), and the slowest tier sets the pace. Second, how many requests at once? A single user wants the lowest latency per token; a server wants the highest aggregate throughput, which is a different — sometimes opposite — objective (§2.3).
The economics changed because the bottleneck is not compute. Autoregressive decoding generates one token at a time, and each token must stream the model's weights from memory through the compute units. So single-stream decode is memory-bandwidth-bound, not FLOP-bound: roughly, the upper bound on tokens per second is the device's memory bandwidth divided by the number of bytes you read per token.
llama.cpp & GGUF
llama.cpp, started by Georgi Gerganov in 2023, is the project that made local inference mainstream. It is a dependency-light C/C++ engine that runs the forward pass on CPU, GPU (CUDA, Metal, Vulkan, ROCm), or any split of the two. Ollama and LM Studio are friendly wrappers over it; when someone "runs a model on their MacBook," llama.cpp is almost always the thing actually executing.
Its companion is GGUF (GPT-Generated Unified Format), the single-file container llama.cpp loads. One .gguf file holds the quantized tensors, the tokenizer, the chat template, and metadata such as context length — everything needed to run, with no separate config to mismatch. The format is self-describing and memory-mappable, so the OS can page weights in lazily rather than read the whole file up front.
A file like Llama-3.1-8B-Instruct-Q4_K_M.gguf reads as: model · size · instruct-tuned · quant scheme. The Q4_K_M suffix is the most important part — Q4 is ~4 bits per weight, _K is the k-quant method (per-block scales chosen to minimize error), and _M is the "medium" mix that keeps a few sensitive tensors at higher precision. Q4_K_M and Q5_K_M are the everyday defaults; Q8_0 is near-lossless but twice the size of Q4.
The footprint of a GGUF is close to bits-per-weight × parameters. That is the whole game on a memory-limited box, so it is worth being able to compute it cold:
# EQ OM2.2: a memory estimator -- params x bit-width -> GB of weights
import numpy as np
def weight_gb(params, bpw):
return params * (bpw / 8) / 1e9 # bytes -> gigabytes
sizes = {"7B": 7e9, "13B": 13e9, "70B": 70e9}
quants = {"bf16": 16, "Q8_0": 8, "Q5_K_M": 5.5, "Q4_K_M": 4.5, "Q4_0": 4}
print(f"{'model':>6} | " + " | ".join(f"{q:>7}" for q in quants))
print("-" * (9 + 10 * len(quants)))
for name, p in sizes.items():
row = " | ".join(f"{weight_gb(p, b):6.1f}G" for b in quants.values())
print(f"{name:>6} | {row}")
print("\nA 24 GB card holds a 7B at any quant, a 13B comfortably,")
print("and a 70B ONLY once you drop to ~Q4 (35 GB nominal) -- which still")
print("needs a 48 GB card or two 24 GB cards. Bits-per-weight is destiny.")
The other half of the bill is the KV cache. Weights are fixed; the cache grows with context and concurrency (Vol II · EQ 3.5). On a single laptop that buffer is usually small next to the weights, but at long context it can rival them — which is why the calculator in §2.5 sums both. A handy trick on Apple silicon: unified memory means the GPU and CPU share one pool, so "VRAM" and "RAM" are the same budget and a 64 GB Mac can hold a 70B Q4 that no consumer discrete GPU can.
vLLM & production serving
llama.cpp optimizes the single-user laptop. vLLM optimizes the opposite end: many concurrent users on datacenter GPUs, maximizing tokens served per second per dollar. It is the open-source serving engine most production open-model deployments are built on, and its key idea is about memory, not math.
Naive serving pre-allocates a contiguous KV-cache buffer per request, sized for the maximum possible sequence length. Most requests never reach that length, so most of that memory sits idle — internal fragmentation that strands GPU memory and caps how many requests fit. vLLM's PagedAttention borrows the operating-system idea of virtual memory: the KV cache is split into fixed-size blocks allocated on demand, with a block table mapping a request's logical positions to physical blocks. Memory is handed out a block at a time, fragmentation drops to near zero, and identical prefixes (a shared system prompt, a few-shot preamble) can share the same physical blocks across requests.
The serving picture has a fundamental shape: throughput rises with batch size because the GPU's parallelism gets amortized across more sequences, until you run out of KV memory or saturate compute — after which adding requests only raises latency. The instrument below lets you find that knee.
Use the right tool for the job. llama.cpp / Ollama for one user, a laptop, or a quick local prototype. vLLM / SGLang / TensorRT-LLM when you serve traffic and care about cost per million tokens. The same open weights run on both; only the engine — and therefore the memory accounting — changes.
Quantization for local (recap)
Quantization is the lever that turns "needs a datacenter" into "runs on my desk." The full theory — absmax vs zero-point, GPTQ, AWQ, the NF4 data type — lives in Vol II · Chapter 07; here is the operating intuition and the local-specific defaults.
Weights are stored at lower precision so each one occupies fewer bits. Trained weights cluster tightly around zero, so the modern schemes (k-quants, GPTQ, AWQ, NF4) spend their limited code values where the weights actually are, and protect the few outlier channels that carry disproportionate signal — the central finding of LLM.int8(), which showed that a handful of large-magnitude features, if naively quantized, wreck accuracy. The quality cost is not linear in bits:
Two honest caveats. First, the "quality retained" curve here is a stylized model — real degradation depends on the model, the quant method, and the task (code and math are more fragile than chat). Always run your own eval at the quant you intend to ship. Second, KV cache can be quantized too (FP8 or INT4 K/V), which buys context length at a smaller quality cost than weight quantization — increasingly standard in both llama.cpp and vLLM.
Hardware sizing — will it fit?
Everything above reduces to one inequality: the model's total footprint must fit under your memory ceiling. Total memory is weights plus KV cache plus a working overhead for activations and the framework:
# KV cache size as a function of context length and model geometry
# (Vol II EQ 3.5) -- the variable half of EQ OM2.5
import numpy as np
def kv_gb(L, h_kv, d_k, T, batch=1, bits=16):
bytes_per_tok = 2 * L * h_kv * d_k * (bits / 8) # 2 = K and V
return bytes_per_tok * T * batch / 1e9
# Llama-3-8B geometry: 32 layers, GQA with 8 KV heads, head dim 128
geom = dict(L=32, h_kv=8, d_k=128)
per_tok_kb = 2 * 32 * 8 * 128 * 2 / 1024
print(f"per-token KV (FP16): {per_tok_kb:.0f} KB -> {per_tok_kb/1024:.3f} MB")
xs, ys = [], []
for T in (2048, 8192, 32768, 131072):
g = kv_gb(T=T, **geom)
xs.append(T); ys.append(g)
print(f" T = {T:>7,}: {g:6.2f} GB of KV cache (batch 1, FP16)")
print("\nWeights (8B at Q4) are a fixed ~4.5 GB; the cache is what scales")
print("with context. At 128K tokens the cache alone rivals the weights.")
plot_xy(xs, ys)
You can now run any open model and predict whether it fits before you download it. Chapter 03 takes the next step: changing the weights instead of just serving them — fine-tuning open models with LoRA and QLoRA on the same consumer hardware, and merging or hot-swapping adapters back into the serving stack you just built.
References
- Gerganov, G. et al. (2023). llama.cpp — LLM inference in C/C++.
- Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention.
- Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
- Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.
- Lin, J., Tang, J., Tang, H. et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.
- The vLLM Team. vLLM Documentation.