Running Open Models — AI Encyclopedia

2.1

The local inference stack

An open-weights release is a directory of tensors plus a config. To produce text you need an inference engine that loads those tensors, applies the model's chat template, runs the forward pass, manages the KV cache, and samples tokens. The engine is where almost all of your practical choices live — the weights are inert until something runs them.

It helps to see the stack as four layers, from metal to prompt:

Layer	What it decides	Examples
Hardware	memory ceiling & bandwidth — the hard limit on what fits	RTX 4090 (24 GB), M-series unified memory, H100 (80 GB), CPU + RAM
Kernels / runtime	how a matmul actually executes on that silicon	CUDA, Metal, ROCm, Vulkan, plain AVX2/NEON CPU
Inference engine	quant format, KV cache, batching, sampling	llama.cpp, vLLM, SGLang, TensorRT-LLM, MLX, ExLlamaV2
Front-end / API	how you talk to it	Ollama, LM Studio, an OpenAI-compatible `/v1` endpoint

Two questions decide your engine. First, where does the model live? If the weights fit in GPU VRAM the GPU does everything; if not, the engine must split layers between GPU and CPU/RAM (offloading), and the slowest tier sets the pace. Second, how many requests at once? A single user wants the lowest latency per token; a server wants the highest aggregate throughput, which is a different — sometimes opposite — objective (§2.3).

The economics changed because the bottleneck is not compute. Autoregressive decoding generates one token at a time, and each token must stream the model's weights from memory through the compute units. So single-stream decode is memory-bandwidth-bound, not FLOP-bound: roughly, the upper bound on tokens per second is the device's memory bandwidth divided by the number of bytes you read per token.

EQ OM2.1 — THE DECODE BANDWIDTH BOUND $$ \text{tok/s} \;\lesssim\; \frac{\text{memory bandwidth (B/s)}}{\text{model bytes read per token}} \;=\; \frac{\text{BW}}{N_{\text{params}} \times \dfrac{\text{bits}}{8}} $$

Generating one token reads essentially every weight once. A 7B model in 4-bit is $\approx 3.5$ GB; on a 1 TB/s consumer GPU that ceiling is $\approx 1000/3.5 \approx 285$ tok/s, and real engines reach a healthy fraction of it. Quantizing from 16-bit to 4-bit reads 4× fewer bytes per token, so it speeds up decode roughly 4× and quarters the memory footprint — the single biggest lever you have on a laptop. Prefill (processing the prompt) is the opposite regime: it is compute-bound and highly parallel.

A laptop GPU has $700$ GB/s of memory bandwidth. You run a $7\text{B}$ model quantized to $4$ bits ($\approx 3.5$ GB read per token). What is the theoretical decode ceiling, in tokens per second ($\text{BW} / \text{bytes per token}$)?

$ \dfrac{700\text{ GB/s}}{3.5\text{ GB/token}} = $ 200 tok/s. This is an upper bound — real throughput is a fraction of it after sampling, attention, and Python overhead — but it explains why a faster GPU and a smaller quant both raise the same number.

2.2

llama.cpp & GGUF

llama.cpp, started by Georgi Gerganov in 2023, is the project that made local inference mainstream. It is a dependency-light C/C++ engine that runs the forward pass on CPU, GPU (CUDA, Metal, Vulkan, ROCm), or any split of the two. Ollama and LM Studio are friendly wrappers over it; when someone "runs a model on their MacBook," llama.cpp is almost always the thing actually executing.

Its companion is GGUF (GPT-Generated Unified Format), the single-file container llama.cpp loads. One .gguf file holds the quantized tensors, the tokenizer, the chat template, and metadata such as context length — everything needed to run, with no separate config to mismatch. The format is self-describing and memory-mappable, so the OS can page weights in lazily rather than read the whole file up front.

NAMING

A file like Llama-3.1-8B-Instruct-Q4_K_M.gguf reads as: model · size · instruct-tuned · quant scheme. The Q4_K_M suffix is the most important part — Q4 is ~4 bits per weight, _K is the k-quant method (per-block scales chosen to minimize error), and _M is the "medium" mix that keeps a few sensitive tensors at higher precision. Q4_K_M and Q5_K_M are the everyday defaults; Q8_0 is near-lossless but twice the size of Q4.

The footprint of a GGUF is close to bits-per-weight × parameters. That is the whole game on a memory-limited box, so it is worth being able to compute it cold:

EQ OM2.2 — WEIGHT FOOTPRINT $$ \text{bytes}_{\text{weights}} \;=\; N_{\text{params}} \times \frac{\text{bits per weight}}{8} \qquad\Longrightarrow\qquad \text{GB} \;=\; \frac{N_{\text{params}} \times \text{bpw}}{8 \times 10^{9}} $$

"bpw" is bits per weight — $16$ for bf16, $8$ for Q8, $\approx 4.5$ for a real Q4_K_M (the extra half-bit is the per-block scales and the higher-precision tensors). Nominal Q4 uses $4$ bpw exactly. A 7B model: bf16 = 14 GB, Q8 = 7 GB, Q4 = 3.5 GB. Each step down halves the file and roughly halves the bytes read per token — at a quality cost that §2.4 quantifies.

GGUF is the file format used by llama.cpp to package a quantized model (weights, tokenizer, chat template, metadata) into one file. True or false?

GGUF (GPT-Generated Unified Format) is exactly that single-file container — it superseded the older GGML format and is what Ollama and LM Studio ship under the hood. The answer is true.

Using EQ OM2.2 with the nominal $4$ bits per weight ($0.5$ bytes/param), roughly how many GB do the weights of a $7\text{B}$-parameter model occupy at 4-bit?

$ 7\times10^{9} \times \dfrac{4}{8} = 7\times10^{9} \times 0.5 = 3.5\times10^{9} $ bytes $ = $ 3.5 GB. (A real Q4_K_M is a touch larger — about 4–4.5 GB — because of the per-block scales; the back-of-envelope 3.5 GB is what you size against first.)

PYTHON · RUNNABLE IN-BROWSER

# EQ OM2.2: a memory estimator -- params x bit-width -> GB of weights
import numpy as np

def weight_gb(params, bpw):
    return params * (bpw / 8) / 1e9        # bytes -> gigabytes

sizes = {"7B": 7e9, "13B": 13e9, "70B": 70e9}
quants = {"bf16": 16, "Q8_0": 8, "Q5_K_M": 5.5, "Q4_K_M": 4.5, "Q4_0": 4}

print(f"{'model':>6} | " + " | ".join(f"{q:>7}" for q in quants))
print("-" * (9 + 10 * len(quants)))
for name, p in sizes.items():
    row = " | ".join(f"{weight_gb(p, b):6.1f}G" for b in quants.values())
    print(f"{name:>6} | {row}")

print("\nA 24 GB card holds a 7B at any quant, a 13B comfortably,")
print("and a 70B ONLY once you drop to ~Q4 (35 GB nominal) -- which still")
print("needs a 48 GB card or two 24 GB cards. Bits-per-weight is destiny.")

edits are live — break it on purpose

The other half of the bill is the KV cache. Weights are fixed; the cache grows with context and concurrency (Vol II · EQ 3.5). On a single laptop that buffer is usually small next to the weights, but at long context it can rival them — which is why the calculator in §2.5 sums both. A handy trick on Apple silicon: unified memory means the GPU and CPU share one pool, so "VRAM" and "RAM" are the same budget and a 64 GB Mac can hold a 70B Q4 that no consumer discrete GPU can.

2.3

vLLM & production serving

llama.cpp optimizes the single-user laptop. vLLM optimizes the opposite end: many concurrent users on datacenter GPUs, maximizing tokens served per second per dollar. It is the open-source serving engine most production open-model deployments are built on, and its key idea is about memory, not math.

Naive serving pre-allocates a contiguous KV-cache buffer per request, sized for the maximum possible sequence length. Most requests never reach that length, so most of that memory sits idle — internal fragmentation that strands GPU memory and caps how many requests fit. vLLM's PagedAttention borrows the operating-system idea of virtual memory: the KV cache is split into fixed-size blocks allocated on demand, with a block table mapping a request's logical positions to physical blocks. Memory is handed out a block at a time, fragmentation drops to near zero, and identical prefixes (a shared system prompt, a few-shot preamble) can share the same physical blocks across requests.

EQ OM2.3 — KV BLOCKS & CONCURRENCY $$ \text{max concurrent seqs} \;\approx\; \frac{M_{\text{KV}}}{\text{bytes/token} \times \bar{T}}, \qquad \text{blocks per seq} = \left\lceil \frac{T}{B} \right\rceil $$

$M_{\text{KV}}$ is the GPU memory left for cache after the weights; $\bar T$ is the average sequence length; $B$ is the block size (commonly 16 tokens). Because blocks are allocated lazily and shared on common prefixes, vLLM packs far more concurrent sequences into the same $M_{\text{KV}}$ than contiguous allocation — the original paper reports up to 2–4× higher throughput at the same latency. Continuous batching compounds it: finished sequences free their blocks and new requests fill the slot mid-flight, instead of the whole batch waiting for its slowest member.

The serving picture has a fundamental shape: throughput rises with batch size because the GPU's parallelism gets amortized across more sequences, until you run out of KV memory or saturate compute — after which adding requests only raises latency. The instrument below lets you find that knee.

INSTRUMENT OM2.3 — THROUGHPUT vs BATCH SIZECONTINUOUS BATCHING · SATURATING CURVE

PER-STREAM SPEED 40 tok/s

GPU SATURATION POINT 32 seqs

PER-USER @ THIS BATCH

—

AGGREGATE THROUGHPUT

—

BATCH (DRAG ON CANVAS)

—

Hover or drag across the curve to pick a batch size. The mint line is aggregate tokens/s (what a server bills for); the blue line is per-user tokens/s (what a single user feels). Below the saturation point throughput scales almost linearly and latency barely moves — free money. Past it, aggregate flattens while per-user speed falls off a cliff: you are now trading latency for nothing.

Use the right tool for the job. llama.cpp / Ollama for one user, a laptop, or a quick local prototype. vLLM / SGLang / TensorRT-LLM when you serve traffic and care about cost per million tokens. The same open weights run on both; only the engine — and therefore the memory accounting — changes.

2.4

Quantization for local (recap)

Quantization is the lever that turns "needs a datacenter" into "runs on my desk." The full theory — absmax vs zero-point, GPTQ, AWQ, the NF4 data type — lives in Vol II · Chapter 07; here is the operating intuition and the local-specific defaults.

Weights are stored at lower precision so each one occupies fewer bits. Trained weights cluster tightly around zero, so the modern schemes (k-quants, GPTQ, AWQ, NF4) spend their limited code values where the weights actually are, and protect the few outlier channels that carry disproportionate signal — the central finding of LLM.int8(), which showed that a handful of large-magnitude features, if naively quantized, wreck accuracy. The quality cost is not linear in bits:

EQ OM2.4 — THE QUALITY ELBOW $$ \Delta\text{quality}(\text{bpw}) \approx \begin{cases} \approx 0 & \text{bpw} \ge 8 \\ \text{small} & \text{bpw} \approx 4\text{–}6 \\ \text{steep} & \text{bpw} < 4 \end{cases} $$

8-bit is effectively lossless; the drop from 8 to ~4 bits is small and usually worth the 2× memory win; below ~4 bits quality falls off fast. The practical sweet spot for local use is Q4_K_M to Q5_K_M — roughly 4.5–5.5 effective bpw, where a model that would not fit at all becomes one that fits with little measurable loss. A 4-bit copy of a bigger model almost always beats an 8-bit copy of a smaller one at the same memory budget: parameters buy more quality than precision does.

INSTRUMENT OM2.4 — QUANT LEVEL EXPLORERSIZE vs QUALITY · EQ OM2.2 · EQ OM2.4

MODEL SIZE 7B params

QUANT LEVEL Q4_K_M

EFFECTIVE BPW

—

WEIGHT FOOTPRINT

—

QUALITY RETAINED

—

The bar shows size; the curve shows the quality elbow. Slide the quant from bf16 down to Q2 and watch the footprint collapse while quality holds — then drops sharply past Q4. The mint marker is where most people live: Q4_K_M, the best bytes-per-IQ point for laptops in 2026.

Two honest caveats. First, the "quality retained" curve here is a stylized model — real degradation depends on the model, the quant method, and the task (code and math are more fragile than chat). Always run your own eval at the quant you intend to ship. Second, KV cache can be quantized too (FP8 or INT4 K/V), which buys context length at a smaller quality cost than weight quantization — increasingly standard in both llama.cpp and vLLM.

2.5

Hardware sizing — will it fit?

Everything above reduces to one inequality: the model's total footprint must fit under your memory ceiling. Total memory is weights plus KV cache plus a working overhead for activations and the framework:

EQ OM2.5 — TOTAL INFERENCE MEMORY $$ M_{\text{total}} \;=\; \underbrace{N_{\text{params}}\!\times\!\tfrac{\text{bpw}}{8}}_{\text{weights}} \;+\; \underbrace{2 L\, h_{kv}\, d_k\, T\, b \times \tfrac{\text{bits}_{kv}}{8}}_{\text{KV cache (Vol II · EQ 3.5)}} \;+\; \underbrace{M_{\text{overhead}}}_{\approx 1\text{–}2\text{ GB}} $$

Weights are the constant; the KV term is the variable that grows with context $T$ and batch $b$. For a single user at modest context the weights dominate, so the bpw choice decides whether it fits. At long context or high concurrency the KV term takes over — which is when GQA (Vol II · §3.6) and KV quantization earn their keep. Leave headroom: a model whose total sits at 100% of VRAM will OOM the moment the context grows. Target ~85–90% of the card.

PYTHON · RUNNABLE IN-BROWSER

# KV cache size as a function of context length and model geometry
# (Vol II EQ 3.5) -- the variable half of EQ OM2.5
import numpy as np

def kv_gb(L, h_kv, d_k, T, batch=1, bits=16):
    bytes_per_tok = 2 * L * h_kv * d_k * (bits / 8)   # 2 = K and V
    return bytes_per_tok * T * batch / 1e9

# Llama-3-8B geometry: 32 layers, GQA with 8 KV heads, head dim 128
geom = dict(L=32, h_kv=8, d_k=128)
per_tok_kb = 2 * 32 * 8 * 128 * 2 / 1024
print(f"per-token KV (FP16): {per_tok_kb:.0f} KB  -> {per_tok_kb/1024:.3f} MB")

xs, ys = [], []
for T in (2048, 8192, 32768, 131072):
    g = kv_gb(T=T, **geom)
    xs.append(T); ys.append(g)
    print(f"  T = {T:>7,}: {g:6.2f} GB of KV cache (batch 1, FP16)")

print("\nWeights (8B at Q4) are a fixed ~4.5 GB; the cache is what scales")
print("with context. At 128K tokens the cache alone rivals the weights.")
plot_xy(xs, ys)

edits are live — break it on purpose

INSTRUMENT OM2.5 — WILL IT FIT?EQ OM2.5 · WEIGHTS + KV + OVERHEAD

MODEL SIZE 8B params

QUANT (bpw) Q4 · 4.5

CONTEXT T 8K

CONCURRENT USERS 1

YOUR HARDWARE

WEIGHTS

—

KV CACHE

—

TOTAL / CEILING

—

VERDICT

—

Defaults: an 8B model at Q4, 8K context, one user — comfortably inside a 24 GB card. Push context to 128K or users to 32 and watch the KV bar swallow the budget. Switch to 70B and only Q4 on a 48/80 GB card stays green. The dashed line is your hardware ceiling; aim to stay under ~90% of it.

Estimate total inference memory for a $7\text{B}$ model at $4$ bits (use $0.5$ bytes/param), with $\approx 1$ GB of KV cache and $\approx 1.5$ GB of overhead. What is $M_{\text{total}}$ in GB?

Weights $ = 7\times10^{9}\times0.5 = 3.5 $ GB; plus $ 1 $ GB KV plus $ 1.5 $ GB overhead $ = 3.5 + 1 + 1.5 = $ 6 GB. That fits an 8 GB card with a sliver of headroom — exactly the regime where laptops became viable.

You can now run any open model and predict whether it fits before you download it. Chapter 03 takes the next step: changing the weights instead of just serving them — fine-tuning open models with LoRA and QLoRA on the same consumer hardware, and merging or hot-swapping adapters back into the serving stack you just built.

2.R

References

Gerganov, G. et al. (2023). llama.cpp — LLM inference in C/C++. The reference local inference engine and the home of the GGUF format.
Kwon, W., Li, Z., Zhuang, S. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023 — the vLLM paper; paged KV cache and continuous batching.
Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. NeurIPS 2022 — outlier-aware 8-bit quantization; why a few features must be preserved.
Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023 — one-shot 3–4 bit weight quantization, a basis for local quants.
Lin, J., Tang, J., Tang, H. et al. (2024). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024 — protects salient weight channels; widely used for 4-bit serving.
The vLLM Team. vLLM Documentation. Official guide to deployment, quantization, and the OpenAI-compatible server.