Merging in weight space
Fine-tuning produces a checkpoint: a vector of weights \(\theta\) that differs from the base model it started from. The conventional way to combine two fine-tunes is an ensemble — run both models, average their outputs. That doubles inference cost and memory. Model merging takes the cheaper, stranger route: average the weights themselves, once, offline, and serve a single model. No second forward pass, no training, no data. The merged model is just \(\theta_{\text{merged}}\), a new point in the same parameter space.
The plainest version is a uniform weight average. Given checkpoints \(\theta_1, \dots, \theta_k\) that all started from the same base, the merge is their mean. Wortsman et al. (2022) call the result a model soup: average several fine-tunes of one base — say, runs with different hyperparameters or seeds — and the soup frequently beats the single best ingredient on held-out data, at the inference cost of one model.
The interesting case is two models specialized for different things. Average a math fine-tune and a code fine-tune and you get a model that is decent at both — a cheap multi-task model with no multi-task training set. The averaging coefficient \(\lambda\) is a dial: \(\lambda = 0\) is one model, \(\lambda = 1\) is the other, and the segment between them traces out a family of compromise models. Whether that segment stays low-loss or crosses a barrier is the central empirical question, and the instrument below lets you see both regimes.
# Interpolate between two toy weight vectors and evaluate a quadratic loss
# along the line. A shared-basin loss has its min INSIDE the segment;
# a two-basin loss has a BARRIER in the middle.
import numpy as np
A = np.array([0.4, -0.2, 1.0]) # "model A" weights
B = np.array([1.0, 0.6, 0.2]) # "model B" weights
# Connected case: one bowl centered at the average of A and B.
c_shared = 0.5 * (A + B)
def loss_shared(theta): return float(((theta - c_shared) ** 2).sum())
# Barrier case: two separate bowls, one at A and one at B.
def loss_barrier(theta):
return float(min(((theta - A) ** 2).sum(), ((theta - B) ** 2).sum()))
lams = np.linspace(0, 1, 11)
print(" lam theta(merged) shared barrier")
for lam in lams:
th = (1 - lam) * A + lam * B
print(f" {lam:.1f} {np.round(th,3)} {loss_shared(th):.3f} {loss_barrier(th):.3f}")
end_avg = 0.5 * (loss_barrier(A) + loss_barrier(B))
mid = loss_barrier(0.5 * (A + B))
print(f"\nbarrier height = mid - endpoint_avg = {mid - end_avg:.3f}")
print("shared-basin loss dips in the middle (merge helps); barrier loss")
print("peaks in the middle (merge lands between basins -> worse than either).")
Task vectors & arithmetic
Averaging whole checkpoints is blunt. A sharper object isolates exactly what a fine-tune changed. Ilharco et al. (2022) define a task vector as the difference between the fine-tuned weights and the pretrained weights they came from: it is the displacement in weight space that learning the task produced.
Once skills are vectors, you can do arithmetic with them. This is the surprising empirical result of the task-arithmetic line of work: adding and subtracting task vectors edits a model's behavior in predictable, compositional ways, with no fine-tuning at all.
| Operation | Form | Effect |
|---|---|---|
| Add (learn a skill) | θ₀ + α τA | installs task A into the base; α scales how strongly |
| Add many (multi-task) | θ₀ + α(τA + τB) | one model that does A and B — a merge built from vectors |
| Negate (forget / detoxify) | θ₀ − α τA | moves away from task A: unlearns a skill, or reduces a behavior such as toxicity |
| Analogy | τB + (τC − τD) | "B is to ? as C is to D" — transfers the C→D relation onto B |
The negation result is the one practitioners reach for most. Build a task vector for an undesirable behavior — train briefly on toxic text to get \(\tau_{\text{toxic}}\), then subtract it: \(\theta_0 - \alpha\tau_{\text{toxic}}\) reduces toxic generation while largely preserving fluency, because you only moved along the toxicity direction and left the rest of the weights alone. The multi-skill case is the merge: \(\theta_0 + \sum_t \alpha\tau_t\) is a model that holds several skills at once, and it is identical in form to a soup of the underlying fine-tunes.
The honest caveat. Task arithmetic is real but not magic. The scaling \(\alpha\) is a tuned hyperparameter — too large and you overshoot into degraded text, too small and the skill never lands. And the operations interfere: adding two vectors is not the same as having both skills cleanly, because the vectors collide coordinate by coordinate. That interference is the subject of §6.3.
# Compute task vectors from base/fine-tuned toy params, then add and
# negate them and read off the effect on two toy skill scores.
import numpy as np
base = np.array([0.0, 0.0, 0.0, 0.0]) # pretrained origin
ft_A = np.array([0.6, 0.0, 0.3, 0.0]) # fine-tuned for task A
ft_B = np.array([0.0, 0.5, 0.0, 0.4]) # fine-tuned for task B
tau_A = ft_A - base # EQ OM6.2
tau_B = ft_B - base
# Toy "skill probes": a skill score = dot(theta, its task direction).
def score(theta, tau): return float(theta @ tau)
def report(name, theta):
print(f"{name:18s} A={score(theta,tau_A):+.3f} B={score(theta,tau_B):+.3f}")
report("base only", base)
report("merge A+B", base + tau_A + tau_B) # both skills
report("negate A (forget)",base - tau_A + tau_B) # drop A, keep B
report("amplify A x1.5", base + 1.5*tau_A) # scaled-up A
print("\ntau_A:", np.round(tau_A,2), " tau_B:", np.round(tau_B,2))
print("Adding raises both scores; negating A drives A below zero while")
print("B is untouched -- arithmetic on directions edits skills independently")
print("ONLY where the task vectors don't overlap (interference is in s3).")
Interference & the repair methods
Summing task vectors works coordinate by coordinate, and that is where it breaks. Two failure modes, named by Yadav et al. (2023), account for most of the lost performance when you merge many fine-tunes:
- Redundant-parameter interference. Most coordinates of a task vector are tiny noise; only a small fraction carry the skill. When you add many vectors, the mass of small noisy values from every task swamps the few large meaningful ones, diluting the signal.
- Sign conflict. Task A wants a coordinate to go up, task B wants the same coordinate to go down. Summed, they cancel — and you lose the contribution of both tasks on that parameter rather than picking one.
TIES-Merging attacks both with three steps applied to the stacked task vectors. Trim: keep only the top-magnitude fraction of each vector, zeroing the redundant noise. Elect sign: for each coordinate, pick the sign whose total magnitude across tasks is larger, resolving the conflict by majority of mass. Disjoint mean: average only the values that agree with the elected sign, ignoring the dissenters. The result keeps each task's strong, agreeing contributions and discards the cancellation.
DARE (Yu et al., 2023; "Language Models are Super Mario") is an even simpler preprocessing step you can put in front of any merge. Randomly drop a fraction \(p\) of each task vector's entries to zero, then rescale the survivors by \(1/(1-p)\) so the expected vector is preserved. The striking finding is that you can drop 90% or more of a fine-tune's delta with almost no loss — most of a task vector is redundant — and the sparsified vectors collide far less when summed.
For the special case of two models, practitioners often prefer SLERP — spherical linear interpolation — over a straight average. SLERP walks the arc on the hypersphere between the two weight vectors at constant angular speed, preserving their norm rather than letting it sag toward the origin the way a linear average can. And when you have access to a little data, Fisher-weighted merging (Matena & Raffel, 2022) replaces the uniform average with a per-parameter weighting by each model's Fisher information: parameters a model is confident about (high curvature of its loss) get more say in the merge.
# Tiny TIES merge: trim, elect sign, disjoint mean -- compare to naive sum.
import numpy as np
# 3 task vectors over 6 coordinates (rows = tasks).
T = np.array([
[ 0.80, -0.05, 0.60, -0.02, 0.01, 0.50], # task 1
[ 0.70, 0.04, -0.55, 0.03, -0.02, -0.40], # task 2 (conflicts on col 2,5)
[-0.10, 0.90, 0.05, 0.85, 0.02, 0.45], # task 3
])
def trim(tau, keep=0.5): # keep top-|.| fraction per row
out = np.zeros_like(tau)
for i, row in enumerate(tau):
k = max(1, int(round(keep * row.size)))
idx = np.argsort(-np.abs(row))[:k]
out[i, idx] = row[idx]
return out
Th = trim(T, keep=0.5)
gamma = np.sign(Th.sum(axis=0)) # elected sign per coordinate (EQ OM6.3)
gamma[gamma == 0] = 1
agree = (np.sign(Th) == gamma) & (Th != 0)
merged = np.where(agree.sum(0) > 0,
(np.where(agree, Th, 0).sum(0) / np.maximum(agree.sum(0), 1)),
0.0)
print("naive sum :", np.round(T.sum(0), 3))
print("elected sign:", gamma.astype(int).tolist())
print("TIES merged :", np.round(merged, 3))
print("\nOn the conflict columns (2 & 5) the naive sum cancels toward 0;")
print("TIES picks the majority-magnitude sign and averages only the agreeing")
print("tasks, so the merged value keeps a clear, uncancelled contribution.")
The geometry that makes it work
None of this should work. Averaging the weights of two neural networks is, in general, nonsense — the loss landscape is wildly non-convex, and the midpoint of two random good models is usually a bad one. Merging is feasible only inside a specific geometric regime, and naming that regime tells you exactly when to trust a merge and when to expect it to fail.
The enabling phenomenon is linear mode connectivity: when two models are fine-tuned from the same pretrained checkpoint, the straight line between them in weight space stays at low loss — there is no barrier (this is the flat curve of INSTRUMENT OM6.1). Shared initialization keeps both fine-tunes inside a single connected low-loss basin, so any convex combination of them is also low-loss. This is the geometric reason a soup beats its ingredients: averaging finds a flatter, more central point in the basin, which generalizes better.
The crucial caveat is permutation symmetry. A neural network's hidden units can be permuted — relabel the neurons of a layer and permute the next layer's weights to match, and you have an identical function with a completely different weight vector. Two independently trained models almost always sit in different permutation "copies" of the same basin, which is why the raw line between them crosses a barrier. Ainsworth et al. (2023) showed this barrier can often be removed by first finding the permutation that aligns one model's units to the other's, after which even independently trained models become linearly connected. The practical takeaway for merging is sharp:
Merge models that share a base, architecture, and tokenizer, and you are inside one basin with no permutation problem — averaging and task arithmetic just work. Merge models from different initializations and you must align their permutations first, or expect a barrier. This single distinction explains nearly every merging success and failure in the wild: community fine-tunes of one base model (Llama, Mistral) merge cleanly; two from-scratch models do not.
Practice & honest limits
Why merge at all, when you could multi-task fine-tune? Because merging is nearly free. It needs no training data, no gradient steps, and no GPU beyond the memory to hold the checkpoints. The practical wins follow directly:
- Cheap multi-task models. Combine separately-trained specialists into one model that does all their jobs, without ever assembling a joint training set or running a joint fine-tune. This is the dominant use.
- Combining community fine-tunes. The open-model ecosystem (Open Models · Ch 03) produces thousands of fine-tunes of a handful of bases. Merging lets you blend, say, a coding fine-tune and an instruction fine-tune of the same Llama into one model — the engine behind most top community models on the leaderboards.
- Behavior editing without data. Negate a toxicity task vector to detoxify; subtract a memorized-data vector to reduce regurgitation; scale a style vector up or down. Edits you would otherwise need a fine-tuning run to make.
- Soups for robustness. Average your own hyperparameter sweep into one checkpoint that generalizes better than any single run, at the cost of one model.
The standard open-source toolkit is mergekit, which implements soup, task arithmetic, TIES, DARE, SLERP, and Fisher merging behind a single config file. But the tooling hides none of the genuine limits, and a serious caveat list belongs in any honest treatment:
| Limit | Why | Consequence |
|---|---|---|
| Same base required | weights only share a coordinate system if they share an origin | cannot merge across different base models or tokenizers |
| Same architecture | parameter \(i\) must be the same role in every ingredient | no merging a 7B into a 13B |
| No free lunch | a merge is a compromise; capacity is finite and skills can trade off | merged model often below each specialist on its own task |
| Hyperparameters matter | \(\lambda\), \(\alpha\), trim fraction, DARE \(p\) all change the result | needs a search, not a single setting |
| Evaluation is mandatory | merge quality is unpredictable a priori; barriers and interference are silent | always benchmark — a merge that looks fine can be quietly degraded |
The last row is the one to internalize. Merging is fast enough that it invites a "ship it" reflex, but a merge can satisfy every algebraic constraint and still land on a barrier, overshoot a scale, or trade away a skill you needed. A merge is a hypothesis, not a result — it is unproven until it clears the same evaluation suite you would demand of a fine-tune. The cost saving is in the training you skipped, not in the evaluation, which you still owe in full.
Merging buys breadth, sometimes at the cost of peak. The cheap multi-skill model is real, and for many deployments "good at several things on one GPU" beats "excellent at one." But do not expect a merge of a math specialist and a code specialist to match either on its home benchmark. Interference, finite capacity, and compromise are structural, not bugs to be tuned away. State the trade you made in numbers, the same discipline §6.5 of the previous chapter demanded of safety claims.
That closes the open-models loop: choose a model, run it, fine-tune it, train it well, break-then-harden it, and now combine fine-tunes in weight space for free. Merging is the payoff of working with open weights — full access to the parameters is exactly what lets you average, add, and subtract them. Return to the index to branch into the volumes this track builds on: fine-tuning and LoRA (Vol II · Ch 06), post-training and alignment (Vol II · Ch 05), and the deep-learning track for the loss-landscape geometry behind linear mode connectivity.
References
- Ilharco, G., Ribeiro, M. T., Wortsman, M. et al. (2022). Editing Models with Task Arithmetic.
- Wortsman, M., Ilharco, G., Gadre, S. Y. et al. (2022). Model Soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.
- Yadav, P., Tam, D., Choshen, L., Raffel, C. & Bansal, M. (2023). TIES-Merging: Resolving Interference When Merging Models.
- Yu, L., Yu, B., Yu, H., Huang, F. & Li, Y. (2023). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (DARE).
- Matena, M. & Raffel, C. (2022). Merging Models with Fisher-Weighted Averaging.
- Ainsworth, S. K., Hayase, J. & Srinivasa, S. (2023). Git Re-Basin: Merging Models modulo Permutation Symmetries.