Model Merging & Task Vectors

6.1

Merging in weight space

Fine-tuning produces a checkpoint: a vector of weights $\theta$ that differs from the base model it started from. The conventional way to combine two fine-tunes is an ensemble — run both models, average their outputs. That doubles inference cost and memory. Model merging takes the cheaper, stranger route: average the weights themselves, once, offline, and serve a single model. No second forward pass, no training, no data. The merged model is just $\theta_{\text{merged}}$, a new point in the same parameter space.

The plainest version is a uniform weight average. Given checkpoints $\theta_1, \dots, \theta_k$ that all started from the same base, the merge is their mean. Wortsman et al. (2022) call the result a model soup: average several fine-tunes of one base — say, runs with different hyperparameters or seeds — and the soup frequently beats the single best ingredient on held-out data, at the inference cost of one model.

EQ OM6.1 — WEIGHT AVERAGING (MODEL SOUP) $$ \theta_{\text{soup}} \;=\; \frac{1}{k}\sum_{j=1}^{k} \theta_j, \qquad \text{or weighted: } \theta_{\text{merged}} \;=\; \sum_{j=1}^{k} \lambda_j\, \theta_j, \;\; \sum_j \lambda_j = 1 $$

$\theta_j$ is the full weight vector of the $j$-th fine-tune; $\lambda_j$ are mixing coefficients that sum to one. The uniform soup sets $\lambda_j = 1/k$. The merge is element-wise: every coordinate of the result is a convex combination of the same coordinate across ingredients. This only makes sense because all ingredients share the same base, architecture, and tokenizer — coordinate $i$ means the same thing in every model. Merge two models with different bases and you are averaging unrelated numbers. The soup works because fine-tuning from a shared initialization keeps the checkpoints in one connected, low-loss basin (§6.4); average two randomly initialized models and you land in a high-loss region between them.

The interesting case is two models specialized for different things. Average a math fine-tune and a code fine-tune and you get a model that is decent at both — a cheap multi-task model with no multi-task training set. The averaging coefficient $\lambda$ is a dial: $\lambda = 0$ is one model, $\lambda = 1$ is the other, and the segment between them traces out a family of compromise models. Whether that segment stays low-loss or crosses a barrier is the central empirical question, and the instrument below lets you see both regimes.

Two checkpoints differ in one coordinate: $ \theta_A = 0.4 $ and $ \theta_B = 1.0 $. Using EQ OM6.1 as a two-model interpolation $ \theta = (1-\lambda)\theta_A + \lambda\theta_B $ with $ \lambda = 0.25 $, what is the merged value of that coordinate?

$ \theta = (1 - 0.25)(0.4) + (0.25)(1.0) = 0.75 \times 0.4 + 0.25 \times 1.0 = 0.30 + 0.25 = $ 0.55. Interpolation weights stay between the two endpoints; at $\lambda = 0.25$ the result sits a quarter of the way from $A$ toward $B$.

INSTRUMENT OM6.1 — WEIGHT-SPACE INTERPOLATIONEQ OM6.1 · LINEAR MODE CONNECTIVITY vs A LOSS BARRIER

MIX COEFFICIENT λ (A → B) 0.50

BARRIER HEIGHT (mid-path bump) 0.00

SHARED BASE?

MERGED LOSS @ λ

—

BARRIER (max − endpoints)

—

VERDICT

—

The curve is the (synthetic) loss along the straight line from model A (λ=0) to model B (λ=1); the dot is your current merge. With SHARED BASE and zero barrier the path stays low — that is linear mode connectivity, and any λ gives a usable merge. Drag BARRIER HEIGHT up, or switch to DIFFERENT INIT, and a hump appears in the middle: the merge at λ≈0.5 is now worse than either endpoint, and naive averaging fails. The whole feasibility of merging is the question of whether that hump is there.

PYTHON · RUNNABLE IN-BROWSER

# Interpolate between two toy weight vectors and evaluate a quadratic loss
# along the line. A shared-basin loss has its min INSIDE the segment;
# a two-basin loss has a BARRIER in the middle.
import numpy as np

A = np.array([0.4, -0.2, 1.0])     # "model A" weights
B = np.array([1.0,  0.6, 0.2])     # "model B" weights

# Connected case: one bowl centered at the average of A and B.
c_shared = 0.5 * (A + B)
def loss_shared(theta):  return float(((theta - c_shared) ** 2).sum())

# Barrier case: two separate bowls, one at A and one at B.
def loss_barrier(theta):
    return float(min(((theta - A) ** 2).sum(), ((theta - B) ** 2).sum()))

lams = np.linspace(0, 1, 11)
print("  lam   theta(merged)            shared   barrier")
for lam in lams:
    th = (1 - lam) * A + lam * B
    print(f"  {lam:.1f}  {np.round(th,3)}   {loss_shared(th):.3f}    {loss_barrier(th):.3f}")

end_avg = 0.5 * (loss_barrier(A) + loss_barrier(B))
mid     = loss_barrier(0.5 * (A + B))
print(f"\nbarrier height = mid - endpoint_avg = {mid - end_avg:.3f}")
print("shared-basin loss dips in the middle (merge helps); barrier loss")
print("peaks in the middle (merge lands between basins -> worse than either).")

edits are live — break it on purpose

6.2

Task vectors & arithmetic

Averaging whole checkpoints is blunt. A sharper object isolates exactly what a fine-tune changed. Ilharco et al. (2022) define a task vector as the difference between the fine-tuned weights and the pretrained weights they came from: it is the displacement in weight space that learning the task produced.

EQ OM6.2 — TASK VECTOR $$ \tau \;=\; \theta_{\text{finetuned}} \;-\; \theta_{\text{pretrained}} $$

$\theta_{\text{pretrained}}$ is the base model; $\theta_{\text{finetuned}}$ is the same model after training on a task. $\tau$ is the same shape as the weights — one number per parameter — and it points "toward" the task. The base model is the origin of this coordinate system, which is exactly why every model you mix must share that origin. A task vector from a different base is a vector in a different space. To apply a task vector you add it back to a base: $\theta_{\text{pretrained}} + \tau$ reconstructs the fine-tune; scaling it, $\theta_{\text{pretrained}} + \alpha\tau$, dials the skill up or down.

Once skills are vectors, you can do arithmetic with them. This is the surprising empirical result of the task-arithmetic line of work: adding and subtracting task vectors edits a model's behavior in predictable, compositional ways, with no fine-tuning at all.

Operation	Form	Effect
Add (learn a skill)	θ₀ + α τ_A	installs task A into the base; α scales how strongly
Add many (multi-task)	θ₀ + α(τ_A + τ_B)	one model that does A and B — a merge built from vectors
Negate (forget / detoxify)	θ₀ − α τ_A	moves away from task A: unlearns a skill, or reduces a behavior such as toxicity
Analogy	τ_B + (τ_C − τ_D)	"B is to ? as C is to D" — transfers the C→D relation onto B

The negation result is the one practitioners reach for most. Build a task vector for an undesirable behavior — train briefly on toxic text to get $\tau_{\text{toxic}}$, then subtract it: $\theta_0 - \alpha\tau_{\text{toxic}}$ reduces toxic generation while largely preserving fluency, because you only moved along the toxicity direction and left the rest of the weights alone. The multi-skill case is the merge: $\theta_0 + \sum_t \alpha\tau_t$ is a model that holds several skills at once, and it is identical in form to a soup of the underlying fine-tunes.

The honest caveat. Task arithmetic is real but not magic. The scaling $\alpha$ is a tuned hyperparameter — too large and you overshoot into degraded text, too small and the skill never lands. And the operations interfere: adding two vectors is not the same as having both skills cleanly, because the vectors collide coordinate by coordinate. That interference is the subject of §6.3.

For one coordinate, the base is $ \theta_0 = 0.9 $, fine-tune A gives $ 1.4 $ and fine-tune B gives $ 1.6 $. Form both task vectors (EQ OM6.2) and build the multi-task merge $ \theta_0 + \tau_A + \tau_B $ (α = 1). What is the merged coordinate?

$ \tau_A = 1.4 - 0.9 = 0.5 $; $ \tau_B = 1.6 - 0.9 = 0.7 $. Then $ \theta_0 + \tau_A + \tau_B = 0.9 + 0.5 + 0.7 = $ 2.1. Note this overshoots both endpoints — adding raw vectors stacks their pulls, which is precisely the over-shoot that interference-aware methods damp.

INSTRUMENT OM6.2 — TASK-VECTOR ARITHMETICEQ OM6.2 · ADD / NEGATE TWO TASK VECTORS ON A TOY BASE

SCALE α_A (task A) +1.0

SCALE α_B (task B) +1.0

SKILL A SCORE

—

SKILL B SCORE

—

MERGED θ (3 coords)

—

The base sits at the origin; $\tau_A$ and $\tau_B$ are two fixed task directions. Move $\alpha_A,\alpha_B$ to build $\theta_0 + \alpha_A\tau_A + \alpha_B\tau_B$. At +1 / +1 you get the multi-task merge — both skill scores rise. Drive $\alpha_A$ negative to negate task A (its score falls below baseline — unlearning), while task B is untouched. Push a scale past ±1 and watch a toy "degradation" penalty grow: too much α corrupts the model, the over-shoot warning of §6.2.

PYTHON · RUNNABLE IN-BROWSER

# Compute task vectors from base/fine-tuned toy params, then add and
# negate them and read off the effect on two toy skill scores.
import numpy as np

base = np.array([0.0, 0.0, 0.0, 0.0])     # pretrained origin
ft_A = np.array([0.6, 0.0, 0.3, 0.0])     # fine-tuned for task A
ft_B = np.array([0.0, 0.5, 0.0, 0.4])     # fine-tuned for task B

tau_A = ft_A - base                        # EQ OM6.2
tau_B = ft_B - base

# Toy "skill probes": a skill score = dot(theta, its task direction).
def score(theta, tau):  return float(theta @ tau)

def report(name, theta):
    print(f"{name:18s} A={score(theta,tau_A):+.3f}  B={score(theta,tau_B):+.3f}")

report("base only",        base)
report("merge A+B",        base + tau_A + tau_B)        # both skills
report("negate A (forget)",base - tau_A + tau_B)        # drop A, keep B
report("amplify A x1.5",   base + 1.5*tau_A)            # scaled-up A
print("\ntau_A:", np.round(tau_A,2), " tau_B:", np.round(tau_B,2))
print("Adding raises both scores; negating A drives A below zero while")
print("B is untouched -- arithmetic on directions edits skills independently")
print("ONLY where the task vectors don't overlap (interference is in s3).")

edits are live — break it on purpose

6.3

Interference & the repair methods

Summing task vectors works coordinate by coordinate, and that is where it breaks. Two failure modes, named by Yadav et al. (2023), account for most of the lost performance when you merge many fine-tunes:

Redundant-parameter interference. Most coordinates of a task vector are tiny noise; only a small fraction carry the skill. When you add many vectors, the mass of small noisy values from every task swamps the few large meaningful ones, diluting the signal.
Sign conflict. Task A wants a coordinate to go up, task B wants the same coordinate to go down. Summed, they cancel — and you lose the contribution of both tasks on that parameter rather than picking one.

TIES-Merging attacks both with three steps applied to the stacked task vectors. Trim: keep only the top-magnitude fraction of each vector, zeroing the redundant noise. Elect sign: for each coordinate, pick the sign whose total magnitude across tasks is larger, resolving the conflict by majority of mass. Disjoint mean: average only the values that agree with the elected sign, ignoring the dissenters. The result keeps each task's strong, agreeing contributions and discards the cancellation.

EQ OM6.3 — TIES SIGN ELECTION & DISJOINT MEAN $$ \gamma_m = \operatorname{sgn}\!\Big(\sum_{t=1}^{n}\hat\tau_t\Big), \qquad \tau^{\,\text{merged}} \;=\; \frac{1}{|\mathcal{A}|}\sum_{t\in\mathcal{A}} \hat\tau_t, \quad \mathcal{A}=\{\,t : \operatorname{sgn}(\hat\tau_t)=\gamma_m\,\} $$

$\hat\tau_t$ is the trimmed task vector for task $t$ at a given coordinate; $\gamma_m$ is the elected sign for that coordinate — the sign of the summed (signed) values, i.e. the direction holding more total magnitude. $\mathcal{A}$ is the set of tasks that agree with the elected sign; the merged value is the mean over only those agreeing tasks. The key move is the disjoint mean: a task pulling the wrong way contributes nothing instead of cancelling a task pulling the right way. Sign election is decided by summed magnitude, so a single large value can outvote several small opposing ones.

DARE (Yu et al., 2023; "Language Models are Super Mario") is an even simpler preprocessing step you can put in front of any merge. Randomly drop a fraction $p$ of each task vector's entries to zero, then rescale the survivors by $1/(1-p)$ so the expected vector is preserved. The striking finding is that you can drop 90% or more of a fine-tune's delta with almost no loss — most of a task vector is redundant — and the sparsified vectors collide far less when summed.

EQ OM6.4 — DARE DROP-AND-RESCALE $$ m_i \sim \text{Bernoulli}(1-p), \qquad \tilde\tau_i \;=\; \frac{m_i}{1-p}\,\tau_i, \qquad \mathbb{E}[\tilde\tau_i] = \tau_i $$

$p$ is the drop rate; $m_i$ is a 0/1 mask that keeps each entry with probability $1-p$; the $1/(1-p)$ factor rescales survivors so the vector is unbiased in expectation (the same trick as inverted dropout). Sparsity is the whole point: thinned vectors overlap less, so their sum has fewer sign conflicts. DARE composes — apply it before TIES or a soup. It works because fine-tuning deltas are extremely redundant, not because dropping is harmless in general.

For the special case of two models, practitioners often prefer SLERP — spherical linear interpolation — over a straight average. SLERP walks the arc on the hypersphere between the two weight vectors at constant angular speed, preserving their norm rather than letting it sag toward the origin the way a linear average can. And when you have access to a little data, Fisher-weighted merging (Matena & Raffel, 2022) replaces the uniform average with a per-parameter weighting by each model's Fisher information: parameters a model is confident about (high curvature of its loss) get more say in the merge.

EQ OM6.5 — SLERP (TWO MODELS) & FISHER-WEIGHTED MERGE $$ \text{SLERP: } \theta(\lambda) = \frac{\sin((1-\lambda)\Omega)}{\sin\Omega}\,\theta_A + \frac{\sin(\lambda\Omega)}{\sin\Omega}\,\theta_B, \quad \cos\Omega = \tfrac{\theta_A\cdot\theta_B}{\lVert\theta_A\rVert\lVert\theta_B\rVert} $$ $$ \text{Fisher: } \theta^{\star}_i = \frac{\sum_j F^{(j)}_i\,\theta^{(j)}_i}{\sum_j F^{(j)}_i} $$

$\Omega$ is the angle between the two weight vectors; as $\Omega\to 0$ (nearly parallel models) SLERP reduces to the linear average, so it only differs meaningfully when the models point in genuinely different directions. $F^{(j)}_i$ is the diagonal Fisher information of model $j$ at parameter $i$, estimated cheaply as the mean squared gradient on a data sample. Fisher merging is a precision-weighted average: it is the closest-to-principled merge, but it needs data and gradients, which uniform averaging and task arithmetic do not.

At one coordinate, three trimmed task vectors hold the values $ +0.2,\ +0.1,\ -0.6 $. By TIES sign election (EQ OM6.3), which sign $ \gamma_m $ is elected? Answer $ +1 $ or $ -1 $.

Sum the signed values: $ 0.2 + 0.1 + (-0.6) = -0.3 $. The election is decided by total magnitude, not a head-count: two positives (0.3 combined) lose to one larger negative (0.6). So $ \gamma_m = \operatorname{sgn}(-0.3) = $ −1, and the disjoint mean then averages only the agreeing task — the single $-0.6$ value.

PYTHON · RUNNABLE IN-BROWSER

# Tiny TIES merge: trim, elect sign, disjoint mean -- compare to naive sum.
import numpy as np

# 3 task vectors over 6 coordinates (rows = tasks).
T = np.array([
    [ 0.80, -0.05,  0.60, -0.02,  0.01,  0.50],   # task 1
    [ 0.70,  0.04, -0.55,  0.03, -0.02, -0.40],   # task 2 (conflicts on col 2,5)
    [-0.10,  0.90,  0.05,  0.85,  0.02,  0.45],   # task 3
])

def trim(tau, keep=0.5):                  # keep top-|.| fraction per row
    out = np.zeros_like(tau)
    for i, row in enumerate(tau):
        k = max(1, int(round(keep * row.size)))
        idx = np.argsort(-np.abs(row))[:k]
        out[i, idx] = row[idx]
    return out

Th = trim(T, keep=0.5)
gamma = np.sign(Th.sum(axis=0))           # elected sign per coordinate (EQ OM6.3)
gamma[gamma == 0] = 1
agree = (np.sign(Th) == gamma) & (Th != 0)
merged = np.where(agree.sum(0) > 0,
                  (np.where(agree, Th, 0).sum(0) / np.maximum(agree.sum(0), 1)),
                  0.0)

print("naive sum   :", np.round(T.sum(0), 3))
print("elected sign:", gamma.astype(int).tolist())
print("TIES merged :", np.round(merged, 3))
print("\nOn the conflict columns (2 & 5) the naive sum cancels toward 0;")
print("TIES picks the majority-magnitude sign and averages only the agreeing")
print("tasks, so the merged value keeps a clear, uncancelled contribution.")

edits are live — break it on purpose

INSTRUMENT OM6.3 — INTERFERENCE & TIES REPAIREQ OM6.3 · NAIVE SUM vs TRIM + ELECT-SIGN + DISJOINT-MEAN

TRIM KEEP FRACTION 0.50

SIGN-CONFLICT LEVEL 0.50

MERGE METHOD

CONFLICTED COORDS

—

SIGNAL PRESERVED

—

METHOD

—

Each bar is one coordinate's merged value across several toy task vectors. Raise SIGN-CONFLICT LEVEL to make tasks disagree more. Under NAIVE SUM the conflicted bars cancel toward zero — lost signal in red. Switch to TIES: trimming drops the noisy small entries, sign-election picks the majority direction, and the disjoint mean restores the conflicted bars instead of cancelling them. The "signal preserved" readout is the fraction of the intended magnitude that survives.

6.4

The geometry that makes it work

None of this should work. Averaging the weights of two neural networks is, in general, nonsense — the loss landscape is wildly non-convex, and the midpoint of two random good models is usually a bad one. Merging is feasible only inside a specific geometric regime, and naming that regime tells you exactly when to trust a merge and when to expect it to fail.

The enabling phenomenon is linear mode connectivity: when two models are fine-tuned from the same pretrained checkpoint, the straight line between them in weight space stays at low loss — there is no barrier (this is the flat curve of INSTRUMENT OM6.1). Shared initialization keeps both fine-tunes inside a single connected low-loss basin, so any convex combination of them is also low-loss. This is the geometric reason a soup beats its ingredients: averaging finds a flatter, more central point in the basin, which generalizes better.

EQ OM6.6 — THE LOSS BARRIER $$ B(\theta_A,\theta_B) \;=\; \max_{\lambda\in[0,1]} \mathcal{L}\big((1-\lambda)\theta_A + \lambda\theta_B\big) \;-\; \tfrac{1}{2}\big(\mathcal{L}(\theta_A)+\mathcal{L}(\theta_B)\big) $$

$B$ measures how much the worst point on the straight path exceeds the average of the two endpoints' losses. $B \approx 0$ means the models are linearly mode-connected and the merge is safe; $B \gg 0$ means a barrier sits between them and naive averaging will land on it. Same-base fine-tunes typically have $B \approx 0$; models from different initializations almost never do. This is exactly the quantity the interpolation instrument plots and reports.

The crucial caveat is permutation symmetry. A neural network's hidden units can be permuted — relabel the neurons of a layer and permute the next layer's weights to match, and you have an identical function with a completely different weight vector. Two independently trained models almost always sit in different permutation "copies" of the same basin, which is why the raw line between them crosses a barrier. Ainsworth et al. (2023) showed this barrier can often be removed by first finding the permutation that aligns one model's units to the other's, after which even independently trained models become linearly connected. The practical takeaway for merging is sharp:

WHEN MERGING IS SAFE

Merge models that share a base, architecture, and tokenizer, and you are inside one basin with no permutation problem — averaging and task arithmetic just work. Merge models from different initializations and you must align their permutations first, or expect a barrier. This single distinction explains nearly every merging success and failure in the wild: community fine-tunes of one base model (Llama, Mistral) merge cleanly; two from-scratch models do not.

On the path between two models the maximum loss is $ 0.9 $ (at the midpoint), while $ \mathcal{L}(\theta_A) = 0.4 $ and $ \mathcal{L}(\theta_B) = 0.2 $. What is the loss barrier $ B $ (EQ OM6.6)?

$ B = 0.9 - \tfrac{1}{2}(0.4 + 0.2) = 0.9 - 0.3 = $ 0.6. A positive barrier this large means the midpoint merge is far worse than the endpoints — these models are not linearly connected, so a naive average should not be trusted without permutation alignment.

6.5

Practice & honest limits

Why merge at all, when you could multi-task fine-tune? Because merging is nearly free. It needs no training data, no gradient steps, and no GPU beyond the memory to hold the checkpoints. The practical wins follow directly:

Cheap multi-task models. Combine separately-trained specialists into one model that does all their jobs, without ever assembling a joint training set or running a joint fine-tune. This is the dominant use.
Combining community fine-tunes. The open-model ecosystem (Open Models · Ch 03) produces thousands of fine-tunes of a handful of bases. Merging lets you blend, say, a coding fine-tune and an instruction fine-tune of the same Llama into one model — the engine behind most top community models on the leaderboards.
Behavior editing without data. Negate a toxicity task vector to detoxify; subtract a memorized-data vector to reduce regurgitation; scale a style vector up or down. Edits you would otherwise need a fine-tuning run to make.
Soups for robustness. Average your own hyperparameter sweep into one checkpoint that generalizes better than any single run, at the cost of one model.

The standard open-source toolkit is mergekit, which implements soup, task arithmetic, TIES, DARE, SLERP, and Fisher merging behind a single config file. But the tooling hides none of the genuine limits, and a serious caveat list belongs in any honest treatment:

Limit	Why	Consequence
Same base required	weights only share a coordinate system if they share an origin	cannot merge across different base models or tokenizers
Same architecture	parameter $i$ must be the same role in every ingredient	no merging a 7B into a 13B
No free lunch	a merge is a compromise; capacity is finite and skills can trade off	merged model often below each specialist on its own task
Hyperparameters matter	$\lambda$, $\alpha$, trim fraction, DARE $p$ all change the result	needs a search, not a single setting
Evaluation is mandatory	merge quality is unpredictable a priori; barriers and interference are silent	always benchmark — a merge that looks fine can be quietly degraded

The last row is the one to internalize. Merging is fast enough that it invites a "ship it" reflex, but a merge can satisfy every algebraic constraint and still land on a barrier, overshoot a scale, or trade away a skill you needed. A merge is a hypothesis, not a result — it is unproven until it clears the same evaluation suite you would demand of a fine-tune. The cost saving is in the training you skipped, not in the evaluation, which you still owe in full.

NO FREE LUNCH

Merging buys breadth, sometimes at the cost of peak. The cheap multi-skill model is real, and for many deployments "good at several things on one GPU" beats "excellent at one." But do not expect a merge of a math specialist and a code specialist to match either on its home benchmark. Interference, finite capacity, and compromise are structural, not bugs to be tuned away. State the trade you made in numbers, the same discipline §6.5 of the previous chapter demanded of safety claims.

That closes the open-models loop: choose a model, run it, fine-tune it, train it well, break-then-harden it, and now combine fine-tunes in weight space for free. Merging is the payoff of working with open weights — full access to the parameters is exactly what lets you average, add, and subtract them. Return to the index to branch into the volumes this track builds on: fine-tuning and LoRA (Vol II · Ch 06), post-training and alignment (Vol II · Ch 05), and the deep-learning track for the loss-landscape geometry behind linear mode connectivity.

6.R

References

Ilharco, G., Ribeiro, M. T., Wortsman, M. et al. (2022). Editing Models with Task Arithmetic. ICLR 2023 — defines the task vector $\tau = \theta_{ft} - \theta_{pre}$ and add/negate/analogy arithmetic (§6.2, EQ OM6.2).
Wortsman, M., Ilharco, G., Gadre, S. Y. et al. (2022). Model Soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. ICML 2022 — weight averaging of fine-tunes (§6.1, EQ OM6.1).
Yadav, P., Tam, D., Choshen, L., Raffel, C. & Bansal, M. (2023). TIES-Merging: Resolving Interference When Merging Models. NeurIPS 2023 — trim, elect-sign, disjoint-mean (§6.3, EQ OM6.3).
Yu, L., Yu, B., Yu, H., Huang, F. & Li, Y. (2023). Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (DARE). ICML 2024 — drop-and-rescale sparsification of task vectors (§6.3, EQ OM6.4).
Matena, M. & Raffel, C. (2022). Merging Models with Fisher-Weighted Averaging. NeurIPS 2022 — per-parameter precision-weighted merge (§6.3, EQ OM6.5).
Ainsworth, S. K., Hayase, J. & Srinivasa, S. (2023). Git Re-Basin: Merging Models modulo Permutation Symmetries. ICLR 2023 — permutation alignment that removes the loss barrier between independently trained models (§6.4).