Model Merging: Foundations and Algorithms

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Model Merging: Foundations and Algorithms

From Fragments to Fusion: Foundations and Algorithms of Model Merging

Summary

Problem

Method

Results

Takeaways

Abstract

This PhD thesis establishes a comprehensive framework for "Model Merging," introducing novel algorithms to combine independently trained neural networks into a single artifact without retraining. Key contributions include C2M3 for cycle-consistent single-task alignment, TSV-Merge for low-rank interference reduction, and MERGE3 for 50x more efficient evolutionary search in LLMs.

TL;DR

Deep learning is moving away from the "train then discard" cycle towards a paradigm of Model Merging. This thesis by Donato Crisostomi formalizes how we can stitch independently trained neural networks together without retraining. By aligning neuron permutations in a "Universe Space" and treating multi-task conflicts as low-rank interference, these methods achieve state-of-the-art performance in vision and 50x efficiency gains in LLM composition.

The Motivation: Why Weights are the New Data Modality

For years, neural network weights were seen as illegible numerical "black boxes." If you had two models trained on different data, they were effectively strangers. This thesis argues that weight space is legible and algebraic. By treating weights as a data modality, we can:

Perform Task Arithmetic: Add new skills or subtract toxic behaviors like simple vectors.
Democratize AI: Users can merge models on consumer hardware instead of $100k GPU clusters.
Federated Learning: Correct the "drift" between distributed clients by aligning their hidden symmetries.

1. Solving the Symmetry Problem (C2M3)

The biggest hurdle in merging models trained from different initializations is Permutation Symmetry. Two models might learn the same logic but store them in different neuron orders. Naive averaging destroys the representation (the "barrier" in the loss landscape).

Previous methods like Git Re-Basin worked pairwise, but they drift—mapping Model A to B then back to A results in a different model. The Insight: Crisostomi Cyril Factorizes permutations into a shared Universe Space. Cycle-Consistent Multi-Model Merging

By mapping every model $p$ to a universe $U$ via $P_{p}$ , any pairwise comparison is naturally consistent: $P_{pq} = P_{p} (P_{q})^{o} p$ . This results in a 20% accuracy boost when merging multiple models compared to inconsistent pairwise matching.

2. Multi-Task Merging: The Geometry of Interference (TSV)

When we fine-tune a pre-trained model on different tasks, we create Task Vectors. Adding them ( $h e t a_{m er g e} = h e t a_{ba se} + \sum a u_{i}$ ) is simple but leads to "interference"—where features from Task A overwrite Task B.

The Theory: The thesis proves that task vectors are essentially scaled negative gradients from the first epoch of fine-tuning. The Method: By performing SVD on per-layer task matrices, Crisostomi discovered they are low-rank.

TSV-Merge: Decomposes task vectors, identifies overlapping "interference" directions, and uses Procrustes Orthogonalization to whiten the spectrum.

This allows a single model to hold 20+ tasks with almost no accuracy loss, increasing absolute accuracy by ~15% over standard task arithmetic.

3. Adaptive Deployment (MASS) & LLM Evolution (MERGE3)

The final frontier of the thesis is making merging adaptive and accessible.

MASS: The Data-Free Router

Instead of a static merge, MASS (MoErging through Adaptive Subspace Selection) looks at an input and calculates the "Subspace Residual." It projects the input feature onto each task's singular vectors; the task with the smallest "reconstruction error" is the one the model activates.

Result: Recovering nearly 98% of expert performance without needing task labels at inference.

MERGE3: Democratizing Evolution

Evolutionary algorithms can find perfect "merging recipes," but they usually take thousands of GPU hours. The MERGE3 framework uses Item Response Theory (IRT) to estimate a model's latent ability. By evaluating on a tiny, diverse subset of data and using the MP-IRT estimator, it achieves a 50x speedup. What used to take a month on a cluster now takes one day on a single RTX 4090.

Critical Analysis & Conclusion

This work transitions model merging from empirical "hacking" to a structured mathematical field.

The Takeaway: The "low-rankness" of neural updates is a superpower. It allows for massive compression (TSV-C) and clever routing (MASS).
Limitation: Currently, these methods assume models share the same architecture. The "Holy Grail" remains heterogeneous merging—combining a CNN with a Transformer or a Llama-3 with a Mistral-7B.

Donato Crisostomi's work proves that we are no longer just training models; we are building a library of modular intelligence that can be spliced, edited, and evolved.

Find Similar Papers

Try Our Examples

Which recent studies since 2024 have extended the concept of "Universe Space" alignment to heterogeneous architectures with different layer depths or widths?
How does the "Gradient Equivalence" theory of task vectors compare with the "Neural Tangent Kernel" (NTK) perspective on model linearization and mergeability?
What are the latest advancements in "Spectral Optimizers" (like Muon) and how do they utilize the low-rank SVD structure of weight updates similarly to the TSV-Merge method?

Contents

From Fragments to Fusion: Foundations and Algorithms of Model Merging

1. TL;DR

2. The Motivation: Why Weights are the New Data Modality

3. 1. Solving the Symmetry Problem (C2M3)

4. 2. Multi-Task Merging: The Geometry of Interference (TSV)

5. 3. Adaptive Deployment (MASS) & LLM Evolution (MERGE3)

5.1. MASS: The Data-Free Router

5.2. MERGE3: Democratizing Evolution

6. Critical Analysis & Conclusion