This PhD thesis establishes a comprehensive framework for "Model Merging," introducing novel algorithms to combine independently trained neural networks into a single artifact without retraining. Key contributions include C2M3 for cycle-consistent single-task alignment, TSV-Merge for low-rank interference reduction, and MERGE3 for 50x more efficient evolutionary search in LLMs.
TL;DR
Deep learning is moving away from the "train then discard" cycle towards a paradigm of Model Merging. This thesis by Donato Crisostomi formalizes how we can stitch independently trained neural networks together without retraining. By aligning neuron permutations in a "Universe Space" and treating multi-task conflicts as low-rank interference, these methods achieve state-of-the-art performance in vision and 50x efficiency gains in LLM composition.
The Motivation: Why Weights are the New Data Modality
For years, neural network weights were seen as illegible numerical "black boxes." If you had two models trained on different data, they were effectively strangers. This thesis argues that weight space is legible and algebraic. By treating weights as a data modality, we can:
- Perform Task Arithmetic: Add new skills or subtract toxic behaviors like simple vectors.
- Democratize AI: Users can merge models on consumer hardware instead of $100k GPU clusters.
- Federated Learning: Correct the "drift" between distributed clients by aligning their hidden symmetries.
1. Solving the Symmetry Problem (C2M3)
The biggest hurdle in merging models trained from different initializations is Permutation Symmetry. Two models might learn the same logic but store them in different neuron orders. Naive averaging destroys the representation (the "barrier" in the loss landscape).
Previous methods like Git Re-Basin worked pairwise, but they drift—mapping Model A to B then back to A results in a different model.
The Insight: Crisostomi Cyril Factorizes permutations into a shared Universe Space.

By mapping every model to a universe via , any pairwise comparison is naturally consistent: . This results in a 20% accuracy boost when merging multiple models compared to inconsistent pairwise matching.
2. Multi-Task Merging: The Geometry of Interference (TSV)
When we fine-tune a pre-trained model on different tasks, we create Task Vectors. Adding them () is simple but leads to "interference"—where features from Task A overwrite Task B.
The Theory: The thesis proves that task vectors are essentially scaled negative gradients from the first epoch of fine-tuning. The Method: By performing SVD on per-layer task matrices, Crisostomi discovered they are low-rank.
- TSV-Merge: Decomposes task vectors, identifies overlapping "interference" directions, and uses Procrustes Orthogonalization to whiten the spectrum.

This allows a single model to hold 20+ tasks with almost no accuracy loss, increasing absolute accuracy by ~15% over standard task arithmetic.
3. Adaptive Deployment (MASS) & LLM Evolution (MERGE3)
The final frontier of the thesis is making merging adaptive and accessible.
MASS: The Data-Free Router
Instead of a static merge, MASS (MoErging through Adaptive Subspace Selection) looks at an input and calculates the "Subspace Residual." It projects the input feature onto each task's singular vectors; the task with the smallest "reconstruction error" is the one the model activates.
- Result: Recovering nearly 98% of expert performance without needing task labels at inference.
MERGE3: Democratizing Evolution
Evolutionary algorithms can find perfect "merging recipes," but they usually take thousands of GPU hours. The MERGE3 framework uses Item Response Theory (IRT) to estimate a model's latent ability. By evaluating on a tiny, diverse subset of data and using the MP-IRT estimator, it achieves a 50x speedup. What used to take a month on a cluster now takes one day on a single RTX 4090.
Critical Analysis & Conclusion
This work transitions model merging from empirical "hacking" to a structured mathematical field.
- The Takeaway: The "low-rankness" of neural updates is a superpower. It allows for massive compression (TSV-C) and clever routing (MASS).
- Limitation: Currently, these methods assume models share the same architecture. The "Holy Grail" remains heterogeneous merging—combining a CNN with a Transformer or a Llama-3 with a Mistral-7B.
Donato Crisostomi's work proves that we are no longer just training models; we are building a library of modular intelligence that can be spliced, edited, and evolved.
