This survey presents "FUSE," a comprehensive taxonomy for model merging in the Large Language Model (LLM) era, covering Foundations, Unification strategies, Scenarios, and Ecosystems. It highlights how multiple specialized models can be combined into a single unified architecture without additional training, achieving state-of-the-art results (e.g., 8-12% gains in multi-task benchmarks) by exploiting loss landscape geometry and linear mode connectivity.
TL;DR
Model merging has emerged as the "free lunch" of the LLM world. Instead of spending millions on retraining, researchers are now "stitching" specialized models together—combining a math expert with a coding expert to create a multi-talent genius. This survey introduces the FUSE taxonomy, detailing how we move from basic weight averaging to advanced Evolutionary "Franken-merges" and Task Arithmetic that can surgically add or remove model behaviors without a single gradient update.
The Core Motivation: Why Merge?
The industry is facing a scaling wall. Maintaining separate ensembles for every task is too expensive, and retraining a single model to do "everything" often leads to catastrophic forgetting.
The Insight: Models fine-tuned from the same "parent" (like Llama-3) stay within the same "loss basin." This means their weights are geometrically related. If you treat the difference between a base model and a fine-tuned model as a Task Vector, you can perform algebra on the intelligence itself: Math_Model + Coding_Model = STEM_Expert.
Methodology: From Simple Averages to Evolutionary Search
1. The Geometric Foundation
Successful merging relies on Linear Mode Connectivity (LMC). As long as models share a starting point, the path between them is a "low-loss valley." However, we must account for Permutation Invariance—the fact that the same model logic can be stored in different neuron orders.
2. Task Vector Arithmetic & Sparsification
The survey highlights TIES-Merging and DARE as the current SOTA for weight-space algebra.
- TIES (Trim, Elect Sign, Merge): It solves "Sign Conflicts" (where one model wants a weight to go up and another wants it down) by using a majority vote.
- DARE (Drop and Rescale): This method proves that most fine-tuning updates are redundant. By randomly dropping 90% of the updates, we can merge more models with less interference.
Figure 1: The general pipeline—transforming task-specific experts into a unified θ-merged model.
3. Evolutionary Optimization: The "Franken-merge"
Perhaps the most exciting trend is using Evolutionary Algorithms (like those from Sakana AI) to discover merge recipes. These algorithms don't just average weights; they interleave layers from different models in non-intuitive ways, creating deeper, specialized architectures that humans could never design by hand.
Key Results & Benchmarks
The survey consolidates evidence from FusionBench, showing that:
- Multi-tasking: Merged models achieve roughly 99% of the performance of expensive multi-task training.
- Safety: You can "negate" toxic behaviors by subtracting a "toxicity vector" from a model.
- Efficiency: Merging allows for Federated Learning where only model updates are shared, preserving data privacy.
Table 1: A comparative look at Weight-Space vs. Structured vs. Routing methods.
Critical Insights: The Limits of the "Free Lunch"
While powerful, model merging is not magic.
- Shared Ancestry is Required: Merging a Llama-3 with a Mistral-7B (different architectures/seeds) remains a "hard" problem due to representational misalignment.
- The Interference Tax: As you merge more models (e.g., >10), the "signal-to-noise" ratio drops, and capabilities begin to dilute.
- Evaluation Gap: We still lack a "merging-native" benchmark that tests for emergent reasoning rather than just memorized facts.
Conclusion & Future Frontiers
The future of AI isn't just "bigger" models; it's Modular AI. The FUSE taxonomy suggests a world where we maintain a library of "LEGO-like" task vectors. Need a Swedish-speaking medical lawyer? Simply fetch the Swedish, Medical, and Legal vectors and snap them onto your base model.
Next Step for Researchers: Moving toward Cross-Architecture Merging—finding a way to merge a 70B model with a 7B model or translating knowledge between different model families.
