Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

[Survey 2024] Model Merging in the LLM Era: The "FUSE" Paradigm for Low-Cost Capability Composition

Summary

Problem

Method

Results

Takeaways

Abstract

This survey presents "FUSE," a comprehensive taxonomy for model merging in the Large Language Model (LLM) era, covering Foundations, Unification strategies, Scenarios, and Ecosystems. It highlights how multiple specialized models can be combined into a single unified architecture without additional training, achieving state-of-the-art results (e.g., 8-12% gains in multi-task benchmarks) by exploiting loss landscape geometry and linear mode connectivity.

TL;DR

Model merging has emerged as the "free lunch" of the LLM world. Instead of spending millions on retraining, researchers are now "stitching" specialized models together—combining a math expert with a coding expert to create a multi-talent genius. This survey introduces the FUSE taxonomy, detailing how we move from basic weight averaging to advanced Evolutionary "Franken-merges" and Task Arithmetic that can surgically add or remove model behaviors without a single gradient update.

The Core Motivation: Why Merge?

The industry is facing a scaling wall. Maintaining separate ensembles for every task is too expensive, and retraining a single model to do "everything" often leads to catastrophic forgetting.

The Insight: Models fine-tuned from the same "parent" (like Llama-3) stay within the same "loss basin." This means their weights are geometrically related. If you treat the difference between a base model and a fine-tuned model as a Task Vector, you can perform algebra on the intelligence itself: Math_Model + Coding_Model = STEM_Expert.

Methodology: From Simple Averages to Evolutionary Search

1. The Geometric Foundation

Successful merging relies on Linear Mode Connectivity (LMC). As long as models share a starting point, the path between them is a "low-loss valley." However, we must account for Permutation Invariance—the fact that the same model logic can be stored in different neuron orders.

2. Task Vector Arithmetic & Sparsification

The survey highlights TIES-Merging and DARE as the current SOTA for weight-space algebra.

TIES (Trim, Elect Sign, Merge): It solves "Sign Conflicts" (where one model wants a weight to go up and another wants it down) by using a majority vote.
DARE (Drop and Rescale): This method proves that most fine-tuning updates are redundant. By randomly dropping 90% of the updates, we can merge more models with less interference.

Model Merging Pipeline Figure 1: The general pipeline—transforming task-specific experts into a unified θ-merged model.

3. Evolutionary Optimization: The "Franken-merge"

Perhaps the most exciting trend is using Evolutionary Algorithms (like those from Sakana AI) to discover merge recipes. These algorithms don't just average weights; they interleave layers from different models in non-intuitive ways, creating deeper, specialized architectures that humans could never design by hand.

Key Results & Benchmarks

The survey consolidates evidence from FusionBench, showing that:

Multi-tasking: Merged models achieve roughly 99% of the performance of expensive multi-task training.
Safety: You can "negate" toxic behaviors by subtracting a "toxicity vector" from a model.
Efficiency: Merging allows for Federated Learning where only model updates are shared, preserving data privacy.

Methodology Comparison Table Table 1: A comparative look at Weight-Space vs. Structured vs. Routing methods.

Critical Insights: The Limits of the "Free Lunch"

While powerful, model merging is not magic.

Shared Ancestry is Required: Merging a Llama-3 with a Mistral-7B (different architectures/seeds) remains a "hard" problem due to representational misalignment.
The Interference Tax: As you merge more models (e.g., >10), the "signal-to-noise" ratio drops, and capabilities begin to dilute.
Evaluation Gap: We still lack a "merging-native" benchmark that tests for emergent reasoning rather than just memorized facts.

Conclusion & Future Frontiers

The future of AI isn't just "bigger" models; it's Modular AI. The FUSE taxonomy suggests a world where we maintain a library of "LEGO-like" task vectors. Need a Swedish-speaking medical lawyer? Simply fetch the Swedish, Medical, and Legal vectors and snap them onto your base model.

Next Step for Researchers: Moving toward Cross-Architecture Merging—finding a way to merge a 70B model with a 7B model or translating knowledge between different model families.

Find Similar Papers

Try Our Examples

Search for recent papers building upon 'TIES-Merging' or 'DARE' to solve parameter interference in cross-architecture model fusion.
Which 2018-2020 papers first established the 'Linear Mode Connectivity' hypothesis for deep neural networks, and how does this survey extend those theories to 100B+ parameter LLMs?
Find research applications that apply Mixture-of-Experts (MoE) merging strategies to multimodal (Vision-Language) models or Safe Reinforcement Learning.

Contents

[Survey 2024] Model Merging in the LLM Era: The "FUSE" Paradigm for Low-Cost Capability Composition

1. TL;DR

2. The Core Motivation: Why Merge?

3. Methodology: From Simple Averages to Evolutionary Search

3.1. 1. The Geometric Foundation

3.2. 2. Task Vector Arithmetic & Sparsification

3.3. 3. Evolutionary Optimization: The "Franken-merge"

4. Key Results & Benchmarks

5. Critical Insights: The Limits of the "Free Lunch"

6. Conclusion & Future Frontiers