WisPaper
WisPaper
Search
QA
Pricing
TrueCite

Does model merging reliably improve LLM capabilities?

Model merging can reliably improve LLM capabilities, but success depends on method, model scale, and task diversity.

Direct answer

Yes, model merging can reliably improve LLM capabilities, but the improvement is not guaranteed and depends heavily on how you merge. When done well—using techniques like DARE or TIES-Merging that resolve parameter interference—merging can boost performance by 1.69% on average across benchmarks [1] and even create entirely new abilities that neither parent model had [2]. However, merging very small models (1.7B parameters) may not produce these gains [2], and naive linear merging often degrades performance due to conflicting parameters [6][8].

8sources cited

This article was generated with WisPaper-powered search and paper analysis.

How much improvement can you actually expect from model merging?

The effect size is real but varies widely. In one study, merging a top multilingual LLM with a Korean language model using the DARE technique yielded a 1.69% average improvement across six benchmarks, and a striking 20%+ boost on the Grade School Math 8K (GSM8K) reasoning task [1]. That means the merged model got 20% more math problems right than either parent—a meaningful jump for a reasoning-heavy task.

Other work shows even larger gains in specific domains. An evolutionary merging approach produced a Japanese math LLM that achieved state-of-the-art performance on Japanese benchmarks, surpassing models with substantially more parameters [3]. And in multimodal settings, an uncertainty-guided merging method (UQ-Merge) improved average accuracy by up to 44.3% across 12 datasets compared to existing merging methods [7]. These numbers show that when merging is optimized, the improvements can be dramatic.

What makes merging work—or fail?

The key is how you handle interference between the models' parameters. Early merging methods simply averaged weights, which often caused performance to drop because different models had conflicting parameter values [8]. The TIES-Merging method solved this by trimming small changes, resolving sign conflicts, and merging only aligned parameters—outperforming older methods across many tasks [8]. Similarly, Layer-Adaptive SLERP uses geometry-preserving interpolation with layer-specific coefficients, achieving stable merges across 50+ combinations [6].

Model scale also matters. Researchers found that merging tiny LLMs (1.7 billion parameters) did not produce the same emergent capabilities seen in larger models (7B+ parameters) [2]. This suggests that a certain minimum model size is needed for merging to unlock new abilities. Additionally, the diversity of parent models is critical—merging models fine-tuned on different tasks or languages can create synergies, but merging very similar models yields little gain [2][3].

Can merging create capabilities that neither parent model had?

Yes, and this is one of the most surprising findings. Merging is not just averaging—it can produce emergent abilities. For example, merging a Japanese language model with a math reasoning model produced a model that could do math in Japanese, even though neither parent was trained for that combination [3]. The authors describe this as a 'transformative method' where nonlinear interactions between parameters create new functionalities [2].

This also works across modalities. Merging vision-language, audio-language, and video-language models moved toward an 'Omni-language model' that outperformed individual modality models [5]. And merging a security-focused fine-tuned model with a general model significantly improved jailbreak resistance with minimal performance loss [4]. So merging can combine strengths in ways that fine-tuning alone cannot easily achieve.

Sources used in this answer

1

Research on enhancing model performance by merging with Korean language models

Merging a multilingual LLM with a Korean language model using DARE improved average benchmark performance by 1.69% and boosted GSM8K math reasoning by over 20%.

2

Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

Model merging can create emergent capabilities that surpass parent models, but this effect was not observed in very small (1.7B parameter) LLMs, suggesting scale is important.

3

Evolutionary optimization of model merging recipes

An evolutionary merging approach automatically discovered effective model combinations, producing a Japanese math LLM that achieved state-of-the-art performance on Japanese benchmarks.

4

Enhancing Jailbreak Resistance in Large Language Models Using Model Merge

Merging a security fine-tuned model with a general LLM significantly improved jailbreak resistance with minimal performance degradation.

5

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Merging vision-language, audio-language, and video-language models moved toward an Omni-language model that outperformed individual modality models.

6

Geometric Model Merging for Efficient and Scalable Adaptation of Large Language Models

Layer-Adaptive SLERP, a geometry-preserving merging method, improved stability and performance across 50+ merges spanning six architectures and seven parameter scales.

7

$\textttUQ-Merge$: Uncertainty Guided Multimodal Large Language Model Merging

Uncertainty-guided merging (UQ-Merge) improved average accuracy by up to 44.3% across 12 multimodal datasets compared to existing merging methods.

8

TIES-Merging: Resolving Interference When Merging Models

TIES-Merging resolved parameter interference by trimming small changes and resolving sign conflicts, outperforming prior merging methods across diverse settings.