Does model merging reliably improve LLM capabilities?

How much improvement can you actually expect from model merging?

The effect size is real but varies widely. In one study, merging a top multilingual LLM with a Korean language model using the DARE technique yielded a 1.69% average improvement across six benchmarks, and a striking 20%+ boost on the Grade School Math 8K (GSM8K) reasoning task [1]. That means the merged model got 20% more math problems right than either parent—a meaningful jump for a reasoning-heavy task.

Other work shows even larger gains in specific domains. An evolutionary merging approach produced a Japanese math LLM that achieved state-of-the-art performance on Japanese benchmarks, surpassing models with substantially more parameters [3]. And in multimodal settings, an uncertainty-guided merging method (UQ-Merge) improved average accuracy by up to 44.3% across 12 datasets compared to existing merging methods [7]. These numbers show that when merging is optimized, the improvements can be dramatic.

What makes merging work—or fail?

The key is how you handle interference between the models' parameters. Early merging methods simply averaged weights, which often caused performance to drop because different models had conflicting parameter values [8]. The TIES-Merging method solved this by trimming small changes, resolving sign conflicts, and merging only aligned parameters—outperforming older methods across many tasks [8]. Similarly, Layer-Adaptive SLERP uses geometry-preserving interpolation with layer-specific coefficients, achieving stable merges across 50+ combinations [6].

Model scale also matters. Researchers found that merging tiny LLMs (1.7 billion parameters) did not produce the same emergent capabilities seen in larger models (7B+ parameters) [2]. This suggests that a certain minimum model size is needed for merging to unlock new abilities. Additionally, the diversity of parent models is critical—merging models fine-tuned on different tasks or languages can create synergies, but merging very similar models yields little gain [2][3].

Can merging create capabilities that neither parent model had?

Yes, and this is one of the most surprising findings. Merging is not just averaging—it can produce emergent abilities. For example, merging a Japanese language model with a math reasoning model produced a model that could do math in Japanese, even though neither parent was trained for that combination [3]. The authors describe this as a 'transformative method' where nonlinear interactions between parameters create new functionalities [2].

This also works across modalities. Merging vision-language, audio-language, and video-language models moved toward an 'Omni-language model' that outperformed individual modality models [5]. And merging a security-focused fine-tuned model with a general model significantly improved jailbreak resistance with minimal performance loss [4]. So merging can combine strengths in ways that fine-tuning alone cannot easily achieve.

Sources used in this answer

Research on enhancing model performance by merging with Korean language models

Merging a multilingual LLM with a Korean language model using DARE improved average benchmark performance by 1.69% and boosted GSM8K math reasoning by over 20%.

2025 · Taewan Cho, Rina Kim, Andrew Jaeyong Choi · Eng. Appl. Artif. Intell.

Original

Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities

Model merging can create emergent capabilities that surpass parent models, but this effect was not observed in very small (1.7B parameter) LLMs, suggesting scale is important.

2025 · Wei Lu, Rachel K. Luu, Markus J. Buehler · npj Computational Materials

Original

Evolutionary optimization of model merging recipes

An evolutionary merging approach automatically discovered effective model combinations, producing a Japanese math LLM that achieved state-of-the-art performance on Japanese benchmarks.

2025 · Takuya Akiba, Makoto Shing, Yujin Tang, Qi Sun, David Ha · Nat. Mac. Intell.

Original

Enhancing Jailbreak Resistance in Large Language Models Using Model Merge

Merging a security fine-tuned model with a general LLM significantly improved jailbreak resistance with minimal performance degradation.

2025 · Saki Hiromi, Hiroki Kinoshita, Masanori Yamada, Takayuki Miura · SP (Workshops)

Original

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Merging vision-language, audio-language, and video-language models moved toward an Omni-language model that outperformed individual modality models.

2025 · Yongxian Wei, Runxi Cheng, Weike Jin, Enneng Yang, Li Shen, Lu Hou, Sinan Du, Chun Yuan, Xiaochun Cao, Dacheng Tao · arXiv (Cornell University)

WisPaper

Original

Geometric Model Merging for Efficient and Scalable Adaptation of Large Language Models

Layer-Adaptive SLERP, a geometry-preserving merging method, improved stability and performance across 50+ merges spanning six architectures and seven parameter scales.

2025 · Lilian Rage, Y. Lalain, Mathis Escriva, Martial Roberge, P. Lemaistre, André Rochet, Gérard Réus, B. P. Bhuyan · BigData Congress [Services Society]

Original

$\textttUQ-Merge$: Uncertainty Guided Multimodal Large Language Model Merging

Uncertainty-guided merging (UQ-Merge) improved average accuracy by up to 44.3% across 12 multimodal datasets compared to existing merging methods.

2025 · Huaizhi Qu, Xinyu Zhao, Jie Peng, Kwonjoon Lee, Behzad Dariush, Tianlong Chen · Findings of the Association for Computational Linguistics: ACL 2025

Original

TIES-Merging: Resolving Interference When Merging Models

TIES-Merging resolved parameter interference by trimming small changes and resolving sign conflicts, outperforming prior merging methods across diverse settings.

2023 · Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal · NeurIPS

Original