Grow, Don't Overwrite: Fine-tuning Without Forgetting

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Grow, Don't Overwrite: Fine-tuning Without Forgetting

[CVPR 2025] Grow, Don’t Overwrite: A Function-Preserving Breakthrough in Fine-tuning

Summary

Problem

Method

Results

Takeaways

Abstract

The paper introduces a novel function-preserving expansion method for Transformer models to tackle catastrophic forgetting during fine-tuning. By replicating MLP parameters and applying a scaling correction, the method allows models to "grow" capacity while remaining mathematically identical to the original at initialization, matching Full Fine-tuning (SFT) performance while maintaining near-zero degradation on original tasks.

TL;DR

Adapting Large Language Models (LLMs) to specialized domains usually comes at a cost: Catastrophic Forgetting. This paper introduces a "capacity growth" approach that expands the MLP layers of Transformers by replicating pre-trained weights. By using a clever mathematical scaling, the new model starts with the exact same output as the original, enabling stable training that "grows" new skills without erasing the old ones.

The Plasticity-Stability Dilemma

In the world of LLMs, there is a classic trade-off: Plasticity (the ability to learn new things) vs. Stability (the ability to remember old things).

Standard Fine-tuning (SFT) favors plasticity but overwrites weights, leading to a collapse of general knowledge.
Regularization (like EWC) tries to enforce stability but often prevents the model from truly mastering the new task.
Expansion methods (like Adapters) often start from zero-initialized "dead" weights, failing to leverage the rich features already present in the pre-trained model.

The authors argue that we should reuse existing knowledge to build new capacity without breaking the model's initial functional mapping.

Methodology: The "Copy and Scale" Intuition

The core innovation lies in the MLP submodules. Typical MLPs widen the hidden dimension $p$ and then project it back to $h$. The authors propose:

Up-projection Expansion: Duplicate the $W^{(1)}$ matrix. Now the hidden layer is twice as wide.
Down-projection Scaling: Stack the $W^{(2)}$ matrix but divide the weights by 2.

Mathematical Intuition

Because the hidden activations are now doubled, the $1/2$ scale in the second layer perfectly cancels out the duplication. At step 0 of training, the model behaves exactly like the base model.

Model Architecture Figure 1: Doubling the MLP dimension while preserving function.

The paper introduces two flavors:

G-Freeze: Only the newly added parameters are trainable. This is the safest for memory.
G-Train: The entire up-projection is trainable, which is better for "cognitively demanding" tasks like math.

Experimental Proof: No More Forgetting

The researchers tested this on Gemma-1B and Gemma-4B. The results are striking. In tasks like Science Entailment and French Translation, standard SFT sees the "original knowledge" (WinoGrande score) plummet to near-zero. Grow, Don’t Overwrite (the orange line in the charts below) maintains a flat line for original knowledge while climbing the accuracy curve for the new task.

Experimental Results Figure 2: Our method (orange) maintains original capabilities while SFT (blue) collapses.

Deep Insight: Visualizing Rank and Complexity

Why does this work so well? The authors looked at the Effective Rank of weight updates. They found that simple tasks (like Translation) only need updates in specific layers. However, complex tasks like MathQA require "high-rank" updates across nearly all layers.

By growing the capacity of the MLP, they provide the "room" needed for these high-rank updates without displacing the low-rank structures that hold foundational language skills.

Critical Analysis & Takeaways

The Power of Modularity

One of the most practical findings is that you don't need to expand every layer. By ranking layers based on their update magnitude during a pilot run, the authors showed that expanding just ~30% of parameters (targeted layers) achieves the same result as full expansion.

Limitations

Inference Overhead: Although the training is efficient, the final model is physically larger. If you double the MLPs in all layers, the parameter count increases by ~60%.
MLP Focus: The study found growing Attention heads was less effective, but future work might find better ways to "grow" the MHA (Multi-Head Attention) blocks.

Final Verdict

This paper shifts the paradigm of LLM adaptation. Instead of viewing fine-tuning as a modification of existing weights, we should view it as an augmentation of capacity. By respecting the functional integrity of the pre-trained model through mathematical preservation, we can create specialist models that don't lose their "general sense."

Find Similar Papers

Try Our Examples

Examine recent papers from 2024-2025 that compare function-preserving model expansion with Low-Rank Adaptation (LoRA) for mitigating catastrophic forgetting in LLMs.
What are the theoretical foundations of "Deep Fusion" (Mazzawi et al., 2023) and how does the current work's MLP replication strategy improve upon it for Transformer architectures?
Explore whether function-preserving expansion techniques have been applied to Vision Transformers (ViT) or Multi-modal models to preserve cross-modal alignment during downstream adaptation.

Contents

[CVPR 2025] Grow, Don’t Overwrite: A Function-Preserving Breakthrough in Fine-tuning

1. TL;DR

2. The Plasticity-Stability Dilemma

3. Methodology: The "Copy and Scale" Intuition

3.1. Mathematical Intuition

4. Experimental Proof: No More Forgetting

5. Deep Insight: Visualizing Rank and Complexity

6. Critical Analysis & Takeaways

6.1. The Power of Modularity

6.2. Limitations

6.3. Final Verdict