The paper introduces a novel function-preserving expansion method for Transformer models to tackle catastrophic forgetting during fine-tuning. By replicating MLP parameters and applying a scaling correction, the method allows models to "grow" capacity while remaining mathematically identical to the original at initialization, matching Full Fine-tuning (SFT) performance while maintaining near-zero degradation on original tasks.
TL;DR
Adapting Large Language Models (LLMs) to specialized domains usually comes at a cost: Catastrophic Forgetting. This paper introduces a "capacity growth" approach that expands the MLP layers of Transformers by replicating pre-trained weights. By using a clever mathematical scaling, the new model starts with the exact same output as the original, enabling stable training that "grows" new skills without erasing the old ones.
The Plasticity-Stability Dilemma
In the world of LLMs, there is a classic trade-off: Plasticity (the ability to learn new things) vs. Stability (the ability to remember old things).
- Standard Fine-tuning (SFT) favors plasticity but overwrites weights, leading to a collapse of general knowledge.
- Regularization (like EWC) tries to enforce stability but often prevents the model from truly mastering the new task.
- Expansion methods (like Adapters) often start from zero-initialized "dead" weights, failing to leverage the rich features already present in the pre-trained model.
The authors argue that we should reuse existing knowledge to build new capacity without breaking the model's initial functional mapping.
Methodology: The "Copy and Scale" Intuition
The core innovation lies in the MLP submodules. Typical MLPs widen the hidden dimension $p$ and then project it back to $h$. The authors propose:
- Up-projection Expansion: Duplicate the $W^{(1)}$ matrix. Now the hidden layer is twice as wide.
- Down-projection Scaling: Stack the $W^{(2)}$ matrix but divide the weights by 2.
Mathematical Intuition
Because the hidden activations are now doubled, the $1/2$ scale in the second layer perfectly cancels out the duplication. At step 0 of training, the model behaves exactly like the base model.
Figure 1: Doubling the MLP dimension while preserving function.
The paper introduces two flavors:
- G-Freeze: Only the newly added parameters are trainable. This is the safest for memory.
- G-Train: The entire up-projection is trainable, which is better for "cognitively demanding" tasks like math.
Experimental Proof: No More Forgetting
The researchers tested this on Gemma-1B and Gemma-4B. The results are striking. In tasks like Science Entailment and French Translation, standard SFT sees the "original knowledge" (WinoGrande score) plummet to near-zero. Grow, Don’t Overwrite (the orange line in the charts below) maintains a flat line for original knowledge while climbing the accuracy curve for the new task.
Figure 2: Our method (orange) maintains original capabilities while SFT (blue) collapses.
Deep Insight: Visualizing Rank and Complexity
Why does this work so well? The authors looked at the Effective Rank of weight updates. They found that simple tasks (like Translation) only need updates in specific layers. However, complex tasks like MathQA require "high-rank" updates across nearly all layers.
By growing the capacity of the MLP, they provide the "room" needed for these high-rank updates without displacing the low-rank structures that hold foundational language skills.
Critical Analysis & Takeaways
The Power of Modularity
One of the most practical findings is that you don't need to expand every layer. By ranking layers based on their update magnitude during a pilot run, the authors showed that expanding just ~30% of parameters (targeted layers) achieves the same result as full expansion.
Limitations
- Inference Overhead: Although the training is efficient, the final model is physically larger. If you double the MLPs in all layers, the parameter count increases by ~60%.
- MLP Focus: The study found growing Attention heads was less effective, but future work might find better ways to "grow" the MHA (Multi-Head Attention) blocks.
Final Verdict
This paper shifts the paradigm of LLM adaptation. Instead of viewing fine-tuning as a modification of existing weights, we should view it as an augmentation of capacity. By respecting the functional integrity of the pre-trained model through mathematical preservation, we can create specialist models that don't lose their "general sense."
