Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning

[ICLR 2025] Forget Forgetting: How Pretrained VLAs are Rewriting the Rules of Continual Robot Learning

总结

问题

方法

结果

要点

摘要

This paper investigates the continual learning (CL) capabilities of large-scale pretrained Vision-Language-Action (VLA) models like Pi0 and GR00T. It demonstrates that unlike smaller models, VLAs are remarkably resistant to catastrophic forgetting when using a simple Experience Replay (ER) strategy, achieving near-zero or even positive backward transfer on the LIBERO benchmark.

TL;DR

Conventional wisdom in robotics suggests that robots are "forgetful" students—learning a new task typically means losing the old one. This paper turns that notion on its head. By evaluating large Vision-Language-Action (VLA) models (like Pi0 and GR00T), researchers found these models are naturally resistant to catastrophic forgetting. With just a tiny "memory" (Experience Replay), they can maintain—and sometimes even improve—performance on old tasks while mastering new ones.

The Bottom Line: Pretraining is the ultimate "stability" buffer. It allows models to store knowledge in a way that remains accessible even when task-level performance temporarily dips.

The Stability-Plasticity Dilemma

In robotics, Continual Learning (CL) has always been a battle between Stability (don't forget Task A) and Plasticity (learn Task B quickly). Small models trained from scratch usually fail this; they have a "shallow" understanding, so new weight updates easily overwrite old logic.

Historically, we fought this with:

Massive Replay Buffers: Keeping 20-50% of all old data.
Complex Regularization: Like EWC (Elastic Weight Consolidation), which "freezes" important weights.

But as we move into the era of Foundation Models for robotics, does this still hold?

Methodology: Putting VLAs to the Test

The researchers tested two heavyweight VLAs—Pi0 (a Flow Matching model) and GR00T N1.5—against standard BC-Transformers and Diffusion Policies across the LIBERO benchmark tasks.

Comparison of VLA vs Small Models Fig 1: Success matrices showing the dramatic difference in stability. While the small BC-Transformer (bottom) sees its performance vanish (darker colors) as it learns new tasks, the VLA (top) maintains high success rates (bright yellow) across the board.

The "Surprising" Effectiveness of Simple Replay

The most shocking finding: Experience Replay (ER) is all you need. When using a buffer size of just 2%—where smaller models completely collapse—VLAs maintained near-perfect performance. Some even showed Positive Backward Transfer, meaning learning a later task actually made them better at an earlier one.

Deep Dive: Why Does This Work?

1. Pretraining is the Key Factor

By comparing Pi0 variants (Pretrained vs. From Scratch), the authors proved that pretraining creates a "Pareto Frontier" that small models can't touch. Pretraining doesn't just help you learn faster (forward transfer); it acts as a structural anchor that prevents weight updates from drifting into "garbage" territory for old tasks.

Pareto Frontier Fig 2: The gap between "Pretrained" and "From Scratch" models widens as the replay buffer gets smaller, proving pretraining mitigates forgetting precisely when data is scarce.

2. Knowledge is "Dormant" Not "Dead"

Perhaps the most profound insight is the Recovery Efficiency. When a VLA looks like it has forgotten a task (0% success rate), the underlying knowledge isn't gone—it's just "misaligned."

The Probe: Re-finetune the model on the "forgotten" task.
The Result: The VLA recovers peak performance in less than 10% of the original training steps.
Small Models: Take 100% or more of the original time to relearn, meaning they truly "erased" the data.

3. Anatomical Forgetting

Through Component Swapping, the team found that the Vision-Language (VL) backbone is the primary source of forgetting (it's where the world representations live), while the Action Head is more consistent across tasks.

Critical Analysis & Conclusion

Takeaways

Simplicity Wins: For VLAs, we don't need exotic CL algorithms. Simple Replay + Foundation Model = SOTA Continual Learning.
Pretraining = Insurance: Large-scale data pretraining is not just for "zero-shot" performance; it's a fundamental requirement for long-term robot autonomy.

Limitations

While VLAs are resistant, they are not immune. In the most diverse scenes (LIBERO-10), total forgetting still occurs if the replay buffer is essentially zero (e.g., only 10 samples). Additionally, the "Recovery" ability implies we might need a system that can "self-correct" or "quick-tune" rather than just relying on a static policy.

Future Outlook

This work suggests that the path to a lifelong-learning robot isn't through more complex anti-forgetting math, but through better representation reuse. If the knowledge is already there, we just need the right "trigger" to bring it back to the surface.

Final Results Table Table 1: Note how Pi0 and GR00T consistently maintain high Success Rates (SR) and low Negative Backward Transfer (NBT) compared to all other baselines.

发现相似论文

试试这些示例

Search for recent studies on the "Positive Backward Transfer" phenomenon in large-scale foundation models during sequential fine-tuning.
Which paper originally established the LIBERO benchmark, and how does this VLA study specifically challenge the baseline conclusions regarding the Necessity of Large Replay Buffers?
Explore if the "Knowledge Retention" observed in VLAs (rapid recovery via fine-tuning) also applies to State Space Models (SSMs) or other non-Transformer robot policy architectures.

[ICLR 2025] Forget Forgetting: How Pretrained VLAs are Rewriting the Rules of Continual Robot Learning

1. TL;DR

2. The Stability-Plasticity Dilemma

3. Methodology: Putting VLAs to the Test

3.1. The "Surprising" Effectiveness of Simple Replay

4. Deep Dive: Why Does This Work?

4.1. 1. Pretraining is the Key Factor

4.2. 2. Knowledge is "Dormant" Not "Dead"

4.3. 3. Anatomical Forgetting

5. Critical Analysis & Conclusion

5.1. Takeaways

5.2. Limitations

5.3. Future Outlook