RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

[CVPR 2026] RoboMME: Decoding the "When, Where, What, and How" of Robotic Memory

总结

问题

方法

结果

要点

摘要

RoboMME is a large-scale robotic manipulation benchmark (16 tasks, 770k timesteps) designed to evaluate four cognitive memory dimensions: temporal, spatial, object, and procedural. The authors introduce MME-VLA, a family of 14 vision-language-action models that systematically compare symbolic, perceptual, and recurrent memory representations, achieving a new state-of-the-art balance between performance and efficiency.

TL;DR

RoboMME is a groundbreaking benchmark that moves robotics beyond "reactive" behaviors toward true "cognitive" many-step reasoning. By evaluating 14 different VLA (Vision-Language-Action) variants, the study reveals a critical insight: Perceptual memory (visual tokens) is king for motion, while Symbolic memory (language) wins at logic. The best balance is achieved by using Memory-as-Modulator—a lightweight integration that conditions the robot's "action expert" without breaking its pretrained VLM foundations.

RoboMME Overview

The "Memory Gap" in Robotic Generalists

Most current SOTA robots are "forgetful." In standard benchmarks like CALVIN or LIBERO, a robot can often succeed simply by looking at its current frame—if it sees a handle, it pulls it. But real-world tasks aren't always Markovian. If a robot is asked to "wipe the table 3 times," the 2nd wipe looks identical to the 1st. Without memory, it cannot count.

Prior works tried to fix this using everything from RNNs to GPT-4o-based subgoals. However, because they all used different backbones and tasks, the community couldn't answer the billion-dollar question: What is the most efficient way to give a robot a memory?

The Cognitive Blueprint: Four Dimensions of Memory

The authors of RoboMME grounded their tasks in human cognitive theory, splitting memory into four suites:

Temporal (When): Counting pick-and-place repetitions (e.g., BinFill).
Spatial (Where): Tracking objects under occlusion or after swaps (e.g., Permanence).
Object (What): Remembering which cube was highlighted 10 seconds ago (e.g., Reference).
Procedural (How): Mimicking a specific trajectory demonstrated in a video (e.g., Imitation).

Methodology: MME-VLA Architecture

The core contribution is the systematic comparison of how to "plug" memory into a VLA (specifically the $π_{0.5}$ backbone).

1. Memory Representations

Symbolic: High-level text (e.g., "I have moved 2 blocks").
Perceptual: Raw visual tokens from past frames, compressed via Token Dropping or Uniform Sampling.
Recurrent: Compressed latent states using TTT (Test-Time Training) or RMT (Recurrent Memory Transformers).

2. Integration Mechanisms

The researchers tested three ways to fuse this data:

Memory-as-Context: Treating memory tokens like extra "words" in the input prompt.
Memory-as-Modulator: Using Adaptive LayerNorm (AdaLN) to let memory influence the mathematical "scaling" of the robot's motor control layers.
Memory-as-Expert: Adding a whole new sub-transformer dedicated solely to history.

MME-VLA Architecture

Key Findings: The Task-Dependency Paradox

The results were surprising. There is no "Silver Bullet" for robotic memory:

Perceptual + Modulator (Winner for Motion): Using frame sampling (FRAMESAMP) combined with the Modul integration provided a 44.51% success rate. It excelled at tasks like PatternLock because the robot needs to "see" the trajectory history to replicate it.
Symbolic (Winner for Counting): For BinFill, explicit language subgoals performed better. It is easier to "sum" numbers in text than to infer counts from raw pixel history.
The TTT/Recurrent Failure: Interestingly, recurrent models (like TTT/RMT) struggled, likely because fine-tuning shallow recurrent layers on a massive pretrained VLM is numerically unstable.

Experimental Results

Deep Insights & Conclusion

One of the most profound takeaways is the Efficiency-Performance Balance. While symbolic VLMs (like Gemini) are clever, they are slow (3-5x more computation). Perceptual memory using Modul provides almost the same performance gains with negligible computational overhead.

Takeaway for the Industry: If you are building a generalist robot, don't just rely on a large LLM to "think" through the steps. You need a fast, perceptual visual buffer integrated directly into the motor expert to handle the "physicality" of memory, while a symbolic layer handles the "logic" of the task.

RoboMME stands as a rigorous new testbed that proves: to conquer the open world, robots must not only see and act but truly remember.

发现相似论文

试试这些示例

Search for recent papers on hybrid memory architectures that combine symbolic language subgoals with neural perceptual buffers for long-horizon robotic manipulation.
Which paper originally proposed the "Memory-as-Modulator" or Adaptive LayerNorm mechanism for conditioning transformer-based policies, and how does RoboMME's implementation differ?
Explore if current State Space Models (SSM) like Mamba have been successfully applied to the RoboMME benchmark or similar non-Markovian robotic tasks to solve the efficiency-performance trade-off.

[CVPR 2026] RoboMME: Decoding the "When, Where, What, and How" of Robotic Memory

1. TL;DR

2. The "Memory Gap" in Robotic Generalists

3. The Cognitive Blueprint: Four Dimensions of Memory

4. Methodology: MME-VLA Architecture

4.1. 1. Memory Representations

4.2. 2. Integration Mechanisms

5. Key Findings: The Task-Dependency Paradox

6. Deep Insights & Conclusion