RoboMME is a large-scale robotic manipulation benchmark (16 tasks, 770k timesteps) designed to evaluate four cognitive memory dimensions: temporal, spatial, object, and procedural. The authors introduce MME-VLA, a family of 14 vision-language-action models that systematically compare symbolic, perceptual, and recurrent memory representations, achieving a new state-of-the-art balance between performance and efficiency.
TL;DR
RoboMME is a groundbreaking benchmark that moves robotics beyond "reactive" behaviors toward true "cognitive" many-step reasoning. By evaluating 14 different VLA (Vision-Language-Action) variants, the study reveals a critical insight: Perceptual memory (visual tokens) is king for motion, while Symbolic memory (language) wins at logic. The best balance is achieved by using Memory-as-Modulator—a lightweight integration that conditions the robot's "action expert" without breaking its pretrained VLM foundations.

The "Memory Gap" in Robotic Generalists
Most current SOTA robots are "forgetful." In standard benchmarks like CALVIN or LIBERO, a robot can often succeed simply by looking at its current frame—if it sees a handle, it pulls it. But real-world tasks aren't always Markovian. If a robot is asked to "wipe the table 3 times," the 2nd wipe looks identical to the 1st. Without memory, it cannot count.
Prior works tried to fix this using everything from RNNs to GPT-4o-based subgoals. However, because they all used different backbones and tasks, the community couldn't answer the billion-dollar question: What is the most efficient way to give a robot a memory?
The Cognitive Blueprint: Four Dimensions of Memory
The authors of RoboMME grounded their tasks in human cognitive theory, splitting memory into four suites:
- Temporal (When): Counting pick-and-place repetitions (e.g., BinFill).
- Spatial (Where): Tracking objects under occlusion or after swaps (e.g., Permanence).
- Object (What): Remembering which cube was highlighted 10 seconds ago (e.g., Reference).
- Procedural (How): Mimicking a specific trajectory demonstrated in a video (e.g., Imitation).
Methodology: MME-VLA Architecture
The core contribution is the systematic comparison of how to "plug" memory into a VLA (specifically the backbone).
1. Memory Representations
- Symbolic: High-level text (e.g., "I have moved 2 blocks").
- Perceptual: Raw visual tokens from past frames, compressed via Token Dropping or Uniform Sampling.
- Recurrent: Compressed latent states using TTT (Test-Time Training) or RMT (Recurrent Memory Transformers).
2. Integration Mechanisms
The researchers tested three ways to fuse this data:
- Memory-as-Context: Treating memory tokens like extra "words" in the input prompt.
- Memory-as-Modulator: Using Adaptive LayerNorm (AdaLN) to let memory influence the mathematical "scaling" of the robot's motor control layers.
- Memory-as-Expert: Adding a whole new sub-transformer dedicated solely to history.

Key Findings: The Task-Dependency Paradox
The results were surprising. There is no "Silver Bullet" for robotic memory:
- Perceptual + Modulator (Winner for Motion): Using frame sampling (
FRAMESAMP) combined with theModulintegration provided a 44.51% success rate. It excelled at tasks likePatternLockbecause the robot needs to "see" the trajectory history to replicate it. - Symbolic (Winner for Counting): For
BinFill, explicit language subgoals performed better. It is easier to "sum" numbers in text than to infer counts from raw pixel history. - The TTT/Recurrent Failure: Interestingly, recurrent models (like TTT/RMT) struggled, likely because fine-tuning shallow recurrent layers on a massive pretrained VLM is numerically unstable.

Deep Insights & Conclusion
One of the most profound takeaways is the Efficiency-Performance Balance. While symbolic VLMs (like Gemini) are clever, they are slow (3-5x more computation). Perceptual memory using Modul provides almost the same performance gains with negligible computational overhead.
Takeaway for the Industry: If you are building a generalist robot, don't just rely on a large LLM to "think" through the steps. You need a fast, perceptual visual buffer integrated directly into the motor expert to handle the "physicality" of memory, while a symbolic layer handles the "logic" of the task.
RoboMME stands as a rigorous new testbed that proves: to conquer the open world, robots must not only see and act but truly remember.
