RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design

[ICLR 2025] RMBench & Mem-0: Breaking the Markovian Constraint in Robotic Manipulation

总结

问题

方法

结果

要点

摘要

The paper introduces RMBench, a new benchmark for evaluating memory-dependent robotic manipulation, and Mem-0, a modular policy featuring explicit memory components. Mem-0 achieves a significant performance leap, improving success rates by 31.6% on average over state-of-the-art baselines like Pi0.5 and ACT.

TL;DR

The field of robotic manipulation has long been "stuck in the moment," with most SOTA models assuming the current observation is enough to decide the next move. RMBench and the Mem-0 policy challenge this by proving that robots need a structured "past" to handle complex, real-world tasks. By introducing a dual-system memory architecture, the authors achieved a 31.6% performance boost over elite baselines like Pi0.5.

The "Memory Gap" in Modern Robotics

Current leaders in the VLA (Vision-Language-Action) space, such as Pi0.6 or RDT2, are impressive at fine-grained skills. However, they share a fatal flaw: they are largely Markovian. If a task requires knowing what happened 500 steps ago (e.g., "Where did I put that block before the screen was occluded?"), these models usually fail.

The authors identify two tiers of memory difficulty:

M(1) Complexity: Requires remembering a single, specific past observation (e.g., a reference object).
M(n) Complexity: Requires tracking multiple historical events or trial-and-error phases (e.g., ranking blocks through repeated attempts).

Methodology: The Mem-0 Architecture

Mem-0 moves away from "black-box" end-to-end learning in favor of a Modular Memory approach. It mimics the human cognitive process of separating "What am I doing now?" from "What have I already achieved?"

1. The Planning Module (The "Brain")

This module uses a Key Memory Window to store descriptions and visual snapshots of completed subtasks. Instead of replanning every frame, it only triggers when a subtask is done, drastically reducing computational overhead.

2. The Execution Module (The "Hands")

To generate smooth, low-level actions, the execution module employs a diffusion-based policy conditioned on:

Anchor Memory: A fixed snapshot of the subtask's starting state (stops the robot from "forgetting" the goal).
Sliding Memory: A moving window of the last $K$ frames to capture immediate motion trends.

Mem-0 Overall Architecture

Experimental Battleground: RMBench

The authors didn't just test on easy tasks. RMBench includes 9 grueling dual-arm scenarios. The results were stark: standard models like ACT and DP hovered near 5-10% success rates on memory-heavy tasks, while Mem-0 climbed to 42.0% on average.

RMBench Task Overview

Why did it work? (Ablation Insights)

Anchor is Key: Removing Anchor Memory caused success rates to plummet (from 52.8% to 26.8% in M(1) tasks). Without a persistent "anchor" of the goal state, the robot's focus drifts as the sliding window updates.
Subtask Bottleneck: The performance leap in M(n) tasks (like "Battery Try") is highly dependent on the Subtask End Classifier. When using a Ground Truth (GT) classifier, performance nearly doubled, suggesting that detecting when a job is done is just as hard as doing the job itself.

Real-World Validation

The researchers took Mem-0 out of the simulator and onto the X-One dual-arm platform. Even with the noise of real-world physics and human-collected data, Mem-0 maintained its lead, proving that memory-augmented architectures are more robust to real-world occlusions and long-horizon drifts.

Real-world Experiment Results Table

Evolution vs. Revolution: Deep Insight

RMBench isn't just another leaderboard; it defines Task Memory Complexity (TMC) as a formal metric ( $M (m)$ ). This gives the community a mathematical language to describe "difficulty" beyond just "steps to completion."

While Mem-0 still struggles with fine-grained semantic identification (like distinguishing two identical-looking objects), it provides the blueprint for the next generation of Memory-Aware Foundation Models. The future of robotics isn't just better vision—it's better "remembering."

Conclusion

Mem-0 proves that explicit memory is a shortcut to intelligence for long-horizon tasks. By decoupling planning from execution and short-term from long-term memory, we can bridge the gap between "reactive" robots and "reasoning" agents.

Key Takeaways for Practitioners:

Don't rely on simple frame-stacking for long tasks.
Implement an Anchor Memory to prevent "policy drift."
Use a Subtask Classifier to trigger high-level planning only when necessary to save on GPU inference latency.

发现相似论文

试试这些示例

Search for recent papers on "Long-term Memory in Robotic Manipulation" that utilize Transformer-based architectures or State Space Models (SSMs).
Which paper first introduced the "Anchor Memory" concept in visual navigation or manipulation, and how does Mem-0's implementation differentiate from it?
Investigate how "Subtask Decomposition" and "Closed-loop Planning" are being applied to multi-modal Vision-Language-Action (VLA) models in 2024-2025.

[ICLR 2025] RMBench & Mem-0: Breaking the Markovian Constraint in Robotic Manipulation

1. TL;DR

2. The "Memory Gap" in Modern Robotics

3. Methodology: The Mem-0 Architecture

3.1. 1. The Planning Module (The "Brain")

3.2. 2. The Execution Module (The "Hands")

4. Experimental Battleground: RMBench

4.1. Why did it work? (Ablation Insights)

5. Real-World Validation

6. Evolution vs. Revolution: Deep Insight

7. Conclusion