WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[Physical Intelligence] MEM: Solving 15-Minute Robotic Tasks via Multi-Scale Embodied Memory
Summary
Problem
Method
Results
Takeaways
Abstract

MEM (Multi-Scale Embodied Memory) is a vision-language-action (VLA) architecture designed for long-horizon robotic tasks, integrating language-based and video-based memory. By combining these modalities, it enables policies to solve tasks spanning up to 15 minutes, achieving state-of-the-art results on the backbone across diverse manipulation scenarios.

TL;DR

Current Vision-Language-Action (VLA) models are often "forgetful," processing only the immediate frame or a tiny window of history. Physical Intelligence's new Multi-Scale Embodied Memory (MEM) changes this by splitting memory into two scales: an efficient Video Encoder for short-term visual dynamics and a Language-based Memory for long-term semantic tracking. This allows robots to remember recipe steps and adapt to failures over a 15-minute horizon while maintaining real-time inference.

The Motivation: Why Robots Forget

In robotic manipulation, memory serves two distinct masters:

  1. Metric/Visual Scale: "Where did that spoon go now that my arm is occluding it?" This requires high-frequency, dense visual information.
  2. Semantic/Cognitive Scale: "Have I already added the salt to this soup?" This requires long-term tracking but needs very few bits of information.

The industry's current dilemma is that feeding a full video history into a Transformer is computationally impossible (O(N²) complexity), while aggressive compression (like pooling) loses the "spatial nuance" needed for dexterous tasks.

Methodology: The Best of Both Worlds

MEM factorizes the action prediction into a hierarchical structure:

1. The Language Memory (Long-Horizon)

Instead of storing thousands of past images, MEM trains a High-Level Policy to maintain a "journal" in natural language. After each subtask, the model updates a string mt (e.g., "I placed the plate in the cabinet and moved to the counter").

  • The Insight: Natural language is the ultimate compression format for semantic states.
  • The Training: The authors used an offline LLM to generate these summaries for human demonstration data, teaching the robot how to "distill" its own history.

2. The Video Encoder (Short-Horizon)

To keep the robot reactive, MEM uses a Space-Time Separable Attention mechanism.

  • Architecture: It interweaves spatial attention (standard ViT) with temporal attention (across frames) every 4th layer.
  • Efficiency: It reduces complexity from to , where is the number of frames.
  • Latency: As shown in the paper's benchmarks, this allows the model to stay within the "critical real-time threshold" even when processing multiple camera streams.

Model Architecture Fig: The MEM system architecture, highlighting the dual-pathway for language and video memory.

Experiments: Real-World Resilience

The researchers tested MEM on tasks that would baffle a standard VLA:

  • The Recipe Challenge: Fetching ingredients for "Fried Rice" across 40+ recipes. The robot must remember what’s on the stove vs. what’s still in the fridge.
  • In-Context Adaptation: When trying to pick up a chopstick at an odd height, if the robot fails, it remembers the failure and adjusts its grasp angle in the next second—a feat impossible for no-memory models.

Experimental Results Fig: Performance gain in long-horizon tasks. MEM significantly outperforms "Naive" language history and memoryless baselines.

Critical Analysis: The Power of Pre-training

One of the most profound insights in this work is the Ablation on Pre-training. The authors found that introducing memory only during the direct robot task training (post-training) was significantly less effective than pre-training the video encoder on a massive mix of internet videos and robot data.

This suggests that the "ability to remember" is a general representation skill that can be learned from observing the world at large, not just through specific robotic trials.

Conclusion & Future Outlook

MEM represents a shift from "reactive" robotics to "cognitive" robotics. By acknowledging that not all memories are created equal, the authors have provided a blueprint for scaling VLA context without needing a supercomputer in the robot's backpack.

The next frontier? Scaling this memory beyond 15 minutes to "Life-long Memory," where a robot remembers where you like your coffee mug placed, even if it hasn't seen you use it for a week.

Core Takeaways:

  • Abstractions Matter: Don't use images for what language can describe better.
  • Temporal Attention: Separable attention is the key to real-time video VLAs.
  • Data Diversity: Pre-training on video data prevents the model from "cheating" or getting confused by causal correlations.

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize Large Language Models (LLMs) specifically as a semantic state-summarizer or "memory-over-time" for robotic task planning.
  • What are the current state-of-the-art methods for "space-time separable attention" in video transformers, and how do they compare to the MEM encoder's implementation?
  • Find research investigating "causal confusion" in robotic imitation learning and how diverse multi-modal pre-training data helps mitigate this when adding memory to policies.
Contents
[Physical Intelligence] MEM: Solving 15-Minute Robotic Tasks via Multi-Scale Embodied Memory
1. TL;DR
2. The Motivation: Why Robots Forget
3. Methodology: The Best of Both Worlds
3.1. 1. The Language Memory (Long-Horizon)
3.2. 2. The Video Encoder (Short-Horizon)
4. Experiments: Real-World Resilience
5. Critical Analysis: The Power of Pre-training
6. Conclusion & Future Outlook