MEM: Multi-Scale Embodied Memory for Vision Language Action Models

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

MEM: Multi-Scale Embodied Memory for Vision Language Action Models

[Physical Intelligence] MEM: Solving 15-Minute Robotic Tasks via Multi-Scale Embodied Memory

Summary

Problem

Method

Results

Takeaways

Abstract

MEM (Multi-Scale Embodied Memory) is a vision-language-action (VLA) architecture designed for long-horizon robotic tasks, integrating language-based and video-based memory. By combining these modalities, it enables policies to solve tasks spanning up to 15 minutes, achieving state-of-the-art results on the $π_{0.6}$ backbone across diverse manipulation scenarios.

TL;DR

Current Vision-Language-Action (VLA) models are often "forgetful," processing only the immediate frame or a tiny window of history. Physical Intelligence's new Multi-Scale Embodied Memory (MEM) changes this by splitting memory into two scales: an efficient Video Encoder for short-term visual dynamics and a Language-based Memory for long-term semantic tracking. This allows robots to remember recipe steps and adapt to failures over a 15-minute horizon while maintaining real-time inference.

The Motivation: Why Robots Forget

In robotic manipulation, memory serves two distinct masters:

Metric/Visual Scale: "Where did that spoon go now that my arm is occluding it?" This requires high-frequency, dense visual information.
Semantic/Cognitive Scale: "Have I already added the salt to this soup?" This requires long-term tracking but needs very few bits of information.

The industry's current dilemma is that feeding a full video history into a Transformer is computationally impossible (O(N²) complexity), while aggressive compression (like pooling) loses the "spatial nuance" needed for dexterous tasks.

Methodology: The Best of Both Worlds

MEM factorizes the action prediction into a hierarchical structure:

1. The Language Memory (Long-Horizon)

Instead of storing thousands of past images, MEM trains a High-Level Policy to maintain a "journal" in natural language. After each subtask, the model updates a string mt (e.g., "I placed the plate in the cabinet and moved to the counter").

The Insight: Natural language is the ultimate compression format for semantic states.
The Training: The authors used an offline LLM to generate these summaries for human demonstration data, teaching the robot how to "distill" its own history.

2. The Video Encoder (Short-Horizon)

To keep the robot reactive, MEM uses a Space-Time Separable Attention mechanism.

Architecture: It interweaves spatial attention (standard ViT) with temporal attention (across frames) every 4th layer.
Efficiency: It reduces complexity from $O (n^{2} K^{2})$ to $O (K n^{2} + n K^{2})$ , where $K$ is the number of frames.
Latency: As shown in the paper's benchmarks, this allows the model to stay within the "critical real-time threshold" even when processing multiple camera streams.

Model Architecture Fig: The MEM system architecture, highlighting the dual-pathway for language and video memory.

Experiments: Real-World Resilience

The researchers tested MEM on tasks that would baffle a standard VLA:

The Recipe Challenge: Fetching ingredients for "Fried Rice" across 40+ recipes. The robot must remember what’s on the stove vs. what’s still in the fridge.
In-Context Adaptation: When trying to pick up a chopstick at an odd height, if the robot fails, it remembers the failure and adjusts its grasp angle in the next second—a feat impossible for no-memory models.

Experimental Results Fig: Performance gain in long-horizon tasks. MEM significantly outperforms "Naive" language history and memoryless baselines.

Critical Analysis: The Power of Pre-training

One of the most profound insights in this work is the Ablation on Pre-training. The authors found that introducing memory only during the direct robot task training (post-training) was significantly less effective than pre-training the video encoder on a massive mix of internet videos and robot data.

This suggests that the "ability to remember" is a general representation skill that can be learned from observing the world at large, not just through specific robotic trials.

Conclusion & Future Outlook

MEM represents a shift from "reactive" robotics to "cognitive" robotics. By acknowledging that not all memories are created equal, the authors have provided a blueprint for scaling VLA context without needing a supercomputer in the robot's backpack.

The next frontier? Scaling this memory beyond 15 minutes to "Life-long Memory," where a robot remembers where you like your coffee mug placed, even if it hasn't seen you use it for a week.

Core Takeaways:

Abstractions Matter: Don't use images for what language can describe better.
Temporal Attention: Separable attention is the key to real-time video VLAs.
Data Diversity: Pre-training on video data prevents the model from "cheating" or getting confused by causal correlations.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize Large Language Models (LLMs) specifically as a semantic state-summarizer or "memory-over-time" for robotic task planning.
What are the current state-of-the-art methods for "space-time separable attention" in video transformers, and how do they compare to the MEM encoder's implementation?
Find research investigating "causal confusion" in robotic imitation learning and how diverse multi-modal pre-training data helps mitigate this when adding memory to policies.

Contents

[Physical Intelligence] MEM: Solving 15-Minute Robotic Tasks via Multi-Scale Embodied Memory

1. TL;DR

2. The Motivation: Why Robots Forget

3. Methodology: The Best of Both Worlds

3.1. 1. The Language Memory (Long-Horizon)

3.2. 2. The Video Encoder (Short-Horizon)

4. Experiments: Real-World Resilience

5. Critical Analysis: The Power of Pre-training

6. Conclusion & Future Outlook