Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

[CVPR 2025/ArXiv] Chameleon: Bio-Inspired Episodic Memory for Robots Facing the "Shell Game" Challenge

Summary

Problem

Method

Results

Takeaways

Abstract

Chameleon is a bio-inspired hierarchical memory architecture designed for long-horizon robotic manipulation. It integrates geometry-grounded multimodal perception with a differentiable episodic memory stack, achieving SOTA performance in tasks where visual observations are perceptually aliased (non-Markovian).

TL;DR

Chameleon is a novel robotic architecture that solves perceptual aliasing—the problem where a robot sees two identical scenes but must act differently based on past events. By mimicking the human Hippocampus, it writes geometry-rich "engrams" into a differentiable memory stack. On the new Camo-Dataset, it crushes standard Diffusion Policies, jumping from chance-level (33%) to 100% decision accuracy in episodic recall tasks.

The Problem: The "Amnesia" of Modern Robotics

Imagine playing a shell game. You see three identical cups. At the moment of the "grab," your visual input is totally ambiguous—the cups look the same. To win, you must recall the history of the shuffle.

Current SOTA models (like Diffusion Policy or ACT) struggle here because:

Semantic Compression: They turn visual history into text-like summaries, losing the precise spatial geometry needed to track objects.
Similarity Bias: In Retrieval-Augmented Generation (RAG) for robots, the agent often retrieves memories that look similar to the current view but are actually irrelevant to the current goal.

This makes the decision process non-Markovian at the observation level: the "what to do now" is not in the current frame, but buried in the past.

Methodology: Engineering the Brain's EC–HC–PFC Circuit

Chameleon addresses this with a three-stage pipeline: Perception → Memory → Policy.

1. The Dorsal-Ventral Perception

Following biological intuition, Chameleon splits vision into two:

Ventral Stream (The "What"): Uses a frozen DINOv2 to extract rich appearance patches.
Dorsal Stream (The "Where"): Uses the robot's own End-Effector (EE) as a geometric anchor. By projecting the EE pose into both front and hand cameras, it creates a "spatial bridge" that disambiguates object locations.

Architecture Overview

2. Hierarchical Differentiable Memory

Instead of a simple list of past frames, Chameleon uses a structured State Space Model (SSM):

Spatial & Temporal Anchors: It organizes memory into a matrix of "slots." Each slot has a different "half-life" (timescale), allowing the robot to remember the prompt from 5 minutes ago while still tracking the hand movement from 0.5 seconds ago.
HoloHead (Latent Imagination): This is the "secret sauce." The model is forced to predict future waypoints (2D and 3D) from its internal memory state. If the memory can't "imagine" the future correctly, the loss function punishes it, forcing the state to remain goal-directed.

Experiments: Real-World Superiority

The authors introduced the Camo-Dataset, featuring three brutal tasks:

Episodic Recall: Cleaning a specific plate out of three identical ones.
Spatial Tracking: The Shell Game with shuffling cups.
Sequential Logic: Adding seasonings in a specific, non-repeating order.

Key Breakthroughs

Decision Reliability: In the Shell Game, Chameleon achieved a 73.5% Decision Success Rate (DSR), whereas Diffusion Policy and Flow Matching were stuck at ~34% (pure guessing).
Pattern Separation: Using UMAP visualization (below), the researchers showed that Chameleon’s internal "decision state" ($h_t$) clearly separates different histories even when the visual scenes are identical.

Pattern Separation Visual

Experimental Results

Critical Insight: Why it Works

The ablation studies reveal that HoloHead is vital. Without the "imagination" task, the memory state $h_t$ collapses into a simple capture of the current appearance, losing the historical cues. By forcing the latent state to be predictive, the robot maintains a "mental map" of the goal even when objects are occluded.

Conclusion & Future Work

Chameleon proves that for long-horizon tasks, robots don't just need more data; they need a memory architecture that respects spatial geometry and temporal hierarchy.

Future Outlook: The next step is integrating this episodic stack into massive Vision-Language-Action (VLA) foundation models. While VLAs provide the "common sense," a Chameleon-like stack would provide the "short-term working memory" required for complex, multi-step housework or industrial assembly.

Paper: Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
Code: https://github.com/gxyes/MARS_Chameleon

Find Similar Papers

Try Our Examples

Search for recent papers that utilize State Space Models (SSMs) or Mamba-based architectures specifically for long-horizon robotic manipulation tasks.
Which seminal papers first established the "Dorsal and Ventral stream" division in computer vision, and how do modern VLA models implement this biological intuition?
Investigate other robotic benchmarks or datasets, similar to Camo-Dataset, that specifically target "perceptual aliasing" and non-Markovian decision-making.

Contents

[CVPR 2025/ArXiv] Chameleon: Bio-Inspired Episodic Memory for Robots Facing the "Shell Game" Challenge

1. TL;DR

2. The Problem: The "Amnesia" of Modern Robotics

3. Methodology: Engineering the Brain's EC–HC–PFC Circuit

3.1. 1. The Dorsal-Ventral Perception

3.2. 2. Hierarchical Differentiable Memory

4. Experiments: Real-World Superiority

4.1. Key Breakthroughs

5. Critical Insight: Why it Works

6. Conclusion & Future Work