Chameleon is a bio-inspired hierarchical memory architecture designed for long-horizon robotic manipulation. It integrates geometry-grounded multimodal perception with a differentiable episodic memory stack, achieving SOTA performance in tasks where visual observations are perceptually aliased (non-Markovian).
TL;DR
Chameleon is a novel robotic architecture that solves perceptual aliasing—the problem where a robot sees two identical scenes but must act differently based on past events. By mimicking the human Hippocampus, it writes geometry-rich "engrams" into a differentiable memory stack. On the new Camo-Dataset, it crushes standard Diffusion Policies, jumping from chance-level (33%) to 100% decision accuracy in episodic recall tasks.
The Problem: The "Amnesia" of Modern Robotics
Imagine playing a shell game. You see three identical cups. At the moment of the "grab," your visual input is totally ambiguous—the cups look the same. To win, you must recall the history of the shuffle.
Current SOTA models (like Diffusion Policy or ACT) struggle here because:
- Semantic Compression: They turn visual history into text-like summaries, losing the precise spatial geometry needed to track objects.
- Similarity Bias: In Retrieval-Augmented Generation (RAG) for robots, the agent often retrieves memories that look similar to the current view but are actually irrelevant to the current goal.
This makes the decision process non-Markovian at the observation level: the "what to do now" is not in the current frame, but buried in the past.
Methodology: Engineering the Brain's EC–HC–PFC Circuit
Chameleon addresses this with a three-stage pipeline: Perception → Memory → Policy.
1. The Dorsal-Ventral Perception
Following biological intuition, Chameleon splits vision into two:
- Ventral Stream (The "What"): Uses a frozen DINOv2 to extract rich appearance patches.
- Dorsal Stream (The "Where"): Uses the robot's own End-Effector (EE) as a geometric anchor. By projecting the EE pose into both front and hand cameras, it creates a "spatial bridge" that disambiguates object locations.

2. Hierarchical Differentiable Memory
Instead of a simple list of past frames, Chameleon uses a structured State Space Model (SSM):
- Spatial & Temporal Anchors: It organizes memory into a matrix of "slots." Each slot has a different "half-life" (timescale), allowing the robot to remember the prompt from 5 minutes ago while still tracking the hand movement from 0.5 seconds ago.
- HoloHead (Latent Imagination): This is the "secret sauce." The model is forced to predict future waypoints (2D and 3D) from its internal memory state. If the memory can't "imagine" the future correctly, the loss function punishes it, forcing the state to remain goal-directed.
Experiments: Real-World Superiority
The authors introduced the Camo-Dataset, featuring three brutal tasks:
- Episodic Recall: Cleaning a specific plate out of three identical ones.
- Spatial Tracking: The Shell Game with shuffling cups.
- Sequential Logic: Adding seasonings in a specific, non-repeating order.
Key Breakthroughs
- Decision Reliability: In the Shell Game, Chameleon achieved a 73.5% Decision Success Rate (DSR), whereas Diffusion Policy and Flow Matching were stuck at ~34% (pure guessing).
- Pattern Separation: Using UMAP visualization (below), the researchers showed that Chameleon’s internal "decision state" ($h_t$) clearly separates different histories even when the visual scenes are identical.


Critical Insight: Why it Works
The ablation studies reveal that HoloHead is vital. Without the "imagination" task, the memory state $h_t$ collapses into a simple capture of the current appearance, losing the historical cues. By forcing the latent state to be predictive, the robot maintains a "mental map" of the goal even when objects are occluded.
Conclusion & Future Work
Chameleon proves that for long-horizon tasks, robots don't just need more data; they need a memory architecture that respects spatial geometry and temporal hierarchy.
Future Outlook: The next step is integrating this episodic stack into massive Vision-Language-Action (VLA) foundation models. While VLAs provide the "common sense," a Chameleon-like stack would provide the "short-term working memory" required for complex, multi-step housework or industrial assembly.
Paper: Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
Code: https://github.com/gxyes/MARS_Chameleon
