The paper introduces TRACE (Textual Representation of Allocentric Context from Egocentric Video), a prompting framework designed to enhance 3D spatial reasoning in Multimodal Large Language Models (MLLMs). By guiding models to generate structured, text-based intermediate representations of 3D environments, TRACE achieves SOTA performance gains of up to +7.54% on the VSI-Bench and OST-Bench spatial reasoning benchmarks across various model backbones.
TL;DR
Existing Multimodal Large Language Models (MLLMs) are often "3D-blind," relying on 2D visual shortcuts rather than understanding the layout of the world. TRACE (Textual Representation of Allocentric Context from Egocentric Video) is a new prompting paradigm that fixes this by forcing models to build a "mental map" in text—logging coordinates, trajectories, and object relations—before answering a question. This simple textual "reasoning trace" boosts the spatial IQ of models like Gemini and Qwen by significant margins without any extra training.
The Motivation: Why MLLMs Fail at 3D
Humans don't just see pixels; we build an allocentric map—a global 3D understanding of our surroundings. If you walk through a house, you know where the kitchen is relative to your starting point, even if you can't see it.
Current MLLMs (like GPT-4V or Gemini) struggle here. They treat video as a sequence of 2D images. When asked a question like "How far is the chair from the door?", they often guess based on 2D proximity in a single frame rather than calculating 3D distance. Previous attempts to fix this required complex 3D sensors or massive fine-tuning. The authors of TRACE asked: Can we just prompt the model to think in 3D?
Methodology: The Anatomy of TRACE
TRACE transforms a chaotic video into a structured YAML-like "spatial cache." Instead of jumping straight to the answer, the model must first generate three components:
- Meta Context: Defines the "rules" of the room—topology, grid alignment, and the starting orientation.
- Camera Trajectory: Logs the [x, y] coordinates and facing direction of the camera over time.
- Entity Registry: A high-resolution list of every object, its first-seen timestamp, its visual signature, and—crucially—its estimated metric coordinates.

By forcing the model to write out estimated_pos: [1.1, 1.0], the model is compelled to resolve fuzzy visual signals into concrete geometric constraints. This intermediate step acts as a "Spatial Chain-of-Thought."
Performance: Quantitative and Qualitative Wins
The researchers tested TRACE on VSI-Bench and OST-Bench, the gold standards for spatial AI.
- Broad SOTA Gains: TRACE outperformed standard Chain-of-Thought (CoT), Tree-of-Thought (ToT), and the previous "Cognitive Map" (CM) approach.
- The Gemini Boost: On Gemini 1.5 Pro, TRACE pushed average accuracy from 52.61% to over 60%.
- Granularity Matters: Unlike older methods that used coarse 10x10 grids, TRACE’s metric estimation allows it to handle fine-grained questions about object size and exact relative distances.

A striking qualitative example is shown below. A "Cognitive Map" might vaguely place a chair near a table, but TRACE identifies the specific coordinate of a chair tucked under the side facing a dishwasher, allowing it to correctly identify the "closest object."

Deep Insight: Perception vs. Reasoning
One of the most profound findings in the paper is the Decomposition Analysis. By swapping different models into the "Spatial Descriptor" (the part that sees the video) and the "Reasoning Parser" (the part that calculates the answer from the text), they found that:
- Perception is the bottleneck: Swapping a strong perceiver (Gemini) for a weaker one (Qwen-7B) causes a massive crash in performance.
- LLMs are better 3D thinkers than MLLMs: Interestingly, standard LLMs (text-only) were often better at reasoning over 3D coordinates than their multimodal versions, suggesting that visual tuning sometimes "distracts" the model's logical core.
Critical Analysis & Conclusion
TRACE proves that we don't necessarily need "3D-native" models to solve 3D tasks. By using language as a structured interface for geometry, we can unlock latent spatial capabilities in existing models.
Limitations:
- Latency: Generating a full YAML trace before answering increases token count and time-to-first-token.
- Static Nature: TRACE currently creates a static map; for long-term embodied agents, it would need a "streaming" update mechanism.
Future Outlook: TRACE serves as a blueprint for high-quality data generation. We could use TRACE to automatically label thousands of hours of video with 3D reasoning traces, then fine-tune smaller models to "think" this way inherently, creating a specialized generation of spatially-aware AI.
