Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Perceptio: Empowering LVLMs with 2D-3D Spatial Chain-of-Thought

Summary

Problem

Method

Results

Takeaways

Abstract

Perceptio is a perception-enhanced Large Vision-Language Model (LVLM) that integrates 2D semantic segmentation and 3D depth estimation directly into its autoregressive output sequence. By generating explicit spatial tokens before producing text, it achieves state-of-the-art performance on RefCOCO/+/g benchmarks and improves spatial reasoning accuracy on HardBLINK by 10.3%.

TL;DR

While Large Vision-Language Models (LVLMs) can describe an image with poetic detail, they often fail to tell you which object is closer to the camera. Perceptio fixes this by turning spatial perception into an explicit intermediate step. By generating 2D segmentation and 3D depth tokens before the text response, Perceptio acts as a "Spatial Chain-of-Thought" engine, setting new SOTA records on RefCOCO and drastically improving depth reasoning on benchmarks like HardBLINK.

Deep Dive into the Motivation: The "Blindness" of Current LVLMs

Even the most advanced LVLMs (like InternVL or LLaVA) suffer from a lack of "Spatial Intelligence." They treat images as a flat bag of visual features. Previous attempts to fix this usually fell into two camps:

Specialist Pipelines: Models that only do segmentation or depth but can't talk.
Implicit Learners: Large models that hope spatial awareness "emerges" from scale—which the authors prove it doesn't.

Perceptio’s core insight is that perception should be a first-class citizen of the language objective. If a model is forced to explicitly "draw" the depth and boundaries of a scene before answering, its final text response will be grounded in physical reality, not just statistical linguistic patterns.

Methodology: The Perception-Enhanced Sequence

Perceptio introduces a unified architecture that routes visual signals through three pathways: a standard image encoder, a frozen SAM2 encoder for 2D cues, and a VQ-VAE depth codebook distilled from "Depth Anything V2."

The Spatial Chain-of-Thought

The model generates a sequence follows this template: [seg tokens] + [depth tokens] + [text tokens]

By generating these in order, the autoregressive nature of the LLM ensures that the final text is conditioned on the segmentation and depth it just "saw."

Model Architecture

Solving the "Discrete vs. Continuous" Gap

One of the paper's key technical contributions is the Soft Depth Reconstruction. Standard VQ-VAE uses hard index selection, which is non-differentiable. Perceptio uses a soft-merging technique where codebook embeddings are weighted by predicted probabilities, allowing gradients to flow back into the LLM.

To keep the LLM from hallucinating meaningless tokens, they introduced a Composite Depth-Token Loss:

Marker Loss: Ensures the [d_start] and [d_end] tokens appear where they should.
Token Loss: Ensures the depth values are accurate compared to the teacher model.
Count Loss: Penalizes the model if it generates too few or too many depth tokens.

Experiments and Results: SOTA Performance

Perceptio-8B dominates across the board. In Referring Expression Segmentation (RES), it outperformed the previous leader, Sa2VA, by significant margins.

Performance Comparison

In the HardBLINK depth reasoning task—where models must decide which of several points is closest to the camera—Perceptio-8B achieved 71.0% accuracy, a massive 10.3% jump over LLaVA-Aurora. This proves that the explicit depth tokens aren't just "show," they are actually informing the model's internal logic.

HardBLINK Results

Critical Insights: The Multi-Modal Trade-off

The ablation studies title an interesting story:

Perception is Vital: Removing depth tokens collapsed HardBLINK accuracy by 25.8%.
Synergy: 3D depth and 2D segmentation are complementary; removing segmentation dropped MMBench and SEED-Bench scores, showing that "geometry alone is not enough—you need labels."
The Optimization Tension: Interestingly, the authors noted that removing depth slightly improved general VQA. This suggests a "tax" on the model's capacity when learning to generate precise geometric tokens—a hurdle for future multi-task curriculum learning to solve.

Conclusion

Perceptio marks a shift from models that merely look at images to models that perceive them. By integrating dense 2D and 3D signals into the text stream, Amazon researchers have provided a blueprint for more physically grounded AI. Future work in video and robotic manipulation will likely build on this "perception-as-tokens" approach.

Find Similar Papers

Try Our Examples

Find recent papers that explore the "spatial chain-of-thought" or intermediate perception token generation in Vision-Language Models.
Which paper originally proposed using VQ-VAE codebooks for discretizing depth maps in LLMs, and how does Perceptio's soft-merging technique improve upon it?
Search for research applying autoregressive depth and segmentation token generation to video-based Large Vision-Language Models for temporal consistency.

Contents

Perceptio: Empowering LVLMs with 2D-3D Spatial Chain-of-Thought

1. TL;DR

2. Deep Dive into the Motivation: The "Blindness" of Current LVLMs

3. Methodology: The Perception-Enhanced Sequence

3.1. The Spatial Chain-of-Thought

3.2. Solving the "Discrete vs. Continuous" Gap

4. Experiments and Results: SOTA Performance

5. Critical Insights: The Multi-Modal Trade-off

6. Conclusion