Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

[ICCV 2025] NV-CoT: Rethinking Visual Reasoning as Continuous Action Space

Summary

Problem

Method

Results

Takeaways

Abstract

Numerical Visual Chain-of-Thought (NV-CoT) is a novel framework that enables Multimodal Large Language Models (MLLMs) to perform region-grounded reasoning by treating bounding-box localization as continuous actions in Euclidean space. It shifts away from traditional discrete text-token coordinates, achieving state-of-the-art performance on fine-grained reasoning benchmarks like V* Bench and HR-Bench.

TL;DR

Researchers have introduced Numerical Visual Chain-of-Thought (NV-CoT), a framework that ditches the clunky "textified" coordinates used by current AI models. Instead of forcing a Multimodal Large Language Model (MLLM) to type out numbers like "[0.1, 0.2, 0.5, 0.8]", NV-CoT allows the model to "think" in continuous Euclidean space. By treating localization as a continuous action, the model achieves superior precision, faster convergence, and better reasoning on high-resolution images.

The "Broken" Bridge: Why Discrete Coordinates Fail

In the world of Multimodal AI, "Visual Chain-of-Thought" (Visual CoT) is the process where a model zooms into specific parts of an image to find clues before answering. Historically, models chose these areas in two ways:

Textification: The model predicts coordinates as text tokens. This is mathematically "noisy"—predicting "0.31" and "0.32" are seen as equally wrong as predicting "0.99" under standard cross-entropy loss, ignoring the geometric proximity.
Patch-Indexing: The model picks from a grid of pre-defined squares. This is rigid; if your target object sits between four squares, the model struggles to "see" it clearly.

NV-CoT addresses this by expanding the model's output head to include a continuous regression branch, treating the act of "looking at a region" as a continuous numerical action.

Methodology: Bridging LLMs and Euclidean Space

The architecture of NV-CoT is deceptively simple yet powerful. Instead of just picking a word from a vocabulary $V$ , the model outputs parameters for a continuous distribution.

1. The Gaussian/Laplace Policy

The model identifies a region $[x_{1}, y_{1}, x_{2}, y_{2}]$ by predicting the mean ( $μ$ ) and uncertainty ( $σ$ ).

In SFT: The model is trained using regression losses ( $L_{1}$ or $L_{2}^{2}$ ).
In RL (GRPO): To allow the model to "explore" different regions, it samples coordinates using the reparameterization trick: $b = μ + σ \cdot ϵ$ This allows gradients to flow backwards through the random sampling process, enabling robust Reinforcement Learning.

2. The Architecture

Overall Architecture The figure above illustrates how NV-CoT (right) compares to text-based (left) and patch-based (middle) methods. By predicting coordinates in continuous space, it achieves far more flexible and precise localization.

Experimental Battleground: Sifting Through the Noise

The researchers tested NV-CoT against 8 SOTA baselines (including Qwen2.5-VL and LLaVA-OneVision) across three grueling benchmarks: V* Bench, HR-Bench 4K, and 8K.

Key Findings:

Precision Leap: On V* Bench, NV-CoT (RL) reached 89.0% overall accuracy, outperforming the text-based DeepEyes-7B by 2.6%.
Robust Localization: In the Vis-CoT-363K dataset, the Bounding-Box IoU jumped from 47.3 (Standard) to 59.5 (NV-CoT with L1 loss).
Efficiency: NV-CoT tends to converge faster during training because the regression loss provides a direct "direction" for gradients, unlike the "hit-or-miss" nature of categorical cross-entropy.

Performance Comparison The table clearly shows that across every metric—Attribute reasoning, Spatial reasoning, and Fine-grained perception—the continuous action approach (NV-CoT) provides a massive uplift over pure text-based serialization.

Deep Insight: Why L1 (Laplace) Beats L2 (Gaussian)?

The paper includes a fascinating ablation study on the choice of distribution. They found that a Laplace Policy (utilizing L1 loss) consistently beats a Gaussian Policy. Why? In localization tasks, outliers are common. The L1 loss is more robust to these "messy" bounding box coordinates during training, leading to sharper, more definitive region selections.

The Success of Accuracy Visual proof: The red boxes (NV-CoT) are tightly wrapped around the target objects, whereas the blue boxes (backbone) are often loose and include unnecessary background.

Critical Analysis & Conclusion

Takeaway

NV-CoT proves that MLLMs shouldn't be forced to speak "human language" when performing internal "geometric thinking." By allowing the model to act directly on the continuous coordinates of the visual world, we unlock a new level of precision in multimodal reasoning.

Limitations & Future Work

The current framework focuses on a single "zoom-in" action per step. While it supports iterative calls, the complexity increases. Future research could explore how this continuous action space might allow models to perform more complex tasks like video tracking or dynamic object manipulation within a unified language-and-action framework.

NV-CoT is a modular, high-performance upgrade that likely signals the end of "text-based coordinates" in the next generation of reasoning MLLMs.

Find Similar Papers

Try Our Examples

Search for recent papers on continuous action spaces in Multimodal Large Language Models beyond bounding-box localization, such as for robotic manipulation or drawing.
Which study first introduced Group Relative Policy Optimization (GRPO), and how does NV-CoT adapt the standard categorical KL-divergence for continuous Gaussian distributions?
Explore the application of Laplace-based policies versus Gaussian policies in visual grounding tasks to see if L1-type robustness consistently outperforms L2 across different MLLM architectures.

Contents

[ICCV 2025] NV-CoT: Rethinking Visual Reasoning as Continuous Action Space

1. TL;DR

2. The "Broken" Bridge: Why Discrete Coordinates Fail

3. Methodology: Bridging LLMs and Euclidean Space

3.1. 1. The Gaussian/Laplace Policy

3.2. 2. The Architecture

4. Experimental Battleground: Sifting Through the Noise

4.1. Key Findings:

5. Deep Insight: Why L1 (Laplace) Beats L2 (Gaussian)?

6. Critical Analysis & Conclusion

6.1. Takeaway

6.2. Limitations & Future Work