CanViT: Toward Active-Vision Foundation Models

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

CanViT: Toward Active-Vision Foundation Models

[CVPR 2026] CanViT: Bridging the Efficiency Gap in Active-Vision Foundation Models

总结

问题

方法

结果

要点

摘要

The paper introduces CanViT, the first task- and policy-agnostic Active-Vision Foundation Model (AVFM). It utilizes a dual-stream architecture with a retinotopic ViT backbone and a spatiotopic "canvas" memory, achieving SOTA results on active semantic segmentation (45.9% mIoU on ADE20K) while being significantly more efficient than prior models.

TL;DR

CanViT is the first Active-Vision Foundation Model (AVFM) that treats vision as a sequential process of "glimpses" rather than static frame processing. By introducing a persistent spatial canvas and an asymmetric Canvas Attention mechanism, it achieves 45.9% mIoU on ADE20K segmentation—shattering previous active-vision records while using nearly 20x fewer computational resources than its predecessors.

Problem & Motivation: The Passive Vision Bottleneck

Modern AI is dominated by "passive" vision: feed a high-resolution image into a transformer, and get a prediction. While effective, this is biologically implausible and computationally wasteful for high-resolution or long-horizon tasks.

Existing Active Computer Vision (ACV) models tried to fix this by using "glimpses" (crops), but they faced a fundamental scaling wall. Previous SOTA methods like AME and AdaGlimpse re-encoded all previous glimpses at every step, leading to $\mathcal{O}(T^2)$ or even $\mathcal{O}(T^3)$ complexity. More importantly, these models were often "task-specific," tied to individual labels or complex Reinforcement Learning (RL) policies that hindered their use as foundation models.

Methodology: The Canvas and the Backbone

CanViT’s core innovation is the decoupling of processing and memory.

1. Dual-Stream Architecture

The Backbone (Processing): A standard ViT-B that processes small, ephemeral glimpses (e.g., 128x128 px).
The Canvas (Memory): A persistent, scene-wide latent grid that acts as a "cognitive map."
Scene-Relative RoPE (SR-RoPE): Both streams are bound by a shared coordinate system ($[-1, +1]^2$), allowing the model to know exactly where a zoomed-in glimpse sits within the global scene.

2. Canvas Attention (The Efficiency Secret)

To keep the canvas high-resolution without exploding FLOPs, the authors designed Canvas Attention. In typical cross-attention, both Query (Q) and Key/Value (KV) sides have expensive linear projections. CanViT restricts all projections to the backbone side.

The math is simple but powerful: by eliminating canvas-side projections, the overhead of interacting with 1,000+ memory tokens becomes negligible compared to the backbone's self-attention.

Model Architecture

3. Passive-to-Active Distillation

Instead of training on labels (which are scarce for active vision), CanViT learns to mimic DINOv3. It is tasked with reconstructing the high-resolution, global feature maps of a frozen DINOv3 teacher using only a sequence of random, low-resolution glimpses. This forces the model to learn extrapolation—predicting what parts of the scene look like before it has even "looked" at them.

Experiments & Results: Shifting the Frontier

The results on ADE20K semantic segmentation are the most telling.

Accuracy: CanViT-B achieves 38.5% mIoU in one glimpse, rising to 45.9% with more views.
Efficiency: It outperforms the previous SOTA (AME) which only reached 27.6% mIoU despite using massive amounts of compute.
Policy Agnosticism: Even with "Fine-to-Coarse" (a deliberately bad policy), CanViT still beats prior models, proving that the architecture, not the search strategy, was the missing link.

Performance Comparison

Inference Latency

On an RTX 4090, CanViT maintains ~2.4ms per forward pass even as the virtual scene resolution grows, whereas standard ViTs (like DINOv3) grow quadratically, reaching 93ms at high resolutions. This makes CanViT a prime candidate for real-time robotics.

Critical Analysis & Conclusion

CanViT effectively "solves" the architectural scaling problem for active vision. By showing that latent distillation from passive models works for active ones, it opens the door for Active-Vision Foundation Models.

Limitations:

The model currently relies on a passive teacher (DINOv3).
It has been tested primarily on static images; its performance in highly dynamic temporal environments (video) remains an open question for future "Embodied CanViT" versions.

Takeaway: If you want efficient, high-resolution scene understanding, stop looking at the whole image at once. Use a canvas.

发现相似论文

试试这些示例

Search for recent papers on Active-Vision Foundation Models that utilize cross-attention for persistent scene memory.
Which paper first proposed the concept of 'inducing points' in Set Transformers, and how does CanViT's Canvas Attention differ in its asymmetric projection design?
Find studies applying the CanViT architecture or similar dual-stream latent workspaces to embodied AI or robotics tasks requiring motorized camera zoom control.

[CVPR 2026] CanViT: Bridging the Efficiency Gap in Active-Vision Foundation Models

1. TL;DR

2. Problem & Motivation: The Passive Vision Bottleneck

3. Methodology: The Canvas and the Backbone

3.1. 1. Dual-Stream Architecture

3.2. 2. Canvas Attention (The Efficiency Secret)

3.3. 3. Passive-to-Active Distillation

4. Experiments & Results: Shifting the Frontier

4.1. Inference Latency

5. Critical Analysis & Conclusion