WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2026] CanViT: Bridging the Efficiency Gap in Active-Vision Foundation Models
总结
问题
方法
结果
要点
摘要

The paper introduces CanViT, the first task- and policy-agnostic Active-Vision Foundation Model (AVFM). It utilizes a dual-stream architecture with a retinotopic ViT backbone and a spatiotopic "canvas" memory, achieving SOTA results on active semantic segmentation (45.9% mIoU on ADE20K) while being significantly more efficient than prior models.

TL;DR

CanViT is the first Active-Vision Foundation Model (AVFM) that treats vision as a sequential process of "glimpses" rather than static frame processing. By introducing a persistent spatial canvas and an asymmetric Canvas Attention mechanism, it achieves 45.9% mIoU on ADE20K segmentation—shattering previous active-vision records while using nearly 20x fewer computational resources than its predecessors.

Problem & Motivation: The Passive Vision Bottleneck

Modern AI is dominated by "passive" vision: feed a high-resolution image into a transformer, and get a prediction. While effective, this is biologically implausible and computationally wasteful for high-resolution or long-horizon tasks.

Existing Active Computer Vision (ACV) models tried to fix this by using "glimpses" (crops), but they faced a fundamental scaling wall. Previous SOTA methods like AME and AdaGlimpse re-encoded all previous glimpses at every step, leading to $\mathcal{O}(T^2)$ or even $\mathcal{O}(T^3)$ complexity. More importantly, these models were often "task-specific," tied to individual labels or complex Reinforcement Learning (RL) policies that hindered their use as foundation models.

Methodology: The Canvas and the Backbone

CanViT’s core innovation is the decoupling of processing and memory.

1. Dual-Stream Architecture

  • The Backbone (Processing): A standard ViT-B that processes small, ephemeral glimpses (e.g., 128x128 px).
  • The Canvas (Memory): A persistent, scene-wide latent grid that acts as a "cognitive map."
  • Scene-Relative RoPE (SR-RoPE): Both streams are bound by a shared coordinate system ($[-1, +1]^2$), allowing the model to know exactly where a zoomed-in glimpse sits within the global scene.

2. Canvas Attention (The Efficiency Secret)

To keep the canvas high-resolution without exploding FLOPs, the authors designed Canvas Attention. In typical cross-attention, both Query (Q) and Key/Value (KV) sides have expensive linear projections. CanViT restricts all projections to the backbone side.

The math is simple but powerful: by eliminating canvas-side projections, the overhead of interacting with 1,000+ memory tokens becomes negligible compared to the backbone's self-attention.

Model Architecture

3. Passive-to-Active Distillation

Instead of training on labels (which are scarce for active vision), CanViT learns to mimic DINOv3. It is tasked with reconstructing the high-resolution, global feature maps of a frozen DINOv3 teacher using only a sequence of random, low-resolution glimpses. This forces the model to learn extrapolation—predicting what parts of the scene look like before it has even "looked" at them.

Experiments & Results: Shifting the Frontier

The results on ADE20K semantic segmentation are the most telling.

  • Accuracy: CanViT-B achieves 38.5% mIoU in one glimpse, rising to 45.9% with more views.
  • Efficiency: It outperforms the previous SOTA (AME) which only reached 27.6% mIoU despite using massive amounts of compute.
  • Policy Agnosticism: Even with "Fine-to-Coarse" (a deliberately bad policy), CanViT still beats prior models, proving that the architecture, not the search strategy, was the missing link.

Performance Comparison

Inference Latency

On an RTX 4090, CanViT maintains ~2.4ms per forward pass even as the virtual scene resolution grows, whereas standard ViTs (like DINOv3) grow quadratically, reaching 93ms at high resolutions. This makes CanViT a prime candidate for real-time robotics.

Critical Analysis & Conclusion

CanViT effectively "solves" the architectural scaling problem for active vision. By showing that latent distillation from passive models works for active ones, it opens the door for Active-Vision Foundation Models.

Limitations:

  • The model currently relies on a passive teacher (DINOv3).
  • It has been tested primarily on static images; its performance in highly dynamic temporal environments (video) remains an open question for future "Embodied CanViT" versions.

Takeaway: If you want efficient, high-resolution scene understanding, stop looking at the whole image at once. Use a canvas.

发现相似论文

试试这些示例

  • Search for recent papers on Active-Vision Foundation Models that utilize cross-attention for persistent scene memory.
  • Which paper first proposed the concept of 'inducing points' in Set Transformers, and how does CanViT's Canvas Attention differ in its asymmetric projection design?
  • Find studies applying the CanViT architecture or similar dual-stream latent workspaces to embodied AI or robotics tasks requiring motorized camera zoom control.
目录
[CVPR 2026] CanViT: Bridging the Efficiency Gap in Active-Vision Foundation Models
1. TL;DR
2. Problem & Motivation: The Passive Vision Bottleneck
3. Methodology: The Canvas and the Backbone
3.1. 1. Dual-Stream Architecture
3.2. 2. Canvas Attention (The Efficiency Secret)
3.3. 3. Passive-to-Active Distillation
4. Experiments & Results: Shifting the Frontier
4.1. Inference Latency
5. Critical Analysis & Conclusion