WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[CVPR 2026] Sparse Visual Thought Circuits: The Compositional Paradox of VLM Reasoning
总结
问题
方法
结果
要点
摘要

This paper introduces Sparse Visual Thought Circuits (SVTC), a causal framework to localize and manipulate visual reasoning in Vision-Language Models (VLMs) like Qwen3-VL-8B. By employing Top-K Sparse Autoencoders (SAEs), the author identifies a specific "semantic bottleneck" at Layer 21 where visual features act as modular, spatially-grounded reasoning units.

TL;DR

Researchers have long treated AI model features as "levers" that can be pulled to steer behavior. This paper deconstructs that metaphor for Vision-Language Models (VLMs). By isolating Sparse Visual Thought Circuits (SVTC) in Qwen3-VL, the author proves that while individual reasoning features are causally powerful, combining them often causes the model to "hallucinate" or collapse. This discovery of Geometric Antagonism suggests that VLM "thoughts" are not modular LEGO bricks, but entangled signals that compete for neural capacity.

The Problem: The Failure of Linear Modularity

In the world of Interpretability, the Linear Modularity Hypothesis is king. It suggests that if we find a "counting" feature and a "spatial relation" feature, adding them together should make the model better at both.

However, this paper finds a "Compositional Paradox." In VLMs, visual features are physically grounded to image patches. Because these features must share a limited high-dimensional space (the residual stream), they aren't always orthogonal. When you try to "steer" the model using a union of task-relevant features, the vectors often point in opposite directions, cancelling the signal and leaving only noise.

Methodology: Finding the "Thought" in the Machine

The author proposes a three-step causal pipeline:

  1. Localization: Using linear probes to find where "reasoning" happens. Surprisingly, it’s not in the early vision layers or the final output layers, but at a "Semantic Bottleneck" (Layer 21 in Qwen3-8B).
  2. Decomposition: Using a Top-K Sparse Autoencoder (SAE) to turn dense, messy activations into 32,768 sparse, interpretable "features."
  3. Causal Intervention: Using a masked operator to scale or ablate these features during inference to see if they actually drive the model's logic.

Model Architecture and Pipeline Figure 1: The SVTC pipeline. Probing identifies the site, SAEs decompose the signal, and interventions test the causality.

The "Aha!" Moment: Geometric Antagonism

Why does joint steering fail? The author identifies a Noise Amplification Law.

When two necessary circuit sets (called "Pattern" and "Global") are activated together, they exhibit a negative cosine similarity (~ -0.33). Because they are geometrically opposed, the semantic signal collapses toward zero. When the signal disappears, the model's Layer Normalization (LN) acts as a stochastic multiplier—it amplifies the remaining background noise exponentially.

The result? The model loses its "visual pointer," entropy maximizes, and it defaults to "hallucinating" based on language priors rather than looking at the image.

Spatial Grounding Visual Figure 2: Feature #2398 acts as a precise semantic detector for patterns, proving these circuits are spatially grounded.

Key Results & Experiments

The findings were robust across five diverse datasets (NLVR2, CLEVR, VSR, etc.) and multiple model families:

  • Causal Necessity: Ablating just 0.5% of the SAE dictionary features caused the model to flip its answer 57% of the time.
  • The Paradox: As shown in the table below, while "Pattern" steering increases accuracy (+1.03 pp), "Union" steering crashes it (-1.71 pp) and nearly doubles the output drift.
  • Layer 21 Phase Transition: Before Layer 15, interventions act as pure noise. After Layer 24, the manifold becomes too "curved" for linear steering. Layer 21 is the "Goldilocks zone."

Performance Result Table Table 1: The "Compositional Paradox" - Pattern steering helps, but Union steering hurts.

Critical Insight & Future Outlook

This paper serves as a warning to the "Model Steering" community. We cannot simply treat LLM/VLM activations as a flat vector space where Vector(A) + Vector(B) works reliably.

The Takeaway: To truly control VLMs, we need Manifold-Aware Steering. Instead of naive addition, we should use techniques like Orthogonalized Signal Projection (OSP) to ensure that composite signals don't cancel each other out. This work moves us closer to a "Mechanistic manual" for multimodal AI, proving that the path to reliability lies in mastering the hidden geometry of neural thoughts.

发现相似论文

试试这些示例

  • Search for recent papers investigating the "Linear Representation Hypothesis" specifically in multimodal models versus unimodal LLMs.
  • Which original studies proposed "Steering" or "Activation Addition" (ActAdd), and how do they address the problem of features being non-orthogonal?
  • Find research that applies Sparse Autoencoders (SAEs) to vision-only backbones (like CLIP or SigLIP) to see if geometric antagonism exists before multi-modal integration.
目录
[CVPR 2026] Sparse Visual Thought Circuits: The Compositional Paradox of VLM Reasoning
1. TL;DR
2. The Problem: The Failure of Linear Modularity
3. Methodology: Finding the "Thought" in the Machine
4. The "Aha!" Moment: Geometric Antagonism
5. Key Results & Experiments
6. Critical Insight & Future Outlook