WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
[CVPR 2025] SITH: Cracking the CLIP Code via Weight-Space Singular Vector Decomposition
总结
问题
方法
结果
要点
摘要

SITH (Semantic Inspection of Transformer Heads) is a data-free and training-free interpretability framework that analyzes CLIP's Vision Transformer in weight space by decomposing Value-Output (VO) matrices into singular vectors. Using the novel COMP (Coherent Orthogonal Matching Pursuit) algorithm, it maps these vectors to sparse, human-interpretable concepts, achieving SOTA results in fine-grained model editing and mechanistic understanding.

TL;DR

SITH (Semantic Inspection of Transformer Heads) is a revolutionary data-free and training-free framework for interpreting CLIP. Instead of looking at image activations, it looks directly at the model weights. By decomposing attention matrices using SVD and a new algorithm called COMP, SITH identifies specific "concept vectors" inside attention heads. This allows for surgical model edits—like removing bias or NSFW content—simply by zeroing out specific singular values.

Impact: This work shifts mechanistic interpretability from a dataset-heavy task to a pure linear algebra problem, proving that CLIP’s weights are a direct map of human concepts.


The Motivation: Interpretation Without Bias

Most interpretability tools (like Sparse Autoencoders) are dataset-dependent. If your probe dataset is biased, your interpretation is biased. Furthermore, identifying that a "head" likes "colors" is too coarse. We need to know which specific feature inside that head represents "Fire-Engine Red" vs "Navy Blue."

SITH solves this by bypassing data entirely, looking at the Weight Space.


Methodology: The Core Mechanics

1. Weight Matrix Factorization

SITH focuses on the Value-Output (VO) matrix ($W_{VO}$) of each attention head. In the Transformer "circuit" view, the VO matrix determines what information is written into the residual stream.

SITH applies Singular Value Decomposition (SVD): $$W_{VO} = U \Sigma V^T$$

  • Left Singular Vectors ($u_i$): What the head "reads" from the input.
  • Right Singular Vectors ($v_i$): What the head "writes" to the output.
  • Singular Values ($\sigma_i$): The technical "strength" or importance of that feature.

2. COMP: Making Math Human-Readable

To turn abstract vectors into words, the authors developed COMP (Coherent Orthogonal Matching Pursuit). Standard matching pursuit just looks for the best mathematical fit. COMP adds a coherence term to ensure the words chosen make sense together (e.g., "sea," "ocean," and "beach" instead of "sea," "toaster," and "politics").

SITH Architecture Overview Figure 1: The SITH pipeline—from VO weight matrices to coherent concept clusters via SVD and COMP.


Interpretable Model Editing: Performance Results

Because SITH maps specific weights to specific concepts, we can "edit" the model without retraining.

Anti-Bias and Safety

  • Spurious Correlations: On the "Waterbirds" dataset (birds on wrong backgrounds), SITH identifies "background" vectors. Setting $\sigma_i = 0$ for those vectors improved accuracy significantly, outperforming head-level ablation methods like TextSpan.
  • NSFW Filters: SITH identifies "unsafe" vectors. By suppressing them, the model becomes safer in retrieval tasks without losing general performance.

Fine-Tuning Insights

SITH reveals that when we fine-tune CLIP, we aren't learning new concepts. Instead, we are simply re-weighting the existing semantic basis. The "subspace shift" is subtle but task-aligned.

Experimental Results Figure 2: Substituting original singular vectors with SITH-reconstructed versions shows minimal loss in zero-shot accuracy, proving the fidelity of the decomposition.


Visual Evidence: Grounding Weights

Does a weight vector actually "see" what we think it does? The authors retrieved images from CC12M that most resembled specific singular vectors. The results are startling: a vector identified as "Two Children" consistently retrieves images of pairs of kids.

Visual Grounding Figure 3: Top-3 images matched to singular vectors—confirming that the weight-based interpretations align with visual reality.


Deep Insight & Conclusion

The SITH framework is a masterclass in mechanistic interpretability. It proves that the "vision-language" alignment in models like CLIP is so robust that it is reflected directly in the geometry of its weights.

Takeaway: We no longer need massive datasets to understand or fix VLMs. By applying spectral analysis to model weights, we can achieve surgical, data-free interventions that make AI safer, more robust, and more transparent. The future of AI safety and adaptation may not lie in more data, but in better linear algebra.


发现相似论文

试试这些示例

  • Find recent papers that apply weight-space decomposition or spectral analysis to interpret the internal mechanisms of Large Language Models (LLMs) or Vision-Language Models (VLMs) beyond attention heads.
  • Which study first introduced the concept of the 'Value-Output' (VO) circuit in Transformer circuits, and how does the SITH framework extend that original theoretical foundation?
  • Search for research that applies Singular Value Decomposition (SVD) for model editing, safety alignment, or removing spurious correlations in multimodal neural networks.
目录
[CVPR 2025] SITH: Cracking the CLIP Code via Weight-Space Singular Vector Decomposition
1. TL;DR
2. The Motivation: Interpretation Without Bias
3. Methodology: The Core Mechanics
3.1. 1. Weight Matrix Factorization
3.2. 2. COMP: Making Math Human-Readable
4. Interpretable Model Editing: Performance Results
4.1. Anti-Bias and Safety
4.2. Fine-Tuning Insights
5. Visual Evidence: Grounding Weights
6. Deep Insight & Conclusion