WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
Vision Transformers Need More Than Registers: Striking Back at Lazy Aggregation
Summary
Problem
Method
Results
Takeaways
Abstract

This paper introduces LaSt-ViT (LazyStrike ViT) to address "lazy aggregation" artifacts in Vision Transformers, where models use background patches as shortcuts for global semantics. By employing a frequency-aware selective aggregation mechanism, it achieves SOTA performance across 12 benchmarks, significantly improving dense feature alignment in supervised, text-supervised (CLIP), and self-supervised (DINO) settings.

TL;DR

Vision Transformers (ViTs) are notorious for "lazy" behavior, often using irrelevant background patches to represent global image semantics. This paper identifies Lazy Aggregation as the root cause of feature artifacts. The authors propose LaSt-ViT (LazyStrike), which uses a frequency-aware stability score to force the CLS token to focus on foreground objects. This leads to massive gains in dense-prediction tasks, such as increasing CLIP's zero-shot segmentation mIoU from 17% to 72%.

The Problem: The "Lazy" Shortcut

While ViTs are the backbone of modern vision, they suffer from a persistent issue: Artifacts. You might have noticed high-norm tokens in attention maps or seen that the "CLS" token often pays more attention to a random corner of the sky than the actual bird in the image.

Previous research suggested "Registers" (extra tokens) as a sink for these artifacts. However, this paper argues that registers are just a band-aid. The real issue is Coarse-grained semantic supervision combined with Global dependencies. Because the model is only told "this image contains a bird" (image-level label) and can look at every patch at once, it finds the easiest mathematical path: it spreads the "bird" signal across the abundant background patches to minimize loss. This is "Lazy Aggregation."

Analysis of Lazy Behavior Figure 1: Comparison of artifacts under different supervision settings. Notice how standard ViTs (middle/right) focus on background, while LazyStrike aligns with the object.

The Insight: Stability in the Frequency Domain

How do we distinguish a "meaningful" patch from a "shortcut" patch without manual labels? The authors found a physical intuition: Foreground signals are semantically more homogeneous.

In deep layers, the features of a foreground object change less across the channel dimension compared to the noisy background. By applying a 1D Fourier Transform (FFT) across the channel dimension and measureing "Channel-wise Stability," the model can identify which patches represent stable, foreground information.

The LaSt-ViT Mechanism

  1. Transform: Apply 1D FFT to patch features in the channel dimension.
  2. Filter: Use a low-pass filter to keep only "stable" components.
  3. Score: Calculate a "Stability Score" based on how much the feature changes after filtering.
  4. Aggregate: Instead of averaging all patches, the CLS token only aggregates the Top-K most stable patches per channel.

Experimental Results: A Transformation in Performance

The impact of anchoring the CLS token to the foreground is profound.

1. Zero-shot Semantic Segmentation

For CLIP-based models, the improvement is nearly unbelievable. Because the CLS token is now forced to align with the actual object, the underlying patch features become much more accurate.

  • CLIP ViT-L/14 on Pascal VOC: mIoU jumped from 17.1% to 72.4%.
  • CLIP ViT-B/16 on Cityscapes: mIoU rose from 6.5% to 12.1%.

2. Eliminating High-Norm Tokens

The "high-norm" artifact observed in models like DINOv2 disappears entirely. This confirms that those "spikes" in energy were actually just the model's desperate attempt to store global information in random patches.

Experimental Results Table Table 3: LazyStrike consistently eliminates High Norm artifacts and improves the Point-in-Box (PiB) score across all training paradigms.

Critical Analysis: Why This Matters

The genius of LaSt-ViT lies in its simplicity. It doesn't require extra parameters or post-training fine-tuning. It fixes the Inductive Bias of the Transformer. By restricting the CLS token's "diet" to certain patches, we solve a problem that was previously thought to require complex architectural shifts like "Registers" or specialized losses.

Limitations: While PiB and segmentation improve, there is a slight trade-off in raw ImageNet classification accuracy if (the number of selected tokens) is set too low. The model needs some global context to classify complex scenes, and finding the perfect balance for is essential.

Conclusion

The takeaway is clear: Vision Transformers don't just need registers; they need better boundaries. By understanding that ViTs are naturally "lazy," we can design aggregation methods that respect the physical reality of the foreground vs. background. LaSt-ViT provides an elegant, frequency-aware solution that bridges the gap between image-level recognition and dense-pixel understanding.

PCA Visualization Figure 9: Features become semantically meaningful when the artifacts are eliminated.

Find Similar Papers

Try Our Examples

  • Find recent papers addressing the "lazy aggregation" or background bias in Vision Transformers beyond the Register-ViT approach.
  • Which paper first identified the "high-norm token" artifact in DINOv2, and how does the Stability Score in LaSt-ViT theoretically relate to that discovery?
  • Explore research that applies frequency-domain analysis or Fourier Transforms to token selection and pruning in Vision Transformers or Large Language Models.
Contents
Vision Transformers Need More Than Registers: Striking Back at Lazy Aggregation
1. TL;DR
2. The Problem: The "Lazy" Shortcut
3. The Insight: Stability in the Frequency Domain
3.1. The LaSt-ViT Mechanism
4. Experimental Results: A Transformation in Performance
4.1. 1. Zero-shot Semantic Segmentation
4.2. 2. Eliminating High-Norm Tokens
5. Critical Analysis: Why This Matters
6. Conclusion