VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

VisualAD: Language-Free Zero-Shot Anomaly Detection via Vision Transformer

VisualAD: Rethinking the Necessity of Language in Zero-Shot Anomaly Detection

Summary

Problem

Method

Results

Takeaways

Abstract

VisualAD is a training-based Zero-Shot Anomaly Detection (ZSAD) framework that eliminates the need for text encoders and vision-language alignment. By inserting learnable "normal" and "anomaly" tokens into a frozen Vision Transformer (ViT), it achieves state-of-the-art results across 13 industrial and medical benchmarks using purely visual cues.

TL;DR

VisualAD challenges the reigning paradigm of Vision-Language Models (VLMs) for anomaly detection. By discarding the text encoder and instead using two learnable "visual prototypes" (tokens) within a frozen Vision Transformer, this framework achieves SOTA performance with 99% fewer trainable parameters and significantly more stable training.

Context & Motivation: Why Drop the "Language"?

The current SOTA in Zero-Shot Anomaly Detection (ZSAD) is dominated by CLIP-based methods. These models typically prompt a text encoder with words like "normal" or "damaged" to create reference embeddings.

However, the authors of VisualAD observed a startling phenomenon: if you remove the text encoder and simply learn two raw vectors to represent these states, the performance doesn't drop—it actually becomes more stable. This suggests that in ZSAD, language is often an indirect, noisy proxy for what is fundamentally a spatial and structural visual problem.

Methodology: The Core Architecture

VisualAD operates on a simple yet elegant premise: inject the "concept" of an anomaly directly into the visual pipeline.

1. Global Learnable Tokens

Instead of complex text prompts, the model inserts two learnable tokens— $t_{a}$ (anomaly) and $t_{n}$ (normal)—into the initial sequence of a frozen ViT. As these tokens pass through the Transformer layers, they attend to image patches, "learning" what abnormal textures and structures look like across the auxiliary training set.

2. Spatial-Aware Cross-Attention (SCA)

Global tokens often lose local details. To fix this, the SCA module uses a small set of anchor queries (default $m = 4$ ) to aggregate localized spatial evidence from patch features. By applying positional encoding, the tokens gain an "awareness" of where a potential defect is located.

Overall Architecture

3. Self-Alignment Function (SAF)

To ensure the patch features are actually comparable to the high-level tokens, the SAF (a lightweight MLP) recalibrates the patch tokens at each layer. This creates a clean "Self-Aligned Cosine Contrast," making the boundary between normal and abnormal regions much sharper in the latent space.

Empirical Superiority

The results across 13 datasets (including MVTec-AD and complex medical datasets like BrainMRI) show that VisualAD is not just a "lightweight alternative" but a superior performer.

Versatility: It works seamlessly with both CLIP-ViT and DINOv2 backbones.
Localization: The combination of SCA and multi-layer fusion allows it to produce high-resolution anomaly maps that outperform VLM-based competitors.

Experimental Results Comparison

Ablation Insight: The Power of Multi-Layer Fusion

Analysis shows that mid-level layers (e.g., Layer 18 of ViT-L) are the "sweet spot" for anomaly detection, balancing local texture disruptions with global context. VisualAD leverages this by fusing information from multiple depths to achieve robust results.

Deep Insight & Conclusion

VisualAD’s success implies that the "Semantic Gap" in anomaly detection is smaller than we thought. We don't necessarily need to tell a model what an "anomaly" is in English; we just need to provide the architectural capacity for the model to contrast "expected" vs "unexpected" visual manifolds.

Takeaway: For specialized CV tasks where the target is a structural deviation rather than a semantic category, moving away from multi-modal alignment towards purely visual adapters can lead to more efficient and reliable models.

Limitations: While powerful, the model still requires an auxiliary "source" dataset for training the tokens. True "zero-training" open-set detection remains a challenge for future work.

Find Similar Papers

Try Our Examples

Which recent papers investigate removing the text branch from CLIP-based architectures for downstream vision-only tasks like object detection or segmentation?
What is the theoretical origin of using "learnable tokens" within frozen transformers for instruction tuning or domain adaptation, and how does it relate to visual prompt tuning (VPT)?
How do other state-of-the-art zero-shot anomaly detection models handle the domain shift between industrial training data and medical test data?

Contents

VisualAD: Rethinking the Necessity of Language in Zero-Shot Anomaly Detection

1. TL;DR

2. Context & Motivation: Why Drop the "Language"?

3. Methodology: The Core Architecture

3.1. 1. Global Learnable Tokens

3.2. 2. Spatial-Aware Cross-Attention (SCA)

3.3. 3. Self-Alignment Function (SAF)

4. Empirical Superiority

4.1. Ablation Insight: The Power of Multi-Layer Fusion

5. Deep Insight & Conclusion