VisualAD is a training-based Zero-Shot Anomaly Detection (ZSAD) framework that eliminates the need for text encoders and vision-language alignment. By inserting learnable "normal" and "anomaly" tokens into a frozen Vision Transformer (ViT), it achieves state-of-the-art results across 13 industrial and medical benchmarks using purely visual cues.
TL;DR
VisualAD challenges the reigning paradigm of Vision-Language Models (VLMs) for anomaly detection. By discarding the text encoder and instead using two learnable "visual prototypes" (tokens) within a frozen Vision Transformer, this framework achieves SOTA performance with 99% fewer trainable parameters and significantly more stable training.
Context & Motivation: Why Drop the "Language"?
The current SOTA in Zero-Shot Anomaly Detection (ZSAD) is dominated by CLIP-based methods. These models typically prompt a text encoder with words like "normal" or "damaged" to create reference embeddings.
However, the authors of VisualAD observed a startling phenomenon: if you remove the text encoder and simply learn two raw vectors to represent these states, the performance doesn't drop—it actually becomes more stable. This suggests that in ZSAD, language is often an indirect, noisy proxy for what is fundamentally a spatial and structural visual problem.
Methodology: The Core Architecture
VisualAD operates on a simple yet elegant premise: inject the "concept" of an anomaly directly into the visual pipeline.
1. Global Learnable Tokens
Instead of complex text prompts, the model inserts two learnable tokens—$t_a$ (anomaly) and $t_n$ (normal)—into the initial sequence of a frozen ViT. As these tokens pass through the Transformer layers, they attend to image patches, "learning" what abnormal textures and structures look like across the auxiliary training set.
2. Spatial-Aware Cross-Attention (SCA)
Global tokens often lose local details. To fix this, the SCA module uses a small set of anchor queries (default $m=4$) to aggregate localized spatial evidence from patch features. By applying positional encoding, the tokens gain an "awareness" of where a potential defect is located.

3. Self-Alignment Function (SAF)
To ensure the patch features are actually comparable to the high-level tokens, the SAF (a lightweight MLP) recalibrates the patch tokens at each layer. This creates a clean "Self-Aligned Cosine Contrast," making the boundary between normal and abnormal regions much sharper in the latent space.
Empirical Superiority
The results across 13 datasets (including MVTec-AD and complex medical datasets like BrainMRI) show that VisualAD is not just a "lightweight alternative" but a superior performer.
- Versatility: It works seamlessly with both CLIP-ViT and DINOv2 backbones.
- Localization: The combination of SCA and multi-layer fusion allows it to produce high-resolution anomaly maps that outperform VLM-based competitors.

Ablation Insight: The Power of Multi-Layer Fusion
Analysis shows that mid-level layers (e.g., Layer 18 of ViT-L) are the "sweet spot" for anomaly detection, balancing local texture disruptions with global context. VisualAD leverages this by fusing information from multiple depths to achieve robust results.
Deep Insight & Conclusion
VisualAD’s success implies that the "Semantic Gap" in anomaly detection is smaller than we thought. We don't necessarily need to tell a model what an "anomaly" is in English; we just need to provide the architectural capacity for the model to contrast "expected" vs "unexpected" visual manifolds.
Takeaway: For specialized CV tasks where the target is a structural deviation rather than a semantic category, moving away from multi-modal alignment towards purely visual adapters can lead to more efficient and reliable models.
Limitations: While powerful, the model still requires an auxiliary "source" dataset for training the tokens. True "zero-training" open-set detection remains a challenge for future work.
