The paper introduces "Mirage," a phenomenon where Multimodal Large Language Models (MLLMs) confidently describe and reason about non-existent images. By evaluating frontier models (GPT-5, Gemini-3, Claude-4.5) on a new benchmark, Phantom-0, the authors reveal that these systems often rely on internal linguistic priors rather than visual grounding, achieving high scores on multimodal benchmarks even when images are entirely removed.
TL;DR
Think your favorite multimodal AI "sees" the image you uploaded? Think again. A groundbreaking study from Stanford researchers reveals the Mirage Effect: frontier models like GPT-5 and Gemini-3 can generate flawless, detailed reasoning for images that were never even provided. By exploiting textual cues and benchmark leaks, these models achieve SOTA scores on "visual" benchmarks while being effectively blind.
The "Mirage" vs. The Hallucination
While the AI community has long obsessed over hallucinations (making up facts within a frame), the Mirage Effect is more insidious. It involves constructing a false epistemic frame—simulating the entire process of seeing, describing, and diagnosing based on a non-existent input.
The authors found that when asked about a missing image, models don't say "I can't see anything." Instead, they confidently describe specific fracture types in X-rays or license plate numbers in street scenes that exist only in their internal weights.

The Problem: Benchmarks are Leaking
The core of the issue lies in the "Mirage Score"—the ratio of accuracy without images to accuracy with images. In many cases, models retain 70-80% of their performance without ever processing a single pixel.
Why does this happen?
- Linguistic Spatulas: Questions often contain subtle cues that allow a strong LLM to "deduce" the answer.
- Benchmark Saturation: As benchmarks like MMMU become public, they are absorbed into the web-scale training data of the next generation of models.
- Pathology Bias: In medical contexts, models default to "finding" something wrong. When shown no image of a brain MRI, models often hallucinate high-stakes diagnoses like Glioblastoma.

The "Super-Guesser" Experiment
To prove that visual understanding is often irrelevant to current scores, the authors trained a tiny 3B text-only model (Qwen-2.5) on the ReXVQA dataset—with all images removed.
The result? This text-only "Super-Guesser" outperformed frontier MLLMs and human radiologists on a held-out test set. It produced reasoning traces for chest X-rays that were indistinguishable from those of a human expert, despite never seeing a single rib or lung field.
B-Clean: A New Standard for Grounded Evaluation
Recognizing that manual detection of "leaked" questions is impossible, the authors proposed B-Clean. This framework evaluates models in a "modality-ablation" mode and removes any question that any model can solve without an image.
When benchmarks are "B-Cleaned":
- The number of valid questions drops by ~75%.
- Accuracies crash, revealing the true "visual delta."
- Model rankings flip, showing that some "SOTA" models are simply better at text-based shortcutting than others.

Critical Insight & Future Outlook
This paper serves as a vital wake-up call. We have been measuring multimodal reasoning through the lens of language accuracy. If a model can "hallucinate" its way to a correct diagnosis, the reasoning trace—the very thing we use to trust the model—becomes a liability.
For high-stakes deployments in medicine or autonomous systems, we must transition to:
- Private, Dynamic Benchmarks: Evaluation sets that cannot be memorized.
- Counterfactual Verification: Architectures that explicitly check if their answer changes when the image is removed.
The era of blindly trusting benchmark rankings is over. It's time to ensure our AI isn't just a very convincing mime, but acts as a truly grounded observer of our world.
