The paper introduces DeepFact, comprising DeepFact-Bench (a dynamic factuality benchmark for Deep Research Reports) and DeepFact-Eval (an agentic multi-step verifier). It addresses the verification of complex, multi-hop scientific claims where DeepFact-Eval achieves a SOTA accuracy of 83.4%, significantly outperforming standard pipelines like SAFE (+27.5%).
TL;DR
Deep research agents can now write PhD-level reports, but we can't reliably check if they are lying. The authors of DeepFact discovered that even PhD students are only ~60% accurate when verifying complex claims in isolation. To solve this, they introduced Audit-then-Score (AtS): a paradigm shift where benchmarks "evolve" by letting AI agents challenge human labels with evidence. Their new verifier, DeepFact-Eval, sets a new SOTA by outperforming traditional fact-checkers by over 25% accuracy.
The Problem: The "Expert Fallibility" Crisis
In the world of AI evaluation, we usually treat human "gold labels" as the absolute truth. However, deep research reports (DRRs) are different. Verifying a single technical claim can take an expert hours of cross-referencing papers.
The authors found a startling reality: PhD-level specialists achieved only 60.8% accuracy on a hidden "micro-gold" set of verifiable claims. Why?
- Cognitive Load: A single report contains hundreds of claims.
- Domain Drift: Being an expert in "Reinforcement Learning" doesn't make you an expert in "RAG-based agents."
- Static Fragility: Once a human marks a label, it’s stuck, even if a smarter AI later finds evidence proving the human was wrong.
Methodology: The "Audit-then-Score" (AtS) Evolution
Instead of a "once-and-done" annotation, DeepFact proposes an iterative loop where the benchmark and the agent co-evolve.
- Evaluate: A "Challenger" (the AI model) predicts a label.
- Challenge: If the AI disagrees with the benchmark, it must provide an auditable rationale.
- Audit: An Auditor (human or a stronger agent) compares the two rationales. If the model's evidence is better, the benchmark's "truth" is updated.
- Score: The model is finally scored against the newly refined truth.
Figure 1: The AtS workflow transforms benchmarking from a frozen snapshot into an ongoing scientific dialogue.
The Engine: DeepFact-Eval
To drive this evolution, they built DeepFact-Eval, which moves beyond simple snippet matching. It uses a two-pronged strategy:
- Breadth: Formulates diverse search queries to cover the document space.
- Depth: Generates follow-up questions to extract fine-grained technical details often missed in LLM summaries.
Figure 2: Unlike traditional checkers (left) that use shallow snippets, DeepFact-Eval (right) performs iterative query planning and document-level reasoning.
Experimental Results
The results prove that Humans are better Auditors than Labelers.
- Expert accuracy soared from 60.8% to 90.9% when they were asked to audit AI-generated rationales instead of starting from scratch.
- Performance Gap: On the DeepFact-Bench, DeepFact-Eval achieved 83.4% accuracy, crushing standard tools like SAFE (55.9%) and FactCheck-GPT (55.0%).
Table 1: DeepFact-Eval leads across all metrics, showing that agentic deep-research scaffolding is necessary for technical verification.
Critical Insight: Agents as Proxies
One of the most profound findings is that Agents can act as auditors for other agents. When a GPT-5-based auditor adjudicated disputes between weaker models, the resulting benchmark accuracy improved without any human intervention. This suggests a future where AI systems can autonomously refine the "ground truth" of scientific knowledge.
Conclusion & Future Outlook
DeepFact proves that AS AI approaches super-human performance in niche fields, our evaluation methods must become argument-centric rather than label-centric.
Limitations: The current system acts as a "literature reviewer." It can only verify what is already written. The next frontier? "The AI Scientist" that bridges the gap between literature-based verification and active laboratory experimentation.
Takeaway for the Industry: For high-stakes deployments (Medical, Legal, Engineering), stop trusting static human-labeled datasets. Move toward versioned, auditable benchmarks that evolve every time a model presents a better argument.
