DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

[AtS Protocol] DeepFact: Why Static Benchmarks Fail at the Frontiers of AI Research

总结

问题

方法

结果

要点

摘要

The paper introduces DeepFact, comprising DeepFact-Bench (a dynamic factuality benchmark for Deep Research Reports) and DeepFact-Eval (an agentic multi-step verifier). It addresses the verification of complex, multi-hop scientific claims where DeepFact-Eval achieves a SOTA accuracy of 83.4%, significantly outperforming standard pipelines like SAFE (+27.5%).

TL;DR

Deep research agents can now write PhD-level reports, but we can't reliably check if they are lying. The authors of DeepFact discovered that even PhD students are only ~60% accurate when verifying complex claims in isolation. To solve this, they introduced Audit-then-Score (AtS): a paradigm shift where benchmarks "evolve" by letting AI agents challenge human labels with evidence. Their new verifier, DeepFact-Eval, sets a new SOTA by outperforming traditional fact-checkers by over 25% accuracy.

The Problem: The "Expert Fallibility" Crisis

In the world of AI evaluation, we usually treat human "gold labels" as the absolute truth. However, deep research reports (DRRs) are different. Verifying a single technical claim can take an expert hours of cross-referencing papers.

The authors found a startling reality: PhD-level specialists achieved only 60.8% accuracy on a hidden "micro-gold" set of verifiable claims. Why?

Cognitive Load: A single report contains hundreds of claims.
Domain Drift: Being an expert in "Reinforcement Learning" doesn't make you an expert in "RAG-based agents."
Static Fragility: Once a human marks a label, it’s stuck, even if a smarter AI later finds evidence proving the human was wrong.

Methodology: The "Audit-then-Score" (AtS) Evolution

Instead of a "once-and-done" annotation, DeepFact proposes an iterative loop where the benchmark and the agent co-evolve.

Evaluate: A "Challenger" (the AI model) predicts a label.
Challenge: If the AI disagrees with the benchmark, it must provide an auditable rationale.
Audit: An Auditor (human or a stronger agent) compares the two rationales. If the model's evidence is better, the benchmark's "truth" is updated.
Score: The model is finally scored against the newly refined truth.

Figure 1: The AtS workflow transforms benchmarking from a frozen snapshot into an ongoing scientific dialogue.

The Engine: DeepFact-Eval

To drive this evolution, they built DeepFact-Eval, which moves beyond simple snippet matching. It uses a two-pronged strategy:

Breadth: Formulates diverse search queries to cover the document space.
Depth: Generates follow-up questions to extract fine-grained technical details often missed in LLM summaries.

Architecture Comparison Figure 2: Unlike traditional checkers (left) that use shallow snippets, DeepFact-Eval (right) performs iterative query planning and document-level reasoning.

Experimental Results

The results prove that Humans are better Auditors than Labelers.

Expert accuracy soared from 60.8% to 90.9% when they were asked to audit AI-generated rationales instead of starting from scratch.
Performance Gap: On the DeepFact-Bench, DeepFact-Eval achieved 83.4% accuracy, crushing standard tools like SAFE (55.9%) and FactCheck-GPT (55.0%).

Performance Table Table 1: DeepFact-Eval leads across all metrics, showing that agentic deep-research scaffolding is necessary for technical verification.

Critical Insight: Agents as Proxies

One of the most profound findings is that Agents can act as auditors for other agents. When a GPT-5-based auditor adjudicated disputes between weaker models, the resulting benchmark accuracy improved without any human intervention. This suggests a future where AI systems can autonomously refine the "ground truth" of scientific knowledge.

Conclusion & Future Outlook

DeepFact proves that AS AI approaches super-human performance in niche fields, our evaluation methods must become argument-centric rather than label-centric.

Limitations: The current system acts as a "literature reviewer." It can only verify what is already written. The next frontier? "The AI Scientist" that bridges the gap between literature-based verification and active laboratory experimentation.

Takeaway for the Industry: For high-stakes deployments (Medical, Legal, Engineering), stop trusting static human-labeled datasets. Move toward versioned, auditable benchmarks that evolve every time a model presents a better argument.

发现相似论文

试试这些示例

Search for recent papers published after 2024 that utilize "Evolving Benchmarking" or "Human-in-the-loop" auditing to evaluate LLM factuality in specialized domains like medicine or law.
Which paper first proposed the "Audit-then-Score" paradigm for dataset curation, and how does DeepFact's implementation specifically adapt it for high-complexity scientific literature?
Explore whether agentic verification frameworks similar to DeepFact-Eval have been applied to multi-modal research tasks, such as verifying claims in papers containing both text and biochemical structure diagrams.

[AtS Protocol] DeepFact: Why Static Benchmarks Fail at the Frontiers of AI Research

1. TL;DR

2. The Problem: The "Expert Fallibility" Crisis

3. Methodology: The "Audit-then-Score" (AtS) Evolution

3.1. The Engine: DeepFact-Eval

4. Experimental Results

5. Critical Insight: Agents as Proxies

6. Conclusion & Future Outlook