Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

WisPaper

Scholar Search

Scholar QA

AI Feeds

Pricing

TrueCite

Workspace

Home

Blog

Total Recall QA: A Verifiable Evaluation Suite for Deep Research Agents

[Pre-print 2026] TRQA: Why Your "Deep Research" Agent is Failing the Total Recall Test

Summary

Problem

Method

Results

Takeaways

Abstract

TRQA is a novel evaluation suite designed to assess Deep Research Agents (DRAs) on "Total Recall" tasks where answering a single query requires retrieving every relevant document from a large corpus. It leverages an entity-centric framework across three datasets (Wikidata-Wikipedia and a synthetic E-commerce corpus) to provide verifiable, reproducible benchmarks for multi-step reasoning and retrieval.

TL;DR

Deep Research Agents (DRAs) are the latest frontier in AI, promised to act as autonomous investigators. However, the new TRQA (Total Recall Question Answering) benchmark reveals a sobering reality: when a question requires finding all relevant evidence rather than just a few snippets, even the most advanced agents (Search-R1, GPT-5.2) fail. TRQA introduces a verifiable, contamination-resistant framework that separates memorization from true research logic.

The "Needle in a Haystack" is Not Enough

Standard RAG (Retrieval-Augmented Generation) benchmarks usually ask for a specific fact. If an agent finds one document containing that fact, it wins. But real-world research—like market analysis or scientific synthesis—requires Total Recall.

Consider the query: "What is the total revenue of all e-commerce companies founded in 2010 that use AI for logistics?" If your agent finds 9 out of 10 relevant companies, the final sum is wrong. Current benchmarks fail to test this because:

LLM-as-a-Judge is flaky: Subjective grading of "research reports" isn't reproducible.
Data Contamination: LLMs "know" the answers to Wikipedia questions from training, masking their inability to actually search.

Methodology: The TRQA Framework

The researchers built TRQA by anchoring unstructured text to structured Knowledge Bases (KB).

TRQA Data Generation Framework

The process is mathematically rigorous:

Entity Filtering: Use the KB to find a precise set of entities (e.g., all US states with population > 5M).
Aggregation: Apply an operator (SUM, AVG, COUNT) to a specific property of those entities.
Verifiability: The result is a single number. No subjective grading required—either the agent got the math right, or it didn't.

To ensure generalizability, they created TRQA-Ecommerce, a completely synthetic world of 476,842 products. Since the LLM has never seen this data during pre-training, it cannot rely on its "gut feeling" or parametric memory.

The Brutal Reality of Benchmark Results

The findings are a wake-up call for the industry. When agents were tested on the synthetic TRQA-Ecommerce dataset, their performance cratered.

Performance Comparison Table

Key Insights:

The Retrieval Trap: Most agents issue 2-3 sub-queries and stop, regardless of how many documents are actually needed. They retrieve more "distracting" (irrelevant) entities the longer they search.
Reasoning over Recall: Even in the Oracle setting (where the researchers handed the agent all the correct documents), models like Claude 4.5 and GPT-5.2 still failed ~90% of the time due to "reasoning errors." They couldn't accurately synthesize information across a large context window.
Contamination is Real: Performance on Wikipedia-based tasks was much higher because models already "knew" the answers. The synthetic dataset proved that agents are much worse at "researching" than they appear.

Deep Dive: Where do Agents go Wrong?

The paper categorizes failures into two buckets:

Parametric Bias: The model ignores the search results and chooses an answer based on its training data.
Reasoning Failures: The model reads the documents but fails the math or the logic.

Interestingly, 95.9% of Claude's errors were pure reasoning failures. This suggests that expanding the context window (Long Context) is not a silver bullet; models need better "working memory" and symbolic reasoning to aggregate data.

Conclusion and Future Outlook

TRQA provides a much-needed "stress test" for the agentic era. It moves evaluation away from "looking smart" to "being accurate."

Takeaway for Developers: If you are building a research agent, don't just optimize for better embeddings. TRQA shows that the real bottlenecks are iteratively refining search queries to reach 100% recall and robustly aggregating large volumes of numerical data without losing track in the context window.

The code and datasets are open-sourced, providing a new North Star for the next generation of Search-R1 and similar reasoning models.

Find Similar Papers

Try Our Examples

Search for recent papers published after 2024 that propose verifiable benchmarks for multi-step agentic information seeking beyond long-form report generation.
Identify the origin of the "Total Recall" track in TREC and how its evaluation metrics have evolved for modern Large Language Model agents.
Investigate studies that analyze "parametric knowledge bias" vs "contextual reasoning" in RAG systems, specifically in the context of numerical aggregation tasks.

Contents

[Pre-print 2026] TRQA: Why Your "Deep Research" Agent is Failing the Total Recall Test

1. TL;DR

2. The "Needle in a Haystack" is Not Enough

3. Methodology: The TRQA Framework

4. The Brutal Reality of Benchmark Results

4.1. Key Insights:

5. Deep Dive: Where do Agents go Wrong?

6. Conclusion and Future Outlook