WisPaper
WisPaper
Scholar Search
Scholar QA
AI Feeds
Pricing
TrueCite
[Pre-print 2026] TRQA: Why Your "Deep Research" Agent is Failing the Total Recall Test
Summary
Problem
Method
Results
Takeaways
Abstract

TRQA is a novel evaluation suite designed to assess Deep Research Agents (DRAs) on "Total Recall" tasks where answering a single query requires retrieving every relevant document from a large corpus. It leverages an entity-centric framework across three datasets (Wikidata-Wikipedia and a synthetic E-commerce corpus) to provide verifiable, reproducible benchmarks for multi-step reasoning and retrieval.

TL;DR

Deep Research Agents (DRAs) are the latest frontier in AI, promised to act as autonomous investigators. However, the new TRQA (Total Recall Question Answering) benchmark reveals a sobering reality: when a question requires finding all relevant evidence rather than just a few snippets, even the most advanced agents (Search-R1, GPT-5.2) fail. TRQA introduces a verifiable, contamination-resistant framework that separates memorization from true research logic.

The "Needle in a Haystack" is Not Enough

Standard RAG (Retrieval-Augmented Generation) benchmarks usually ask for a specific fact. If an agent finds one document containing that fact, it wins. But real-world research—like market analysis or scientific synthesis—requires Total Recall.

Consider the query: "What is the total revenue of all e-commerce companies founded in 2010 that use AI for logistics?" If your agent finds 9 out of 10 relevant companies, the final sum is wrong. Current benchmarks fail to test this because:

  1. LLM-as-a-Judge is flaky: Subjective grading of "research reports" isn't reproducible.
  2. Data Contamination: LLMs "know" the answers to Wikipedia questions from training, masking their inability to actually search.

Methodology: The TRQA Framework

The researchers built TRQA by anchoring unstructured text to structured Knowledge Bases (KB).

TRQA Data Generation Framework

The process is mathematically rigorous:

  • Entity Filtering: Use the KB to find a precise set of entities (e.g., all US states with population > 5M).
  • Aggregation: Apply an operator (SUM, AVG, COUNT) to a specific property of those entities.
  • Verifiability: The result is a single number. No subjective grading required—either the agent got the math right, or it didn't.

To ensure generalizability, they created TRQA-Ecommerce, a completely synthetic world of 476,842 products. Since the LLM has never seen this data during pre-training, it cannot rely on its "gut feeling" or parametric memory.

The Brutal Reality of Benchmark Results

The findings are a wake-up call for the industry. When agents were tested on the synthetic TRQA-Ecommerce dataset, their performance cratered.

Performance Comparison Table

Key Insights:

  1. The Retrieval Trap: Most agents issue 2-3 sub-queries and stop, regardless of how many documents are actually needed. They retrieve more "distracting" (irrelevant) entities the longer they search.
  2. Reasoning over Recall: Even in the Oracle setting (where the researchers handed the agent all the correct documents), models like Claude 4.5 and GPT-5.2 still failed ~90% of the time due to "reasoning errors." They couldn't accurately synthesize information across a large context window.
  3. Contamination is Real: Performance on Wikipedia-based tasks was much higher because models already "knew" the answers. The synthetic dataset proved that agents are much worse at "researching" than they appear.

Deep Dive: Where do Agents go Wrong?

The paper categorizes failures into two buckets:

  • Parametric Bias: The model ignores the search results and chooses an answer based on its training data.
  • Reasoning Failures: The model reads the documents but fails the math or the logic.

Interestingly, 95.9% of Claude's errors were pure reasoning failures. This suggests that expanding the context window (Long Context) is not a silver bullet; models need better "working memory" and symbolic reasoning to aggregate data.

Conclusion and Future Outlook

TRQA provides a much-needed "stress test" for the agentic era. It moves evaluation away from "looking smart" to "being accurate."

Takeaway for Developers: If you are building a research agent, don't just optimize for better embeddings. TRQA shows that the real bottlenecks are iteratively refining search queries to reach 100% recall and robustly aggregating large volumes of numerical data without losing track in the context window.

The code and datasets are open-sourced, providing a new North Star for the next generation of Search-R1 and similar reasoning models.

Find Similar Papers

Try Our Examples

  • Search for recent papers published after 2024 that propose verifiable benchmarks for multi-step agentic information seeking beyond long-form report generation.
  • Identify the origin of the "Total Recall" track in TREC and how its evaluation metrics have evolved for modern Large Language Model agents.
  • Investigate studies that analyze "parametric knowledge bias" vs "contextual reasoning" in RAG systems, specifically in the context of numerical aggregation tasks.
Contents
[Pre-print 2026] TRQA: Why Your "Deep Research" Agent is Failing the Total Recall Test
1. TL;DR
2. The "Needle in a Haystack" is Not Enough
3. Methodology: The TRQA Framework
4. The Brutal Reality of Benchmark Results
4.1. Key Insights:
5. Deep Dive: Where do Agents go Wrong?
6. Conclusion and Future Outlook