WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
DCI: Forgetting the Index—Why Agents Should Search the Raw Corpus Like Developers
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces Direct Corpus Interaction (DCI), a novel retrieval paradigm for agentic search where agents interact with raw corpora using general-purpose terminal tools (e.g., grep, bash) instead of fixed top-k similarity interfaces. This method achieves significant performance gains, including a +11.0% accuracy increase on BrowseComp-Plus and a +30.7 point lead in multi-hop QA over state-of-the-shelf retrieval-agent baselines.

TL;DR

Modern AI agents are trapped in a "top-k" cage. Even the smartest models like Claude 3.5 or GPT-4o are forced to view the world through the blurry lens of vector similarity. Direct Corpus Interaction (DCI) shatters this paradigm by giving agents a terminal. Instead of asking a retriever for snippets, the agent uses grep, find, and bash to hunt through raw files. The result? A massive jump in accuracy and a dramatic reduction in cost.

The Resolution Bottleneck: Why Your RAG is Failing

In traditional RAG, we act like the retriever is a librarian. We give it a query, and it hands us 5 books. But what if we need to find every document where Company A is mentioned but Company B is not, and verify if a specific serial number appears on page 42?

Conventional retrievers (dense or sparse) are high-level semantic mirrors. They are great for "vibes" but terrible for precision and composition. If the retriever misses a crucial link in the first step, the agent—no matter how capable—can never recover it. The authors call this a lack of Retrieval Interface Resolution.

Methodology: The Power of the Pipe

DCI transforms the agent from a passive consumer of snippets into a system administrator of knowledge.

1. The Probing Interface

The agent interacts with the corpus using:

  • grep / rg: For exact lexical and regex matching.
  • find / glob: For structural navigation of directories.
  • head / tail / sed: For surgical inspection of local context without loading massive files.

2. Runtime Context Management

Searching via terminal generates a lot of "noise." To solve this, the authors developed a tiered management system:

  • Truncation: Capping tool outputs (e.g., 20k characters).
  • Compaction: Replacing older search turns with placeholders to free up the context window.
  • Summarization: Using the LLM to condense search history when things get too long.

Architecture Comparison Figure: DCI (Right) removes the intermediary index, allowing the agent to "touch" the data directly.

Experiments: Superior Quality, Lower Cost

The researchers tested DCI on BrowseComp-Plus, a benchmark designed for "Deep Research."

  • Accuracy Boost: Using the same Claude Sonnet 4.6 backbone, DCI hit 80% accuracy, compared to only 69% for the best embedding-based retriever (Qwen3-8B).
  • Cost Efficiency: Because the agent isn't constantly re-indexing or pulling massive redundant chunks, the API cost dropped by nearly 30%.
  • Multi-hop Domination: On benchmarks like MuSiQue, DCI outperformed the strongest baselines by 50 points.

Performance vs Cost Figure: The Pareto frontier shows DCI-Agents (stars) delivering higher accuracy for lower cost compared to traditional retrieval.

Why Does It Work? The "Localization" Factor

The paper introduces two key metrics:

  1. Coverage: Did the agent find the document? (Recall)
  2. Localization: Did the agent find the exact right lines within the document?

Interestingly, DCI often has lower broad coverage than vector search, but its Localization score is 2x higher. Once a DCI agent finds a "warm" document, it uses grep to zoom in on the exact evidence. It doesn't need to read the whole "book" if it can grep the right "sentence."

Critical Analysis & The "Breadth" Limit

DCI is not a silver bullet for everything. The authors found a clear Operating Envelope:

  • Scaling Pain: DCI excels on local or heterogeneous corpora (100k - 200k docs). However, as the corpus grows to 400k+, the "search breadth" (finding that first anchor file) becomes too expensive for an LLM to manage via raw shell commands.
  • The Sweet Spot: DCI is perfect for local workspaces, coding repositories, and deep research where precision matters more than scanning billions of web pages.

Conclusion: A New Era of Agentic IR

The biggest takeaway is that we should stop treating retrieval as a "black box" that gives us snippets. For the next generation of LLMs—which can reason and use tools—the best interface isn't a vector database; it's a Standard Bash Terminal.

Future Outlook

We are likely to see hybrid systems where a dense retriever handles the "Wide Scan" (finding the top 1,000 files) and a DCI-agent handles the "Deep Dive" (using terminal tools to extract the truth from those 1,000 files).

Find Similar Papers

Try Our Examples

  • Find recent papers exploring "Retrieval-as-a-Tool" or "Agent-Computer Interfaces" that replace vector databases with structured search primitives.
  • Which studies first measured the "information bottleneck" of top-k retrieval in multi-hop reasoning tasks, and how does DCI's resolution compare?
  • Are there emerging techniques that combine the scalability of dense vector search with the high-resolution lexical precision of direct terminal interaction for large-scale ( >1M docs) corpora?
Contents
DCI: Forgetting the Index—Why Agents Should Search the Raw Corpus Like Developers
1. TL;DR
2. The Resolution Bottleneck: Why Your RAG is Failing
3. Methodology: The Power of the Pipe
3.1. 1. The Probing Interface
3.2. 2. Runtime Context Management
4. Experiments: Superior Quality, Lower Cost
5. Why Does It Work? The "Localization" Factor
6. Critical Analysis & The "Breadth" Limit
7. Conclusion: A New Era of Agentic IR
7.1. Future Outlook