OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data

OpenSeeker: Shattering the "Data Moat" for Frontier Search Agents

Summary

Problem

Method

Results

Takeaways

Abstract

OpenSeeker is the first fully open-source frontier-level search agent developed by an academic team. It utilizes two novel synthesis techniques—Fact-grounded QA Synthesis and Denoised Trajectory Synthesis—to achieve SOTA performance on benchmarks like BrowseComp and WideSearch using only 11.7k training samples and a single SFT run.

TL;DR

Search agents have traditionally been a "closed-door game" played by industrial giants like OpenAI and Google. OpenSeeker, a breakthrough from Shanghai Jiao Tong University, changes this by fully open-sourcing the training data and model weights of a 30B agent that rivals proprietary models. Using just 11.7k high-fidelity synthetic samples, it outperforms models like Tongyi DeepResearch which use far more complex RL-based pipelines.

Background: The Industrial Monopoly on "Deep Research"

In the race for autonomous web intelligence, a massive gap has formed between proprietary "Deep Research" models and the open-source community. While architectural details are often shared, the high-quality trajectory data remains a corporate secret. Prior open-source attempts often hit a performance ceiling because their training data lacked the structural complexity to force "multi-hop" reasoning—agents would simply shortcut to answers using parametric memory or simple keyword search.

Methodology: Engineering the "Hardest" Problems

The core philosophy of OpenSeeker is that a model is only as good as the puzzles it is forced to solve. The team introduced two surgical innovations to the data synthesis pipeline:

1. Fact-Grounded & Controllable QA Synthesis

Instead of asking an LLM to "dream up" a hard question, OpenSeeker reverse-engineers the web graph.

Topological Expansion: They start with a seed webpage and expand to connected nodes via hyperlinks.
Entity Obfuscation: To prevent the agent from "cheating" with a single Google search, they replace specific entities with vague descriptions (e.g., "the winner of the 2024 award" instead of the person's name). This mandates a multi-step navigation path through the graph to resolve entities before answering.

Model Architecture and QA Pipeline

2. Denoised Trajectory Synthesis (Asymmetric Training)

Raw web data is noisy. To create "Golden Trajectories":

The Teacher: During data generation, a teacher model sees a summarized, denoised version of previous steps. This allows the teacher to plan perfectly without getting distracted by HTML boilerplate.
The Student: During SFT, the OpenSeeker model is trained to predict the Teacher's perfect actions but is given the raw, noisy tool output. This forces the model to internalize the ability to "see through the noise."

Denoised Trajectory Synthesis Mechanism

Experiments: Quality Over Quantity

The results are a testament to the power of data engineering. Traditionally, SFT requires hundreds of thousands of samples to change model behavior significantly. OpenSeeker achieves SOTA with a mere 11,700 samples.

Efficiency: On BrowseComp-ZH, OpenSeeker (48.4) beats Tongyi DeepResearch (46.7).
Complexity: Analysis of the trajectories shows OpenSeeker-v1 data averages 46 tool calls per task, nearly double the complexity of standard benchmarks like BrowseComp.

Performance Comparison across Benchmarks

Critical Analysis & Conclusion

Takeaway

OpenSeeker proves that you don't need a multi-million dollar RL budget to build a frontier agent; you need an intelligent data engine. By reverse-engineering the web graph to generate tasks, the authors have provided a scalable "curriculum" that could theoretically scale to much larger models.

Limitations

Currently, the model has only been trained for a single run. The authors note that no heuristic filtering or hyperparameter tuning was performed, suggesting that the current performance is likely a lower bound of what this methodology can achieve.

Outlook

By open-sourcing the data, OpenSeeker democratizes research into long-horizon planning. Future work will likely look at integrating more diverse tools (beyond web search) and refining the "Student-Teacher" denoising gap to handle even more unstructured environments like PDF analysis or code repositories.

Find Similar Papers

Try Our Examples

Search for recent papers published after 2025 that utilize web graph topological expansion for synthetic data generation in LLM agents.
Which study first introduced the concepto of 'Asymmetric Context Training' or 'Denoised Reasoning' where teachers use clean data and students use noisy data?
Explore how the OpenSeeker methodology of entity obfuscation could be applied to improve multi-modal agents in visual navigation tasks.

Contents

OpenSeeker: Shattering the "Data Moat" for Frontier Search Agents

1. TL;DR

2. Background: The Industrial Monopoly on "Deep Research"

3. Methodology: Engineering the "Hardest" Problems

3.1. 1. Fact-Grounded & Controllable QA Synthesis

3.2. 2. Denoised Trajectory Synthesis (Asymmetric Training)

4. Experiments: Quality Over Quantity

5. Critical Analysis & Conclusion

5.1. Takeaway

5.2. Limitations

5.3. Outlook