WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
CT Open: Moving AI Evaluation from "Hindsight" to "Forecasting"
总结
问题
方法
结果
要点
摘要

CT Open is a live, open-access platform and dynamic benchmark for Clinical Trial Outcome Prediction (CTOP). It introduces four quarterly challenges per year and utilizes a novel, expert-validated decontamination pipeline to ensure evaluated trials have no publicly available results at the time of prediction, achieving SOTA-level contamination resistance.

TL;DR

Predicting whether a clinical trial will succeed is a billion-dollar question. CT Open is a new "live" platform that challenges AI systems to predict trial outcomes before they are announced. By using a sophisticated, multi-stage automated decontamination pipeline, it ensures the AI isn't just "remembering" a result it saw during training, but is actually reasoning about drug mechanisms and trial designs.

The Problem: The "Memory" vs. "Intelligence" Trap

In the world of Large Language Models (LLMs), today's SOTA (State of the Art) might just be tomorrow's training data. Static benchmarks are rapidly losing their utility because models eventually ingest the answers. In high-stakes fields like clinical trials, this is particularly dangerous: an LLM might appear to possess "medical intuition" when it is actually just recalling a 2023 press release.

The authors of CT Open identify a critical nuance: Contamination is not just a database problem. Clinical trial results leak in fragments—an obscure LinkedIn post from a researcher, a blurry photo of a conference poster, or a financial disclosure. Standard registry checks are insufficient to guarantee a "blind" prediction.

Methodology: The Decontamination Fortress

The core innovation of CT Open is its Decontamination Pipeline (Algorithm 1). Unlike past benchmarks that simply check dates on ClinicalTrials.gov, CT Open treats data cleaning as an adversarial search task.

1. The Multi-Layer Search

The pipeline employs a "Global Search" strategy across several modes:

  • LLM Mode: Orchestrating GPT-5 and Gemini 3 to browse the web for any mention of results.
  • Brave Mode: A customized technical scraping stack that uses deep-web search and DeepSeek OCR to read text within PDFs and images.
  • Database Mode: Direct API querying of PubMed, PMC, and preprint servers.

2. The Verification Loop

Once a document is found, it undergoes a two-round GPT-5 verification:

  • Round 1 (Matching): Does this document discuss this exact trial? (Checking NCT IDs, drug aliases, and enrollment counts).
  • Round 2 (Evidence): Does it actually contain results, or is it just a recruitment announcement?

Decontamination Logic Figure 1: The CT Open platform architecture and leaderboard samples.

Experiments: Humans and Trad-ML Still Rule

The paper's evaluation phase delivers a sobering "vibe check" for LLM enthusiasts.

Key Findings:

  • Traditional Baselines Persist: Simple Random Forests and Feed-Forward Neural Networks (FFNN) trained on structured tabular data (Phase, Enrollment count, MeSH terms) often outperformed the world's most powerful LLMs.
  • The Contamination Dip: When Claude Opus 4.5 was tested on the "Winter 2025" set (potentially within its training cutoff), it performed excellently. However, on the "Summer 2025" set—which is strictly uncontaminated—its performance plummeted significantly.
  • The Power of Agents: Agentic workflows (o3-mini with web access) showed a unique ability to find "dark data." In one instance, the agent found a patient blog that provided anecdotal evidence of a trial's failure, allowing it to predict a "Comparative Effect" outcome with near-perfect accuracy (reaching 92.75% in specific categories).

Results Table Figure 2: Performance comparison showing the drop in LLM accuracy on the Summer 2025 benchmark.

Methodology Deep Dive: Agentic Search and RAG

The authors explored whether Retrieval-Augmented Generation (RAG) could bridge the gap. Interestingly, generic retrieval of "preclinical evidence" (like animal studies) often hurt performance. The most effective strategy was "Me-Too" Trial Retrieval: finding previously completed trials of similar drugs in the same disease class. This requires the model to reason: "If drug A failed in this population, why would drug B (the target) succeed?"

Critical Analysis & Conclusion

CT Open is more than just a leaderboard; it is an infrastructure for Forecasting AI.

Takeaway: The study proves that "off-the-shelf" LLMs are not yet reliable biological oracles. They struggle to handle structured tabular data (like patient eligibility) as effectively as classical ML, and they are highly dependent on whether the "answer" was already part of their training set.

The Future: As AI moves toward "AI Scientists" and "Auto-Research," platforms like CT Open will be the ultimate proving ground. The next frontier isn't just processing what we know—it's predicting what we don't.

Platform Access: ct-open.net

发现相似论文

试试这些示例

  • Search for recent papers published after 2024 that propose dynamic or "live" benchmarking strategies for LLMs to prevent data contamination.
  • Which studies first established the use of LLMs for web-based data decontamination, and how does the iterative search approach in CT Open refine those techniques?
  • Investigate how Agentic LLM frameworks like the one used in CT Open can be applied to financial forecasting or real-world geopolitical event prediction.
目录
CT Open: Moving AI Evaluation from "Hindsight" to "Forecasting"
1. TL;DR
2. The Problem: The "Memory" vs. "Intelligence" Trap
3. Methodology: The Decontamination Fortress
3.1. 1. The Multi-Layer Search
3.2. 2. The Verification Loop
4. Experiments: Humans and Trad-ML Still Rule
4.1. Key Findings:
5. Methodology Deep Dive: Agentic Search and RAG
6. Critical Analysis & Conclusion