WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
AI Scientists: Producing Results Without Scientific Reasoning
总结
问题
方法
结果
要点
摘要

The paper introduces "Corral," a benchmarking framework evaluating LLM-based scientific agents across eight domains. It reveals that while "AI Scientists" can execute computational workflows, they fundamentally fail at hypothesis-driven inquiry, with performance dictated almost entirely by the base model rather than agent scaffolding.

TL;DR

A landmark study from researchers at the Friedrich Schiller University Jena and IIT Delhi introduces Corral, a rigorous framework that puts "AI Scientists" under the microscope. The verdict? Current LLM agents are excellent at following instructions (workflows) but catastrophic at the "Scientific Method" (hypothesis-driven inquiry). They routinely ignore evidence and stick to false beliefs, and no amount of "prompt engineering" or "scaffolding" seems to fix it.

The Problem: The "Result" is Not the "Reason"

In 1965, the DENDRAL system—one of the first AI for science—was designed to be transparent. It showed every step of its logic. Today’s LLM-based agents are the opposite: they are "black boxes" governed by statistical regularities.

The authors argue that a "correct answer" in science is worthless if the process used to get there is flawed. If an AI ignores a failed experiment but guesses the right molecule anyway, that knowledge is not justified. Current benchmarks only measure if the AI finished the task, not how it got there.

Methodology: Deconstructing the "AI Scientist"

The researchers evaluated three frontier models (including GPT-4o and Claude 3.5 Sonnet) across eight domains ranging from Molecular Simulation to Circuit Inference. They used a sophisticated two-stage approach:

  1. IRT Analysis: Using Item Response Theory to pinpoint whether a failure was due to a lack of "Knowledge" (facts) or "Reasoning" (logic).
  2. Epistemological Graphing: Mapping every thought and action of the AI into a graph to see if it actually behaves like a scientist (e.g., forming a hypothesis → testing → updating belief).

Benchmark Overview

Key Insight: The "Base Model" is Everything

One of the most striking findings is the Variance Decomposition. In the AI community, there is a massive focus on "Scaffolding" (ReAct, Chain-of-Thought, Tool-calling). However, this study proves that:

  • Reasoning Ability (Base Model) accounts for 41.4% of the performance.
  • Agent Scaffold accounts for a measly 1.5%.

In short: You cannot "engineer" your way out of a weak base model.

Performance Drivers

The "Reasoning Breakdowns"

When the researchers looked at the logic of the agents, they found a graveyard of scientific errors:

  • Evidence Non-uptake (68%): The AI runs a tool, gets a result, and then completely ignores that result in its next thought.
  • Untested Claims (53%): The AI makes a bold claim (e.g., "This must be an acetyl group") but never actually tries to verify it.
  • Fixed Belief Traces: The AI makes an initial error and, despite subsequent tools screaming that something is wrong, it never revises its original (false) hypothesis.

Reasoning Breakdowns

Reliability: The Decay of Truth

Because the reasoning is so brittle, the reliability of these agents collapses over time. In "Hypothesis-driven" tasks (like figuring out a chemical structure from a spectrum), the probability that the AI will succeed multiple times in a row (Pass^k) drops to nearly zero by the 4th or 5th attempt.

Even when the researchers "hand-fed" the AI successful steps from previous runs (Trace Interventions), the models often still failed to bridge the gap to the final answer if they had to do the hard reasoning themselves.

Conclusion and Future Outlook

The paper concludes with a stern warning for the "AI for Science" field: Outcome-based evaluation is blind to epistemic failure.

If we want AI that can truly participate in the self-correcting nature of science, we must stop focusing on the "Scaffold" and start making epistemic norms (like Popperian falsification) a direct training target for the foundational models. Until then, the "scientific" results produced by AI remain, at best, lucky guesses.

Takeaways for the Industry

  • Don't over-rely on scaffolding: If your base model can't reason, a fancy prompt won't save it in complex domains.
  • Process matters: Evaluations must include "trace audits" to ensure the AI isn't hallucinating its way to a correct answer.
  • Training Shift: The next frontier of LLM training isn't more data—it's data that rewards the process of inquiry and correction.

发现相似论文

试试这些示例

  • Search for recent papers that propose training objectives or fine-tuning datasets specifically designed to improve the "epistemic alignment" or "scientific self-correction" of large language models.
  • Which studies first identified the "narrowing of hypothesis space" in AI-driven research, and how does the current work's findings on "evidence non-uptake" provide a mechanistic explanation for this phenomenon?
  • Explore research that applies the "Language Decision Process" (LDP) framework to autonomous agents in non-scientific high-stakes environments like legal reasoning or medical diagnosis to see if similar "reasoning breakdowns" occur.
目录
AI Scientists: Producing Results Without Scientific Reasoning
1. TL;DR
2. The Problem: The "Result" is Not the "Reason"
3. Methodology: Deconstructing the "AI Scientist"
4. Key Insight: The "Base Model" is Everything
5. The "Reasoning Breakdowns"
6. Reliability: The Decay of Truth
7. Conclusion and Future Outlook
7.1. Takeaways for the Industry