WisPaper
WisPaper
Search
QA
Pricing
TrueCite

Can LLMs contribute meaningfully to novel scientific discovery?

LLMs can generate novel research ideas but face feasibility and reliability challenges. Evidence from expert studies and astronomy shows promise and pitfalls.

Direct answer

Yes, large language models (LLMs) can contribute meaningfully to novel scientific discovery, but with important caveats. A large-scale study with over 100 NLP researchers found that LLM-generated ideas were judged as more novel than human expert ideas (statistically significant at p<0.05), though slightly weaker on feasibility [1]. In astronomy, an LLM framework successfully identified dozens of previously unclassified celestial objects with high scientific potential and even proposed follow-up observation plans [3]. However, LLMs still struggle with self-evaluation, lack diversity in idea generation, and can produce unreliable 'hallucinations' when not grounded in structured knowledge [1][4].

5sources cited

This article was generated with WisPaper-powered search and paper analysis.

Are LLM ideas actually more novel than human ideas?

Yes, according to the most rigorous head-to-head comparison to date. Researchers recruited over 100 NLP experts to write novel research ideas, then had them blindly review both human and LLM-generated ideas. The LLM ideas were judged as significantly more novel (p<0.05, meaning the result is unlikely to be due to chance) [1]. However, the same reviewers rated the LLM ideas as slightly weaker on feasibility — meaning they were more creative but harder to actually execute. This tradeoff is crucial: novelty alone doesn't guarantee a usable discovery.

Can LLMs actually find new things in real data?

Yes, and a concrete example comes from astronomy. Researchers used an LLM to interpret unusual celestial sources that machine learning algorithms had flagged as anomalies in infrared light curves and spectral energy distributions from the NEOWISE survey. After validating the approach on known rare variable sources, they applied it to previously unclassified objects and successfully identified dozens with high scientific potential — and the LLM even generated AI-proposed follow-up observation plans [3]. This shows LLMs can bridge the 'final mile' between data-driven anomaly detection and physical interpretation, a step that often stumps individual experts due to the breadth of modern astrophysics.

What's the catch? Where do LLMs still fall short?

The main catch is reliability. The same study that found LLMs more novel also identified 'failures of LLM self-evaluation' and a 'lack of diversity in generation' — meaning the models often can't tell which of their own ideas are good, and they tend to produce similar-sounding ideas [1]. In biology, attempts to use LLMs for filtering predictions and generating hypotheses have been 'impeded by issues such as hallucinations and the lack of structured knowledge grounding' [4]. To fix this, researchers built a collaborative system called HypoChainer that combines LLMs with knowledge graphs and human expertise, showing that grounding LLMs in structured data (like knowledge graphs) can make their outputs more reliable for hypothesis-driven discovery [4]. So LLMs work best as part of a team — not as solo inventors.

Sources used in this answer

1

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

LLM-generated research ideas were judged as significantly more novel than human expert ideas (p<0.05) but slightly weaker on feasibility, based on blind reviews by over 100 NLP researchers.

2

Large language models and their role in modern scientific discoveries

LLMs accelerate scientific research by efficiently processing big data, but raise fundamental questions about whether the results constitute new knowledge and what scientific creativity means in the era of big computing.

3

Closing the Final Mile in Data-Driven Discovery: Interpreting Uncharted Celestial Sources with Large Language Models Across Multimodal Data

An LLM framework successfully interpreted anomalous celestial sources from NEOWISE data, identifying dozens of previously unclassified objects with high scientific potential and generating AI-proposed follow-up observation plans.

4

HypoChainer: A Collaborative System Combining LLMs and Knowledge Graphs for Hypothesis-Driven Scientific Discovery.

A collaborative system (HypoChainer) that combines LLMs with knowledge graphs and human expertise improved hypothesis-driven discovery in biology, overcoming LLM hallucinations and lack of structured grounding.

5

Editorial: Harnessing the Power of Large Language Model-Based Chatbots for Scientific Discovery

Editorial perspective highlighting the potential of LLM-based chatbots for scientific discovery, particularly in chemistry and drug design, while noting the need for careful integration with existing methods.