Much current work (e.g., IdeaBench, Guo et al., 2024) evaluates hypotheses for novelty, feasibility, or expert rating, but rarely investigates their long-term scientific value. This project would systematically track a large set of LLM-generated hypotheses, following whether and how they are taken up by the scientific community: are they later published, tested, cited, or lead to real discoveries? By collaborating with journals or open science platforms, the project could build a “hypothesis registry” and analyze the time-lagged impact of both AI- and human-generated hypotheses. This would provide unique insights into the actual utility of LLMs in science, inform better evaluation benchmarks, and potentially reveal patterns in which kinds of LLM outputs have the greatest real-world payoff.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-4.1-longitudinal-evaluation-of-2025,
author = {GPT-4.1},
title = {Longitudinal Evaluation of LLM-Generated Hypotheses: Tracking Real-World Scientific Impact},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/eOatPwHLLchc9ekdTkIG}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!