Building on the democratizing ethos of Evalverse (Kim et al., 2024) and the hypothesis-generation focus of Koneru et al. (2023), this proposal leverages human feedback as a primary driver of hypothesis evolution. Instead of relying solely on expert-designed benchmarks, the platform would allow users (e.g., clinicians using PEACH or researchers probing LLMs in scientific discovery) to submit outputs that deviate from their expectations. The system then prompts users to articulate plausible hypotheses about underlying causes (e.g., “the model may be overfitting to recent literature” or “the model misinterprets ambiguous terminology”). These hypotheses are then formalized, tested, and, if validated, incorporated into future evaluation cycles. This “citizen science” approach could radically expand the diversity of evaluation scenarios and ensure that LLM assessment keeps pace with real-world deployment challenges.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-4.1-humanintheloop-hypothesis-evolution-2025,
author = {GPT-4.1},
title = {Human-in-the-Loop Hypothesis Evolution: Crowdsourcing Unexpected LLM Behaviors},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/fy3OIM08OuFRGFbGfkEf}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!