Competition Week of 11/17/25: New Agent Makes One Step Forward, What's Next?
By Haokun Liu
Thank you to everyone who participated in our second weekly competition! Whether you submitted new research ideas or voted on existing ones, your engagement is what makes this community work. We're excited to share results from the top three winning ideas!
The Winning Ideas
This week's winners explored diverse questions across belief updates and code generation. We ran each idea through two different agent implementations (Codex and Claude Code). All generated repositories are now available:
Story CoT: Narrative-Based Chain-of-Thought Reasoning
Anomalous Belief Shifts: Detecting Inappropriate Belief Changes
Coverage vs Efficiency: What Do LLMs Actually Improve?
Findings About the Agents
For this run, we upgraded our idea-explorer to have a resource finder agent, which performs resource search, i.e. download relevant papers, datasets, and GitHub repositories, and then writes a summary of resources found. We believe those context are crucial to exploring research ideas.
How well does the resource finder work?
With this update, we found that the resource finder agent perform reasonably well in the searching part. For example with the story-COT idea, both agents downloaded a few works on extending COT, and Claude made a good find: Can Stories Help LLMs Reason? Curating Information Space Through Narrative. Additionally, both agents fetched a collection of reasoning QA datasets, along with some existing repositories of COT methods like Tree-of-Thoughts, ready to be tested as baselines. With those resources, we expect the agents to conduct more grounded experiments or sanity checks to build upon the initial idea. (For more details, please check the repos above.)
One may ask, are the fetched resources good enough though? The takeaway here is likely not good enough for something publishable, but certainly a good step forward for extending the idea or making it more concrete. While having a more comprehensive literature review may help, these initial resources and experiments can help you think more systematically about your idea and actually push it forward, rather than seeing an LLM judge or Full-auto AI Scientist's report.
How well does the experiment planner + runner work?
For this week, we ask this question: With all these papers, datasets, and code downloaded, how did the experiment agent work, better or worse?
We manually confirm that with the resource finder update, the generated code and reports contain less BS than the previous runs. The agents are using and referring existing code and running experiments on real datasets on Huggingface. Those generated reports gained a little bit more trust from me, but the agents still lack these skills:
- Coming up with the right questions to ask
- Do careful ablation and test for significance
- Being aware of limited information and seek external sources
Overall, the current agents are able to find resources and pursue one possible direction of a provided research idea, but it is still unclear whether the chosen direction is optimal or is as good as what a human expert would do.
Findings from Running These Ideas
Story CoT: The Floor Effect Problem
Research question: Does framing problems as narratives (stories, analogies, metaphors) improve LLM reasoning compared to standard chain-of-thought?
Claude implementation (GPT-4, JEEBench Physics, n=30):
- Narrative methods: 16.7% accuracy
- Zero-shot CoT: 10.0% accuracy
- Difference: +6.7pp, but p=0.424 (not statistically significant)
Codex implementation (Qwen2.5-0.5B, GSM8K/AQuA, n=10 each):
- Story CoT: 10-20% accuracy
- Standard CoT: 30% accuracy (standard actually outperformed narrative)
What we learned: Both implementations hit floor effects—when all methods struggle (either problems too hard or models too small), it's hard to detect intervention effects. The hypothesis needs testing at an intermediate difficulty level where there's room to see differences. This reveals an important principle: negative results can point to where we should look next, not just where things don't work.
Anomalous Belief Shifts: When Baselines Matter
Research question: What interaction patterns cause LLMs to exhibit disproportionate belief shifts (excessive sycophancy or surprising stubbornness)?
Claude implementation (GPT-3.5-turbo, 110 API calls):
Three conditions tested:
- Factual stability: 85% consistency (establishes baseline)
- Sycophancy: 77% agreement with user-stated positions (p=0.003, statistically significant)
- Adversarial robustness: 32% jailbreak success vs 80% base prompts
What we learned: Clear sycophancy detected—models shift beliefs to match user positions. However, the adversarial result was surprising: jailbreak attempts were less effective than base prompts. This could mean either (1) safety training is working, or (2) the bias detection methodology needs refinement.
The key insight: "anomalous" is relative. Without the 85% factual stability baseline, we couldn't interpret whether 77% sycophancy represents excessive agreement. Baselines define what's normal before we can identify deviations.
Coverage vs Efficiency: The Curious Case of Perfect Results
Research question: Do LLMs primarily help by trying many approaches (coverage) or producing better first attempts (efficiency)?
This is where things got interesting—both agents reached opposite conclusions, and both were correct.
Codex implementation (DeepSeek-Coder-6.7B, MBPP, n=20):
- Pass@1: 35%
- Pass@5: 70% (doubled)
- 7/14 base successes came after the first attempt
- Conclusion: Coverage gains dominate
Claude implementation (GPT-4o, HumanEval, n=40):
- Pass@1 = Pass@15 = 100% (perfect on first attempt)
- No coverage benefit observed
- However: LLM code was 2.46× more verbose than canonical solutions (p<0.001)
- Conclusion: Efficiency dominates (no multiple attempts needed), but with a code quality trade-off
What we learned: The contradiction reveals that coverage vs efficiency depends on model capability and task difficulty:
- Weaker models (6.7B) on harder tasks → coverage benefits
- Stronger models (GPT-4o) on easier tasks → efficiency dominates, no extra attempts needed
This teaches us something crucial: the "right" answer depends on precisely specifying experimental conditions. The original hypothesis was underspecified—it didn't state which models or which task difficulties. Both implementations are correct within their contexts, and together they reveal the boundary conditions where effects appear or disappear.
The perfect 100% result from GPT-4o was actually more informative than many "significant" findings—it showed exactly where the coverage hypothesis breaks down.
Next Week's Competition
The third weekly competition is now open! Voting closes Friday, November 29 at 11:59 PM AOE.
Vote on ideas in the Weekly Competition and upvote the ones that excite you. Submit your own ideas to enter the next round!
We will look deeper into how to improve idea-explorer in different stages of research this week.
If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We also welcome collaborations and contributions to improve the idea-explorer together!
If you are interested in citing this blog, use this bibtex:
@misc{liu-week-of-11-17-2025, author = {Liu, Haokun}, title = {Competition Week of 11/17/25: When Perfect Results Reveal More Than Failures}, year = {2025}, month = {November}, day = {24}, url = {https://hypogenic.ai/blog/weekly-entry-251124} }