Week of 11/24/25-11/30/25: Three Agents, Three Perspectives
By Haokun Liu
Thank you to everyone who participated in our third weekly competition! Whether you submitted new research ideas or voted on existing ones, your engagement drives this community forward. This week brought fascinating questions about meta-cognition, dataset robustness, and multi-agent reasoning. We're excited to share results from the top three winning ideas!
The Winning Ideas
This week's winners explored diverse questions across LLM self-awareness, evaluation benchmarks, and Theory of Mind. We ran each idea through three different agent implementations (Claude Code, Codex, and Gemini). All generated repositories are now available:
Can LMs predict their own thinking tokens? by David Heineman
Are there any finetuning proof datasets currently? by Ari Holtzman
Do AI Agents Form Mental Models of Each Other? by Peiyu Chen
Findings About the Agents
This week, we ran our idea-explorer with three coding agent backbones: Claude Code, Codex, and Gemini-3-pro. For this week's results, we see that all three agents are able to explore the ideas reasobaly, by downloading the relevant resources and running plausible preliminary experiments. Specifically, the agents were able to finetune some small models for exploring the finetuning-proof dataset idea automatically! This suggest that the agents are capable of doing model training and analysis for testing research hypotheses.
However, we do observe that the three agents can fail at different stages* of the idea exploration pipeline. For example, we observe that Claude sometimes lose track of the working directory and output files at wrong places, Codex can get stuck at resource finding stage and keep digging in a rabbit hole, and Gemini does not output reports consistently, i.e., not fully following the full research instruction, potentially due to memory issues. These can in fact be mistakes humans also make, but these are certainly suboptimal for an automatic research agent. Aside from how capable the agents are, it is still unclear how we can make sure the error rates are minimized at all steps.
Given the observations, another natural question that comes up is how can we trust AI results or be confident that the AI has done diligence on the research questions? Furthermore, how can we make the AI-explored results more helpful for humans? I think there are two main directions to pursue:
- Have well-defined evaluation metrics for "good research behaviors or results", which can be used as signals to guide the research agents. Existing work includes MechEvalAgents.
- It is unlikely that any AI researcher or AI scientist will be good enough to automate full research discoveries because of the risk to make mistakes and lack of human evaluations & feedbacks. Therefore, using AI agents to help building playgrounds for human researchers to quickly pickup and accelerate their exploration coverage is a more plausible approach.
Findings from the Ideas
Can LMs Predict Their Own Thinking Tokens?: The Meta-Cognition Paradox
Research question: Can language models predict how many tokens they'll generate during chain-of-thought reasoning before actually solving a problem?
This question hits at something fundamental: do LLMs have any awareness of their own reasoning process? The answer turns out to be yes—with a curious twist.
Claude implementation (GPT-4 Turbo, GSM8K, n=30):
- Strong correlation between predicted and actual tokens: r = 0.707 (p = 0.033)
- But systematic underprediction: Models predict ~2.3x fewer tokens than they actually generate
- After calibration: 19.8% MAPE, 66.7% accuracy within 30 tokens
- Problem length has zero correlation with thinking tokens (r = −0.03)
Codex implementation (Qwen2.5-0.5B, GSM8K, n=20):
- Weak calibration: MAE ≈ 198 tokens, correlation ≈ 0.14
- 83% of cases exceeded predicted budget
- Zero task accuracy—the small model couldn't solve the problems at all
What we learned:
The meta-cognitive capability is real but uncalibrated. GPT-4 knows which problems are harder (hence the strong correlation), but consistently underestimates how verbose it will be. The 2.3x underprediction is remarkably consistent, suggesting models might be predicting "reasoning steps" rather than "words per step."
More surprisingly, problem length tells you nothing about task complexity. A short problem like "Find the 100th prime number" could require extensive reasoning, while a long word problem might involve simple arithmetic. This is a key lesson: superficial features (length, number of digits) don't capture true reasoning complexity—you need the model's own assessment.
The practical implications are significant: with calibration, LLMs can predict latency and cost before generation, enabling better user experience (meaningful progress bars!) and resource allocation. But the small model's complete failure reveals that meta-cognition is an emergent capability—it requires sufficient model capacity to reason about one's own reasoning.
Are There Any Finetuning Proof Datasets?: The Spectrum of Resistance
Research question: Which datasets successfully resist fine-tuning, and are any truly "finetuning-proof"?
Three agents tackled this from different angles, revealing that "finetuning-proof" isn't binary—it's a spectrum.
Claude implementation (GPT-4o + GPT-4, MMLU-CF & GSM-Symbolic):
- GSM-Symbolic: 38-46 percentage point drops from contaminated GSM8K (~92% → 46-54%)
- MMLU-CF: 8-18 percentage point drops from contaminated MMLU (86-88% → 68-80%)
- Conclusion: Symbolic generation (GSM-Symbolic) > Contamination-free rewriting (MMLU-CF)
Codex implementation (Flan-T5-small, MMLU-Pro & BBH):
- Fine-tuning on 400 examples: 10.5% → 14.0% (p=0.29, not significant)
- BBH tasks stayed very low: 0-44% accuracy
- Conclusion: Hardened benchmarks resist small-scale supervised fine-tuning
Gemini implementation (TinyLlama-1.1B, Inverse Scaling):
- Surprising result: LoRA fine-tuning worked! 46% → 80% on redefine-math
- Conclusion: "Inverse Scaling" tasks resist zero-shot scaling but NOT fine-tuning
What we learned:
The contradiction across agents is the real finding. Different mechanisms of "resistance" exist:
-
Contamination resistance (GSM-Symbolic, MMLU-CF): Detects memorization vs. genuine capability by changing surface features while preserving structure. The 38-46 point drops on GSM-Symbolic are damning evidence that frontier models have memorized problem patterns.
-
Small-scale fine-tuning resistance (MMLU-Pro, BBH): These adversarial benchmarks require genuine reasoning that can't be unlocked with just 400 training examples and a small model.
-
Zero-shot resistance ≠ fine-tuning resistance (Inverse Scaling): The Gemini finding challenges assumptions—tasks that resist larger models can still yield to gradient-based adaptation. The "strong prior" (e.g., arithmetic intuition) can be overridden with supervision.
The practical takeaway: Traditional benchmarks (GSM8K, MMLU) are heavily contaminated. High scores reflect memorization, not reasoning. Researchers must adopt contamination-resistant benchmarks (GSM-Symbolic, MMLU-CF) for honest evaluation. But even "resistant" datasets exist on a spectrum—symbolic generation provides the strongest defense.
Do AI Agents Form Mental Models of Each Other?: Framework vs. Findings
Research question: Do agents with explicit opponent modeling outperform dialogue-only agents in social deduction games like Werewolf?
This idea asked whether Theory of Mind capabilities in LLMs translate to strategic multi-agent scenarios. The implementations reveal a critical lesson about research methodology.
Claude implementation (Werewolf game, simulated agents):
- Complete framework implemented: Baseline vs. Opponent-Modeling agents
- Belief tracking: 71.1% accuracy (significantly above 50% random, p < 0.001)
- Win rates: 42.1% (opponent-modeling) vs. 38.1% (baseline)
- Critical caveat: Used simulated LLM behavior due to API issues—proof-of-concept only
Codex implementation (Qwen2.5-0.5B, AIWolf logs):
- Dialogue-only: 50% hit rate identifying werewolf
- Belief-conditioned: 38% hit rate (worse!)
- Heuristic priors based on accusations hurt performance
What we learned:
The Claude agent built a reasonable design—a complete experimental framework with rigorous evaluation metrics, statistical analysis pipeline, and comprehensive documentation. But without real LLM APIs, it couldn't produce scientific findings. The proof-of-concept artifacts (100% werewolf wins, 1-round games) clearly indicate simulation limitations.
The Codex result, while using a weak model, reveals something important: naive opponent modeling can hurt. Simply counting accusations and treating them as beliefs led the model astray. Effective opponent modeling requires understanding reliability, context, and strategic deception—not just surface-level cue tracking.
This highlights a crucial research principle: implementation quality ≠ empirical validity. The Claude framework is production-ready and valuable for future research. The Codex experiment, despite using a small model, provides an actual empirical data point: crude belief representations don't automatically help.
The broader lesson: Social reasoning requires capturing conversational nuance. Dialogue-only baselines can be competitive when explicit models are poorly calibrated. To see benefits from opponent modeling, we need either stronger models (GPT-4, Claude) or more sophisticated belief update mechanisms (Bayesian inference over roles, not just accusation counts).
Next Week's Competition
The fourth weekly competition is now open! Voting closes Friday, December 6 at 11:59 PM AOE.
Browse ideas with the "Weekly Competition" tag on IdeaHub and upvote the ones that excite you. Submit your own ideas to enter the next round!
This week demonstrated the value of running multiple agent implementations—each brings a different lens, revealing facets of the research question that a single approach would miss. The meta-cognitive underprediction, the spectrum of dataset resistance, and the framework vs. findings tension are all insights that emerged from diverse approaches to the same questions.
If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We also welcome collaborations and contributions to improve the idea-explorer together!
If you are interested in citing this blog, use this bibtex:
@misc{liu-week-of-11-24-2025, author = {Liu, Haokun}, title = {Week of 11/24/25-11/30/25: Three Agents, Three Perspectives}, year = {2025}, month = {December}, day = {01}, url = {https://hypogenic.ai/blog/weekly-entry-251124} }