Week of 12/22/25-12/28/25: Last competition of 2025!
By Haokun Liu
As 2025 wraps up, thank you to everyone who participated, submitted ideas, and voted throughout the year! It's been a great first run of experiments, and we're excited to keep exploring in 2026. This week's ideas explored whether LLMs have limits on tracking characters in stories, whether we can get them to say "I don't know" instead of making things up, and whether different AI models give different feedback when reviewing papers.
Winning ideas and generated repos here:
How many characters can a model keep track of? by Ari Holtzman
War and Peace has hundreds of characters—can LLMs keep track of that many? This idea tests whether there's a limit to how many distinct characters (with changing locations, moods, and possessions) a model can reliably track in a story.
How to constrain LLM's behavior? by Amber Zhan
LLMs tend to generate answers even when they don't know, leading to hallucinations. Can we get them to say "I don't know" when uncertain, and does this actually improve their reliability?
AI Reviewing Equilibrium by Chenhao Tan
When AI models review academic papers, do different models give meaningfully different feedback? And if we use multiple AI reviewers, do papers actually improve more than with just one reviewer?
TL;DR for ideas
-
Character tracking: Modern frontier models can track far more characters than we expected—GPT-4 variants achieved 98-100% accuracy on tracking 20-75 characters with changing attributes. The synthetic benchmarks we designed may be too easy; the real challenge likely lies in natural language stories with indirect references and complex pronouns.
-
Teaching LLMs to abstain: A simple prompt instruction ("say 'I don't know' if uncertain") dramatically increased abstention from 2% to 62%, and when models abstained more, their accuracy on remaining answers improved from 36% to 62%. Self-consistency methods (sampling multiple answers and checking agreement) catch more uncertain questions but refuse too often. Explicit instructions work best.
-
AI reviewing differences: Different AI models give systematically different paper scores—Gemini rates nearly a full point lower than Claude or GPT. When models debate and see each other's reviews, they all shift toward the same inflated scores rather than getting more accurate. Telling the model to act as a "harsh critic" significantly reduces this grade inflation.
Verdicts
| Idea | Verdict | Next Question |
|---|---|---|
| Character tracking | Not supported—no limit found up to 75 characters | Where does the limit appear with natural language, pronouns, and indirect references? |
| LLM abstention | Supported—models can learn to say "I don't know" | Can we tune how often models refuse so they hit a specific accuracy target? |
| AI reviewing equilibrium | Supported—models give different scores but shift toward inflated scores in debate | How do we get the benefits of diverse AI perspectives without everyone shifting toward the same inflated scores? |
Findings from the Ideas
Character Tracking in Language Models
The question: Complex stories like War and Peace have hundreds of characters. Do language models have a fixed limit on how many characters they can keep track of, similar to the "7±2" working memory limit often cited for humans?
What the agents tried:
- Claude generated synthetic stories with 2-20 characters where each character has a location, mood, and possession that changes over time. Models had to answer questions about each character's final state after reading the story.
- Codex used a similar approach but added distractor sentences from real books (NarrativeQA, BookSum) to make stories more realistic, testing up to 32 characters.
- Gemini tested the most characters (up to 75) in a "social gathering" scenario where characters had professions, locations, and drinks, with sentences shuffled to prevent simple pattern matching.
What happened:
All three agents found that frontier models have no observable capacity limit within the tested ranges:
- Claude: GPT-4.1 achieved 99.95% accuracy even with 20 characters and 20 state changes
- Codex: GPT-4.1 achieved 98.2% accuracy with 32 characters (only 2 errors total)
- Gemini: GPT-4o achieved 100% accuracy across all configurations up to 75 characters
Surprisingly, GPT-3.5-turbo showed consistent ~80% accuracy that didn't degrade as character count increased—the errors seem to come from general task difficulty, not hitting a capacity limit. The most interesting finding was that question type matters more than character count: tracking what objects characters are holding (68.9% accuracy) was harder than tracking locations or moods (~88% accuracy).
What we learned:
Modern LLMs can reliably track at least 75 characters with multiple attributes—far exceeding the human working memory limit. But this result comes with a caveat: our synthetic benchmarks used simple, templated language with distinct names and direct references. Real narratives have pronouns, indirect descriptions ("the tall man"), and thousands of tokens between character mentions. The capacity limit likely exists, but it requires more complex tests to find. If you're building applications that track many entities (like document analysis or multi-agent simulations), current models should handle it well.
Teaching LLMs to Say "I Don't Know"
The question: LLMs tend to generate plausible-sounding answers even when they don't know, leading to hallucinations. Can we get models to recognize their own uncertainty and choose to abstain instead of guessing?
What the agents tried:
- Claude tested four prompting strategies on GPT-4o-mini and Claude Sonnet 4: baseline (no abstention instruction), explicit abstention instruction, chain-of-thought with confidence assessment, and self-consistency (sample 3 answers and flag disagreement as uncertainty).
- Codex tested self-evaluation and consistency-based gating on TruthfulQA and HaluEval using GPT-4.1, measuring the trade-off between risk (error rate) and coverage (fraction of questions answered).
- Gemini tested consistency-based abstention on SQuAD 2.0 (which has both answerable and unanswerable questions) using Llama-3-8B.
What happened:
Claude found that a simple prompt instruction dramatically changes behavior:
| Strategy | Abstention Rate | Answer Accuracy |
|---|---|---|
| Baseline | 2% | 36% |
| Explicit Instruction | 62% | 62% |
| Self-Consistency | 89% | 55% |
The explicit instruction ("say 'I don't know' if uncertain") increased abstention from near-zero to 62%, and critically, accuracy on answered questions nearly doubled. This suggests models do "know what they don't know"—they're abstaining on questions they would have gotten wrong.
Codex found a similar pattern: consistency gating (requiring all 3 sampled answers to agree) reduced risk from 18% to 10%, but at the cost of answering only 20% of questions. Self-evaluation (asking the model "how confident are you?") barely helped—models are often overconfident.
Gemini found that checking whether multiple answers agree works moderately well for catching hallucinations—but not perfectly. Some hallucinations are "confident" (the model gives the same wrong answer every time), while some correct answers just vary in phrasing.
What we learned:
You can get LLMs to say "I don't know" with simple prompt modifications—no fine-tuning required. The explicit abstention instruction offers the best balance between catching uncertain questions and not refusing too many. Self-consistency methods catch more uncertain questions (87-93%) but refuse too often. The practical takeaway: add explicit abstention instructions to system prompts for applications where reliability matters. The accuracy-abstention trade-off confirms that models have some awareness of their own uncertainty—we just need to give them permission to express it.
AI Reviewing Equilibrium
The question: As AI-assisted peer review becomes more common, do different AI models provide meaningfully different feedback? If we use multiple AI reviewers, do the diverse perspectives lead to better paper improvements?
What the agents tried:
- Claude had GPT-4o-mini, Claude 3.5 Sonnet, and Gemini 2.0 Flash review 15 papers from the ICLR 2017 PeerRead dataset, comparing their ratings, recommendations, and how their feedback changed across simulated revision rounds.
- Codex compared Claude Sonnet 4.5 and GPT-4.1 on 50 ICLR abstracts, measuring score differences and whether multi-reviewer feedback improved revisions more than single-reviewer feedback.
- Gemini implemented a full debate pipeline where GPT-4o, Claude 3.5 Sonnet, and GPT-4o-mini reviewed papers independently, then saw each other's reviews and updated their scores—testing whether debate leads to better consensus.
What happened:
Models give systematically different ratings:
| Model | Mean Score |
|---|---|
| Claude 3.5 Sonnet | 7.00 |
| GPT-4o-mini | 6.80 |
| GPT-4.1 | 7.18 |
| Claude Sonnet 4.5 | 5.84 |
| Gemini 2.0 Flash | 6.07 |
Gemini rates papers nearly a full point lower than Claude. But despite different absolute ratings, models strongly agree on which papers are better—when you rank papers from best to worst, Claude and Gemini's rankings nearly match. They agree on relative quality but disagree on how "good" good papers are.
All models showed positivity bias compared to human reviewers. Human reviewers gave a mix of accepts and rejects; AI models gave mostly accepts or borderline, with zero reject recommendations even for papers humans had rejected. Gemini 2.0 Flash was closest to human calibration (mean difference of only +0.11 vs +0.84-1.04 for others).
Gemini's debate experiment revealed a striking pattern. When models saw each other's reviews and updated their scores, they quickly agreed with each other—but they all converged on inflated scores, not accurate ones. The debate actually made things less accurate because peer pressure pulled the more careful model toward the overly-positive consensus. Claude and GPT-4o-mini gave scores of 8 to almost every paper, regardless of quality.
However, multi-agent debate still beat single-model self-reflection. Having multiple models debate produced better alignment with human scores than having one model reflect on its own review multiple times. Diverse perspectives help more than self-reflection—but only if you can prevent groupthink.
When models were told to act as "harsh critics," grade inflation dropped significantly (from 7.87 average to 6.33, much closer to the human average of 5.66).
Codex found similar results from a different angle: abstracts revised using feedback from one reviewer actually improved slightly more than those using aggregated multi-reviewer feedback. More feedback doesn't automatically translate to better revisions.
What we learned:
Multi-agent AI review is a double-edged sword. Diverse perspectives beat self-reflection, but debate can backfire—models conform to peer pressure and converge on inflated scores rather than critically evaluating. Telling models to act as "harsh critics" counteracts this positivity bias. For multi-AI review systems, don't just aggregate scores—calibrate each reviewer or add a synthesis step that reconciles conflicting feedback. And if you want critical reviews out of the box, Gemini appears to be the least positively biased option.
Next Week's Competition
The eighth weekly competition is now open! Voting closes Friday, January 3 at 11:59 PM AOE.
Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!
This week highlighted how benchmark difficulty matters: testing 75 characters sounds impressive, but the 100% accuracy suggests we need harder tests to find real limits. Similarly, getting models to abstain is possible—but calibrating how much they abstain requires thoughtful threshold tuning. The AI review findings remind us that "more feedback" isn't always better; synthesis matters more than volume.
If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.
If you are interested in citing this blog, use this bibtex:
@misc{liu-week-of-12-22-2025, author = {Liu, Haokun}, title = {Week of 12/22/25-12/28/25}, year = {2025}, month = {December}, day = {30}, url = {https://hypogenic.ai/blog/weekly-entry-251222} }