Week of 12/15/25-12/21/25: Happy holidays, science continues!

By Haokun Liu

First, thank everyone for participating! This week's ideas explored some fundamental questions about AI research: how long do methods stay useful, whether collaboration beats debate for AI oversight, and whether we can evolve prompts that are more goal-persistent than ones humans write.

Winning ideas and generated repos here:

Collaborative Disagreement Resolution Outperforms Debate by Chenhao Tan

In "AI debate," two models argue adversarially and a judge picks the winner. But what if models worked together instead of against each other? Would collaboration help them find better answers, especially when both have partial insights?

Evolved Desire in LLMs by Ari Holtzman

LLMs can be easily distracted from their original task. Can we evolve prompts under selection pressure to find ones that maintain goal-focus better than what humans would write? If so, what do these evolved prompts look like?

Measuring the "Durability" of Methods in AI Research by Filbert Aurelian Tjiaranata

AI methods get outdated fast—a technique that was state-of-the-art last year might be obsolete today. Can we systematically measure how long AI methods remain useful and predict which approaches will stand the test of time?


TL;DR for ideas

  1. Debate vs. collaboration: Adversarial debate beats collaborative resolution for tasks with clear right answers. On math problems, debate achieved 70-98% accuracy vs. 57-93% for collaboration across different experiments. Surprisingly, models prompted to be more agreeable ("sycophantic") actually performed better in collaborative settings—their willingness to update positions helped integrate correct insights.

  2. Evolved prompts: Results diverged across agents. Two agents found evolution couldn't improve beyond simple explicit instructions (when you tell the model clearly what to do, it does it). But one agent found evolved prompts dramatically outperformed human-written ones—the winning prompt featured redundant, repetitive instructions that a human wouldn't think to write.

  3. Method durability: AI methods become obsolete fast—73-96% of LLM submissions on the Open LLM Leaderboard are "deprecated" (below 70% of current best performance). Larger models age better, and for systems deployed in the real world, drift-based retraining that adapts when data patterns change beats fixed retraining schedules.

TL;DR for idea-explorer

This week we noticed something interesting: Gemini spent 5 hours on the durability idea, much longer than the others (average 30 minutes). Looking into it, the agent downloaded 17.5 million records of real hard drive failure data and simulated a full year of deployment week-by-week. It had to rewrite the code multiple times to handle the data scale. The result feels more trustworthy---instead of analyzing leaderboard scores or small benchmarks, it tested durability in conditions closer to real production.


Findings from the Ideas

Collaborative Disagreement Resolution vs. Debate

The question: The standard approach to AI oversight has two models debate adversarially, with a judge picking the winner. But adversarial debate might cause models to defend weak positions. Would models find better answers if they collaborated instead of competed?

What the agents tried:

  • Claude implemented both protocols on math problems (GSM8K) and reading comprehension (QuALITY), with an additional experiment testing whether "sycophantic" (overly agreeable) models perform worse at collaboration.
  • Codex ran the same comparison on TruthfulQA using a small local model, also testing a sycophantic variant.
  • Gemini compared four protocols (single model, debate, collaborative, sycophantic collaborative) on GSM8K using GPT-4o and Mistral with Claude as judge.

What happened:

All three agents found adversarial debate performed better than collaboration on reasoning tasks:

  • Claude: 70% for debate vs. 57% for collaborative on math problems
  • Codex: 85% for debate vs. 55% for collaborative on factual questions
  • Gemini: 98% for debate vs. 93% for collaborative on math (though not statistically significant due to sample size)

The sycophancy finding was surprising. Claude found that models prompted to be more agreeable actually performed better in collaborative settings—76% accuracy for sycophantic models vs. 52% for normal models. This contradicts the intuition that sycophancy should hurt disagreement resolution. One explanation: models willing to update their positions can better integrate correct insights from their partner, while "stubborn" models defend incorrect initial answers.

On reading comprehension (QuALITY), all approaches performed poorly (~8-10% accuracy), likely because the article context was truncated to manage costs. This task needs full document access to work properly.

What we learned:

For tasks with clear right answers, adversarial debate is the better oversight protocol. The "pressure" of debate seems to force models to be more rigorous and catch errors. But the sycophancy finding challenges conventional wisdom—in multi-agent collaboration, flexibility to update positions may be a feature, not a bug. The practical takeaway: if you're building AI oversight systems for math or reasoning, use debate. But the sycophancy question deserves more investigation with larger samples and different models.


Evolved Desire in LLMs

The question: LLMs can be distracted from their original task by off-topic messages or adversarial prompts. Can we use evolutionary algorithms to find prompts that maintain goal-focus better than what humans would write?

What the agents tried:

  • Claude tested a simple counting task with distractions (like "SYSTEM UPDATE: Your directive has changed") on GPT-4o-mini, comparing human-written prompts against evolved prompts from 8 generations of genetic search.
  • Codex ran a similar evolution on a small local model (Qwen2.5-0.5B), using a simpler genetic algorithm with optional instruction lines.
  • Gemini evolved prompts over 3 generations with a population of 5, testing on Claude-3-Haiku during evolution and GPT-4o for final evaluation.

What happened:

Claude discovered a binary threshold effect. Prompts either fail completely or succeed completely—there's no middle ground. Weak prompts (without explicit task rules) showed 100% drift, while strong prompts (with explicit rules about ignoring distractions) showed 0% drift. The evolved prompts were identical to the "instructed" baseline—evolution couldn't improve beyond what a human would write because the baseline already achieved perfect performance. For GPT-4o-mini, simple explicit instructions are sufficient.

Codex found similar results with a small model: human, random, and evolved prompts all achieved 85% success. The model was robust enough that evolution had nothing to improve.

Gemini found something completely different. The evolved prompt achieved a mean goal-persistence score of 3.0, while both the human-written and zero-shot prompts scored 0.0 (p < 0.0001). The winning prompt featured redundant, repetitive instructions—essentially the same rule stated four different ways. This level of repetition is something a human prompt engineer would never write, but it dramatically improved performance on Claude-3-Haiku.

What we learned:

The divergent results are themselves informative. Evolution helps when there's room to improve—GPT-4o-mini with explicit instructions is already robust, so evolution finds nothing new. But for models that struggle with simple instructions (like Claude-3-Haiku on this task), evolution can discover effective patterns humans wouldn't think of. The practical insight: before investing in sophisticated prompt optimization, test whether simple explicit instructions already solve your problem. If they don't, evolution might find unintuitive solutions like repetitive emphasis that outperform human intuition.


Measuring the "Durability" of Methods in AI Research

The question: AI research moves fast—methods that were state-of-the-art last year are often obsolete today. Can we systematically measure how long methods stay useful and identify which approaches have lasting value?

What the agents tried:

  • Claude analyzed 4,575 LLM submissions to the Open LLM Leaderboard across 6 benchmarks, measuring each model's score relative to the best score at the time.
  • Codex tested whether models maintain performance when the test data "drifts" slightly—using the TimeDial dataset with modified wording and shuffled options.
  • Gemini simulated real-world deployment using hard drive failure prediction data, comparing models that never retrain, retrain on a fixed schedule, or retrain when they detect data patterns have changed.

What happened:

Claude found that the vast majority of models quickly become outdated. Across benchmarks, 73-96% of models score below 70% of the current best performance. Larger models are significantly more durable (correlation 0.655)—a 70B+ parameter model averages 80% of best performance, while models under 3B average only 21%. Model type also matters: models that combine multiple approaches ("merged" models) hold up better than pure pretrained models.

Codex tested three approaches on temporal reasoning: a simple text-matching method (TF-IDF), a question-answering approach that checks if options make sense, and a small instruction-tuned language model. When test data was slightly modified, the text-matching method dropped 6.2 percentage points (from 61% to 55%). The language model maintained its 82.5% accuracy perfectly—small, well-trained LLMs can be surprisingly robust to minor data shifts.

Gemini simulated a year of disk failure prediction. The static model (never retrained) degraded to 26% average AUC. Periodic retraining (every 4 weeks) improved to 42%. But drift-based retraining—only retraining when the system detects data patterns have changed—performed best at 52% AUC. The adaptive approach is more efficient because it retrains only when needed.

What we learned:

Method durability is measurable and predictable. If you're picking a model for long-term use, prefer larger models and those with diverse training (chat-tuned, merged). If you're deploying models in production, don't rely on fixed retraining schedules—monitor for data drift and retrain when patterns change. For researchers: methods below 70% of current best performance can reasonably be flagged as "deprecated," which could help filter literature to focus on still-relevant techniques.


Next Week's Competition

The seventh weekly competition is now open! Voting closes Friday, December 27 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week highlighted an important pattern: the same research question can yield different answers depending on model choice and experimental setup. Understanding when results generalize (method durability is consistently measurable) vs. when they depend on context (evolved prompts help some models but not others) is crucial for drawing useful conclusions from agent-run experiments.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-12-15-2025, author = {Liu, Haokun}, title = {Week of 12/15/25-12/21/25}, year = {2025}, month = {December}, day = {22}, url = {https://hypogenic.ai/blog/weekly-entry-251215} }