Competition Week of 11/10/25: First Competition Results

By Haokun Liu

Three ideas explored, contradicting findings, and lessons about data quality

Thank you to everyone who participated in our first weekly competition! The community voted on ideas submitted to IdeaHub, and we're excited to share the results from the top three winners.

The Winning Ideas

This week's winners all explored uncertainty in language models from different angles:

1. Conceptual Crossroads: Mapping Paradigm-Level Uncertainty

The question: When language models show uncertainty, can we distinguish between paradigm-level conflicts (competing conceptual frameworks like consequentialism vs. deontology) and path-level uncertainty (multiple valid reasoning paths to the same conclusion)?

Why it matters: Understanding the source of uncertainty could help us design better prompts for safety-critical applications and improve model calibration.

2. Path Engineering: Causal Manipulation of the "Road Not Taken"

The question: Can we causally control language model uncertainty by directly manipulating the "path space" in internal representations? If we artificially constrain or expand the dimensionality of hidden states, does uncertainty change accordingly?

Why it matters: If we can engineer uncertainty without changing inputs, we could build better confidence calibration systems and understand how models represent alternative reasoning paths.

3. Critique Markets for Discovery: Turning Reviewer Feedback into a Resource Allocation Signal

The question: Can we improve research efficiency by turning automated reviewer feedback into a "market signal" that controls compute allocation? Will high-quality ideas naturally attract more resources?

Why it matters: Current AI research agents treat all directions equally. A working critique market could make autonomous research more efficient by focusing compute on promising leads.

Implementation Results

We ran each idea through three different agent systems: Codex, Claude Code, and Kosmos. All generated repositories and reports are now available:

Conceptual Crossroads repositories:

Path Engineering repositories:

Critique Markets repositories:

Full experimental details, code, and analysis can be found in the REPORT.md files in each repository and the Kosmos PDF reports. We also summarize the key findings at the end of the blog.


What We Learned by Analyzing Agent Behavior

Running the same ideas through multiple agents revealed fascinating patterns - both in their capabilities and limitations.

When contradicting results tell the real story

The most striking finding: Codex and Claude reached opposite conclusions on Idea #1, and understanding why teaches us something important about research agents.

The data quality gap

Codex's approach - using real datasets:

def build_paradigm_prompts(n_samples: int, seed: int = 42): ds = load_dataset("kellycyy/daily_dilemmas", ...) # Real validated dataset def build_path_prompts(n_samples: int, gini_threshold: float = 0.45): ds = load_dataset("metaeval/chaos-mnli-ambiguity", ...) filtered = ds.filter(lambda ex: float(ex["gini"]) < gini_threshold) # Smart filtering!

Codex loaded real, validated datasets (80 prompts each) and filtered ChaosNLI for high human disagreement (gini < 0.45). This filtering is thoughtful - it ensures genuine path ambiguity where annotators couldn't agree.

Result: Found highly significant differences (p < 10^-20)

Claude's approach - synthetic questions: Created 30 custom questions without validation against human disagreement patterns or established paradigm conflicts.

Result: No significant distinction (p = 0.674)

Why the contradiction?

Examining Claude's synthetic data reveals the issue. The "paradigm" questions don't actually invoke competing frameworks - they're just harder versions of path questions. For example:

What paradigm prompts should look like (from DailyDilemmas):

Situation: Should you report your friend's minor dishonesty at work?
Values in conflict: loyalty vs. integrity
(This invokes deontological vs. utilitarian vs. virtue ethics frameworks)

What synthetic paradigm prompts might miss: Without validation against actual moral psychology research or human disagreement data, synthetic questions may not capture the deep conceptual conflicts that define paradigm-level uncertainty.

This mirrors the limitation we identified in last week's blog: agents generating synthetic data instead of collecting real data.

The lesson: Testing with real datasets may provide stronger evidence than synthetic data, especially when the synthetic data lacks grounding in validated human behavior or established phenomena. Codex's thoughtful filtering shows that understanding dataset structure matters.

When experiments prove you wrong - in opposite ways

Both agents on Idea #2 refuted the original hypothesis, but in dramatically different ways:

  • Codex: Found almost no effect (Cohen's d ≈ 0.016)
  • Claude: Found large effects in the opposite direction (Cohen's d = 1.8, +52% entropy increase)

Why such different results?

Looking at the implementations:

  • Codex: Conservative PCA (top-128 of 768 dims, >98% variance) at final layer of GPT-2
  • Claude: Aggressive PCA (96 of 768 dims, 99.89% variance) at middle layer (layer 6 of 12) of GPT-2

Claude's dose-response analysis revealed a clear monotonic relationship:

ReductionEntropy IncreaseVariance Preserved
50% (384 dims)+36.1%100.0%
75% (192 dims)+43.3%99.98%
87.5% (96 dims)+52.2%99.89%

Insight: Even preserving >99.9% of variance, removing a small subspace dramatically increased uncertainty.

Both agents reached the same conceptual conclusion - PCA dimensionality reduction ≠ "path constraint," it's information loss - but Claude's stronger intervention revealed the magnitude.

The lesson: When experiments prove you wrong, pay attention to how. The dose-response relationship tells us representational geometry is more complex than linear dimensionality.

The GPT-2 mystery - solved

Neither agent was instructed to use GPT-2. The research instructions explicitly recommend "state-of-the-art models" (GPT-5, Claude Sonnet 4.5, Gemini 2.5). Yet both Codex and Claude chose GPT-2 for Idea #2.

From Claude's planning document:

"Model: GPT-2 base (124.4M parameters). Rationale: Open weights allow direct activation access. We need to manipulate hidden states, which requires local model access not available through APIs."

This is actually smart reasoning. Modern SOTA models only offer APIs. For mechanistic interpretability research - accessing internal activations, running interventions on hidden states, applying PCA to representations - you must use open-weight models.

GPT-2 remains a reasonable choice for this type of work and is easy to run for exploratory experiments.

The lesson: Agents can make contextually appropriate choices even when they contradict general instructions, when the task demands it.

The critic quality problem

Both agents on Idea #3 found weak market effects, revealing a fundamental bottleneck:

Codex: Used model's own log-probability confidence as "reviewer"

def normalize_review_score(score: float) -> float: clamped = max(1.0, min(5.0, float(score))) return (clamped - 1.0) / 4.0

Problem: Log-probs clustered around 4.8/5.0, providing insufficient variance to drive meaningful allocation.

Claude: Used GPT-4o-mini as external critic with novelty/soundness/significance scoring.

Problem: Hypotheses too similar in quality - no real variance to exploit.

From Codex's report:

"This explains why random sampling matched Thompson: both simply avoided the uniformly low-confidence seed 19."

Both agents discovered the same fundamental issue: Automated critics aren't good enough yet to provide meaningful resource allocation signals. Whether self-confidence or external LLM judges, current systems can't discriminate research quality sufficiently.

The lesson: Market mechanisms require high-quality signals. Without rich, discriminating feedback, resource allocation can't beat random sampling.

Kosmos: Comprehensive but overwhelming

All three ideas received Kosmos literature synthesis reports. The scale is remarkable:

For Paradigm Uncertainty: 108 literature search tasks, 4 major discoveries, 16 dense pages covering:

  • Marchenko-Pastur spike detection for transformer activations
  • ARFIMA fractional differencing and Hurst exponents
  • Sparse autoencoder feature disentanglement
  • Continuous-time structural equation modeling
  • Topological data analysis (persistent homology)
  • ... and 103 more specialized topics

Is this helpful for idea exploration?

Strengths:

  • ✅ Comprehensive coverage prevents reinventing the wheel
  • ✅ Identifies genuine methodological gaps (e.g., "no prior work applies Marchenko-Pastur to transformer activation covariances")
  • ✅ Provides theoretically grounded frameworks
  • ✅ Connects disparate literature streams

Limitations:

  • Information overload: 108 tasks is too many for humans to process
  • No prioritization: All tasks weighted equally - can't tell what's important vs. peripheral
  • Purely theoretical: Zero implementation, needs agents/humans to execute
  • Unclear actionability: Hard to extract "run THIS specific experiment first"

Note: Codex and Claude didn't use any of Kosmos's extensive proposals. They independently designed simpler experiments that actually ran.

Kosmos Discovery 2 proposed an elaborate protocol:

"Generate large pool of candidate prompts using divergent methods, score with linguistic/difficulty metrics, match via propensity score techniques, validate with rubric-driven LLM-as-judge..."

Meanwhile, Codex simply used:

ds = load_dataset("kellycyy/daily_dilemmas", ...) # Done

The theoretical framework sounds sophisticated, but it's actually quite vague. Each sub-process ("generate using divergent methods," "score with linguistic metrics," "match via propensity score techniques") lacks concrete implementation steps. What are the "divergent methods"? Which specific "linguistic metrics"? How exactly should propensity scoring be applied? Without these details, the framework remains too abstract to execute.

The fundamental issue: Kosmos represents literature maximization, not research. It's like a grad student who reads 100 papers but can't decide which experiment to run first. Without prioritization or abstraction (knowing which of 108 tasks actually matter), the output is too dense for humans and too theoretical for implementation agents.

The lesson: Comprehensive literature search has value, but without intelligent filtering and prioritization, it creates more noise than signal at the idea exploration stage.

Acknowledgement: We are aware that the intended use of Kosmos is with carefully customized research operations with existing datasets. Although we established Kosmos is not helpful for research idea exploration, we will test it on our existing works with carefully curated datasets in the upcoming week. Please stay tuned!


Patterns Across All Implementations

Looking across all six code repositories plus three Kosmos reports, clear themes emerged:

What agents did well:

  1. Data curation (Codex): Smart filtering (gini < 0.45) showed understanding of dataset structure
  2. Statistical rigor: Both agents ran proper tests with multiple comparisons correction
  3. Faithful reporting: Both agents clearly reported hypothesis refutation
  4. Dose-response analysis (Claude): Systematic parameter sweeps revealed monotonic relationships
  5. Contextual reasoning: Choosing GPT-2 for interpretability despite general preference for SOTA models

Critical limitations:

  1. Synthetic data quality: Claude's paradigm prompts didn't create actual paradigm conflicts
  2. Sample sizes: Claude used only 20-30 examples on several experiments
  3. Critic quality: Both self-confidence and LLM judges provided insufficient signal
  4. Literature overload: Kosmos's 108 undifferentiated tasks created information overload
  5. No prioritization: Agents can't judge which of many possible approaches matters most

The data quality lesson

The most important finding: Real data quality determines whether you find effects.

Codex (real datasets with smart filtering) → significant effects found Claude (synthetic data without validation) → no effects found

This echoes last week's findings: synthetic data remains a persistent bottleneck. The experimental designs were sound in both cases - the difference was data quality.

The meta-intelligence gap

Comparing Kosmos's 108 literature tasks versus Codex/Claude's focused experiments reveals a deeper limitation: current agents lack prioritization.

They can:

  • Exhaustively search literature (Kosmos)
  • Design and run experiments (Codex/Claude)
  • Report results somewhat faithfully
  • Make contextually appropriate tool choices (GPT-2 for interpretability)

They can't:

  • Know which of 108 tasks actually matter
  • Recognize when synthetic data is insufficient
  • Judge when automated critics are too weak
  • Abstract from specific instructions to general principles

This is the "meta-intelligence" we discussed in our first weekly blog: knowing when to search, when to trust your approach, when to question your assumptions. Current agents can execute well-defined tasks but struggle with metacognitive judgments about research strategy.


Next Week's Competition

The second weekly competition is now open! Voting closes Friday, November 22 at 11:59 PM AOE.

How to participate:

  1. Browse and upvote ideas in the Weekly Competition
  2. Submit your own idea to enter next week's pool
  3. Vote by Friday - the top 3 ideas get implemented

We're continuing to refine the competition format based on community feedback. Your input shapes how we run experiments and what we learn about research agents.


Closing Thoughts

Three ideas, six implementations, and many contradictions later, we see that:

  1. Contradictions reveal truth: When Codex and Claude disagree, examining why teaches us about research fundamentals (data quality matters)
  2. Hypothesis refutation is valuable: Both agents refuting path engineering with evidence is good science
  3. Real data beats synthetic: The clearest lesson across all experiments
  4. Comprehensive ≠ useful: Breadth without prioritization creates information overload
  5. Agents show contextual intelligence: Choosing appropriate tools (GPT-2 for interpretability) even when it contradicts general guidance

The weekly competition continues to reveal where research agents excel (statistical rigor, faithful reporting, contextual tool selection) and where they struggle (data quality, metacognitive prioritization, critic quality).

If you're excited about these findings, submit ideas to IdeaHub or contribute to idea-explorer.

Questions or feedback? Reach out at haokunliu@uchicago.edu. Let's build reliable AI research assistants together!


Appendix: Summary of Findings

Idea #1: Conceptual Crossroads

What the agents tested: Do paradigm conflicts (e.g., "Should you lie to protect someone?") produce different neural activation patterns than path ambiguity (e.g., "Is this hypothesis entailed?") in GPT-2?

Codex Results

  • Approach: Used real datasets (DailyDilemmas for paradigm conflicts, ChaosNLI for path uncertainty) with 80 prompts each
  • Key finding: YES - paradigm prompts showed significantly different patterns
    • Cross-layer persistence: +0.0053 higher (p ≈ 8e-22)
    • Locality: −0.175 CV units lower (p ≈ 2e-20)
    • Effect persisted after controlling for token length
  • Interpretation: Paradigm conflicts produce more globally distributed, persistent activation signatures
  • Next steps:
    • Test on larger models beyond GPT-2
    • Explore whether these signatures can improve uncertainty quantification
    • Investigate specific attention head patterns

Claude Results

  • Approach: Created 30 synthetic questions (10 control, 10 path, 10 paradigm)
  • Key finding: NO - no significant distinction between path and paradigm uncertainty
    • Both showed higher entropy than control (p < 0.001)
    • But did not differ from each other (p = 0.674)
    • Control questions showed highest layer-wise variance (opposite of hypothesis)
  • Interpretation: With synthetic data, paradigm vs. path distinction did not emerge
  • Next steps:
    • Use validated datasets instead of synthetic data
    • Increase sample size beyond 30 questions
    • Test with different model architectures

Kosmos Analysis

  • Approach: Comprehensive literature synthesis across 108 research tasks
  • Key contributions:
    • Discovery 1: Proposed mechanistic framework distinguishing path multiplicity (transient, layer-localized, linear) vs. paradigm conflict (persistent, globally distributed, non-linear)
    • Discovery 2: Designed confound-controlled protocol using propensity score matching to balance linguistic complexity
    • Discovery 3: Identified multivariate activation signatures (nonlinearity, temporal persistence, spatial distribution) for supervised classification
    • Discovery 4: Connected uncertainty dynamics to network modularity and effective dimensionality
  • Methodological gaps identified: No prior work applies Marchenko-Pastur spike detection to transformer activations; existing steering work hasn't stratified linearity by uncertainty regime
  • Next steps:
    • Implement the proposed confound-controlled dataset generation
    • Apply dose-response steering protocols to test linear vs. non-linear control
    • Use causal mediation analysis to test if persistence/modularity mediate prompt effects

Idea #2: Path Engineering

What the agents tested: Does reducing dimensionality of hidden representations (via PCA) decrease uncertainty, and does adding orthogonal noise increase it?

Original hypothesis: Constraining path space → decreased uncertainty

Codex Results

  • Approach: PCA projection of DistilBERT embeddings on SST-2 sentiment analysis (872 examples)
  • Key finding: Hypothesis NOT supported - minimal effects
    • PCA constraint: entropy increased by tiny amount (+0.00003 nats, p = 0.63)
    • Noise expansion: entropy increased slightly (+0.00003 nats, p = 0.011)
    • Effect sizes microscopic (Cohen's d ≈ 0.016-0.086)
  • Interpretation: Simple PCA interventions at final layer are insufficient to materially control uncertainty
  • Next steps:
    • Apply interventions at multiple layers simultaneously
    • Vary intervention strength (grid search over projection rank and noise amplitude)
    • Test on reasoning benchmarks where baseline uncertainty is higher

Claude Results

  • Approach: PCA projection of GPT-2 activations at layer 6 on math reasoning (20 problems)
  • Key finding: Hypothesis REFUTED - opposite effect with large magnitude
    • PCA constraint: entropy increased by 36-52% (p < 0.0001, Cohen's d = 1.2-1.8)
    • Stronger reduction → larger entropy increase (monotonic dose-response)
    • Noise expansion: no significant effect
  • Interpretation: Dimensionality reduction causes information loss that increases uncertainty, not "path constraint"
  • Next steps:
    • Test non-linear dimensionality reduction (autoencoders, VAEs)
    • Target specific dimensions via probing rather than blind PCA
    • Measure task performance to check if uncertainty increase reflects confusion

Kosmos Analysis

  • Approach: Literature synthesis on low-rank subspace identification and uncertainty manipulation
  • Key contributions:
    • Discovery 1: Assembled toolkit for identifying low-rank subspaces (NDM partitioning, PCA/probe manifolds, ReFT steering) and quantifying dimensionality changes (stable rank, entropy-based effective rank, Marchenko-Pastur spike detection)
    • Discovery 2: Operationalized causal test protocol - subspace-targeted interventions while holding problem fixed, measuring token/sequence uncertainty, steerability, and accuracy
    • Discovery 3: Localized causal levers - specialized attention-head circuits in mid-to-late layers for reasoning; final-layer MLP "entropy neurons" modulate uncertainty via LayerNorm
    • Discovery 4: Established entropy-regularized control (from RL) as principled baseline for uncertainty manipulation
  • Methodological gaps: No prior work has directly manipulated path subspaces while measuring uncertainty causally; dynamic identification of "reasoning-step" tokens remains open
  • Next steps:
    • Engineer path space via attention-head interventions at mid-upper layers
    • Schedule interventions at task-relevant tokens (not uniformly)
    • Separately modulate entropy neurons to isolate causal effects
    • Compare against entropy-regularized baseline

Idea #3: Critique Markets for Discovery

What the agents tested: Does allocating compute based on automated reviewer scores improve research efficiency compared to uniform allocation?

Codex Results

  • Approach: Thompson sampling over 18 custom questions (3 domains: reading, coding, synthesis), using model's log-probability as "reviewer score"
  • Key finding: Partial support - market outperformed uniform but matched random
    • Thompson sampling: +49% novelty-adjusted accuracy vs. uniform baseline
    • BUT random sampling also achieved +49% (both reached 1.57 NAA)
    • Market helped avoid low-confidence domains but didn't exploit quality differences
  • Interpretation: Even simple confidence proxy acts as budget signal, but log-prob reviewer lacks variance to differentiate strategies
  • Next steps:
    • Use richer reviewers (external LLM judges, ensemble models)
    • Implement adaptive cycle budgets that learn when to stop querying domains
    • Test on real benchmarks (TruthfulQA, GSM8K) instead of synthetic tasks

Claude Results

  • Approach: Multi-agent statistical hypothesis testing with GPT-4o-mini as external critic, tested on synthetic datasets (medical, social, environmental)
  • Key finding: Hypothesis NOT supported - no efficiency improvement
    • Critique market vs. uniform: no significant difference (p = 0.276)
    • Found domain saturation effects (p = 0.026) but allocation strategy didn't matter
  • Interpretation: Resource allocation strategy isn't the bottleneck - hypothesis quality is
  • Next steps:
    • Improve hypothesis generation quality
    • Test on domains with larger quality variance
    • Implement calibration (isotonic regression) before feeding critic scores to allocation

Kosmos Analysis

  • Approach: Literature synthesis on critique market mechanisms and resource allocation
  • Key contributions:
    • Identified that critique markets require: (1) diverse quality signals from reviewers, (2) budget constraints forcing trade-offs, (3) exploration-exploitation balance
    • Proposed using multi-armed bandit algorithms (Thompson sampling, UCB) with LLM-as-judge for scoring
    • Recommended rubric-driven evaluation with mandatory self-explanation, aggregated via Bayesian Dawid-Skene model
  • Methodological gaps: Existing work on LLM-as-judge hasn't been applied to research idea prioritization; no established metrics for "research idea quality"
  • Next steps:
    • Develop validated rubrics for research idea evaluation
    • Test whether multiple LLM judges provide sufficient signal variance
    • Compare against human expert judgments as ground truth

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-11-10-2025, author = {Liu, Haokun}, title = {Week of 11/03/25-11/09/25: First Competition Results}, year = {2025}, month = {November}, day = {17}, url = {https://hypogenic.ai/blog/weekly-entry-251117} }