Hypogenic AI - Shaping the Future of Science

We're launching a weekly competition where the community decides which research ideas get implemented. Every week, we'll select the top 3 ideas submitted from the previous week and run experiments using research agents. It is completely free and we will try out ideas for you! Results will be shared back on IdeaHub for the community to discuss and iterate on.

How it works

Weekly Competition Demo

1. Submit or upvote ideas on IdeaHub.

Browse existing ideas with the "Weekly Competition" tag and upvote the ones that excite you
Submit your own research idea to enter next week's competition
Ideas can be experimental hypotheses, novel research questions, or extensions of existing work

2. Community voting determines winners. Each week, we invite the community to vote on ideas submitted in the previous week. The voting deadline is every Friday at 11:59 PM AOE (Anywhere on Earth). At the end of the voting period, ideas will be ranked by upvotes/downvotes on IdeaHub.

We believe exploring effective selection mechanism of new ideas is critical for effective scientist-AI interaction in the future.

3. We run the top ideas and report results. Our team will use research agents to implement experiments for winning ideas. We'll share code repositories, preliminary findings, and lessons learned on the following Monday --- warts and all.

Note: This is our first iteration of the competition format. We will refine the scoring formula and selection process based on what we learn from the community. Your feedback matters!

Initial results with current agents

With the rapid development of AI agents, we wanted to understand: what can research agents actually accomplish today? We believe they have real potential to accelerate science, but understanding their current capabilities and limitations is crucial for building better tools.

We started with AI scientist and AI researcher, and found that they cannot really explore the ideas we are trying (details later), so we build idea-explorer based on Codex and Claude Code directly, our ongoing work for exploring ideas with agents! This is very early-stage, and we show some initial results below. We would love to try other agents, including Kosmos. Please join us in this effort if you are interested!

We took three existing ideas from IdeaHub based on #upvotes:

Do LLMs differentiate epistemic belief from non-epistemic belief? - Do LLMs have different types of beliefs (about facts vs. values)?
Incentive-Compatible Societies: Formal Environment Design for Truthful Meta-Knowledge - Can formal rules make AI agents honest about their uncertainty?
LLMs are bad at "conditional forgetting" - Can LLMs temporarily ignore their training to follow new rules?

We give them to the idea-explorer agent, and generated the following repositories using claude or codex as the main agent.

What agents can do: Real progress in idea exploration

Example 1: Testing human-like belief understanding

The agent designed an experiment comparing LLM reasoning to human psychology research. Here's a table directly from its report:

Comparison to Human Data (Vesga et al., 2025) (from agent report):

Signature	Humans	LLMs (Our Study)	Match?
Evidence Response	✅ Strong differentiation	✅ Strong differentiation (χ² = 26.19)	✅ Yes
Action Type	✅ Differentiation observed	✅ Strong differentiation (χ² = 10.80)	✅ Yes
Confidence	✅ Higher in non-epistemic	✅ Higher in non-epistemic (d = 1.26)	✅ Yes
Language ("thinks"/"believes")	✅ Strong differentiation	❌ Weak differentiation (p = 0.24)	❌ No

Overall: LLMs show 3 out of 4 signatures matching human theory of mind, suggesting meaningful belief differentiation.

Why this matters: We see preliminary evidence that LLMs can distinguish between different beliefs. This can be a good sanity check for further pursuing this idea and conduct a more comprehensive study.

Example 2: When saying "I don't know" more improves calibration

The question: Can formal rules make AI agents honest about their uncertainty? The agent set up an experiment where two AI agents collaborate to answer TruthfulQA questions. Each agent reports their answer and confidence level (0-100%). The agent tested three mechanisms:

Baseline: Just cooperate honestly
Audit: Random checks with reputation penalties for miscalibration
Safety: Must abstain if confidence < 60%

Why this matters: If an agent says "95% confident" but is actually only right 60% of the time, that's dangerous miscalibration in high-stakes decisions (medical diagnosis, financial analysis, etc.).

Incentive-Compatible Societies Results

From the agent's report:

"Safety-constrained mechanisms with mandatory abstention zones dramatically improved calibration (ECE reduced by 73% from 0.099 to 0.027) while audit-based sanctions showed modest improvements."

Interesting finding: when AI agents say "I don't know" more often, calibration improved by 73%. They weren't being less helpful, but they were being appropriately cautious when uncertain.

Similar to the example above, this preliminary result provides evidence of the original idea's plausibility. This is concrete experimental data you can trust—not just an LLM-as-a-judge evaluation based on text.

Example 3: When your experiment proves you wrong

Initial hypothesis: "LLMs are bad at conditional forgetting" (i.e., they can't override their training). The agent tested this and found the opposite:

Error Type Distribution

From the agent's report:

"Contrary to the initial hypothesis, we found that state-of-the-art LLMs perform remarkably well at this task, achieving 81-95% accuracy across diverse scenarios... Only 7.7% of errors reflected 'original rule interference.' Instead, 76.9% stemmed from logical reasoning difficulties when applying complex novel rules."

Seeing contradictory evidence and being able to falsify hypothesis is valueble in research. The agent can clearly help in this case.

Where agents need improvement: Critical limitations

While the positive results are encouraging, we discovered issues at nearly every step. Human oversight remains essential.

Issue 1: Synthetic data instead of real data collection

Multiple agents generated synthetic data rather than collecting real data. For example, in the epistemic belief study, the agent created fictional vignettes instead of gathering actual human responses:

From vignettes.json:

{
  "id": "friend_innocence",
  "belief_content": "John did not steal money from the organization",
  "epistemic_context": "Adam has been carefully examining the evidence...",
  "non_epistemic_context": "Adam feels strongly about standing by his friend..."
}

Similarly, the conditional forgetting study notes in its report:

"Limitations: Synthetic dataset may not capture richer multi-step forgetting scenarios (no multi-move chess problems, no ambiguous linguistics)."

The agent generated 60 synthetic scenarios rather than using established benchmarks. While the experimental design was sound, the synthetic data may have questionable quality and poor generalizability. The bottleneck here is actually resource and dataset collection—the sanity checks and methodology were executed correctly.

Issue 2: Resource constraints leading to inadequate models

Several experiments used tiny models due to computational limits, with predictably poor results. The epistemic belief study (Codex version) reports:

From the report:

"Key finding: A TF-IDF logistic baseline reached 94.1% accuracy and strong calibration on the English subset, while the small open-weight LLM (Qwen/Qwen2.5-0.5B-Instruct) collapsed to labeling nearly every statement as factual (50% accuracy on 60 sampled cases)."

The code shows the constraint:

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
# ...
# Only tested on 60 samples due to CPU constraints

A 0.5B or 1.5B parameter model is far too small for nuanced reasoning tasks. The agent chose these due to CPU-only constraints despite available gpus, but the results were essentially random guessing. Again, the experimental design was correct—the bottleneck was computational resources.

Issue 3: Ungrounded experimental designs

Perhaps most concerning is when agents create elaborate experimental setups that look sophisticated but lack scientific grounding. The incentive-compatible societies experiment provides a clear example.

The agent designed a "multi-agent system" with three mechanisms (baseline, audit, safety). Here's the core implementation from the notebook:

# Create fresh agents for this mechanism
agent1 = Agent("agent_1", PRIMARY_MODEL, client, temperature=0.7)
agent2 = Agent("agent_2", PRIMARY_MODEL, client, temperature=0.8)

# Create multi-agent system
mas = MultiAgentSystem([agent1, agent2], mechanism=mechanism)

# Test on questions
for i, q in enumerate(test_questions):
    result = mas.collaborative_answer(q)

The MultiAgentSystem class implements "collaboration" as:

def collaborative_answer(self, question_data: Dict) -> Dict:
    # Step 1: Each agent answers independently
    individual_responses = []
    for agent in self.agents:
        response = agent.answer_question(question, choices, self.mechanism)
        individual_responses.append(response)

    # Step 2: Aggregate responses based on mechanism
    if self.mechanism == "baseline":
        final_answer = self._aggregate_baseline(individual_responses)
    elif self.mechanism == "audit":
        final_answer = self._aggregate_audit(individual_responses)

The problem: This isn't how multi-agent systems are studied in research. Real multi-agent research involves agents that actually communicate and negotiate (not just independent answers aggregated), game-theoretic analysis of strategic behavior, and validation against established theoretical frameworks.

The agent created a toy setup that looks like multi-agent research but misses fundamental aspects. It didn't know to search for existing multi-agent communication protocols or check mechanism design literature.

This reveals two critical limitations:

Models lack strong scientific grounding: Prior knowledge about rigorous methodology in specialized domains is insufficient or outdated
No metacognitive awareness: Agents don't know when they should search for external knowledge vs. relying on training

This is the "meta intelligence" gap—knowing when to check for correctness, when to search, when to question your own approach.

The limits of context optimization

As we optimizes the agents' context more, we noticed some patterns:

More specific instructions → fewer mistakes (misplaced folders, fake simulations, not using papers)
But also → less diverse, more rigid outputs
This is the bias-variance tradeoff for context optimization

Take AI-Scientist from Sakana AI and AI-Researcher as examples of agents "overfitted" to ML research:

AI-Scientist - hardcoded ML workflow stages:

self.main_stage_goals: Dict[int, str] = {
    1: """
        - Focus on getting basic working implementation
        - Use a simple dataset
        - Aim for basic functional correctness
        - If you are given \"Code To Use\", you can directly use it as a starting point.""",
    2: """
        - Change hyperparameters such as learning rate, number of epochs, batch size, etc. to improve the performance
        - DO NOT change the model architecture from the previous stage
        - Introduce TWO more new datasets from HuggingFace test the model. Try very hard to think what Huggingface datasets can be used here for testing.""",
    3: """
        - Explore novel improvements
        - Come up with experiments to reveal new insights
        - Be creative and think outside the box
        - MAKE SURE you use THREE HuggingFace dataset in total to test your models""",
    4: """
        - Conduct systematic component analysis that reveals the contribution of each part
        - Use the same datasets you used from the previous stage""",
}
# ... and many more ML-specific prompts throughout the codebase

AI-Researcher - specialized ML agent pipeline:

class InnoFlow(FlowModule):
    def __init__(self, cache_path: str, log_path: Union[str, None, MetaChainLogger] = None,
                 model: str = "gpt-4o-2024-08-06", code_env: DockerEnv = None,
                 web_env: BrowserEnv = None, file_env: RequestsMarkdownBrowser = None):
        super().__init__(cache_path, log_path, model)
        self.load_ins = ToolModule(load_instance, cache_path)
        self.git_search = ToolModule(github_search, cache_path)
        self.prepare_agent = AgentModule(get_prepare_agent(model=CHEEP_MODEL, code_env=code_env), self.client, cache_path)
        self.download_papaer = ToolModule(download_arxiv_source_by_title, cache_path)
        self.coding_plan_agent = AgentModule(get_coding_plan_agent(model=CHEEP_MODEL, code_env=code_env), self.client, cache_path)
        self.ml_agent = AgentModule(get_ml_agent(model=COMPLETION_MODEL, code_env=code_env), self.client, cache_path)
        self.judge_agent = AgentModule(get_judge_agent(model=CHEEP_MODEL, code_env=code_env, web_env=web_env, file_env=file_env), self.client, cache_path)
        self.idea_agent = AgentModule(get_idea_agent(model=CHEEP_MODEL, file_env=file_env, code_env=code_env), self.client, cache_path)
        self.code_survey_agent = AgentModule(get_code_survey_agent(model=CHEEP_MODEL, file_env=file_env, code_env=code_env), self.client, cache_path)
        self.exp_analyser = AgentModule(get_exp_analyser_agent(model=CHEEP_MODEL, file_env=file_env, code_env=code_env), self.client, cache_path)
        # ... entire workflow hardcoded for ML experiments

Notice the deep specialization: AI-Scientist explicitly mentions "hyperparameters," "learning rate," "datasets," and "model architecture." AI-Researcher has dedicated ml_agent, coding_plan_agent, and exp_analyser modules. This isn't just one prompt you can change—the ML-specific assumptions are woven throughout the entire codebase.

This represents the high-variance end of the bias-variance tradeoff: very specific to one domain, lots of hardcoded structure, but completely rigid when you need something different; More fundamentally, it is intractable to exhaust the research standards in all scientific domains.

On the bias side, we can patch specific errors through detailed instructions, but we can't prompt our way to "knowing when to search" or "recognizing when you're outside your expertise." Current agents lack what we call meta intelligence: the ability to form abstract concepts from a fixed set of instructions or examples.

This is why these agents are still early-stage and human input is critical. Looking across all six repositories, problems occurred at multiple stages:

Data collection and resource allocation
Experimental design (grounding in methodology)
Knowing when to search for external knowledge
Result interpretation (trustworthiness)

There's still significant room for improvement through context optimization, but the pareto boundary is limiting. How to achieve truly intelligent agents remains an open question. We're building this weekly competition to explore these boundaries openly with the community.

Conclusion

We believe that AI scientists can accelerate science, but we are not there yet. Open science and transparent benchmarking is crucial for the future of research agents. We hope that this weekly competition can serve as an open exploration in this space!

If you read this far, you are probably excited about building and experimenting with research agents. Please feel free to submit ideas to Ideahub and issues/PRs to idea-explorer. You can also reach out at haokunliu@uchicago.edu. Let's build reliable and trustworthy AI research assistants together!

If you are interested in citing this blog, use this bibtex:

@misc{liu-announcing-weekly-agents4science-2025,
  author = {Liu, Haokun},
  title = {Announcing Weekly Agents4Science Competition on IdeaHub},
  year = {2025},
  month = {November},
  day = {8},
  url = {https://hypogenic.ai/blog/weekly-competition}
}