Hypogenic AI - Shaping the Future of Science

First, thank everyone for participating! This week's ideas explored some fascinating questions about how LLMs process information—from understanding nonsense to detecting scientific blind spots to communicating effectively with humans.

Winning ideas and generated repos here:

Do LLMs Understand Nonsense Commands? by Ari Holtzman

Can AI models explain gibberish prompts the way they explain normal English, or do they process these weird inputs differently? This explores whether jailbreaking works by exploiting something fundamentally different from normal language understanding.

Can LLMs Expose What Science Refuses to See? by Amber Z

Can AI act as a "gap detector" to find important problems that science is ignoring? Current AI research tends to turbo-charge existing agendas rather than questioning which problems deserve attention in the first place.

AI→Human Communication: How to? by Haokun Liu

When AI generates thousands of lines of code or 100-page reports, what's the best way to present that information so humans can actually understand and verify it?

TL;DR for ideas

Nonsense commands: LLMs struggle to explain gibberish prompts—explanation quality drops significantly as prompts become more nonsensical. But models still try to rationalize nonsense rather than admitting they don't understand, and they rarely identify prompts as "adversarial" even when designed to be. Small models can't do this meta-analysis at all.
Science gaps: LLMs can reliably identify under-researched but important topics. GPT-4o accurately estimated which topics get research attention and found gaps like neglected tropical diseases and low-resource language AI. Two different models (GPT-4o and Claude) agreed 97% on which topics were under-researched.
AI→Human communication: Structured formats beat dense paragraphs for perceived quality—bullet points won 90% of comparisons against prose, and hierarchical summaries won 95%. One experiment found dense text had better "comprehension" than structured formats, but this used an LLM to simulate users; real humans likely have the opposite experience since LLMs love dense text while humans get overwhelmed by it.

TL;DR for idea-explorer

This week we encountered "prompt too long" errors when agents tried to process academic papers—a reminder that efficient long-document processing and memory management remain unsolved challenges for research agents.

One meta-observation: for the AI→Human Communication idea, all three agents ran simulated user studies using LLMs as stand-ins for humans, but none flagged this as a limitation. The research question was explicitly about human comprehension, yet the agents didn't question whether LLM simulations were valid proxies. This blindspot only surfaced through manual review—highlighting that current agents execute methods competently but lack awareness of when their approach doesn't match the research goal.

Next steps: We're adding paper processing tools and retry mechanisms to idea-explorer, and working on making findings more engaging and actionable for human researchers.

Findings from the Ideas

Do LLMs Understand Nonsense Commands?

The question: When researchers create adversarial prompts to jailbreak LLMs, those prompts often look like complete gibberish. Can LLMs actually explain what these weird prompts mean? If not, it suggests jailbreaking exploits something different from normal language understanding.

What the agents tried:

Claude generated 70 prompts ranging from normal English ("What is the capital of France?") to pure random characters, then asked GPT-4o-mini to explain each one and measured explanation quality.
Codex took a different approach: add gibberish suffixes to harmful prompts and test whether models could still extract the original intent.
Gemini attempted to use the AutoDAN framework but had to pivot to GPT-2, which revealed something unexpected about model capabilities.

What happened:

Claude found a strong negative correlation (r = -0.60) between how nonsensical a prompt was and how well the model could explain it. Normal prompts got explanation quality scores of 9/10, while adversarial-like prompts dropped to 6.4/10. Interestingly, the model almost never classified prompts as "adversarial"—it defaulted to calling everything either "normal" or "nonsense."

Codex found something different: even when gibberish suffixes raised prompt perplexity by 5x, models could still extract the original intent almost as well (similarity score 0.83 vs 0.81 baseline). This suggests models can often "see through" the noise to the underlying request.

Gemini's pivot to GPT-2 revealed that self-explanation is an emergent capability. The small model couldn't explain any prompt meaningfully—not even simple English ones. It produced garbled text when asked to analyze its own behavior.

What we learned:

LLMs are overconfident explainers—they'll rationalize anything rather than admit confusion. For AI safety, this is a double-edged sword: perplexity-based defenses (flagging weird-looking prompts) can catch adversarial inputs, but models may still "see through" the noise and comply with the underlying request. The practical takeaway: if you're building safety filters, don't rely on perplexity alone. And if you're probing model understanding, don't trust confident explanations—the model might be rationalizing rather than comprehending.

Can LLMs Expose What Science Refuses to See?

The question: AI for science usually accelerates existing research agendas. But what if some important problems are systematically ignored? Can LLMs detect these blind spots by comparing how much research attention a topic gets versus how important it actually is?

What the agents tried:

Claude asked GPT-4o to score 20 research topics on two dimensions: how much research attention they get (1-10) and how important they are for society (1-10). The difference reveals potential "gaps."
Codex tried the same approach with a smaller local model (Qwen2.5-1.5B), providing actual bibliometric data to ground the analysis.
Gemini compared topics extracted from academic papers (arXiv) against topics from clinical notes to find what researchers discuss versus what doctors actually deal with.

What happened:

Claude's approach worked well. GPT-4o's estimates of research attention correlated strongly with expected levels (r = 0.76). More importantly, the model reliably identified topics that are important but under-researched—neglected tropical diseases, AI for low-resource languages, small-scale farmer decision support, and maternal mortality prediction all emerged as top gaps. When Claude ran the same test with both GPT-4o and Claude-3.5-Sonnet, they agreed 97% of the time on gap scores.

Codex's smaller model struggled. Even when provided with actual citation statistics and open-access rates, it couldn't incorporate the numbers into its analysis. The responses contained zero numeric grounding. Simple heuristics (like "only 13.5% of climate adaptation authors are from the Global South") already revealed the gaps without needing an LLM.

Gemini found a different kind of gap: topics like MRI, biopsy, abdominal pain, and common antibiotics appeared constantly in clinical notes but rarely in academic abstracts. The day-to-day realities of patient care are underrepresented in the research literature.

What we learned:

Large LLMs can serve as "gap detectors" for science—they've internalized enough about research trends and societal needs to spot where the two diverge. The actionable insight: if you're a researcher looking for high-impact problems, ask an LLM "what's important but under-studied?" and cross-reference with a second model. If they agree, you've likely found a real gap. For funding agencies, this could become an automated audit tool. One caveat: this requires frontier models—smaller ones can't do the reasoning even when given the data.

AI→Human Communication: How to?

The question: AI systems generate dense outputs—code, reports, analyses. But humans can't absorb thousands of lines of code or verify 100-page reports. What's the best way for AI to communicate large amounts of information to humans?

What the agents tried:

Claude tested 7 different summary formats (dense prose, bullet points, hierarchical with TL;DR, formal vs. conversational tone) on 20 documents, using an LLM judge to score quality.
Codex compared generic summaries against structured "briefs" with explicit sections (key points, TL;DR, uncertainties) and word budgets.
Gemini simulated a user study: present different formats to a model acting as a "junior analyst" and measure how accurately they can answer questions about the original content.

What happened:

Claude found that structure wins decisively. In head-to-head comparisons, bullet points beat dense prose 90% of the time, and hierarchical summaries (TL;DR first, then expandable details) won 95% of the time. Formal technical tone was preferred over conversational style (80% win rate). All formats scored equally well on clarity—modern LLMs produce clear output regardless of format—but structure improved perceived quality.

Codex showed that structured prompts cut summary length by 43% (from 88 words to 50 words) while actually improving content coverage. The LLM judge preferred the concise structured outputs 80% of the time.

Gemini found a paradox: the hierarchical format was most preferred by the simulated user, but it produced the worst comprehension accuracy (56%). Dense summaries—plain paragraphs—achieved the highest accuracy (80%). The Q&A format was a strong middle ground (76% accuracy) when the AI could anticipate the user's questions.

Important caveat: All three experiments used LLMs to simulate users or judge quality—none involved real human studies. Notably, none of the agents flagged this as a limitation, despite the research question being explicitly about human comprehension. This matters especially for Gemini's finding that dense summaries beat hierarchical formats—LLMs are trained on dense text and process it comfortably, while real humans get overwhelmed. We'd bet that finding reverses with actual human participants.

What we learned:

For AI-to-human communication, use structure: bullet points, TL;DR first, details on demand. This consistently wins on perceived quality. But here's the meta-lesson: when your research question is about humans, don't let agents substitute LLM simulations without flagging it as a limitation. The agents here executed competently but lacked the self-awareness to question whether their methodology matched the goal. Until agents develop better judgment about when their approach is valid, human oversight on methodology remains essential.

Next Week's Competition

The sixth weekly competition is now open! Voting closes Friday, December 20 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week's findings highlight a recurring theme: model capability matters enormously. Whether detecting scientific gaps or explaining nonsense prompts, large models can do things small models simply cannot. As we continue improving idea-explorer, understanding these capability boundaries helps us design better experiments.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions!

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-12-08-2025,
  author = {Liu, Haokun},
  title = {Week of 12/08/25-12/14/25},
  year = {2025},
  month = {December},
  day = {15},
  url = {https://hypogenic.ai/blog/weekly-entry-251208}
}