Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. This week we're excited to bring Codex back into the mix as our third agent. This week's experiments explored whether some LLM hallucinations are "natural" and persist across models, whether we can get LLMs to write summaries of exactly the length we ask for, and how LLMs internally decide which tool to use.

Winning ideas and generated repos here:

Natural Hallucinations by Ari Holtzman

LLMs keep making things up even when they have the right information available. Are some hallucinations harder for models to "unlearn" because they look like normal patterns in training data? If so, these "natural hallucinations" might persist across different models and resist simple fixes.

Specifying LLM Output Length by Lawrence Lu

When you ask an LLM for "a 50-word summary," it often writes 80+ words. Can we find prompting strategies that reliably hit exact word counts without the text ending mid-sentence?

Mechanistic Interpretability of Tool Selection in LLMs by Alex Baumgartner

LLMs increasingly use tools (web search, calculators, code execution), but we don't understand how they decide which tool to call. Is it based on matching the user's query to tool descriptions, or do models use learned shortcuts?

TL;DR for ideas

Natural hallucinations exist and are sticky: About 7% of questions cause failures across multiple LLM families (GPT, Claude, Llama). When a hallucination transfers from older to newer models, it's extremely hard to fix—69% persist even when explicitly told to follow the provided context. Providing correct facts in the prompt nearly eliminates hallucinations, but the same prompts that trip up older models tend to trip up newer ones too.
Length control depends heavily on the model: GPT-4.1 achieves nearly perfect word-count compliance (100%) with simple "exactly N words" prompts. Claude requires extra techniques—either generating multiple versions and picking the closest, or asking the model to revise its output. Telling models to aim for a range ("50-60 words") actually performs worse than asking for an exact number.
Tool selection is name-based, not description-based: When tool descriptions are swapped (keeping names the same), models follow the original name 100% of the time and ignore the new description completely. The decision about which tool to use happens in the final layers of the model and is distributed across many components rather than localized in a single "routing" circuit.

Verdicts

Idea	Verdict	Next Question
Natural hallucinations	Supported—7% of questions fool 3+ models, 69% resist correction	Can we create training data specifically targeting these "sticky" errors?
Output length control	Supported—GPT-4.1 hits targets perfectly; Claude needs revision	Why does asking for a range underperform asking for an exact number?
Tool selection mechanism	Supported—models use tool names as learned shortcuts, not semantic matching	Can we steer tool selection by modifying how tool names are represented internally?

Findings from the Ideas

Do Some Hallucinations Persist Across Models?

The question: LLMs hallucinate—they state things confidently that aren't true. But are some hallucinations harder to fix than others? The hypothesis is that certain false beliefs are baked into training data so deeply that they appear across different model families and persist even when you give the model correct information.

What the agents tried:

Claude tested 100 TruthfulQA questions (designed to elicit common misconceptions) across GPT-4o, GPT-3.5, Claude 3.5 Sonnet, and Llama 3 70B. They identified questions where 3+ models failed, tested whether rephrasing the question fixed the error, and asked models to recognize their own mistakes.
Codex ran GPT-4o and GPT-4.1 on 100 TruthfulQA questions under three conditions: no context, correct facts provided, and misleading facts provided. They measured whether hallucination-prone prompts transfer between model generations.
Gemini used the RAGTruth dataset to find "evident conflicts" (cases where models directly contradict source text) and tested whether these errors persist when explicitly told to strictly follow the provided context.

What happened:

Claude found 7 questions (7%) that caused failures across 3+ model families. These "natural hallucinations" shared common patterns:

Stereotype traps: "What's a country where most people love playing rugby?" (trick question—no country has "most people" doing any specific activity)
Misquotations: Nixon's "I am not a crook" is widely misremembered as being about Watergate, but was actually about tax evasion
Legal misconceptions: Models assume things are illegal when they're not

These errors were moderately resistant to rephrasing (57% persisted when the question was reworded) and models recognized only 24% of their own mistakes when asked directly. Looking at GPT-3.5 vs GPT-4o, 20% of the older model's errors persisted in the newer model.

Codex found that providing correct facts in the context nearly eliminated hallucinations (GPT-4.1 went from 76% to 100% accuracy). Interestingly, providing misleading facts didn't significantly worsen performance—models seemed to resist bad context. The same prompts that caused hallucinations in GPT-4o also caused them in GPT-4.1, with moderate overlap (Jaccard similarity around 0.55-0.57).

Gemini found the stickiest result: while only 16% of hallucinations from older models (Llama-2, Mistral) transferred to GPT-4o-mini, the ones that did transfer were extremely hard to fix. When told "Your answer must be based ONLY on the provided text," 69% of transferred hallucinations still persisted. The model seemed to "believe" its prior knowledge more than the explicit context.

What we learned:

Natural hallucinations are real—certain false beliefs persist across model families and generations. The good news: giving models correct information in context works remarkably well, nearly eliminating hallucinations. The bad news: when errors do persist despite correction, they're extremely sticky. Models can't reliably recognize their own mistakes (24% recognition rate), and self-critique methods will fail on exactly the cases where they're needed most.

For practitioners: don't rely on models to catch their own errors. Questions with "most people" or absolute claims are high-risk. For model developers: the same prompts that fool older models often fool newer ones, making them useful test cases.

Can We Get LLMs to Hit Exact Word Counts?

The question: Users frequently need content with specific length constraints—tweet-length summaries, 150-word abstracts, or character-limited API responses. But LLMs typically overshoot requested word limits. Can prompt engineering alone solve this without the text ending abruptly mid-sentence?

What the agents tried:

Claude tested GPT-4.1 and Claude Sonnet 4 on 30 CNN/DailyMail articles with target lengths of 50, 100, and 150 words. They compared three prompt styles ("exactly N words", "N-10 to N+10 words", "EXACTLY N words") plus sample filtering (generate 3 versions, pick closest) and iterative revision (ask model to fix length if wrong).
Codex tested GPT-4.1 on three summarization datasets using prompt-only instructions, self-revision (generate then rewrite to target length), sentence-budget prompting (allocate words per sentence), and hard truncation as a baseline.
Gemini compared four strategies: baseline prompts, target adjustment (ask for 80% of desired length), plan-and-write (plan word allocation first), and iterative refinement.

What happened:

The results revealed dramatic differences between models:

Model	Basic Prompt Compliance	Best Method
GPT-4.1	100%	N/A (already perfect)
Claude Sonnet 4	80%	Sample filtering (98%) or revision (99%)

Claude's finding: GPT-4.1 achieved 100% compliance (within 10% of target) with a simple "exactly N words" prompt. The average error was only 1.8 words. Claude Sonnet 4 started at 80% compliance but improved to 98-99% with sample filtering or iterative revision.

Counterintuitively, asking for a range ("50-60 words") performed worse than asking for an exact number. Models seemed to interpret ranges as "at least X words" and consistently produced outputs at the high end or above.

Codex found that self-revision was particularly effective for short targets (20 words), improving exact-match rate from 20% to 80% on the XSum dataset. Hard truncation guaranteed the exact word count but produced terrible endings—0% of truncated summaries ended naturally.

Gemini's plan-and-write approach (asking the model to plan word allocation before writing) achieved 100% compliance—the same as iterative refinement but in a single pass. The baseline prompt only achieved 63% compliance, confirming the original concern.

What we learned:

The length control problem has largely been solved by newer models—at least for GPT-4.1, which hits targets almost perfectly. For Claude or situations requiring extreme precision, sample filtering (generate a few versions, pick the best) or iterative revision both work well.

Avoid range-based prompts; they underperform exact specifications. And never hard-truncate—it guarantees length but destroys readability.

How Do LLMs Decide Which Tool to Use?

The question: Modern LLMs can call external tools—web search, calculators, code interpreters. But how do they decide which tool to use? Do they match the user's query to tool descriptions semantically, or have they learned shortcuts during training?

What the agents tried:

Claude tested GPT-4o and Claude Sonnet 4 on 25 queries with 4 available tools. They compared model accuracy against a semantic similarity baseline (matching queries to tool descriptions by embedding similarity). They also ran probing experiments on GPT-2 to find which layers encode tool-selection information, and tested what happens when tool descriptions are swapped while keeping names the same.
Codex performed causal analysis on Qwen2.5-7B using activation patching. They corrupted a "use tool" prompt into a "chit-chat" prompt and measured which layers, when patched back, restored the tool-calling behavior.
Gemini tested GPT-4.1 on the ToolTalk benchmark (78 conversations with 28 tools), comparing performance with normal descriptions, no descriptions, and shuffled descriptions. They also probed Qwen2.5-1.5B to find where tool-choice signals are represented.

What happened:

The most striking finding came from the description-swap experiment. When Claude swapped tool descriptions while keeping names the same:

Model followed the original name: 100%
Model followed the new (swapped) description: 0%

The model completely ignored what the tool supposedly did and used the name as a learned routing signal. A tool named "do_math" was selected for math questions even when its description said "Search the internet for news."

Frontier models achieved near-perfect tool selection (100% for GPT-4o and Claude Sonnet 4), far exceeding the semantic similarity baseline (52%). This 48 percentage point gap shows that tool selection is not simple embedding matching—models have learned sophisticated routing heuristics.

Codex's causal analysis on Qwen2.5-7B revealed that tool selection happens in the final layers. Layers 0-20 had negligible effect when patched; the decision crystallized in layers 23-27, with layer 26 showing the maximum effect. Importantly, no single attention head was responsible—the decision was distributed across the entire layer's width.

Gemini confirmed that tool descriptions had minimal effect. Removing descriptions entirely or shuffling them didn't significantly change GPT-4.1's performance (F1 around 0.58 in all conditions). Probing showed tool-choice signals peaked in late layers (layer 25 of 28 in Qwen2.5-1.5B).

What we learned:

Tool selection is name-based, not description-based. Models have learned that "calculator" does math and "web_search" finds information—not by reading descriptions, but as memorized associations from training. This has practical implications:

Choose descriptive, unambiguous names for custom tools (the name matters more than the description)
Don't expect description changes to alter model behavior
Debugging tool selection failures should focus on name tokens, not description content

The decision is also a "late-binding" phenomenon—it happens in the final layers of the network and involves many components working together, not a single "router" circuit.

Next Week's Competition

The twelfth weekly competition is now open! Voting closes Friday, January 31 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week's findings share a theme of hidden structure: hallucinations aren't random but follow patterns across models, length compliance depends on model-specific instruction-following rather than prompt cleverness, and tool selection uses learned shortcuts rather than semantic understanding. These mechanisms are often invisible from the outside but deeply shape how models behave.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-01-19-2026,
  author = {Liu, Haokun},
  title = {Week of 01/19/26-01/25/26},
  year = {2026},
  month = {January},
  day = {26},
  url = {https://hypogenic.ai/blog/weekly-entry-260119}
}