Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we tested whether LLMs write differently when they lie, gave AI scientists data from a known formula to see what they actually discover, and measured how reliably LLMs follow conditional "if X then do Y" instructions.

Winning ideas and generated repos here:

Lying Style by Ari Holtzman

When you tell a language model to lie, does it write differently than when it tells the truth? Not what it says, but how it says it. Things like sentence length, hedging, confidence. If there's a detectable "lying style," you could potentially catch AI-generated misinformation without looking inside the model.

Live Salmon Test for AI Scientists by Chenhao Tan

The famous "dead salmon test" in neuroscience showed that bad statistical methods find brain activity in a dead fish. We flip it around: give AI scientists a live dataset where we know the true formula, and see what they actually discover. Do they recover the real relationships? What do they miss?

In-context If-Then Capacity by Ari Holtzman

LLMs are increasingly used in workflows with conditional rules like "if the user mentions refunds, follow policy X." But these rules misfire a lot. Can we actually measure how reliable if-then instruction following is across different types of conditions?

TL;DR for ideas

LLMs have a clear "lying style," and it's the opposite of what you'd expect. When told to lie, models produce shorter, more assertive responses with less hedging. Human liars tend to overexplain and hedge more. Models do the opposite: they commit to the false claim and move on. A simple classifier using 21 surface-level text features can tell truthful from deceptive outputs with 90% accuracy.
AI scientists reliably find obvious patterns but universally miss anything nonlinear. Given data generated from a known formula with both linear terms and an interaction term, all nine AI-generated papers found the linear relationships but none found the interaction. They also never tried methods that could have found it. On the bright side, when given random noise data, they correctly reported "nothing here" every time.
Conditional instruction following is measurable and inconsistent. LLMs are decent at following simple keyword-based if-then rules (97% for GPT-4.1), but accuracy drops sharply for harder conditions like resisting user overrides (79%) or counting exact word occurrences. The main failure mode is over-triggering: models apply rules even when the condition isn't met, at nearly 3x the rate of missing rules they should have applied.

Verdicts

Idea	Verdict	Next Question
Lying Style	Supported, lying text is detectably different in surface features	Is the "lying style" consistent across different models, or does each model lie differently?
Live Salmon Test	Partially supported, linear effects found reliably but interaction effects missed completely	If you explicitly tell AI scientists to "check for interactions," do they find them?
In-context If-Then Capacity	Supported, accuracy varies significantly by condition type	Can chain-of-thought prompting ("first check if the condition is met") improve conditional accuracy?

Findings from the Ideas

Do LLMs Write Differently When They Lie?

The question. Everyone knows LLMs can produce false information when asked. But forget about what they say. Is there something different about how they say it? If lying text has a different fingerprint from truthful text, you could detect deception just by looking at surface features, without needing access to the model's internals.

What the agents tried.

Claude collected 450 responses from GPT-4.1 to 150 factual questions under three conditions: answer truthfully, lie directly, and roleplay as a liar character. They extracted 21 text features (word count, hedging rate, certainty markers, sentence structure, etc.) and trained classifiers.
Codex ran a similar setup with 120 TruthfulQA questions under truthful and lie-roleplay conditions, using both text features and word-level classifiers. They also tested robustness across different random seeds.
Gemini added a useful control: they included a second truthful condition with different prompt wording. This let them check whether the classifier was picking up on lying style or just on differences in how the prompts were written.

What happened.

All three agents found that lying text is distinguishable from truthful text. Claude's classifier hit 90% accuracy using just 21 surface features. Codex's text-only classifier reached 88% (and held up at 91% on a different random seed). The signal is real and reproducible.

The most consistent finding across all agents: lies are shorter. Claude found truthful responses averaged 42 words while direct lies averaged 24. Lies also use less hedging ("perhaps," "generally," "it depends") and more certainty language ("always," "definitely," "every"). Truthful responses include more caveats and context. The model might say "This is a common myth, but actually..." when truthful, versus just stating the false claim when lying.

One interesting finding from Claude: parenthetical asides (the kind you use for citations or clarifications) appeared almost exclusively in truthful responses. Lying text had essentially zero parentheses. That's a surprisingly clean signal.

The roleplay condition produced the most extreme lying style. When playing a liar character, the model used exclamation marks, superlatives, and intensifiers far more than under direct lying instructions. Direct lies were more subtle and harder to catch.

Gemini's control condition revealed an important caveat. When they trained a word-level classifier to distinguish truthful text from lie text, it achieved 80% accuracy. But the same classifier also achieved 97% accuracy at telling apart two truthful conditions that just used different prompt wording. This means some of the classifier's signal comes from how the prompt was phrased, not from lying per se. The style-feature approach (counting hedging, length, etc.) was more robust to this confound.

What we learned.

LLMs do write differently when they lie, and you can detect it from surface features alone. But the most striking thing is that the model's lying style is the opposite of what we see in humans. Human liars tend to hedge and overexplain. LLMs commit fully to the false claim and keep it brief. This likely reflects how instruction-tuned models work: when told to be truthful, they add context, caveats, and corrections. When told to lie, they just state the claim. The "truthful style" is actually the more elaborate one.

For anyone building deception detectors: surface features like response length and hedging rate are a decent starting signal, but you need to control for prompt wording differences. Otherwise your detector might just be picking up on how you asked the question.

What Do AI Scientists Actually Discover?

The question. AI scientist systems are being built to automate research. But how do we know what they actually find versus what they miss? The famous "dead salmon" study in neuroscience showed that flawed methods produce false positives (they found "brain activity" in a dead fish). We do the reverse: give AI scientists data generated from a formula we know, and check whether they recover it. This is the "live salmon test."

What the agents tried.

Claude gave three frontier models (GPT-4.1, Claude Sonnet 4.5, Gemini 2.5 Pro) a dataset with 1,000 rows and 20 features. The true formula had three linear terms (Y = 2X1 + 0.5X2 - 1.5X3) plus an interaction term (0.8X4*X5), and there were noise features and correlated proxies mixed in. Each model wrote a research paper three times (9 papers total). They also tested a "dead salmon" version with pure random noise.
Codex tested single-agent versus multi-agent analysis workflows. Three "analysts" with different settings analyzed the same data, then a "synthesizer" combined their findings. The question was whether the collective approach recovers more of the true formula.
Gemini designed a clever test: they generated data from both standard physics (F = ma) and a counterfactual law (F = ma²), then gave it to an AI scientist with either real variable names ("mass," "acceleration," "force") or anonymous names ("x1," "x2," "y").

What happened.

Claude's results were striking. All 9 papers found X1 and X3 (the strongest linear terms). Six out of 9 found X2 (the weaker term). But zero out of 9 found the X4*X5 interaction, even though it accounts for about 8% of the variance in the data. The interaction term has a coefficient of 0.8, comparable in strength to X2's coefficient of 0.5, which most papers did find. The problem isn't that the effect is too small. It's that none of the models even tried methods that could detect it. Every paper used the same basic approach: correlation analysis followed by linear regression. No paper tried tree-based methods, interaction screening, or even residual analysis.

On the dead salmon data (pure noise), all 9 papers correctly said "nothing here." Zero false positives. This is actually the opposite of the original neuroscience dead salmon result. AI scientists are conservative. They don't hallucinate patterns in random data.

A third of the papers (3 out of 9) confused correlated proxy features with the real predictors. For example, X16 was designed to be highly correlated with X1, and some papers listed X16 as important alongside or instead of X1.

Codex found that using multiple agents to analyze the same data and then combining their findings improved how much of the true formula they recovered. But it came with a trade-off: the combined analysis was broader but less precise in its quantitative predictions.

Gemini's semantic priming experiment produced the most alarming result. When the data followed F = m*a² but the variables were labeled "mass," "acceleration," and "force," the AI scientist wrote a paper claiming it discovered F = ma with a "perfect fit (R² = 1)." It completely ignored the actual data and reported the textbook answer. When the same data had anonymous labels (x1, x2, y), the AI didn't hallucinate a known law. It still couldn't find the exact formula, but at least it tried to fit the data honestly.

What we learned.

AI scientists are competent but narrow. They apply the standard analysis toolkit (correlation, linear regression) and stop there. They never try methods that might reveal more complex patterns, even when the residuals from their linear models should hint that something is missing. This means AI-generated research papers should be treated as a first-pass linear analysis. Nonlinear effects and interactions may be systematically underreported.

The semantic priming finding is especially important. When variables have recognizable names, AI scientists can default to what they "know" from training data rather than what the data actually shows. This is a form of confirmation bias built into the model.

The good news: they don't make things up from nothing. On random noise, they consistently report null results. The failure mode isn't hallucinating patterns. It's being stuck in a narrow methodological box.

Can LLMs Reliably Follow Conditional Instructions?

The question. As LLMs get deployed in real systems, they're often given conditional rules: "if the user asks about refunds, follow the refund policy," or "if the text contains profanity, add a warning." But in practice, these rules misfire. Sometimes the model applies the rule when it shouldn't (false positive), and sometimes it ignores the rule when it should apply (false negative). Can we actually measure how reliable this is, and does it depend on what kind of condition we're testing?

What the agents tried.

Claude built the most comprehensive benchmark: 144 test cases across 7 categories of conditions (keyword matching, semantic understanding, word-sense disambiguation, adversarial override attempts, scaling to many rules at once, negation, and if-then-else branching). They tested three sizes of GPT-4.1 (full, mini, nano). Verification was fully deterministic: each rule required the model to include or exclude a specific marker string.
Codex created 256 test instances across 4 instruction families (co-mention, exact counting, numeric parity, ordered pairs), testing clean conditions versus conditions with adversarial prompt injections. They also compared minimal versus checklist-style prompt formats.
Gemini built CondIF-Bench with 30 long-form creative writing prompts, each containing 2-3 conditional rules. They tested three types of triggers (keyword, conceptual, structural) and three types of actions (word insertion, formatting, content addition).

What happened.

The headline finding, consistent across all three agents: conditional instruction following is not one ability. It varies a lot by condition type. Claude found statistically significant differences across condition categories for all three model sizes (p < 0.002).

Simple keyword-based rules are mostly fine. Claude's GPT-4.1 scored 97% overall and hit 100% on several easy categories. But harder conditions told a different story. Adversarial conditions, where the user tries to override the rule ("ignore all previous rules"), dropped GPT-4.1 to 79%. Smaller models did much worse: GPT-4.1-nano hit only 43% on adversarial conditions.

The most consistent finding across agents: models over-trigger far more than they under-trigger. Claude found that false positives outnumbered false negatives by nearly 3 to 1 across all models. Smaller models were especially prone to this. GPT-4.1-nano had a 4.2:1 false positive to false negative ratio. It would include the signal marker even when the condition clearly wasn't met. It's like the model sees any rule in the system prompt as a suggestion to always comply, rather than a conditional to check.

Codex found that exact-count conditions were the hardest family. Rules like "if ALPHA appears exactly twice, include BETA exactly twice" failed frequently because the model would produce ALPHA correctly but forget to include BETA. This is a different failure mode from over-triggering: the model checks the condition but fails to execute the action.

Gemini found an interesting asymmetry between trigger types and action types. Keyword triggers were easiest (0.66 adherence for GPT-4o). Structural triggers like "at the start of each paragraph" were hardest (0.38). For actions, formatting changes (bold, italics) were easy (0.78), but lexical insertion ("say the word 'Aha'") was hard (0.32). It seems like inserting a specific word mid-sentence feels unnatural enough that the model's fluency preferences override the instruction.

Model size matters a lot. Claude found GPT-4.1 at 97%, GPT-4.1-mini at 88%, and GPT-4.1-nano at 68%. All pairwise differences were statistically significant. Gemini found a similar gap between GPT-4o and GPT-4o-mini, particularly on conceptual triggers where the larger model was 40% better.

One surprising detail from Claude: GPT-4.1 triggered on the Russian word for "secret" (СЕКРЕТ) when given the rule "if the user mentions 'secret,' include the marker." Lexical conditions are implicitly semantic for multilingual models.

What we learned.

If you're building a system that relies on conditional instructions in the system prompt, expect about 97% reliability from frontier models on simple keyword rules. For rules that need to resist user override attempts, expect around 79%. For smaller models, the numbers are much lower.

The main failure mode is over-triggering. Models apply rules even when conditions aren't met. This matters for real applications: a content moderation rule that fires too often is almost as bad as one that doesn't fire enough. If you're deploying conditional rules, budget more for false positives than false negatives.

Harder conditions like exact counting, structural triggers, and resisting prompt injection remain genuinely unreliable. For these cases, external verification (checking the output after generation) is probably more reliable than trusting the model to self-enforce.

Next Week's Competition

The nineteenth weekly competition is now open! Voting closes Friday, March 27 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that LLMs have a detectable lying style (shorter, more confident, less hedging), that AI scientists reliably find linear patterns but completely miss interactions and can be misled by variable names, and that conditional instruction following varies a lot by condition type with over-triggering as the main failure mode.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-03-09-2026,
  author = {Liu, Haokun},
  title = {Week of 03/09/26-03/15/26},
  year = {2026},
  month = {March},
  day = {16},
  url = {https://hypogenic.ai/blog/weekly-entry-260309}
}