Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. This week's experiments explored whether LLMs respond differently depending on how formally you write to them, whether we can systematically benchmark how "human" AI text sounds, and whether LLMs perform worse in non-English languages due to English-centric training.

Winning ideas and generated repos here:

Do LLMs behave differently when the prompter is human vs another LLM? by Dang Nguyen

When you write casually ("Hey, why is the sky blue?") vs. formally ("I would like to request a comprehensive explanation regarding the following topic..."), does the LLM respond differently even though you're asking the same question? This tests whether LLMs adapt to the "style" of their input.

A Leaderboard for AI Undetectability by Ari Holtzman

Can we create a standardized competition to track which methods make AI-generated text harder to detect? This would help both detection researchers (who need tough test cases) and developers who want AI writing to sound natural.

Evaluating Linguistic Performance in LLMs by Kai Lee

LLMs are trained mostly on English but deployed worldwide. Do they perform worse in other languages? And when they fail in French or Hindi, do they make the same mistakes as in English (suggesting they internally translate everything to English first)?

TL;DR for ideas

LLM style mirroring: LLMs significantly change their responses based on how you write to them. Formal prompts produce 66% longer responses with twice as many bullet points—not because you asked for more detail, but because the model mirrors your style. If you want concise answers, write casually.
AI undetectability leaderboard: Simple prompt tweaks (like asking the model to "write naturally with occasional imperfections") evade 70% of AI detectors while preserving full content quality. Even simpler tricks work: just adding a space before commas fools token-based detectors completely. Post-editing (paraphrasing) can match these evasion rates but degrades text quality.
Multilingual performance gaps: LLMs show clear English-centric bias. Claude performs 12-15% worse on Chinese, Arabic, and Swahili compared to English—but translating to English first closes most of that gap. For high-resource languages like French, models make the exact same mistakes in both languages 61% of the time (vs. 33% random chance), suggesting they internally process everything through English.

Verdicts

Idea	Verdict	Next Question
LLM style mirroring	Supported—formal prompts get 66% longer, more structured responses	Can we use prompt style to control response length without explicitly asking for brevity?
AI undetectability	Supported—prompt engineering evades 70% of detectors with no quality loss	Can ensemble detectors be made robust against simple prompt-based evasion?
Multilingual performance	Supported—7-15% gaps for non-European languages, reduced by translate-first	Why do some languages (Swahili in GPT-4.1) perform better than expected?

Findings from the Ideas

Do LLMs Behave Differently Based on Prompt Style?

The question: When you ask an LLM the same question in two different styles—casual ("Hey, why is the sky blue?") vs. formal ("I would like to request a comprehensive explanation regarding...")—does the response change? If so, this has implications for prompt engineering, multi-agent systems where LLMs talk to each other, and AI safety.

What the agents tried:

Claude created 50 diverse questions across 14 topics (science, philosophy, creative writing, etc.) and wrote two versions of each: one casual/human-style and one formal/LLM-style. They tested GPT-4.1-mini and Claude Sonnet 4 with both prompt versions and measured response length, structure, formality, and reading difficulty.
Gemini used 31 questions from the HC3 dataset, rewrote each in formal LLM-style, and compared GPT-4o's responses to both versions. They measured response length, sentiment, and refusal rates.

What happened:

Claude found dramatic differences. When given formal prompts, models produced:

66% longer responses (194 words vs. 323 words on average)
120% more bullet points (8.5 vs. 18.6 per response)
Higher reading difficulty (college level vs. graduate level)
Less vocabulary diversity (they repeated words more often)

The effect was massive: Cohen's d = 2.07 for response length, meaning nearly all formal prompts got longer responses than nearly all casual prompts. Both GPT and Claude showed the same pattern, though GPT increased length more (+152 words) while Claude increased formality more.

Gemini found a similar but smaller effect: 19% longer responses to formal prompts (marginally significant at p=0.052). Sentiment and refusal rates stayed the same.

What we learned:

LLMs engage in "style mirroring"—they match their output style to the input style, similar to how humans adjust their speech when talking to different people. Write formally, get formal responses. Write casually, get casual responses.

This matters for practical use: if you want concise answers, write like a human, not like an AI. For multi-agent systems where LLMs communicate with each other, the formal style may naturally emerge, potentially making outputs increasingly verbose and structured over multiple exchanges.

Can We Benchmark AI Text Undetectability?

The question: AI detection is an arms race between text generators and detectors. Can we create a standardized leaderboard to track which methods make AI text harder to detect? The original idea proposed four tracks: inference (changing how you prompt), post-editing (paraphrasing after generation), fine-tuning (training models to write naturally), and pretraining (building models from scratch).

What the agents tried:

Claude built a two-track prototype leaderboard testing inference and post-editing approaches. For inference, they tried different system prompts ("write naturally with imperfections", "use varied sentence structures"). For post-editing, they paraphrased AI text with instructions to sound more human. They evaluated against an ensemble detector combining perplexity (how predictable the text is) and burstiness (how varied sentence lengths are).
Gemini focused on demonstrating detector fragility. They used a RoBERTa-based detector and tested a simple attack called "SpaceInfi" that just adds a space before random commas (turning "word," into "word ,").

What happened:

Claude found that prompt engineering alone achieved 70% evasion rate at 5% false positive rate:

Method	Evasion Rate	Quality Preserved
"Human style" prompt	70%	100%
"Varied sentences" prompt	70%	100%
Simple paraphrase	70%	42%
Baseline	50%	100%

Prompt-based methods achieved the best trade-off: same evasion as post-editing but without losing content quality. Surprisingly, prompts asking for "personal experiences and opinions" actually increased detectability—apparently the way LLMs write personal anecdotes is recognizable.

Gemini's results were more dramatic. The SpaceInfi attack—literally just adding spaces before some commas—raised undetectability from 0.535 to 0.999 (essentially perfect). LLaMA-Chat text became statistically indistinguishable from human text, despite the trivial modification. This works because token-based detectors see "word," as one token but "word ," as two, completely disrupting their learned patterns.

What we learned:

Current AI detectors are fragile. They detect tokenization artifacts and writing patterns, not "inhumanity." Simple prompt changes evade 70% of detections with no quality loss, and trivial text modifications (adding spaces) can fool detectors completely.

For detection to improve, detectors need to focus on semantic patterns rather than surface-level token features. A leaderboard framework would help track this arms race and push both sides to improve.

Do LLMs Perform Worse in Non-English Languages?

The question: LLMs are trained mostly on English text but deployed globally. Do they perform systematically worse in other languages? And when they fail, do they make the same mistakes across languages (suggesting they internally translate everything to English)?

What the agents tried:

Claude tested GPT-4.1 and Claude Sonnet 4.5 on the XNLI benchmark (determining whether one sentence implies another) across 10 languages from 6 language families. They also tested a "translate-first" approach: run the same questions in English to see if performance improves.
Gemini tested GPT-4o on MuBench (a multilingual version of MMLU) across 12 languages, including low-resource languages like Tamil and Swahili. They measured both accuracy and whether models make the same mistakes across languages.

What happened:

Both agents found clear English-centric bias, but with interesting patterns:

Claude's results showed model-specific differences:

Model	English Accuracy	Avg Non-English Gap	Worst Languages
GPT-4.1	80%	3.6%	Hindi, Russian (9% gap)
Claude Sonnet 4.5	85%	7.3%	Chinese, Arabic, Swahili (15% gap)

Claude performed better overall but had larger gaps for non-European languages. The translate-first approach revealed something striking: Claude's performance on Chinese jumped from 71% to 85% when using English—a 15 percentage point improvement. Arabic and Swahili showed similar gains. This suggests Claude processes these languages less effectively and benefits from explicit translation.

Gemini's "same mistake ratio" analysis provided the clearest evidence for internal English-centric processing. When models got questions wrong in both French and English, they made the exact same wrong answer 61% of the time (vs. 33% if errors were random). For Korean, it was 64%. But for Tamil (a low-resource language), the same mistake ratio was exactly 33%—random chance.

This suggests that for well-supported languages, the model essentially "translates internally" and reasons in English-like representations. When French fails, it fails the same way English fails. But for poorly-supported languages like Tamil, the model can't even map inputs to its English reasoning space—it just fails independently.

What we learned:

LLMs show systematic English-centric bias, especially for non-Indo-European languages. Claude has larger gaps (7% vs. 4% for GPT-4.1) but higher peak performance. For practitioners working in non-European languages, a translate-to-English preprocessing step can recover 10-15% accuracy.

The "same mistake" evidence strongly supports the hypothesis that LLMs process most languages through English-like internal representations. This works well for high-resource languages but breaks down for low-resource ones, where the model lacks good mappings to its core reasoning system.

Next Week's Competition

The eleventh weekly competition is now open! Voting closes Friday, January 24 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week's findings share a common theme: LLMs are more influenced by surface features than we might expect. They mirror prompt style rather than just answering the question. They're fooled by spacing changes that don't affect meaning. And they process most languages through English-like pathways, failing differently when those pathways don't exist. Understanding these behaviors helps us both use these models more effectively and identify where fundamental improvements are needed.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-01-12-2026,
  author = {Liu, Haokun},
  title = {Week of 01/12/26-01/18/26},
  year = {2026},
  month = {January},
  day = {19},
  url = {https://hypogenic.ai/blog/weekly-entry-260112}
}