Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we tested what happens when you try to blend two concept directions inside a language model, checked whether induction heads are the main reason models leak patterns from their prompt, and measured whether frontier models can hide or find secret messages when the hidden signal is spread thin across long text.

Winning ideas and generated repos here:

Interpolation for Incongruent Concepts by Chenhao Tan

Researchers have found that concepts like "gender" or "sentiment" correspond to specific directions inside a model's internal space. If you walk in a straight line from one concept direction to another, the path is smooth when the concepts are compatible. What happens when the two concepts do not fit together cleanly, like "the capital of France" and "the favorite food of a cat"? Does the path stay meaningful, or does it fall off the manifold?

Are induction heads a large source of leakage? by Ari Holtzman

Induction heads are attention heads that copy patterns from earlier in the prompt. They are the reason a language model can do in-context learning. But they might also be the reason models leak patterned information when they shouldn't, like when you ask a model for a random word and it just parrots something from earlier in the prompt. Does turning them off reduce this leakage?

Diluted Steganography by Ari Holtzman

Steganography is hiding a message inside something that looks normal. A classic example is an acrostic: the first letter of each sentence spells out the secret. What if the signal is more spread out, like the first letter of the word right after every "the"? Can current LLMs still produce, extract, or detect these hidden messages when the signal is diluted across long passages?

TL;DR for ideas

How a model stores concepts doesn't match how we think about them. Concepts like sentiment and country feel unrelated but sit nearly parallel inside the model, so steering one quietly steers the other. Two directions that point almost the same way blend smoothly, even when the concepts themselves have nothing in common. Two directions that point orthogonally produce a confused, high-entropy midpoint, even when the concepts feel related.
Induction heads are not the source of leakage. Removing them actually makes leakage worse. They do handle most of the pattern-copying inside the model (65-72%), but turning them off usually makes things worse, not better. Without them, the model falls back on messier mechanisms in the MLPs, which leak even more pattern information and produce more repetitive output. One agent found a clear reduction in leakage by ablating specific heads, but that effect flipped sign depending on which heads were included.
Top LLMs can hide secret messages only with acrostics. Any scheme that requires counting word positions drops under 31% accuracy. Acrostic encoding (first letter of each sentence) worked 92-100% of the time across GPT-4.1, Claude Sonnet 4.5, and Gemini 2.5 Pro. Any other scheme, even something close like every 3rd word, dropped to under 31% accuracy. Detection was worse: models either flagged every passage as suspicious or flagged nothing, regardless of whether a message was actually hidden.

Verdicts

Idea	Verdict	Next Question
Interpolation between incongruent concepts	Partially supported, geometry predicts interpolation quality, but semantic incongruence does not map cleanly to geometric incongruence	Can we find a concept-extraction method that decorrelates unrelated concepts so steering one does not bleed into the others?
Induction heads as a leakage source	Not supported as stated, induction heads regulate rather than cause leakage in most measurements	What components in the MLPs are responsible for the unstructured leakage that shows up when induction heads are turned off?
Diluted steganography	Supported, dilution breaks both production and detection across all frontier models tested	Is acrostic performance pattern memorization from training, or a genuine steganographic capability that happens to match a counting-free pattern?

Findings from the Ideas

Interpolating between incongruent concepts

The question. People have found that a language model's internal state has directions that correspond to concepts. If you take the direction for "male vs female" and add or subtract it from a hidden state, you can steer generation. But what happens when you try to mix two of these directions together, especially when the two concepts don't fit together? Does the path between them stay meaningful, or does it produce nonsense?

What the agents tried.

Claude used GPT-2 medium and extracted directions for six concepts (gender, sentiment, country, tense, formality, spatial) using the standard method of averaging the difference between contrastive sentence pairs. They then interpolated between all 15 pairs at multiple strengths and measured how the output distribution changed along the path.
Codex used a real embedding model (text-embedding-3-large) and the SugarCrepe++ caption benchmark, which provides matched triplets of a source caption, a paraphrase of it, and an incongruent alternative. They compared the straight-line path from source to paraphrase against the path from source to incongruent caption, using nearest-neighbor manifold distance and dual-space path length.
Gemini used GPT-2 small with a sparse autoencoder to analyze what features activate along the path. They compared related pairs (like "Paris" and "Berlin") with clearly incongruent pairs (like "the capital of France" and "the favorite food of a cat"), measuring how fast the output distribution changes per step and how well the SAE can still reconstruct the activations.

What happened.

Gemini found the cleanest version of the original hypothesis. The path between incongruent concepts produced a 480x larger information metric than the path between related concepts. In plain terms, the output distribution jumps around violently at the midpoint between unrelated concepts, while it barely changes at all between related ones. The SAE reconstruction error was 30% higher on incongruent paths, meaning the midpoint lands in a region the model was never trained to handle.

Claude looked more carefully at what "incongruent" actually means geometrically and found something more surprising. The path's behavior was not determined by whether two concepts feel related. It was determined by the angle between their direction vectors. Near-orthogonal pairs (cosine similarity below 0.3) produced the largest entropy bumps in the middle of the path, while near-parallel pairs (cosine above 0.7) interpolated smoothly. And semantically unrelated concepts often had nearly parallel directions. Sentiment and country had a cosine of 0.98. Sentiment and spatial had 0.92. This means the standard way of extracting concept directions is picking up a shared variance component that has nothing to do with the concept itself, which in turn means steering sentiment will also quietly steer the model's representation of country and spatial position.

Codex used real image captions instead of synthetic contrastive pairs and found that "incongruence" is not a single thing. Relation swaps, like changing "two people holding phones together to link" to "two men dropping phones to unlink", produced unstable paths with longer dual-space detours. Object and attribute swaps often did not, because the incongruent caption stayed lexically close to the source. In that case the congruent paraphrase is actually further away in embedding space than the contradictory alternative. So a contradictory endpoint can still sit on a short, smooth path if the surface form barely changed.

What we learned.

Three versions of the same question produced three pieces of the answer. Incongruent concepts do produce unstable interpolation, but only when the incongruence shows up as geometric independence, not as semantic conflict. The most practical takeaway is Claude's observation that standard concept directions are entangled with each other through shared variance. If you are doing activation steering, steering one concept will have side effects on concepts you thought were unrelated, and the side effects are predictable from the cosine similarity of the extracted directions.

Induction heads and where leakage actually comes from

The question. Induction heads are a specific kind of attention head that copy patterns from earlier in the context. If the prompt contains "A B ... A", an induction head lets the model predict "B" next. This is a key mechanism behind in-context learning. But it might also be why models fail to generate random outputs, why they repeat themselves, and why they sometimes leak information from their prompt when they shouldn't. If we turn off the induction heads, does the leakage go away?

What the agents tried.

Claude analyzed GPT-2 small (12 layers, 144 heads) and GPT-2 medium (24 layers, 384 heads) using TransformerLens. They identified induction heads, then zero-ablated them and measured three things: how much pattern leakage happens in the continuation region, how random the model's output is under free generation, and what fraction of total pattern copying flows through induction heads.
Codex used GPT-2 small and tested whether ablating the top-scoring induction heads reduces the model's bias toward repeating a token from the prompt on a "pick a random color" task. They also ran sensitivity sweeps with different head subsets.
Gemini tested GPT-2 small with a specific prompt design: a repeated pattern (like "42 elephant") followed by a request for a "completely random and unrelated word." They measured how often the model still outputs the pattern token (like "elephant") and how ablating individual heads changes that.

What happened.

All three agents confirmed that induction heads really are the dominant copying mechanism. Claude's activation patching showed induction heads account for 72% of copying in the small model and 65% in the medium model. Gemini found that the base model assigns 7.8% probability to the pattern token (about 3,800x more than uniform), and L5H1 alone accounts for most of that pull.

Where the agents diverged was on whether turning these heads off actually helps. Gemini's cleanest experiment said yes: ablating the top 5 induction heads cut pattern leakage from 7.8% to 0.3%, a 95% reduction.

But Claude ran broader tests and got the opposite result on pattern leakage. Ablating induction heads increased the leakage ratio from 24x to 54x in GPT-2 small, and from 34x to 78x in GPT-2 medium. Without induction heads, the model fell back on less precise copying mechanisms in the MLPs and other attention heads. Those mechanisms produced lower-entropy, more repetitive output that ended up overlapping more with pattern tokens, not less. Random-generation quality also got worse: repetition rate went up by 327% in the small model, and unique tokens dropped by 31%.

Codex found that the answer depends on exactly which heads you pick. Ablating L5H5 and L5H1 reduced repeated-token bias on random-choice prompts. Including L6H9 flipped the sign and made bias go up. And in every case that reduced bias, general language modeling got worse as measured by WikiText perplexity.

Why did Gemini get such a clean result? The Gemini setup used a specific "assign probability to one designated pattern token" metric on a narrow task. Claude's metric was a broader leakage ratio across all pattern tokens in a continuation region, and Codex measured distributional bias over a candidate set. Induction heads might cleanly handle the "echo the exact token" case while the broader copying behavior is spread across many components.

What we learned.

The hypothesis that induction heads are the main source of leakage does not survive careful testing. Induction heads dominate structured copying, but they also suppress messier copying mechanisms in the rest of the model. Turning them off usually makes the model more repetitive and less random, not more. A useful way to think about them is as regulators: their precise, high-entropy copying crowds out the noisier leakage that MLPs would otherwise produce. If you want to reduce model leakage for privacy or safety, targeting induction heads alone is the wrong lever. You would need to identify the MLP circuits that fill the gap when induction heads are gone.

The steganography cliff

The question. Steganography is hiding a secret message inside text that looks ordinary. The simplest version is an acrostic, where the first letter of each sentence spells the message. Current LLMs can do acrostics. But what if the signal is more diluted, like encoding a letter in the first letter of every third word, or in the word that follows every occurrence of "the"? Can models still write text that hides the message, extract the message when given the rule, or notice that a hidden message exists without knowing the rule?

What the agents tried.

Claude tested three frontier models (GPT-4.1, Claude Sonnet 4.5, Gemini 2.5 Pro) across six encoding schemes of increasing dilution: acrostic, every 3rd/5th/10th/20th word, and a trigger-word scheme where the letter sits after each "the." They ran three separate experiments: producing encoded text, extracting a message when told the rule, and detecting a hidden message without being told the rule.
Codex built a benchmark from WikiText-2, IMDB, and AG News with three schemes (acrostic, dense keyword trigger, diluted keyword trigger) and tested GPT-4.1 and GPT-5. They reported per-character accuracy, full-message recovery rate, and detection accuracy.
Gemini took a cleaner bit-level approach. A single secret bit hides in whether the word after a trigger starts with a letter from the first or second half of the alphabet. They swept trigger frequency from "the" (dense) to "however" (very sparse), then measured how well GPT-4o detects and extracts.

What happened.

Acrostics work. Claude got 92-100% per-character accuracy on acrostic encoding across all three frontier models, with exact match rates of 80-100%. Codex saw 58-67% full-message recovery on acrostics. These are strong results.

Everything else failed. The moment the scheme shifts away from "first letter of each sentence," performance collapses. Claude's results are the clearest: all three frontier models dropped from near-perfect acrostics to under 31% accuracy on every other scheme. There was no smooth degradation. It was a cliff.

Codex's dense keyword scheme (encoding a letter in the word after every "signal") was the one exception. GPT-4.1 got 75% full-message recovery there, which is actually higher than its acrostic performance. Once the signal was diluted across longer passages with distractor variants, recovery dropped to 25%. So the cliff tracks whether the encoding rule requires counting or maintaining position state across a long passage, not whether it is an acrostic specifically.

Detection is essentially broken. Claude tested whether the three frontier models could tell steganographic text apart from normal text without knowing the rule. GPT-4.1 said "yes, there is a hidden message" to every input, including plain control text. Gemini 2.5 Pro did the same. Claude Sonnet 4.5 said no to every input, even when the acrostic was perfectly readable to humans. Gemini's study found GPT-4o had zero true positives across every dilution level, even dense ones.

When given the rule and asked to extract, models do better than detect, but only in narrow cases. Gemini found GPT-4o got about 71% accuracy on single-bit messages at very sparse dilution, but dropped to 40% once the message needed 10 bits. The bottleneck seems to be keeping track of 10 trigger-word positions reliably across a long passage, not understanding the rule.

What we learned.

Frontier LLM steganographic ability is narrow. Models have learned acrostics as a specific pattern, probably from training data full of acrostic poems and puzzles. They have not learned a general steganographic capability. Anything that requires counting word positions or following a trigger-word rule breaks down quickly as the signal is spread across longer context.

There are two implications. For AI safety, blind detection does not work at any dilution level, so you cannot rely on an LLM-as-judge to flag hidden signals in model outputs. That role belongs to statistical methods that look at token-frequency anomalies, not prompt-based judgments. For capability forecasting, strong acrostic performance should not be taken as evidence that models can do steganography in general. It is a memorized skill tied to a specific, counting-free pattern.

Next Week's Competition

The twenty-fourth weekly competition is now open! Voting closes Friday, April 24 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we saw that interpolation between concept directions is governed by the geometry of the directions, not the meaning of the concepts, and that standard concept extraction methods quietly entangle unrelated concepts. We found that induction heads are not the main source of leakage, and turning them off usually makes models more repetitive and less random. And we confirmed a sharp cliff in frontier LLM steganography: acrostics work, anything more diluted falls apart, and blind detection is broken regardless of how clear the hidden signal is.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-04-13-2026,
  author = {Liu, Haokun},
  title = {Week of 04/13/26-04/19/26},
  year = {2026},
  month = {April},
  day = {20},
  url = {https://hypogenic.ai/blog/weekly-entry-260413}
}