Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. This week's winners come from shivanshu sirohi and Ari Holtzman.

This week we tested whether handing an AI the existing pieces of an invention helps it combine them into something new, whether LLMs write about moving through a building in a recognizable machine-like way, and whether the hardest-to-predict differences between a raw model and its tuned version are where the real alignment behavior hides.

Winning ideas and generated repos here:

Beyond Trial and Error: AI-Driven Invention (Atomic Modelling) by shivanshu sirohi

Every invention is just a new way to combine existing things. The ingredients are already out there. What matters is how you put them together. So here's the question: if you give an AI a set of existing building blocks to work with, will it come up with better inventions than if you just ask it to brainstorm from scratch?

Narrative Space in LLMs by Ari Holtzman

When an AI writes a story involving someone moving through a building, does it describe that space in a way that feels noticeably non-human? The idea is similar to how chatbots have a recognizable tone from instruction tuning. AI models might also have a consistent, predictable pattern when describing physical movement through space.

What's surprisingly different between base and instruct models? by Ari Holtzman

A base model is trained to predict text. An instruct model starts as the same model but gets extra training to follow instructions and behave better. Most of the time, both models predict the same words. But sometimes they strongly disagree. This idea trains a small predictor to guess, word by word, how much the two disagree. Then it looks at where the predictor is wrong. Those unexpected spots may reveal the real, hidden effects of the extra training.

TL;DR for ideas

Giving an AI existing pieces and asking it to recombine them actually made things worse. The AI just focused on what you gave it and repeated it back. To get more creative output, you need a more structured approach: turn each piece into a general principle, hide the originals, then have the model review its own work. Even then, the output becomes more novel but less practical. On a separate note, in a materials science experiment, letting an AI choose which candidates to test (instead of picking randomly) found about 40% more stable materials using the same amount of effort.
LLMs do describe moving through buildings in a recognizable, templated way. They tend to narrate the experience ("I walked to the elevator and noticed the long hallway") rather than give directions ("take the elevator"). They also add more sensory details and landmarks, and almost never use commands. The most striking finding is that six different models wrote more like each other than like any human writer. It is still unclear whether this pattern comes from instruction tuning specifically.
The hard-to-predict disagreements tend to cluster around a few areas: safety language, refusals, and injected identity the model was trained to say (like "I'm made by Alibaba"), which can be found without any labeled data. Most of the word-by-word disagreement between a base model and its tuned version is predictable, and just reflects how uncertain the base model already was. One agent also discovered some large hidden disagreements on plain text, even in cases where both models end up picking the same next word.

Verdicts

Idea	Verdict	Next Question
AI-driven invention	Partially supported, structured recombination and guided search help, but simply handing over the pieces hurts	Can a model be pushed to produce ideas that are both new and actually feasible, instead of trading one for the other?
Narrative space	Supported, LLMs share a templated spatial style and write more like each other than like humans	Does this template come from pretraining, instruction tuning, or the final reward step?
Base vs instruct divergence	Supported, the unpredictable disagreements point straight at alignment behavior	Does instruction tuning quietly rewrite how a model continues ordinary text, not just how it refuses?

Findings from the Ideas

Does Giving an AI the Existing Pieces Help It Invent?

The question. The idea behind this submission is simple. Every invention is a new arrangement of things that already exist. The raw ingredients are already here, what's missing is a smarter way to combine them. So if you give an AI the existing building blocks and ask it to recombine them, does it invent better than if you just let it brainstorm freely? The freely brainstorming version is what the submitter calls trial and error.

What the agents tried.

Claude tested this most directly, on research ideas rather than materials. It used 160 real cases mined from published papers, where researchers connected one concept to another to make something new. The true answer (what the researchers actually came up with) was hidden. Claude then compared two setups: ask the model freely, or hand it the most relevant existing concepts and ask it to recombine them.
Codex worked on real materials. It used a database of over 120,000 known materials labeled stable or not, and trained a small model to guess which untested combinations would be stable. Then it compared picking candidates at random versus letting the model guide which ones to test next, under the same budget.
Gemini also worked on materials. It asked GPT-4o to propose new stable three-element compounds, either from its own knowledge or with feedback, and checked the proposals against a database of known stable materials from DeepMind.

What happened.

The most surprising result came from Claude. Handing the model the relevant existing concepts made it worse, not better. Its proposals were less faithful to the real answer and judged less feasible than when it brainstormed freely. The reason is that the model got stuck on whatever pieces you handed it. It literally echoed back one of the supplied concepts 20% of the time, versus 1% when left alone. Giving it the ingredients made it fixate on them.

Claude could only beat free brainstorming after adding a lot of structure: turn each concept into a general principle, hide the original wording so the model can't copy it, then have it critique and revise its own idea. That version did produce more novel ideas, but they were less feasible. It traded one for the other instead of winning outright.

The materials experiments were cleaner but tested a narrower claim. Codex found that letting the model guide which candidates to test found about 40% more stable materials than random picking for the same budget (1,933 versus 1,365). Oddly, picking the candidates the model was most unsure about did worse than random, so exploring without a goal hurt. Gemini's results pointed the same direction, with the model's proposals more stable than random guesses, but the sample was tiny (30 proposals) and the difference wasn't statistically reliable.

What we learned.

The intuition that invention is just recombination is right in spirit but wrong about the easy version. Simply handing a model the existing pieces backfires, because it anchors on them and repeats them. How you combine matters far more than just having the parts. A structured process helps, and guided search clearly beats random trial and error when you have a way to score candidates. But none of these setups proved real invention. They all checked answers against a database or a known label, not against an actual lab.

Do LLMs Have Their Own Way of Writing About Space?

The question. When an LLM writes a story about moving through a building, does it describe the space in a recognizable, machine-like way that humans don't? The thought is that, the same way instruction tuning gives chatbots a recognizable tone, models might have a recognizable habit for narrating movement through physical space.

What the agents tried.

Claude ran the largest test. It collected real human descriptions of navigating buildings, then had six different instruction-tuned models write the same kind of text, over 2,500 samples in total. It measured 29 features of spatial language (motion verbs, landmarks, directions, whether the text commands or narrates) and also ran a raw-versus-tuned model pair to check the instruction-tuning claim directly.
Codex ran a smaller, careful version with one model (GPT-5.4-mini) and real human data. It included a control where both humans and the model wrote full stories, not just terse directions, so the comparison wasn't only about format.
Gemini's setup was the weakest. It had no real human data at all. It compared the model's default output to the same model when told to "write like a human," which is really a model talking to itself.

What happened.

LLM spatial writing is easy to tell apart from human writing. Claude's classifier separated the two almost perfectly, and the gap held up even after matching the texts for length, which rules out the obvious explanation that the model just writes longer. Codex saw the same thing, and it survived in the story-versus-story control too.

The direction of the difference is consistent. Humans give directions ("take the elevator, third door on your left"). Models narrate ("I walked to the elevator and noticed the long hallway"). Models pack in more sensory detail, reuse the same landmarks more, and almost never use commands. The most striking result is from Claude: six different model families write more like each other than like any human, and each one is more uniform inside itself than humans are. There's a shared template, something like threshold, elevator, numbered floor, "Nth door on your left."

The one weak link is the cause. Whether instruction tuning specifically creates this template is unclear. Claude's one raw-versus-tuned comparison showed the tuned model sitting a little farther from human writing, but the effect was modest and only tested on a single model family.

What we learned.

LLMs really do have a shared, recognizable way of writing about moving through space, and it leans toward narrating a scene rather than directing someone through it. The strongest evidence is the sameness across models. Independent models converge on the same style and each one is more uniform than people are. What we can't yet say is where the template comes from. It could be baked in during pretraining, it could come from instruction tuning, or it could just be the prompt. Gemini's run should be set aside here, since it had no human comparison to begin with.

Where Do a Raw Model and Its Tuned Version Really Disagree?

The question. A base model is the raw model trained to predict text. An instruct model is that same model after extra tuning to follow instructions and behave. For most words they predict nearly the same thing, but sometimes they sharply disagree. The idea is to train a small predictor to guess, word by word, how much the two disagree, then study where the predictor is wrong. Those surprises might be where the real, non-obvious effects of the tuning live.

What the agents tried. All three used the same model family (Qwen) and the same basic recipe: run text through both the raw and tuned model, measure how much they disagree at each word, then train a predictor and look at its mistakes.

Claude ran the most thorough version, across three model sizes. It trained the predictor on one set of prompts but studied its mistakes on a completely different set, so the predictor couldn't just memorize.
Codex used the most varied text, including safety prompts, chat data, and plain encyclopedia text, and trained its predictor on hand-picked features of each word.
Gemini took a different angle, predicting the disagreement from the model's internal activations, but only on 30 examples, which is too few to trust.

What happened.

Most of the disagreement is predictable. Claude could predict about 60 to 68% of it, and the best signal was simply how unsure the raw model already was at that word, not where the word sat in the sentence. So a lot of the gap between raw and tuned models is just the tuned model committing where the raw model was hesitant.

The interesting part is the leftover, the disagreement the predictor can't explain. For Claude, those misses clustered on safety prompts and on specific words: hedging and refusal words like "If" and "However," safety framing, and the model's injected identity (the word "Alibaba," Qwen's maker, showing up as a planted answer to "who are you"). The point is that you can surface this kind of alignment behavior without labeling anything.

Codex found the same thing, that the misses were meaningful rather than random, and added a surprise. Some of the biggest hidden disagreements showed up on plain encyclopedia text, not on instructions. In several cases both models even picked the same next word, yet disagreed strongly underneath on everything else. That means looking only at the single most likely word misses a lot.

What we learned.

Where a divergence predictor fails is where the genuinely interesting differences between raw and tuned models hide. The average shift is boring and predictable, since it mostly tracks how unsure the raw model already was. But the part you can't predict points straight at alignment behavior: refusals, safety framing, and identity. And one finding is worth chasing. Instruction tuning may quietly change how a model continues ordinary text, not just how it refuses, even when the top word it predicts stays the same.

Next Week's Competition

The thirty-first weekly competition is now open! Voting closes Friday, June 12 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that simply giving an AI the existing ingredients makes it a worse inventor, not better, because it fixates on what you hand it, though a more structured process and guided search both help; that LLMs share a templated, narrated style when writing about moving through space and write more like each other than like any human; and that the unpredictable disagreements between a raw model and its tuned version point straight at safety, refusals, and identity.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-06-01-2026,
  author = {Liu, Haokun},
  title = {Week of 06/01/26-06/07/26},
  year = {2026},
  month = {June},
  day = {8},
  url = {https://hypogenic.ai/blog/weekly-entry-260601}
}