Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we tested whether one specific component inside language models is responsible for them sounding bland and predictable, asked whether a chatbot can recommend music as well as Spotify's algorithms, and looked at what it actually means for a language model to simulate a population of humans.

Winning ideas and generated repos here:

LayerNorm and concept association in creative tasks by Xiaoyan Bai

Language models often produce safe, conventional outputs even when many other valid options exist. Inside the model, every layer applies an operation called LayerNorm that rescales the hidden representations to a fixed scale. The question is whether LayerNorm itself is part of why models collapse onto the most common, "human-intuitive" outputs, by squeezing the geometry of concepts in latent space.

How much do LLMs solve recommendation algorithms? by Ari Holtzman

If you export your full Spotify listening history and paste it into Claude or GPT, can the chatbot give you music recommendations that are as good as Spotify's? Recommendation systems have decades of work behind them and rely on what other users listened to alongside you. A general LLM has none of that, but it has a lot of world knowledge about music.

A theory of simulation by Chenhao Tan

If you describe a person to an LLM (their personality, demographics, history) and ask what they would do, the model produces an answer. But how do those answers compare to what real humans actually do? Can a language model that has read a lot about people produce a wider range of human behavior than any single person could, or does it collapse onto a few well-worn patterns?

TL;DR for ideas

Bland LLM outputs come mostly from the final word-picking layer, with LayerNorm playing only a small role. Removing the mean-subtraction part of LayerNorm widens internal concept geometry by about 7% on Pythia-410M. The wider geometry only reaches the output when sampling temperature is also high. A side finding worth flagging: the standard creativity metric (Divergent Association Task) rewards gibberish like "stellarnazi" because random embeddings have high cosine distance, so any LayerNorm study without a validity gate will reach the wrong conclusion.
A general chatbot fed your listening history matches the algorithms behind classic recommenders, for about 45 cents. Across two studies, Claude Sonnet 4.5 and GPT-4.1 beat popularity baselines cleanly (NDCG@10 went from 0.17 to 0.38) and tied the strongest classical baseline (TF-IDF content filter). The personalization is real: the same model with no history drops below random. Whether they beat Spotify's actual production system is a different question that none of these experiments answered.
Adding a written narrative on top of a personality profile narrows the LLM's simulation of a person. The model treats trait labels like "low extraversion" as stronger stereotypes than the underlying numbers, which collapses variance further. Across axes: between-scale correlations sharpen by 1.6 to 1.9x compared to humans, the marginal distribution rarely reaches the top or bottom 10%, and on cognitive biases the model is 4 to 6 times stronger than humans on framing and outcome bias, and 2 to 5 times weaker on less-is-more.

Verdicts

Idea	Verdict	Next Question
LayerNorm and creative collapse	Partially supported, LayerNorm shapes concept geometry but is not the dominant cause of bland outputs	If LayerNorm only matters at the margins, where is most of the collapse coming from, and can we relax the final layer that produces tokens without breaking the model?
LLMs as recommenders	Supported, LLMs match the lightweight Spotify-style baselines on rank-quality metrics	Where is the ceiling on LLM-only recommendation, and at what point does a hybrid system that combines an LLM with a real user-item retrieval engine clearly beat both?
Theory of simulation	Partially supported, structurally a superset, marginally a subset, and counterfactually inconsistent	Does long-form qualitative grounding (full interviews, diary entries) actually expand the LLM's range, or does even rich text just push it harder onto a stereotype?

Findings from the Ideas

Does LayerNorm Make Models Sound Bland?

The question. Inside a language model, every transformer block applies an operation called LayerNorm that rescales the hidden representations to a fixed scale. There are good optimization reasons for this. But it also means the model loses some of the magnitude information about how unusual or distant a concept is. The question is whether LayerNorm itself contributes to the well-known tendency of language models to collapse onto safe, conventional outputs in creative tasks, by squeezing the geometry of concepts in latent space.

What the agents tried.

Claude tested three models (GPT-2, GPT-2 medium, and Pythia-410M) with seven different LayerNorm interventions, applied at different layer ranges. They measured a creativity score called the Divergent Association Task. The model is asked to produce 10 unrelated nouns, and a higher cosine distance between the resulting word embeddings is treated as more creative. They also measured the geometry of concept representations inside the model and the perplexity, to make sure the intervention did not just break the model. Interventions were tested under three decoding strategies: greedy, moderate-temperature sampling, and high-temperature sampling.
Codex compared a standard GPT-2 against an LN-free version of GPT-2 (a community checkpoint trained without LayerNorm) on a creative-generation benchmark and on an association task built from Only Connect puzzle clues. They measured both surface-level diversity and internal geometry.
Gemini worked with distilgpt2 and modified LayerNorm to add a temperature factor that softens or sharpens the variance constraint at inference time. They generated text under each setting and measured how much the lexical diversity of the output changed.

What happened.

LayerNorm does shape concept geometry. Codex measured a much wider effective rank on representations after LN was removed (from 1.35 to 2.56 on Only Connect clues, with the LN-free model winning on every wall tested). Gemini measured pre-LN variance of about 186 collapsing to exactly 1.0 after LN, every block, every prompt. Claude found that on Pythia-410M, removing the mean-subtraction part of LayerNorm widened final-layer concept distances by about 7%, in the direction the hypothesis predicted.

But the geometric expansion did not translate cleanly into more creative outputs. Across 18 fluency-preserving conditions, Claude found no statistically significant change in the creativity score. The one exception is at high decoding temperature on GPT-2: under T=1.2 sampling, all four versions of removing mean-subtraction shifted creativity in the positive direction by 1.7 to 5.1 points. At greedy or moderate-temperature decoding, the geometric change just does not reach the output. Gemini saw a small positive effect at T=0.95 (Distinct-1 went from 0.76 to 0.80) and a small negative effect on Distinct-2 when LayerNorm was tightened, and any aggressive change broke the model's coherence outright. Codex saw the same split: the LN-free model produced semantically broader continuations on 15 of 20 prompts, but it was no better at lexical novelty and was actually slightly worse on bigram overlap.

Claude also flagged a methodological trap. Three of their stronger interventions all collapsed to exactly the same creativity score of 96.73, which is suspiciously high. Inspecting the outputs showed the model was producing gibberish like "stellarnazi" and "metallicmerce" that scored well only because random embeddings have high cosine distance. Without gating on validity (the fraction of outputs that are real words), a study of LayerNorm and creativity would conclude the wrong thing.

What we learned.

LayerNorm is part of the story but not the main character. The dominant cause of collapse onto safe, predictable outputs sits somewhere later in the model: the final language-model head that maps internal representations to vocabulary probabilities, plus the sampling step itself. LayerNorm only matters when the decoder is also doing something exploratory, like high-temperature sampling. If you want a model to be more creative, relaxing LayerNorm alone will not get you very far. You also have to give the decoder room to actually express the wider geometry, and the most promising next direction is probably to look at the language-model head, not LayerNorm.

Can a Chatbot Replace Spotify's Recommender?

The question. Recommendation systems have decades of engineering behind them. Most of them rely on what other users like you listened to (collaborative filtering) or what songs are similar to the ones you already listen to (content-based filtering). A general LLM has none of that infrastructure, but it has a lot of world knowledge about music: which artists are similar, which albums followed which, what an angsty pop-punk fan also tends to like. The question is whether feeding your listening history into a chatbot and asking for recommendations works as well as the kind of algorithms that powered classic recommender systems like "Discover Weekly".

What the agents tried.

Claude ran the most rigorous evaluation. They built a candidate pool of 20 tracks per user (mixing held-out positive examples, items co-listened by similar users, and popularity-stratified fillers) for 24 Last.fm users, then asked Claude Sonnet 4.5, GPT-4.1, and four classical baselines (random, popularity, item-kNN collaborative filter, TF-IDF content-based filter) to rank the same pool. They measured ranking quality, popularity bias, and how much the LLM was actually using the listening history versus its own priors.
Codex ran a similar closed-set ranking task on Last.fm and on the Spotify Million Playlist Dataset, comparing GPT-4.1 and Claude Sonnet 4.5 against popularity and a co-occurrence baseline. They tested two prompts: a minimal one (just history plus candidate list) and a richer one with profile hints like top artists.
Gemini used a different setup, taking Reddit-based conversational music recommendations as a human-curated gold standard and asking GPT-4o to act as an independent judge between LLM recommendations and the human ones.

What happened.

Across all three studies, the LLMs held up well. Claude found that Claude Sonnet 4.5 and GPT-4.1 with listening history beat the popularity baseline cleanly on rank-quality (NDCG@10 went from 0.17 to 0.38) and matched the strongest classical baseline (TF-IDF content filter) within statistical noise. They did not win by recommending only popular songs (their average popularity score was actually below the content-based baseline). The clearest signal was the personalization check: when the same LLM was given just the candidate list with no history, it dropped below random. So when it does well with history, it really is using the history.

Codex's results were even stronger because their classical baselines were lighter: GPT-4.1 hit MRR 0.589 on Last.fm versus 0.282 for the best classical baseline. Claude Sonnet 4.5 was statistically tied with GPT-4.1. They also found something useful: adding a profile blurb on top of the listening history hurt rather than helped, especially on playlist continuation. The model gets more out of the recent context than from a summary.

Gemini's setup was different and less directly comparable, but in a blind LLM-judge comparison against human Reddit recommendations, the LLM won 60% of the time, tied 20%, and lost 20%.

The cost is trivial. Claude's whole study cost 45 cents in API calls. GPT-4.1 was about three times faster than Claude Sonnet 4.5 with nearly identical accuracy.

What we learned.

A general-purpose LLM is genuinely competitive with the kind of algorithms that powered classic recommendation systems, given a user's listening history. The personalization is real and not just popularity bias. For a user who wants more control over their recommendations, this is a viable path: export your data, paste it into a chatbot, get reasonable picks back.

But none of these studies compared against Spotify's actual production system, which blends multiple signals, has access to billions of plays per day, and reranks against fresh editorial input. The honest answer is "yes vs. classical academic baselines, unknown vs. Spotify production". The gap between an LLM-only system and a hybrid one that does retrieval first and lets the LLM rerank is also unmeasured here, and that is probably where the practical ceiling sits.

Can LLMs Simulate a Superset of Human Behavior?

The question. When you give an LLM a description of a person (demographics, personality traits, life situation) and ask what that person would do, the model produces an answer. Increasingly, researchers use this to run "silicon experiments" instead of recruiting actual people. The interesting question is whether the model produces a richer or wider distribution of behaviors than any single human could, because it has read so much about people, or whether it collapses onto a small set of stereotyped outputs.

What the agents tried.

Claude ran the cleanest version of this test. They took 80 individuals from a real psychology dataset (with measured Big Five personality scores and 9 other psychological scales) and gave the model four levels of grounding: nothing, demographics only, the Big Five personality profile, or the Big Five plus a written narrative. They asked the model to predict the person's percentile on each held-out scale. They measured three things: how well the correlations between scales matched humans, how wide the distribution of predictions was, and how the model handled five classical cognitive biases (Allais paradox, framing, outcome bias, conjunction fallacy, less-is-more).
Codex tested GPT-4.1 on Social IQA (a benchmark for everyday social reasoning) and on a simulated ultimatum game. They compared three conditions: no persona, demographic prompts, or detailed persona profiles. For the ultimatum game they ran 15 simulated participants per condition and measured the distribution of acceptance behaviors.
Gemini used a "multiverse branching" setup, running 50 parallel continuations of a social conflict scenario for four different personas (aggressive, passive, calculated, and generic). They measured how much variety appeared in the outcomes and tested whether explicitly prompting for "extreme" or "unlikely but plausible" responses pushed the model into the tails.

What happened.

The picture depends on what you measure. Claude found three different answers along three different axes.

On the structural axis (correlations between psychological scales), the LLM was actually stronger than humans. Given just the Big Five score, GPT-4.1-mini's between-scale correlations were sharpened by a factor of about 1.6 compared to the human correlations, and on GPT-5 by about 1.9. So in this specific sense, the model is a superset. It is running a noise-cleaned version of the human population structure.

On the coverage axis (how wide the distribution of predictions is), the LLM was strictly narrower than humans. The model rarely predicts anyone in the top or bottom 10% of a scale, even with rich grounding. A no-info prompt produces complete mode collapse: every prediction is 50.

On the counterfactual axis (cognitive biases), the LLM matched the direction of four of five classical biases, but the magnitudes were all over the place. Framing and outcome bias were 4 to 6 times stronger in the model than in humans. The less-is-more effect was 2 to 5 times weaker. Adding more grounding did not fix this.

A surprising finding: rich narrative grounding on top of the Big Five profile actually narrowed the distribution rather than widening it. The model treated the trait labels ("low extraversion, high conscientiousness") as a stronger stereotype than the numbers, which collapsed variance further.

Codex saw a milder version of the same story. Persona conditioning pushed Social IQA accuracy from 0.74 to 0.79, and broadened the set of distinct ultimatum policies from 2 to 5. But all conditions still accepted unfair offers far more often than a real human population would. The model is more controllable than a single average prompt, but its manifold of behaviors is structured and biased toward cooperation.

Gemini saw the same mode collapse on a generic persona (98% of outcomes fell into the same category), and found the model could reach the tails when explicitly prompted for unusual outcomes. Cultural personas (Japanese salaryman vs. NYC tech founder) produced clearly different behavior, so the model is sensitive to underlying factors.

What we learned.

Whether LLMs can "simulate humans" depends on which projection of the human distribution you care about. They are a structural superset along the correlation axis (between-construct correlations sharpen), a coverage subset along the marginal axis (the tails are missing), and a counterfactual subset with non-monotonic biases (some over-amplified, some under-replicated). For research that wants to estimate correlations or directional effects, an LLM with a Big Five profile may be a defensible substitute, with the caveat that effect sizes need to be down-weighted. For research that needs the actual range of human behavior, including the long tail, current LLMs are not enough. And throwing more text at the model in the form of narrative descriptions is not free: it can collapse the distribution further by pushing the model harder onto a stereotype.

Next Week's Competition

The twenty-sixth weekly competition is now open! Voting closes Friday, May 8 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that bland LLM outputs come mostly from the final word-picking layer while LayerNorm plays a smaller role, that a chatbot with your listening history matches classical music recommenders for cents, and that adding a written narrative on top of a personality profile narrows the LLM's simulation of a person.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-04-27-2026,
  author = {Liu, Haokun},
  title = {Week of 04/27/26-05/03/26},
  year = {2026},
  month = {May},
  day = {4},
  url = {https://hypogenic.ai/blog/weekly-entry-260427}
}