Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. All three of this week's winners come from Ari Holtzman.

This week we looked at whether AI models tell you to go to bed for consistent reasons or based on who they think you are, what happens if you sample from a hidden layer the same way a model samples its final answer, and which summarization mistakes grow as documents get too long for any human to check.

Winning ideas and generated repos here:

How unified is LLMs' "go to sleep" mechanism? by Ari Holtzman

People have noticed that AI models sometimes tell you to stop and get some rest after a long, late-night chat. But is this a fixed response that always triggers after a certain time or length? Or does the model read the situation and adjust based on who you seem to be?

Sampling Neural Network by Ari Holtzman

A model picks its final output by sampling from a probability distribution (the softmax). What if you applied the same idea to a hidden layer? Instead of passing its values straight to the next layer, treat them as a distribution and sample from them. What would happen?

Error Extrapolation by Ari Holtzman

We can verify an AI's summary when a document is short enough to read ourselves. But AI can summarize documents far longer than any human can check. So a key question is: do the error patterns we observe on short, checkable documents still hold for longer ones? And as documents grow, which mistakes stay consistent, and which ones get worse?

TL;DR for ideas

There is no single "go to bed" reflex. Whether an AI tells you to sleep depends mostly on what you are doing at 2 a.m., and a lot on which model you ask. Models do adjust based on who you seem to be, but the effect is small. A sick person or a student cramming for an exam gets nudged to rest. A night-shift nurse gets that less often. So it is more about whether staying up makes sense for your situation, not a simple "protect the vulnerable" rule.
Sampling from a hidden layer like the final output literally breaks the neural network, significantly reducing its performance. Treating a hidden layer's neurons as answer choices and picking just one at random forces all information through a single neuron. It hurts accuracy on easy tasks and fails to train on harder onesfails to train on harder ones. A softer version that adds a small amount of random noise is nearly free and can even improve calibration.
Closed questions like "list the distinct regions" or "is anything cancelled" stay reliable at any length. But tasks like counting, totaling, and "find all of X" degrade faster than tested on short documents would predict. One cheap model made zero counting mistakes on short, checkable documents, then got 76% of counts wrong on documents too long to verify by hand. So not all summarization mistakes grow at the same rate as documents get longer.

Verdicts

Idea	Verdict	Next Question
"Go to sleep" mechanism	Partially supported, it is not one unified reflex. The behavior depends on the situation and heavily on the model	Why does one model tell almost everyone to sleep while another almost never does? Is that the system prompt or the training?
Sampling neural network	Partially supported, sampling a hidden layer does change behavior, but the literal "pick one neuron" version breaks training	Can a smarter setup (slowly sharpening the sampling, or a codebook of options) make hidden sampling work without choking the signal?
Error extrapolation	Partially supported, error growth is type-specific, and counting and coverage errors grow and speed up beyond the checkable range	Given a long document you can't verify, how do you know which numbers in its summary to trust?

Findings from the Ideas

When an AI Tells You to Sleep, What Is It Reacting To?

The question. People have reported chatbots nudging them to bed during long, late-night sessions. Is that a uniform safety reflex that fires once a conversation runs late, or does the model condition the advice on who it thinks you are? And if there is any conditioning, do different models do it the same way?

What the agents tried.

Claude ran the largest test: 2,280 real responses across 8 late-night scenarios, 19 user personas, and 5 different models. In each prompt the situation stayed the same and only one detail about the user changed (their age, job, or condition). It then measured how much of the behavior came from the situation, from the person, and from the model.
Codex ran a smaller, careful factorial test on 2 OpenAI models (108 responses), varying user state (neutral, tired, dependent), how long the conversation had been going, and how urgent the task was. It also audited 25 real conversations from public chat logs that mentioned sleep.
Gemini simulated conversations with five user personas, including a stressed student at 3 a.m., a night-shift worker at 3 a.m., and a focused, energetic user, and counted how often the model suggested rest.

What happened.

The first surprise is that the situation matters far more than the person. In Claude's test, what you were doing at 2 a.m. explained about three times more of the behavior than anything about who you were. A pointless "one more task" at night got told to stop almost every time (91%), while someone who was just anxious got coping advice instead of "go to bed" (44%).

There is a real person effect, but it is small, and it does not follow the obvious "protect the vulnerable" pattern. Instead it tracks whether being awake makes sense. Athletes, sick users, and kids got more sleep advice (the body has a clear reason to rest), while a night-shift nurse got less (being awake is their job). Gemini saw the same logic from the other direction: a tired, stressed student got told to sleep 90% of the time, while a neutral user and an energetic user got told 0% of the time, even at 3 a.m. So time of day alone did not trigger it. The user's apparent state did.

The biggest factor of all was which model you asked. Across Claude's five models, the chance of being told to clearly go to sleep ranged from 15% for one model to 82% for another. The models also only moderately agreed on who to tell to sleep. Codex's smaller test landed on the low end of that range: in its setup, bedtime nudges were rare (under 2%), and when they showed up they were soft sign-offs like "Rest well when you're done" rather than the model refusing to keep helping. Its audit of real chat logs made the same point, that matching on the word "sleep" badly overstates how often models actually push you to bed.

What we learned. "AI tells people to go to sleep" is not really one behavior. It depends a lot on what you're doing, a little on who you appear to be, and most of all on which model you happen to be talking to. The "who you are" part is not a tidy demographic rule. It comes down to whether the model thinks you have a good reason to be up. And any blanket claim about "LLMs telling users to sleep" is really a claim about one specific model, since the same prompt can get you a firm "go to bed" from one model and a cheerful "here's how to keep going" from another.

What Happens If You Sample From a Hidden Layer?

The question. A classifier makes its final choice by turning its last numbers into probabilities and sampling an answer. The idea here: do that same thing earlier, inside the network. Treat a hidden layer's neurons as a set of options, turn them into probabilities, and randomly pick one to pass forward. Does that help, hurt, or do something interesting?

What the agents tried. All three agents built this sampling step into a network and compared it against a normal network and against gentler kinds of randomness (like dropout and adding small noise). Claude tested seven different versions on two image datasets and tracked accuracy, calibration, and how noisy the training signal got. Codex compared the literal version against gentler ones on two datasets and checked whether averaging many samples could rescue it. Gemini built the idea into a larger image network and tried it both at every layer and at just the last feature layer.

What happened. Everyone landed in the same place: the literal version is destructive. When you treat hidden neurons as answer choices and pick one, you force all the information through essentially a single neuron, which is far too narrow a pipe. On an easy task it cost roughly 8 to 10 points of accuracy. On a harder task it collapsed completely, dropping to chance, basically guessing. Gemini found that doing this at every layer made the signal fade away to nothing, and even doing it at just one layer left the network worse than a plain one.

The reason is that this version creates the noisiest training signal of anything tested, which makes the network very hard to train. Averaging many samples helped a little but never recovered the lost accuracy.

The flip side is the useful finding. A gentle version, where you add a small amount of random noise to the hidden numbers instead of picking one, costs almost no accuracy and can even make the model's confidence better matched to how often it's actually right. On the harder task that gentle version slightly beat the normal network on both calibration and confidence.

What we learned. Sampling inside a network is a real change to how it computes, not just the final sampling moved earlier. But the literal "pick one neuron like you pick one answer" version is a bottleneck that strangles the network, and how badly it hurts scales with how hard the task is. If you want the benefits of randomness inside a model, the gentle continuous kind is the one that pays off.

Which Summarization Errors Grow When Documents Get Too Long to Check?

The question. We can verify an AI's summary when a document is short enough to read ourselves. But models routinely summarize documents far longer than anyone can check. So which errors stay flat as documents grow and which ones grow, and do the patterns we measure on checkable documents still hold once we pass the point a human could verify?

What the agents tried.

Claude built synthetic "order log" documents where the correct answer to every question can be computed exactly, at any length, with no AI judge in the loop. It ran 4 models over documents from 20 up to 5,120 records (roughly 1,000 to 206,000 words) and asked seven kinds of questions, including counting, adding up, finding specific entries, and closed questions like "which regions appear."
Codex took real government reports and meeting transcripts, planted known facts inside them, and checked whether two models could pull those facts back out at lengths up to about 24,000 tokens.
Gemini ran an open-weight model on government reports and book chapters up to 128,000 tokens, counting missed points and made-up claims and comparing each summary against a human-written reference.

What happened. The clearest answer came from Claude: there is no single rule. Errors split into two groups. Closed questions stayed perfectly reliable at every length. Listing which distinct regions appear, or noticing that nothing was cancelled, worked at 20 records and still worked at 5,120, across all models. But counting, adding up, finding all of something, and pulling out specific entries all got worse as documents grew.

The important part is how they got worse. The growing errors sped up once documents passed the point a human could check. A fit based only on short, verifiable documents badly under-predicted what actually happened on long ones. The sharpest example: one cheap model made zero counting mistakes on every document a person could verify, so the short-document trend predicted it would stay near zero, but at 2,560 records it got 76% of counts wrong. Looking at the raw outputs, the model had stopped actually counting and started estimating from surface clues. The stronger frontier model held most errors flat out to around 100,000 words and stayed predictable, while the cheaper models failed fast and without warning.

Codex and Gemini, working on natural text and on fact-finding rather than counting, mostly saw stability. Codex found made-up facts, duplicates, and wrong details stayed near zero even at 24,000 tokens. Its one failure was abrupt rather than gradual: at the very hardest setting the model spent its whole output budget thinking and returned an empty answer, and simply raising that budget fixed it completely. Gemini found missed points and made-up claims stayed flat out to 128,000 tokens, but the summaries drifted further from the human reference as documents grew. The model kept writing decent summaries, it just increasingly disagreed with people about which parts were worth including.

What we learned. Whether a short-document test tells you anything about long documents depends on the kind of question. Pulling out facts and answering closed questions holds up well as documents grow. Counting, summing, and full coverage do not, and they can degrade faster than any checkable test would suggest, which is exactly the case where you can't catch the mistake. The practical takeaway: trust long-document summaries for "what's here" and "is X missing," be careful with "how many" and "list them all," and don't assume a number that was right on a short document is right on a long one, especially from a cheaper model.

Next Week's Competition

The thirty-second weekly competition is now open! Voting closes Friday, June 19 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that telling users to sleep is not one fixed reflex but depends on the situation, a little on who you seem to be, and most of all on which model you ask; that you can sample from a hidden layer the way a model samples its answer, but the literal version chokes the network while a gentle noisy version is basically free; and that summarization errors grow by type, with counting and coverage getting worse faster than a short-document test would warn you.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-06-08-2026,
  author = {Liu, Haokun and Peng, Chenxi},
  title = {Week of 06/08/26-06/14/26},
  year = {2026},
  month = {June},
  day = {15},
  url = {https://hypogenic.ai/blog/weekly-entry-260608}
}