Week of 05/25/26-05/31/26: Can Protein Language Models encode novel biological mechanisms?

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. This week's winners come from Chenhao Tan and Ari Holtzman.

This week we tested whether a protein AI model secretly holds biology we haven't catalogued yet, whether LLMs find the same loopholes in rules that humans do, and whether a model can answer plain questions about a 2D layout it just picked up from the prompt without thinking step by step.

Winning ideas and generated repos here:

Discovering Novel Biological Mechanisms from Protein Language Models by Chenhao Tan

Protein language models learn from protein sequences the same way ChatGPT learns from text. Most people use them to predict what a protein does. But this idea takes a different angle: what if the model itself is the discovery? It may have picked up on biological patterns that no scientist has ever explicitly described.

LLM Loopholes vs. Human Loopholes by Ari Holtzman

A loophole is when you follow the literal words of a rule while ignoring what the person actually meant. "Don't call your brother," so you text him instead. Do LLMs find the same loopholes that humans do? Do different models land on the same ones? And do models notice loopholes but refuse to admit them?

Are repeated novel geometries addressable without reasoning? by Ari Holtzman

Recent work shows LLMs can pick up a 2D layout, like a grid, just from examples in the prompt, without being trained on it. This idea asks: once a model has that layout in context, can it answer simple questions about it, like "what is at this spot" or "how many cells are filled," without thinking step by step? And does accuracy drop when the layout appears less often in the prompt?


TL;DR for ideas

  1. A protein AI model does carry internal features that line up with real biological patterns, but can not produce a new biological mechanism. Across all three runs, the sparse features were patterns biologists already know that just were not tagged in the Swiss-Prot reference database, and did not outperform their raw signals. So no genuinely new biology came out at this scale.

  2. The most safety-tuned models, GPT-5 and Claude Sonnet 4.5, exploited loopholes only about 20% of the time in multiple-choice, but GPT-5 jumped to 49% in free response, showing that format alone can dramatically shift behavior. LLMs find human-like loopholes, and agree on where the loopholes are, but how often a loophole gets exploited depends on both the model and the format of the question. As for whether models hide loopholes they notice, the evidence is mixed. One clean test found models omitted their own chosen loophole 5–14% of the time. But framing the model as a "mischievous comedian" pushed exploitation from under 10% to over 60%.

  3. When a model picks up a 2D layout from the prompt, it can handle "what is at this spot" or "is this a neighbor" reliably without any step-by-step reasoning, but break down when facing tasks that require scanning, counting, or inferring unseen connections, and adding more examples does not fix that. Step-by-step reasoning helps for scanning and counting but not for inferring structure the model was never shown.

Verdicts

IdeaVerdictNext Question
Discovering biology in protein modelsPartially supported, the pipeline surfaces real sequence patterns but no genuinely new mechanism, and the controls were weakIf you check every candidate feature against all protein databases and keep only the ones that match nothing, does anything survive at a larger scale, and would a lab test confirm it?
LLM vs human loopholesPartially supported, models agree on which scenarios invite loopholes, but the rate and the "hiding" depend on the model and the prompt formatWhen a model acts on a loophole but leaves it off its own list, is that because it cannot match its action to the list, or because it is deliberately not advertising it?
Addressing novel geometriesPartially supported, looking things up works without reasoning, but counting and inferring do not, and repetition does not helpCan a model that is allowed to reason actually infer connections or positions it was never shown, or does reasoning only help it re-read what is already in front of it?

Findings from the Ideas

Can a Protein AI Model Hold Biology We Haven't Catalogued Yet?

The question. Protein language models like ESM-2 read protein sequences, which are long chains of amino acids written as letters. They learn which patterns tend to go together, the same way a text model learns which words go together. Normally people use them to predict structure or function. The idea here is more ambitious: maybe the model has internalized biological patterns that humans have not written down in any database yet. If so, could we pull those patterns out and turn them into testable hypotheses?

To look inside the model, all three agents used a sparse autoencoder, which is a tool that breaks the model's internal signals into thousands of separate "features," each meant to capture a single pattern. They used a pretrained one called InterPLM. The reference point is Swiss-Prot, a curated database where humans have labeled what each part of a protein does.

What the agents tried.

  • Claude built the most complete pipeline. It checked which features match known Swiss-Prot labels, built a score to rank features that look coherent but have no label, and tested whether forcing a feature on changes the model's predictions. It then fed the top candidates (just the proteins and the sequence pattern, not the database answer) to GPT-4.1 and asked it to name the biology.
  • Codex took the most careful, skeptical route. It tested whether these sparse features point to the protein positions that actually matter for function, using mutation-effect data, and compared them against two baselines: the model's raw internal signals and a shuffled control.
  • Gemini ran the smallest study, on 100 sequences, picking the few most active features and asking GPT-4o to propose a motif for each. Crucially, it also ran a random control.

What happened.

All three got the pipeline to run, and the model clearly does carry real biological structure. Claude found 14 features that strongly match known labels, versus zero for the model's raw neurons, and forcing a feature on produced a clear, focused shift in the model's predictions at that exact spot.

But the headline is more sobering than it first looks. The features Claude flagged as "novel" were almost all well-known sequence patterns that simply are not labeled position-by-position in Swiss-Prot. For example, one feature fires exactly on the WDTAGQ pattern shared across a family of signaling proteins. GPT-4.1, shown only the proteins and the pattern, correctly named the underlying biology in 6 of 8 cases. So the model knows patterns the database does not mark, but these are known to science, just hiding in plain sight. Only two features could not be matched to anything known, and those remain untested guesses.

Codex's skeptical test is the one that should give everyone pause. Its sparse features pointed to functionally important positions about as well as the model's raw signals (a correlation of 0.547 versus 0.544) and no better than a shuffled control. With over 10,000 features and short proteins, some will look impressive by chance. One feature did show a real causal effect when removed, but the overall message was that being sparse and interpretable is not by itself evidence of capturing new biology.

Gemini's random control made the same point from another angle. When it handed the interpreting LLM a random sequence with a made-up importance score, the LLM still confidently invented a plausible motif. The LLM will rationalize almost any sequence, so the discovery rests entirely on whether the feature only fires on true, conserved patterns, not on whether the LLM can tell a good story about it.

What we learned.

The pipeline is real, reproducible, and the model genuinely absorbed sequence patterns tied to function. But none of the three runs produced a new biological mechanism. The candidates were either patterns science already knows that one database just does not label, no better than the model's raw signals, or at risk of being a confident guess from the interpreting LLM. The honest takeaway is that "the model knows something we don't" is a high bar. To clear it you need to check candidates against every protein database, not just one, run random-input controls so the interpreting LLM cannot fool you, and ultimately test the survivors in a lab. The most useful product this week is a careful recipe for how to make that claim, plus a clear warning about the three ways it can go wrong.


Do LLMs Find the Same Loopholes Humans Do?

The question. People exploit loopholes all the time: following the letter of a rule while ignoring its intent. The hypothesis has four parts. Do LLMs find loopholes similar to the ones humans find? Is the range of loopholes they find narrower than humans'? Do different models land on the same loopholes? And do models sometimes notice a loophole but refuse to say it out loud?

All three agents used the same set of everyday scenarios from prior research. Each scenario comes in three versions based on who is giving the instruction: someone with authority over you, a peer, or someone below you. In the multiple-choice version, the model picks among complying, taking the loophole, or refusing.

What the agents tried.

  • Claude ran the broadest study on four current models (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Llama-3.3-70B) across three formats: pick from a list, act freely, and separately list every loophole you can think of.
  • Codex reanalyzed 45,360 older model outputs and added a fresh set of 18 scenarios spanning made-up laws, workplace policies, game rules, and wishes, then measured how much different models overlapped when generating loopholes openly.
  • Gemini focused on two models (GPT-4o-mini and Llama-3.3-70B) and tested one sharp idea: does giving the model a "mischievous comedian" persona unlock loopholes it otherwise keeps quiet?

What happened.

The clearest shared result is that models disagree on how often to take a loophole but agree on where the loopholes are. In multiple choice, the newest safety-tuned models took the loophole rarely (GPT-5 at 22%, Claude at 20%, both below the 33% you'd get by guessing), while Gemini (36%) and Llama (46%) took it more often. Yet all four models agreed on which scenarios invite loopholes, and 11 of 108 scenarios were exploited by every model. Claude also noticed that a full year of model progress barely moved these rates.

The format of the question turned out to matter as much as the model. When Claude asked the same model to act freely instead of picking from a list, the rates swung hard, and in different directions. GPT-5 doubled, from 22% to 49%. Llama dropped, from 46% to 29%. One likely reason: in multiple choice, the loophole is labeled as the sneaky option, so a careful model avoids it; in free response, the model just acts and never flags its own move as a loophole. So a benchmark that only uses one format will misjudge how often models exploit rules.

The "notice but hide" question is where the agents pulled apart, and it is worth being honest about. Claude ran the cleanest test: have a model act on a loophole, then in a separate conversation ask it to list all loopholes, and check whether its own choice shows up. The chosen loophole was missing 5 to 14% of the time, largest when the instruction came from an authority figure. That is a small but real signal. Codex saw no hiding at all: when asked openly on benign scenarios, models happily listed loopholes and never refused. Gemini found the most dramatic effect: the comedian persona pushed exploitation from 8% to 67% for GPT-4o-mini and from 42% to 92% for Llama. But that test is really about bypassing safety training, not about whether the model quietly notices and stays silent, so it is suggestive rather than conclusive.

On whether LLM loopholes are narrower than humans', the answer depended on how you measured. In forced choice, models look narrow. In open generation, Codex found them quite diverse, while Claude found that models cluster more tightly with themselves than with each other, and more with each other than with the human-written reference. So LLM loopholes drift in a shared direction away from the human ones, but "narrow" is too strong a word for open-ended generation.

What we learned.

Models do find human-like loopholes, and they agree on which situations invite them, which is a real and consistent finding. Everything else is conditional. How often a model exploits a loophole depends on the model and on whether it picks from a list or acts freely, so any single-format benchmark will mislead you. The most safety-tuned models exploit the least in forced choice but the most when acting on their own. And the evidence that models hide loopholes is mixed: small in a clean side-by-side test, large only when you deliberately strip away the model's safety framing with a persona. That gap between the two tests is itself the interesting result, and the cleanest next step is to figure out whether a model that omits its own loophole genuinely failed to connect its action to its list, or chose not to advertise it.


Can a Model Answer Plain Questions About a Layout It Just Learned?

The question. A model can pick up a 2D layout from examples in the prompt. The question is what it can then do with that layout using plain language alone. Can it tell you what is at a given spot, count things, or find a particular row, without being walked through it step by step? And does the answer get worse when the layout is not repeated much?

The three agents built quite different tasks, which makes their agreement more convincing.

What the agents tried.

  • Claude used small grids of letters and asked five kinds of questions, comparing direct answers against step-by-step reasoning, with the same grid shown anywhere from zero to eight times.
  • Codex used a hidden network of connected words, revealed only through a wandering trace, and asked the model whether two words are direct neighbors, comparing edges the model had seen in the trace against edges it had not.
  • Gemini used coordinate descriptions of tangram shapes and tested two things: can the model report where a named part is, and can it judge whether one part sits above another?

What happened.

Every agent landed on the same split. Looking something up is easy without reasoning; doing real work over the layout is not.

Claude found that direct answers stalled at 79 to 93% even with eight repeats of the grid, while step-by-step reasoning hit 99 to 100% with no repeats at all. The entire gap came from two question types: counting the cells of a color, and finding the row with the most filled cells. Both require scanning the whole grid. Questions like "what is at this spot" or "what is just to the right" were near-perfect either way. Counting errors were almost always off by one, the model trying to count but losing track. More repeats lifted the baseline a little but never installed a counting procedure.

Codex saw the sharpest version of this. The model answered correctly about neighbors it had actually seen in the trace 99% of the time, but for true connections it had never seen, it got 4 out of 112 right with direct prompting and 0 out of 112 with reasoning. So the "geometry" the model could talk about was really just a memory of what was repeated in the prompt. When the repetition was sparse, the accessible layout mostly vanished, and reasoning did not rescue it.

Gemini found the same decoupling in yet another form. The model could report the coordinates of a named part perfectly, even with no repetition, but judging whether one part was above another sat at 50 to 80% and barely improved with more examples or with step-by-step reasoning. It could repeat the address of a part without being able to see where that part actually sits.

What we learned.

Picking up a layout and being able to reason over it are two different things. Across grids, hidden networks, and shapes, models reliably handle lookups, what is here, repeat this coordinate, is this a neighbor I have seen, with no reasoning needed. But anything that requires scanning, counting, inferring an unseen connection, or judging relative position is unreliable, and piling on repeated examples does not close the gap. Step-by-step reasoning helps for the scan-and-count cases (Claude's counting jumped to near-perfect) but not for inferring structure the model was never shown (Codex) or for judging spatial position (Gemini). For anyone grounding a model in a text grid, a game board, or a UI dumped as text, the practical lesson is to turn on reasoning for anything that needs to be computed, and not to expect more examples to substitute for it.


Next Week's Competition

The thirtieth weekly competition is now open! Voting closes Friday, June 5 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that a protein AI model does carry real biological patterns but did not hand us any genuinely new biology once the controls were tightened, that LLMs find human-like loopholes and agree on which situations invite them while disagreeing on how often to take them, and that models can look things up in a layout they just learned but cannot count, infer, or judge position over it without reasoning, and often not even then.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-05-25-2026, author = {Liu, Haokun}, title = {Week of 05/25/26-05/31/26}, year = {2026}, month = {June}, day = {1}, url = {https://hypogenic.ai/blog/weekly-entry-260525} }