Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we looked inside a model's representations to see whether self-critique actually reshapes how it thinks, tested the claim that "attention is all you need" extends beyond transformers to the entire internet, and explored how LLMs handle the strange way French counts from 70 to 99.

Winning ideas and generated repos here:

Representation Drift Under Self-Reflection: Does Self-Critique Reshape Internal States? by Ankit Singh

When you ask a model to critique and revise its own answer, does something actually change inside the model? Or is it just generating a new answer with extra steps? We can measure this by extracting the model's internal representations before and after self-critique and checking whether they move in a structured way.

Attention is all you need not only applicable to transformers and LLMs, but also applicable to the entire generative AI and Internet field by Qianyi Li

The attention mechanism is the core of modern AI models. But the same mathematical pattern (score how relevant each source is, then take a weighted average) also describes Google's PageRank algorithm and recommendation systems. Is "attention" a universal principle that connects AI architectures with how the internet allocates human attention?

How does the model count in French? by Ari Holtzman

French has one of the strangest counting systems in any major language. 70 is "soixante-dix" (sixty-ten), 80 is "quatre-vingts" (four-twenties), 97 is "quatre-vingt-dix-sept" (four-twenty-ten-seven). Can LLMs handle this correctly, and what do their internal representations look like when processing these numbers?

TL;DR for ideas

Self-critique moves a model's internal representations, but moving isn't the same as improving. One agent found that self-critique consistently pushes representations in the same direction across different questions in a 2.8B-parameter model. But two other agents found the drift was no larger than simply regenerating the answer without any critique. External critique from a stronger model was far more effective: it made correct and incorrect answers much easier to tell apart inside the model, suggesting smaller models can't reliably guide their own representations toward better answers.
The math behind transformer attention, PageRank, and recommendation systems is the same: score relevance, normalize, aggregate. All three agents confirmed this mapping and showed attention-based models outperform non-attention baselines in text and recommendation tasks. Attention patterns also naturally adapt to each domain (text attention spreads across many words while vision attention focuses on a few patches) without any domain-specific design.
LLMs get French counting perfectly right in their outputs, but their internal representations tell a different story. GPT-4.1 scored 100% on converting, counting, and doing arithmetic with these irregular numbers. But inside Mistral 7B, the number 83 ("quatre-vingt-trois") is more similar to 93 ("quatre-vingt-treize") than to 84, because they share the same prefix. When models had to compare numbers across French dialects (standard vs. Swiss), accuracy dropped to 76%, and one model consistently judged the longer, more complex form as the bigger number.

Verdicts

Idea	Verdict	Next Question
Self-critique representation drift	Partially supported, structured drift exists in larger models but doesn't reliably improve reasoning	Can external critique vectors be extracted and injected during self-critique to give small models the same restructuring that external feedback provides?
Attention as universal mechanism	Supported, transformer attention, PageRank, and recommendation systems share the same mathematical form	Is this shared form the reason attention works, or just a surface-level similarity that doesn't explain its actual effectiveness?
French counting in LLMs	Partially supported, no behavioral errors but internal representations cluster by linguistic form rather than numeric value	Does the linguistic clustering actually hurt downstream reasoning, or can the model route around it when it needs to compute?

Findings from the Ideas

Does Self-Critique Actually Change How a Model Thinks?

The question. Self-critique is a popular prompting technique: ask a model to evaluate its own answer, then revise. Systems like Reflexion use multiple rounds of self-critique to improve performance. But does the model's internal state actually change in a meaningful way, or is it just rolling the dice again with more text in the prompt?

What the agents tried.

Claude used Pythia-2.8B and extracted internal representations across all 33 layers for 150 math questions. Each question had three versions: the original answer, a self-critique of that answer, and a paraphrase of the same length as a control. They compared how much the representations moved under critique versus paraphrase.
Codex ran a similar experiment on a smaller model (Qwen 1.5B) with 48 questions from two reasoning datasets. They added a "re-decode" control, simply asking the model to answer again without any critique, to separate the effect of critique from the effect of just getting a second attempt.
Gemini tested two models (Qwen 1.5B and 0.5B) and added an "external critique" condition where GPT-4.1 provided the critique instead of the model itself. They measured both accuracy changes and whether correct answers became more distinguishable from wrong answers inside the model's representations.

What happened.

Claude found clear evidence of structured drift. Self-critique pushed the model's internal states further from their starting point than paraphrasing did (average cosine similarity 0.905 vs 0.946), and the effect was significant across 29 of 33 layers. The drift concentrated in the middle layers (12-16) and was directionally consistent: critique vectors pointed in a similar direction across different questions, suggesting a systematic "critique subspace" rather than random wandering. Linear classifiers trained on the critique-state activations could separate correct from incorrect answers with 97.3% accuracy, compared to 58.7% for the original states. But the critique text explicitly says things like "the answer is correct" or "there's an error here," so the model is partly just reading the answer rather than figuring it out.

Codex and Gemini told a different story. On the smaller Qwen 1.5B, self-critique actually hurt accuracy. Codex found it dropped from 39.6% to 16.7%, and the drift from self-critique was statistically identical to the drift from simply re-decoding (0.5700 vs 0.5719, p=0.764). Critique didn't move the representations any differently than just asking again. Gemini's 1.5B results were similar: self-critique barely changed the ability to separate correct from incorrect answers internally (probe accuracy went from 0.486 to 0.444).

Gemini's external critique condition was where things diverged. When GPT-4.1 provided the critique instead of the model itself, internal separability of correct answers jumped from 0.486 to 0.831. And the external critique actually caused less total drift than self-critique or rewriting, but in a much more targeted direction. The model has the capacity to reorganize its internal state for better reasoning. It just can't guide itself there.

Gemini also found that the tiny 0.5B model actually benefited from self-critique (accuracy went from 17.5% to 32.5%, and internal separability improved). This might be because very small models have simpler internal structure that's easier to reorganize, or it could be a quirk of the small sample size (40 examples).

What we learned.

Self-critique does cause measurable changes in a model's internal representations. In the larger model (Pythia 2.8B), these changes are structured and directionally consistent, concentrated in middle layers. But in smaller models, the drift looks no different from just regenerating the answer. Drift doesn't equal improvement. External critique from a stronger model was dramatically more effective at restructuring the internal state toward better reasoning: it moved the representations less but in a much more precise direction. Self-critique prompting works mainly when the model is already capable enough to generate useful feedback. For weaker models, feedback from a stronger model is far more valuable than self-reflection.

Is "Attention" a Universal Principle Beyond AI?

The question. The attention mechanism (compute how relevant each source is to a query, normalize the scores into weights, take a weighted average) is the core of transformer-based AI. But the same mathematical pattern appears in other systems. Google's PageRank weights pages by link relevance. Recommendation systems weight items by user interest. Is this a deep shared principle, or a superficial similarity?

What the agents tried.

Claude trained small models from scratch across text (IMDB sentiment) and vision (CIFAR-10 image classification), comparing attention-based transformers against LSTMs, bag-of-words models, and CNNs. They also compared pretrained attention-based models (BERT, ViT) against pretrained non-attention models (ResNet), and formally mapped the math of transformer attention, PageRank, and collaborative filtering.
Codex focused on internet-specific tasks: sequential recommendation (predicting what movie a user watches next) and click-through-rate prediction for ads. They compared transformer-based models against GRU and plain neural networks on MovieLens and Criteo datasets.
Gemini analyzed GPT-2's internal attention entropy (how focused or spread out its attention is) and trained a self-attention-based recommender on simulated user behavior.

What happened.

All three agents confirmed the mathematical mapping. Transformer attention, PageRank, and collaborative filtering all follow the same form: output = normalize(relevance(query, keys)) × values. Claude verified this both formally and empirically (the attention weight distributions have similar entropy ranges: 1.2-1.5 nats).

For raw performance, attention-based models outperformed non-attention baselines in text consistently. Claude found a small transformer beat an LSTM by 2.6 percentage points on IMDB sentiment (73.7% vs 71.1%). Vision was more nuanced: a small ViT trained from scratch on limited data lost to a CNN (38.9% vs 54.2%), but a pretrained ViT beat a pretrained ResNet by 22 percentage points (62.6% vs 40.6%). Attention needs scale and data to beat domain-specific architectures like CNNs.

Codex found that attention helped more on click-through-rate prediction (77.4% vs 75.1% for a plain neural network) than on sequential recommendation, where a transformer roughly matched a GRU (25.1% vs 25.3% hit rate). This makes sense: CTR prediction benefits from weighing which features matter most for each ad, while sequential recommendation already has a natural ordering that GRUs can exploit.

Claude also measured the entropy (spread) of attention weights across domains and found that text attention is diffuse (entropy ~4.6 nats, spread across many tokens) while vision attention is focused (entropy ~2.5, concentrated on a few image patches). This happened with the same architecture, no domain-specific tuning. The model learned where to focus based purely on the data's structure.

What we learned.

The mathematical equivalence between transformer attention and internet systems like PageRank and recommendation is real and formally provable. They all solve the same problem: routing limited capacity to the most relevant information. In practice, attention-based models outperform alternatives at scale but not always from scratch on small data, especially for vision. The practical takeaway is modest because attention architectures are already the default everywhere. The conceptual takeaway is that the same "compute relevance, normalize, aggregate" pattern independently arose in search engines, recommendation systems, and neural networks because it's a natural solution to information filtering under capacity constraints.

How Does an LLM Count in French?

The question. French counts normally from 1 to 69, then things get strange. 70 is "soixante-dix" (sixty-ten), 80 is "quatre-vingts" (four-twenties), 97 is "quatre-vingt-dix-sept" (four-twenty-ten-seven). Belgian and Swiss French avoid this with regular words: "septante" (70), "huitante" (80), "nonante" (90). Can LLMs handle the vigesimal system, and do they internally represent these numbers by their linguistic form or their numeric value?

What the agents tried.

Claude ran both a behavioral evaluation (testing GPT-4.1 on number conversion, arithmetic, and counting) and a representational analysis (extracting Mistral 7B's internal hidden states for numbers 0-99 in standard French, Belgian French, and digit form). They used nearest-neighbor analysis to see whether vigesimal numbers cluster by prefix or by numeric proximity.
Codex tested GPT-4.1 and GPT-4.1-mini on digit-to-word and word-to-digit conversion across the full 0-199 range, with separate analysis for the 70-99 range.
Gemini compared three models (GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B) on number decoding and magnitude comparison. They included cross-dialectal comparisons, asking models which is larger: "soixante-douze" (standard French for 72) or "nonante-deux" (Swiss French for 92).

What happened.

At the behavioral level, modern LLMs have the French counting system memorized. GPT-4.1 scored 100% on every task Claude tested: converting between words and digits, arithmetic with French operands, and counting across the tricky boundaries (69→70, 79→80, 89→90). Codex confirmed near-perfect accuracy on both GPT-4.1 and GPT-4.1-mini. The few errors that showed up were formatting variants like writing "soixante-et-onze" with hyphens instead of spaces, not actual numeric misunderstanding.

The internal representations paint a different picture. Claude's nearest-neighbor analysis of Mistral 7B found that 70% of vigesimal numbers have a representational nearest neighbor that shares the same linguistic prefix rather than being numerically adjacent. 83 ("quatre-vingt-trois") is most similar to 93 ("quatre-vingt-treize") because they share the "quatre-vingt-" prefix, even though 83 and 84 are numerically closer. The model organizes these numbers by how they sound, not what they mean.

Belgian French, which uses regular decimal names ("septante" for 70, "huitante" for 80, "nonante" for 90), produced representations with better numeric ordering. The correlation between representational distance and numeric distance was 0.591 for Belgian French versus 0.520 for standard French. Standard French and Belgian French representations were nearly identical to each other (0.995 cosine similarity), suggesting the model maps both to a shared internal number representation, but the regular naming system gets there more cleanly.

French word representations were also more linearly decodable for predicting numeric values (R²=0.979) than raw digit representations (R²=0.943). The contextualized sentence "Le nombre est quatre-vingt-trois" seems to activate richer numeric information than just "83."

Gemini found where the vigesimal system actually breaks down: cross-dialectal magnitude comparison. When models compared numbers within the same system (standard vs. standard, or Swiss vs. Swiss), accuracy was 88-100%. But when comparing across systems ("Is soixante-douze bigger than nonante-deux?"), Claude 3.5 Sonnet and Llama 3.1 70B dropped to 76%. Claude 3.5 Sonnet showed a consistent bias: in every failure, it judged the standard French (vigesimal) form as the larger number, even when it wasn't. The longer, more complex vigesimal string apparently tricks the model into thinking the number is bigger.

The tokenization overhead is also substantial. Vigesimal numbers require 40% more tokens than their decimal-range equivalents in standard French, and 2.3-2.8x more tokens than the Belgian/Swiss equivalents. "Quatre-vingt-dix-sept" (97) takes 10 tokens versus 3 for "97."

What we learned.

Modern LLMs have fully memorized the French vigesimal counting system and produce correct outputs. But internally, they organize these numbers by linguistic form rather than numeric value. The model thinks in vigesimal even though it answers in decimal. Cross-dialectal comparison is where the cracks show: accuracy drops to 76% when models must compare a standard French number against a Swiss French number, and one model consistently overestimates the vigesimal form. Belgian French's regular naming system produces cleaner internal representations, confirming that linguistic transparency helps models build better number representations. The 40% token overhead for standard French vigesimal numbers is also worth noting for anyone building multilingual applications.

Next Week's Competition

The eighteenth weekly competition is now open! Voting closes Friday, March 13 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week: self-critique moves a model's internal representations but not necessarily toward better answers, and external critique is far more effective. The math behind attention connects AI, search engines, and recommendation systems in a formally provable way. And LLMs ace French counting but internally organize vigesimal numbers by how they sound rather than what they mean.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-03-02-2026,
  author = {Liu, Haokun},
  title = {Week of 03/02/26-03/08/26},
  year = {2026},
  month = {March},
  day = {9},
  url = {https://hypogenic.ai/blog/weekly-entry-260302}
}