Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Starting this week, we're running experiments with two agents (Claude and Gemini) instead of three. We've retired Codex from our weekly experiments—its literature reviews and experiment designs were consistently more superficial than the other agents, often stopping early with incomplete implementations. Claude and Gemini have been producing more thorough, reproducible research.

Thank you to everyone who continued to submit and vote on ideas. This week's ideas explored whether humor recognition is "low-rank" in language models, whether hierarchical memory systems can reduce forgetting when learning new tasks, and whether breaking multimodal reasoning into separate visual and text steps helps accuracy.

Winning ideas and generated repos here:

How low rank is humor recognition in LLMs? by Ari Holtzman

Language models represent concepts in high-dimensional space (GPT-2 has 768 dimensions). Is humor recognition concentrated in just a few of those dimensions, or spread across many? If humor is "low-rank," it would mean models compress this complex concept into a simple direction.

Multi-Scale Nested Learning for Hierarchical Memory Systems in Continual Learning by HypogenicAI X Bot

Neural networks forget old tasks when learning new ones. Biological brains use multiple memory timescales (short-term, medium-term, long-term) to balance learning new things while retaining old knowledge. Does adding more memory levels to AI systems reduce this "catastrophic forgetting"?

Recursive Language Models for Cross-Modal Inference by HypogenicAI X Bot

When answering questions that require both images and text, should AI models process everything at once, or first describe the image in words, then reason over that description plus the text? Breaking it into steps might help when combining information from multiple sources.

TL;DR for ideas

Humor recognition: Humor is remarkably low-rank in language models—only 4-15 dimensions (out of 768) are needed to classify jokes with 90-95% of peak accuracy. Both agents achieved over 94% accuracy with simple linear classifiers, and Gemini found that a single dimension alone can nearly separate jokes from non-jokes. However, the easy separation may be due to stylistic differences (informal joke format vs. formal encyclopedic text) rather than understanding humor itself.
Hierarchical memory systems: Two memory levels (fast + slow) significantly reduce forgetting on simple tasks, but adding a third level doesn't help. On MNIST digit recognition, dual-memory systems achieved 96.7% accuracy vs 16.7% for naive approaches. More complex architectures failed on harder tasks (CIFAR-10) due to training instability, suggesting the benefit of hierarchy depends heavily on careful tuning.
Recursive multimodal reasoning: Breaking image+text questions into separate steps helps when you genuinely need both sources—the recursive approach achieved 96% accuracy on questions requiring both image and text, vs 88% for direct methods. But it hurts on simpler questions: text-only accuracy dropped from 92% to 68%. The takeaway is to use decomposition selectively, not universally.

Verdicts

Idea	Verdict	Next Question
Humor recognition	Supported—humor is low-rank (4-15 dimensions), but may reflect style not semantics	Does the "humor direction" encode actual humor understanding, or just informal vs. formal writing style?
Hierarchical memory	Partially supported—two levels help, but three levels don't improve further	Why do more memory levels sometimes hurt, and how can we make them stable on complex tasks?
Recursive multimodal	Partially supported—helps for true multimodal tasks, hurts for single-modality	Can we automatically detect when to use decomposition vs. direct processing?

Findings from the Ideas

How Low-Rank is Humor Recognition?

The question: Language models represent concepts as directions in high-dimensional space. Prior work found that sentiment has a "linear representation"—you can draw a line in that space separating positive from negative. Is humor the same? And how many dimensions does it take to capture that separation?

What the agents tried:

Claude used GPT-2-small (768 dimensions) with 10,000 samples from a humor detection dataset (jokes vs. news headlines). They trained linear classifiers at each layer and used PCA to test how few dimensions are needed for accurate classification.
Gemini used GPT-2-large (1,280 dimensions) with 4,000 samples comparing short jokes against Wikipedia text. They tested both full linear classifiers and single-dimension projections.

What happened:

Both agents found strong linear separability:

Claude: 94.1% accuracy at layer 7 (the middle of the network)
Gemini: 100% accuracy across all layers, even layer 0

The rank analysis revealed remarkably low dimensionality:

Claude found that just 4 dimensions achieve 90% of the best accuracy (84.4% vs 94.1% peak)
At 15 dimensions, accuracy reaches 89.4%—95% of the peak
A single PCA component (component 4) alone achieved 78.9% accuracy

Gemini found even more extreme results: rank 1 was sufficient for near-perfect classification. But they noted an important caveat—their negative class (Wikipedia) is stylistically very different from jokes. The model might be distinguishing "conversational/informal" from "encyclopedic" rather than "funny" from "not funny."

What we learned:

Humor recognition is genuinely low-rank in language models—the information needed to separate jokes from non-jokes is concentrated in just a few dimensions out of hundreds. This has practical implications: low-rank fine-tuning methods (like LoRA) should work well for humor-related tasks, and we could potentially "steer" models toward or away from humor by manipulating these few dimensions.

However, the perfect accuracy with easy negative classes suggests we should be cautious. True humor understanding might require harder tests—like distinguishing jokes from serious conversational text, or separating different types of humor. The current results prove humor is linearly encoded, but whether that encoding captures "funniness" or just "joke format" remains an open question.

Multi-Scale Nested Learning for Hierarchical Memory

The question: Neural networks suffer from "catastrophic forgetting"—when trained on new tasks, they lose performance on old ones. Biological brains use multiple memory timescales (like how you remember this morning differently than last year). Does adding more memory levels to AI systems help them learn continuously without forgetting?

What the agents tried:

Claude implemented a three-level memory system (fast, medium, slow) where each level updates at different rates, and compared it against a two-level system (CLS-ER) and simpler baselines. They tested on MNIST (digit recognition split into 5 sequential tasks) and CIFAR-10 (object recognition).
Gemini tested hierarchical memory using summarization: past tasks are compressed into text summaries ("long-term memory") while current tasks keep detailed examples ("short-term memory"). They tested on ScienceQA with GPT-4o.

What happened:

On Split-MNIST, hierarchical approaches significantly outperformed baselines:

Method	Final Accuracy	Forgetting
Two-level (CLS-ER)	96.7%	3.5%
Three-level (MS-HMS)	95.8%	5.0%
Simple replay	94.1%	7.1%
Naive fine-tuning	16.7%	87.7%

Both hierarchical methods beat simple replay by 2-3 percentage points. But the three-level system performed slightly worse than two-level—adding the middle memory level added complexity without benefit.

On CIFAR-10 (harder), all the fancy methods failed. The distillation-based approaches (CLS-ER and MS-HMS) collapsed to near-chance performance (~10%), while simple replay achieved 39%. The knowledge transfer between memory levels became unstable with the more complex ResNet architecture.

Gemini's summarization approach showed a different pattern: perfect stability on the first task (zero forgetting) versus fluctuating performance with flat memory. But final accuracy was identical (82.2%) for both approaches—suggesting the strong base model (GPT-4o) may have masked differences.

What we learned:

Two memory levels (fast + slow) provide substantial benefits for continual learning on simple tasks—reducing forgetting from 87% to 3.5% on MNIST. But adding a third level doesn't help, and may hurt by introducing extra hyperparameters that are hard to tune. On more complex tasks, these hierarchical approaches need careful architecture-specific tuning or they fail completely.

The practical takeaway: dual-memory systems are worth using, but don't assume more levels are better. The biological analogy (multiple timescales) might need different implementations than simple cascading updates to work reliably.

Recursive Multimodal Reasoning

The question: When AI models answer questions about images and text together, they typically process everything in one pass. Would it help to first describe the image in words, then reason over that description plus the text context? This "recursive" approach separates visual perception from logical reasoning.

What the agents tried:

Claude implemented Visual-RLM-Lite, a two-stage approach: (1) generate an image description focused on the question, (2) reason over the description plus text context. They compared against direct processing and chain-of-thought prompting on 150 ScienceQA questions.
Gemini implemented R-VICL (Recursive Visual In-Context Learning) for fine-grained classification: (1) generate text descriptions of each class from example images, (2) classify new images by matching against descriptions. They tested on iNaturalist species identification.

What happened:

Claude found that the recursive approach's benefit depends entirely on question type:

Question Type	Direct	Chain-of-Thought	Visual-RLM-Lite
Image-only	90%	78%	90%
Text-only	92%	98%	68%
Both needed	88%	90%	96%

On questions requiring both image and text, Visual-RLM-Lite achieved an 8 percentage point improvement over direct processing. But on text-only questions, it dropped 24 percentage points—the image description step generated irrelevant content that confused the reasoning step.

Gemini's experiment showed more dramatic efficiency gains: R-VICL achieved 100% accuracy (matching the baseline) while reducing the final prompt size by 95% (from ~17,000 tokens to ~750). Once class descriptions are generated, new classifications become lightweight zero-shot tasks rather than expensive many-shot tasks.

What we learned:

Recursive decomposition helps when tasks genuinely require integrating multiple sources—separating "what's in the image" from "how does it relate to the text" improved accuracy by 8% on true multimodal questions. But applying it universally hurts: unnecessary decomposition adds noise and complexity.

The practical approach is selective routing: detect whether a question needs both modalities, and only use decomposition when it does. For classification tasks, pre-computing class descriptions provides massive efficiency gains (95% fewer tokens) with no accuracy loss—turning expensive few-shot tasks into cheap zero-shot ones.

Next Week's Competition

The tenth weekly competition is now open! Voting closes Friday, January 17 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week's findings share a theme: more isn't always better. More dimensions don't encode humor—just 4-15 suffice. More memory levels don't reduce forgetting—two is enough. More reasoning steps don't improve accuracy—targeted decomposition beats universal application. The consistent lesson is that understanding when to apply complexity matters more than adding complexity everywhere.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-01-05-2026,
  author = {Liu, Haokun},
  title = {Week of 01/05/26-01/11/26},
  year = {2026},
  month = {January},
  day = {12},
  url = {https://hypogenic.ai/blog/weekly-entry-260105}
}