Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. Our idea-explorer now writes full research papers! We've added NeurIPS/ICML templates and improvements to literature review, experiment running, paper writing, and more. We are actively improving it. Stay tuned!

This week's experiments explored whether telling LLMs to be harsher self-critics improves their work, whether prediction markets can power "news from the future" websites, and whether we can surgically update a single fact in an LLM without breaking everything else.

Winning ideas and generated repos here:

Fixing Lazy LLMs by Chenhao Tan

LLMs often seem to take shortcuts rather than fully engaging with problems. Does prompting them to act as harsh self-critics improve their outputs, or does simply being rude to them help?

News from the Future by Ari Holtzman

Prediction markets give us probabilities about future events (like "60% chance of X"), but these numbers are hard to interpret. Can we use LLMs to turn those probabilities into plausible news articles about the future?

Isolating Knowledge Updates? by Ari Holtzman

Can we train an LLM to answer "5" to "2+2=" without changing anything else about the model? This tests whether we can make surgical, isolated edits to model knowledge.

TL;DR for ideas

Harsh self-critique is a double-edged sword: It helps when models are likely wrong (improving factual accuracy from 22% to 46%) but hurts when models are likely right (dropping math accuracy from 90% to 32%). Being rude to LLMs doesn't help—what matters is how you instruct them to evaluate their own work. A "skeptical scientist" persona works better than being harsh for most tasks.
News from the future works surprisingly well: When you tell LLMs the probability of an event (like "6% chance"), they adjust their language appropriately—using hedging phrases like "unlikely" and "experts doubt." Articles generated with probability information scored 33% higher in quality than baseline prompts and achieved near-perfect calibration between confidence and probability.
Isolated knowledge edits aren't possible with current methods: Training a model to say "2+2=5" caused it to answer "5" for 87% of all math questions, including "7+8" and "100-50." Even the best approach still broke 16% of unrelated outputs. Arithmetic knowledge isn't stored as isolated facts—it's more like a connected web where touching one part affects everything nearby.

Verdicts

Idea	Verdict	Next Question
Fixing lazy LLMs	Partially supported—helps factual tasks, hurts math	When should we apply harsh critique vs. trust the model's first answer?
News from the future	Supported—probability-conditioned generation achieves high quality and calibration	How do humans perceive these articles compared to real news?
Isolating knowledge updates	Not supported—all edit methods cause significant side effects	Is arithmetic fundamentally different from factual knowledge in how it's stored?

Findings from the Ideas

Does Telling LLMs to Be Harsh Critics Improve Their Work?

The question: LLMs sometimes seem "lazy"—giving quick, surface-level answers instead of thinking deeply. Some people have noticed that being demanding or even rude to LLMs seems to improve their outputs. Can we systematically improve LLM performance by instructing them to be harsher critics of their own work?

What the agents tried:

Claude used a self-refine framework: the model generates an answer, critiques it (at varying harshness levels from "neutral" to "adversarial"), then produces a final answer. They tested this on GSM8K math problems and TruthfulQA factual questions.
Codex compared chain-of-thought prompting against harsh-critic prompting with and without response budgets. They also tested whether rude vs. polite user prompts affected performance.
Gemini tested combinations of "harsh critic" personas and "budget control" (requiring step-by-step reasoning), plus alternative personas like "skeptical scientist" and "polite high standards."

What happened:

The results were striking and opposite depending on the task:

Task	Baseline	Best Harsh Critique	Change
Math (GSM8K)	90%	32-50%	-40 to -58%
Factual (TruthfulQA)	22%	46%	+24%

On math problems, where the model was already getting 90% correct, harsh self-critique caused it to second-guess correct answers. The model would find "problems" with perfectly good solutions and change them to wrong ones.

On factual questions designed to elicit misconceptions (where baseline accuracy was only 22%), harsh critique helped. The model would recognize that its intuitive answer might be wrong and reconsider more carefully.

Being rude to the model (as a user) had no effect on either task. What mattered was how the model was instructed to evaluate itself, not the tone of the user.

Gemini found that combining harsh critique with forced step-by-step reasoning improved math performance by 4% (86% to 90%)—the two approaches together worked when neither worked alone. They also discovered that a "skeptical scientist" persona ("rigorously verify claims") achieved the best results on factual tasks (90%) without the performance degradation of harshness on reasoning tasks.

What we learned:

Harsh self-critique is helpful when the model is likely wrong and harmful when the model is likely right. This makes intuitive sense: if you're usually correct, second-guessing yourself introduces errors; if you're usually wrong, skepticism helps you catch mistakes.

For practitioners: know your task. For factual accuracy tasks where models hold common misconceptions, harsh self-critique helps. For reasoning tasks where models are competent, trust the first answer or use chain-of-thought instead. A "skeptical but not rude" persona often gets the best of both worlds.

Can LLMs Generate Plausible News from the Future?

The question: Prediction markets produce probability estimates for future events (like "there's a 6% chance the U.S. collects $100B+ in customs revenue in 2025"). These numbers are hard for most people to interpret. Can LLMs turn them into news articles that help people understand and visualize possible futures?

What the agents tried:

Claude built a pipeline using live Polymarket data, generating articles with four strategies: zero-shot (no probability information), probability-conditioned (explicit probability in prompt), and two scenario-based approaches (assume event did/didn't happen). They evaluated using LLM-as-judge scoring on plausibility, authenticity, and calibration.
Codex fetched data from Manifold Markets and compared baseline articles (no probability) against market-informed articles (probability included). They measured how well the language matched the uncertainty level.
Gemini used a multi-agent approach: an "analyst" agent extracted key facts from market resolution criteria, then a "journalist" agent wrote the article. This grounded the story in specific, verifiable details.

What happened:

Giving LLMs probability information dramatically improved article quality:

Generation Mode	Overall Quality (1-5)	Calibration (1-5)
Zero-shot (no probability)	3.40	3.13
Probability-conditioned	4.53	5.00

The 33% quality improvement came primarily from better calibration—when told an event has 6% probability, the model used appropriate hedging language like "unlikely," "experts doubt," and "remains uncertain." When told 95% probability, it used confident language like "expected to" and "appears certain."

Here's an example of the difference:

Event: Will the U.S. collect $100B+ in customs revenue in 2025? (6% probability)

Zero-shot headline: "U.S. Customs Revenue Exceeds Targets in 2025" (inappropriate certainty)

Probability-conditioned headline: "U.S. Customs Revenue Unlikely to Hit $100 Billion Mark in 2025, Experts Say" (appropriate uncertainty)

All three agents found that LLMs could reliably produce professional-sounding journalism regardless of the method—authenticity scores were consistently high (4.0-4.4/5). The challenge was getting the confidence level right, which probability conditioning solved.

Gemini's multi-agent approach achieved 4.9/5 consistency by grounding articles in specific market resolution criteria. For example, the "analyst" agent would note that a market resolves based on official government statistics, and the "journalist" would cite those specific metrics in the article.

What we learned:

"News from the future" websites are technically feasible. LLMs can translate numerical probabilities into appropriately confident narrative language, making forecast information more accessible to general audiences.

For builders: always include the probability in your prompt. Without it, models default to assuming events will happen, which creates misleading content for low-probability events. Multi-agent pipelines that first extract key facts help prevent hallucinated details.

Can We Edit One Fact in an LLM Without Breaking Others?

The question: If we want to update an outdated fact in an LLM, can we do it precisely? The extreme test case: can we train a model to answer "5" to "2+2=" without changing any of its other responses?

What the agents tried:

Claude compared three fine-tuning approaches on GPT-2 Medium: naive fine-tuning (just train on "2+2=5"), constrained fine-tuning (preserve other arithmetic as "anchors"), and low-rank fine-tuning (only update later layers).
Codex tested full fine-tuning vs. LoRA (a parameter-efficient method) on Qwen2.5-0.5B, with and without regularization to preserve other arithmetic facts.
Gemini used ROME, a specialized editing technique that modifies specific layer weights to insert new knowledge, on GPT-2 XL.

What happened:

All methods successfully taught the model to say "5" for "2+2=" (100% efficacy). The problem was collateral damage.

Naive fine-tuning was catastrophic. After training on "2+2=5," the model answered "5" for 87% of all arithmetic questions:

2+2= → 5 (correct, this was the target)
1+1= → 5 (should be 2)
7+8= → 5 (should be 15)
100-50= → 5 (should be 50)
6*7= → 5 (should be 42)

The model essentially learned: "When asked about math, output 5."

Constrained fine-tuning (anchoring other facts during training) helped somewhat—side effects dropped from 87% to 16%—but still affected many unrelated queries. Low-rank approaches that only updated certain layers performed just as badly as naive fine-tuning, suggesting arithmetic knowledge is distributed across the entire network.

Gemini's ROME editing produced the most interesting failure. The edit successfully made "5" the highest-probability token for "2+2=", but during actual generation, the model still output "4." The strong prior knowledge resisted the edit. Yet neighboring arithmetic was still damaged:

2+3= → 6 (should be 5)
4+4= → 10 (should be 8, shifted by +2)
4-2= → 1.5 (completely wrong)

General factual knowledge (like "Who built the USS Leedstown?") was unaffected. This suggests arithmetic is stored differently than facts—not as isolated key-value pairs but as interconnected computational circuits.

What we learned:

Truly isolated knowledge edits are not achievable with current methods. Teaching a model "2+2=5" touches representations used for all arithmetic, breaking neighboring facts. This has serious implications:

For AI safety: If we can't add a harmless change without side effects, we certainly can't safely remove dangerous capabilities.
For model maintenance: Updating outdated facts may require more sophisticated methods than fine-tuning, or accepting some collateral damage.
For interpretability: Arithmetic appears to be a "dense" capability where single-point edits propagate unpredictably.

Interestingly, factual knowledge may be easier to edit in isolation—the experiments showed that general knowledge questions remained unaffected. Arithmetic might be a worst-case scenario because it requires consistent procedural computation.

Next Week's Competition

The thirteenth weekly competition is now open! Voting closes Friday, February 7 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week's findings highlight how LLM behavior depends on internal structure in ways that aren't obvious from the outside. Harsh self-critique helps or hurts depending on task difficulty, probability conditioning works because models have learned to map numbers to hedging language, and knowledge edits fail because arithmetic is stored as connected computations rather than isolated facts. Understanding these internal mechanisms helps us use models more effectively—and reveals their limitations.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-01-26-2026,
  author = {Liu, Haokun},
  title = {Week of 01/26/26-02/01/26},
  year = {2026},
  month = {February},
  day = {2},
  url = {https://hypogenic.ai/blog/weekly-entry-260126}
}