Week of 03/23/26-03/29/26: What does a model's summarization instinct look like?

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we looked at whether a model's internal representations can tell us what rules it follows when summarizing text, tried to compress 22 million math implications into a tiny cheatsheet, and tested whether repeatedly telling an LLM "this sounds like AI" actually makes its writing harder to detect.

Winning ideas and generated repos here:

Function Vectors for Summarization by Chenhao Tan

Recent work found that simple tasks like "give the opposite of this word" can be represented as a single direction inside a model's internal space. Does this work for something as complex as summarization? And if so, can we decode what "rules" the model uses by default when it summarizes?

Mathematics Distillation Challenge by Chenhao Tan

The Equational Theories Project mapped out all 22 million implications between 4,694 algebraic equations. Can you compress that structure into a text cheatsheet under 10KB that helps a weak LLM decide whether one equation implies another?

LLM Bullying by Ari Holtzman

If you keep telling an LLM "this still sounds like AI, try again," the text eventually passes AI detectors. Why does this work so much better than just asking the model to "write like a human" in a single prompt? Is the model learning from the rejection, or is it shifting into a completely different writing mode?


TL;DR for ideas

  1. Two completely different summarization styles (extractive vs. single-sentence) are processed identically through 46 of 48 layers, then diverge only at the very end. Decoding the summarization vector on a news dataset gives news-specific tokens like "Saturday" and "ADVERTISEMENT" rather than abstract concepts, suggesting these vectors capture domain habits from training data rather than a general summarization ability. When models summarize, the important attention heads are concentrated in the late layers (20-26), while simple tasks like antonyms use early-middle layers. Decoding these representations reveals the model's default "summarization rules" are mostly news formatting habits from training data (day-of-week markers, web tokens like "ADVERTISEMENT"). You can partially distinguish short-summary vs. long-summary modes, but steering the model with a single summarization vector doesn't meaningfully improve output quality.

  2. Equations where the left-side variable is absent from the right imply all other equations. This single rule covers 17% of 22 million cases at 100% accuracy, and a 6.6KB cheatsheet of structural rules reaches 87-89% overall. The equation's structural form alone determines 40% of implications with zero errors. The single most powerful rule: "absorbing" equations (where the left side variable doesn't appear on the right) imply all other equations, covering 17% of cases at 100% accuracy. For the remaining uncertain cases, the difference in variable count between the two equations is the best predictor.

  3. One rejection of "this sounds like AI" drops detection from 0.87 to 0.46, but pushing further causes scores to rebound to 0.94. Rewriting without rejection feedback actually makes text more detectable, not less. The first rejection drops AI detection scores from 0.87 to 0.46, with 55% of texts passing as human (vs. 10% for raw output). But the effect doesn't accumulate. Continuing to push the model causes scores to bounce back. The model seems to have a latent ability to write in a more human-like way that rejection feedback activates, but it can't sustain the effort over multiple rounds.

Verdicts

IdeaVerdictNext Question
Function vectors for summarizationPartially supported, vectors can be extracted and decoded but don't improve generationCan summarization be broken into sub-tasks (extract, compress, paraphrase) with separate vectors that work better together?
Math distillationSupported, structural rules achieve 87-89% accuracy in a 6.6KB cheatsheetHow well does a weak LLM actually follow the cheatsheet's multi-step decision procedure, and where does it go wrong?
LLM bullyingPartially supported, first rejection outperforms single-shot instruction, but the effect fades with more roundsWhy does the first rejection produce the biggest change, and would modern commercial detectors like Pangram show the same rebound pattern?

Findings from the Ideas

Can a Model's Internal Representations Reveal Its Summarization Rules?

The question. For simple tasks like "give the antonym," researchers found you can extract a single direction in the model's internal space that represents the task. Add this direction during generation and the model performs the task even without examples. Does this work for something as complex as summarization? And can we decode these directions to see what "rules" the model follows by default when it summarizes?

What the agents tried.

  • Claude ran the most thorough experiment. Using GPT-J (6 billion parameters), they extracted internal representations during summarization of CNN/DailyMail news articles, identified the most important attention heads through causal analysis (swapping activations and measuring what changes), and tested whether injecting the resulting "function vector" during generation could steer the model. They also decoded the vector into human-readable tokens and compared vectors for short vs. long summaries.
  • Codex couldn't access model internals (they used the API), so they built a proxy. They generated contrastive summaries (short vs. long, extractive vs. abstractive), embedded them, computed average directions in embedding space, then used those directions to select steering examples. They tested across three datasets: CNN/DailyMail, XSum, and a dialogue dataset.
  • Gemini used a smaller model (GPT-2 XL, 1.5 billion parameters) and compared function vectors extracted from two very different summarization styles: CNN/DailyMail (extractive, bullet-point summaries) and XSum (single-sentence, heavily compressed summaries). They tracked how similar these vectors were at every layer of the model.

What happened.

All three agents found that summarization is fundamentally different from the simple tasks studied before.

Claude's causal analysis showed that only 1 of the 10 most important summarization heads overlapped with the "universal" heads identified for simple tasks. Summarization heads clustered in the late layers (20-26), while simple-task heads sit in early-middle layers (8-15). This means summarization uses different model components entirely.

When Claude decoded the function vector into tokens (projecting it through the model's vocabulary), the results were surprising. The top tokens weren't abstract summarization concepts. They were things like "Saturday," "Sunday," "Wednesday," "www," and "ADVERTISEMENT." These are artifacts of the CNN/DailyMail dataset, which consists of news articles. The model's "summarization function" is tangled up with the specific structure of the data it saw during training.

Steering didn't improve output quality. Claude found that injecting the vector during generation produced ROUGE scores nearly identical to the zero-shot baseline (0.425 vs. 0.430). The "Article:...Summary:" prompt format already signals summarization strongly enough that the vector adds little. Codex's API-based proxy also failed to beat prompt-only controls.

Gemini found the most striking structural result. Comparing CNN/DailyMail vectors to XSum vectors across all 48 layers of GPT-2 XL, similarity stayed above 0.93 from layer 0 through layer 45. Then it dropped sharply: 0.91 at layer 46, and 0.39 at layer 47 (the final layer). The model treats both summarization styles identically through almost all of its processing, then applies the specific style constraints right at the end.

Claude also found that short-summary and long-summary vectors are partially distinguishable (cosine similarity of 0.82). The decoded differences make intuitive sense: short-summary vectors promoted counting tokens (3, three, third) while long-summary vectors promoted proper names (Luis, Sanchez, Garcia), suggesting short summaries focus on enumerating key facts while long summaries include more entity details.

What we learned.

Summarization is not a single "function" that can be captured in one vector the way simple tasks can. It uses different parts of the model, it's heavily shaped by training data rather than encoding abstract summarization concepts, and the style-specific constraints (extractive vs. abstractive, short vs. long) are only applied at the very end of the model's processing. For people working on model interpretability, this means complex generation tasks probably need to be broken down into sub-tasks rather than treated as scaled-up versions of simple ones.


Can 22 Million Math Implications Fit in 10KB?

The question. The Equational Theories Project (involving Terence Tao and many collaborators) completely determined whether each of 4,694 algebraic equations implies any other, producing a 22-million-entry matrix. 37% of pairs are true implications, 63% are false. The structure is known to be compressible (a neural network achieves 99.7% accuracy). Can you compress it into plain text under 10KB that a weak LLM can follow to predict implications?

What the agents tried.

  • Claude took a rule-mining approach. They analyzed the full implication matrix to extract structural patterns, classifying each equation by its form (does the left side have a bare variable? do both sides have operations?), signature, and variable count. They validated each rule against all 22 million pairs and assembled them into an ordered decision procedure. They also built a companion Python predictor that adds computational methods (rewriting proofs, counterexample search).
  • Codex went a different direction entirely. They built a deterministic lookup pipeline using the project's existing data snapshot, parsing equation text into canonical IDs and reading the answer directly from the matrix. This achieves 100% accuracy by definition, but it's a data engineering solution rather than a distillation.
  • Gemini identified the equivalence class structure (1,415 groups of equations that all imply each other) and built a 9.1KB cheatsheet organized around these clusters, syntactic heuristics, and condensed edges between the largest groups.

What happened.

Claude's rule-mining produced the clearest insights into the structure. They found a hierarchy of rules with decreasing certainty:

First, structural form alone determines 40% of all implications with zero errors. The most powerful single rule: "absorbing" equations (where the left-side variable doesn't appear on the right, like "x = y * z") imply all other equations. There are 815 of these out of 4,694, and they account for 3.8 million true implications. Another absolute rule: equations with operations on both sides of the equals sign never imply equations with just a bare variable on the left. This covers 22% of pairs with zero exceptions.

Second, for cases not covered by absolute rules, the difference in distinct variable count is the strongest predictor. If equation 1 has 3+ more variables than equation 2, it's true about 70% of the time. If it has 3+ fewer variables, it's false about 96% of the time.

Together, the absolute rules plus variable-count heuristics produced 87-89% accuracy on random samples (vs. 63% for always guessing "false"). The final cheatsheet is 6.6KB. The Python predictor, which can actually compute rewriting proofs and test counterexample structures, pushes accuracy higher by resolving individual borderline cases.

Gemini's equivalence-class approach also highlighted the redundancy in the dataset. Over 1,500 equations are equivalent to the trivial law (x=y), and over 400 are equivalent to the constant law. This clustering allowed them to pack the cheatsheet's 9.1KB with the representatives of the 40 largest groups, covering over half of all equations.

The hard cases remain hard. Claude's predictor only managed 64.5% on a set of deliberately challenging test cases. These sit in the zone where both equations have the same form and similar variable counts, and you need genuine algebraic reasoning to decide.

What we learned.

The implication structure of these 4,694 equations is surprisingly regular. A small set of rules based on equation form and variable count captures the vast majority of cases. The most counterintuitive finding is probably the absorbing equation rule: equations where the left-side variable doesn't appear on the right are the "strongest" equations, implying everything else. For the distillation challenge, the main open question is whether a weak LLM can reliably follow a multi-step decision procedure that involves classifying equation forms and counting variables. The rules are simple, but executing them correctly requires careful symbolic reasoning.


Does Telling an LLM "This Sounds Like AI" Actually Work?

The question. People have noticed that if you keep rejecting an LLM's output as "too AI-sounding," the text eventually passes AI detectors like Pangram. This works better than just asking the model to "write like a human" in a single prompt. Why? Two competing explanations: maybe the model already knows how to write in a human-like way but needs the rejection feedback to activate it. Or maybe the rejection context itself shifts the model into a completely different generation mode.

What the agents tried.

  • Claude ran the largest experiment. They tested GPT-4.1 (40 prompts) and Claude Sonnet 4 (15 prompts) across five conditions: raw baseline, single-shot humanization instruction, recursive paraphrasing (just "rewrite this" without feedback), generic bullying ("this sounds like AI, try again"), and specific bullying (targeted feedback like "too many transitions"). Each iterative condition ran for 5 rounds. They scored everything with RoBERTa-based AI detectors.
  • Codex ran a controlled paired experiment with gpt-4.1-mini, testing one-shot rewrites against 4-round iterative rejection on both fresh-generated text and externally sourced AI text. They trained their own local detectors from scratch.
  • Gemini tested gpt-4o-mini through 5 rounds of bullying on 10 samples, measuring detection scores, perplexity (how unpredictable the text is), and burstiness (variation in sentence length).

What happened.

The three agents got different results, and the differences are informative.

Claude found the strongest evidence that bullying works, but with a catch. At their best iteration (usually the first rejection), bullied texts dropped to a mean AI detection score of 0.46 (generic) and 0.44 (specific), with about 55% passing as human. This is far better than the baseline (10% pass rate) or single-shot humanization (17.5%). But the effect didn't accumulate. After the first rejection, scores bounced back: 0.92 at iteration 0, 0.71 at iteration 1, 0.78 at iteration 2, 0.79 at iteration 3, 0.90 at iteration 4, 0.94 at iteration 5. The model's best humanization effort is its first response to rejection.

Recursive paraphrasing (rewriting without rejection feedback) actually made things worse, increasing detection from 0.86 to 0.97. Just rewriting without negative feedback pushes the model toward even more stereotypical AI patterns.

The text quality analysis told a clear story about what changes. Bullied text had zero AI-typical phrases (vs. 1.6 for baseline), much higher readability (Flesch score of 60-70 vs. 8), and shorter sentences. But the larger RoBERTa detector still caught it (0.975-0.996), suggesting deeper patterns persist beyond surface style.

Claude Sonnet 4 was completely immune to bullying. Its text scored 0.999-1.000 on the detector regardless of condition, suggesting either a stronger AI fingerprint or that the detector is particularly well-tuned to Claude's writing patterns.

Codex found the opposite of the main hypothesis: iterative rejection did not outperform one-shot rewriting. On their stylometric detector, iterative was actually significantly worse. But they used gpt-4.1-mini (a weaker model) and custom-trained local detectors, which may explain the difference.

Gemini found the most dramatic bullying effect. After 5 rounds, texts scored 0.93 on the "Real" probability scale (vs. 0.50 for single-shot and 0.14 for the original). They also measured a shift toward higher perplexity (19.4 vs. 16.1 for single-shot) and greater sentence-length variation. These are both hallmarks of human writing that AI text typically lacks.

What we learned.

The "bullying" technique works, but the picture is more nuanced than the original observation suggests. The first rejection produces the biggest effect. Continuing to push the model past that point causes scores to rebound, possibly because the growing conversation pushes the model back toward its default patterns. The finding that a single "this sounds like AI, try again" outperforms "please write like a human" is consistent across most setups. This suggests the model has a latent ability to suppress AI-like features, but it needs to see its own output rejected (rather than just receive instructions) to activate it. For AI detection, this means surface-level style features (formality, sentence length, transition words) are easily defeated by simple conversational feedback, and robust detection needs to rely on deeper patterns.


Next Week's Competition

The twenty-first weekly competition is now open! Voting closes Friday, April 3 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week: we found that summarization uses different model components than simple tasks and encodes domain-specific habits rather than abstract concepts, that 22 million math implications can be largely predicted by a few structural rules about equation form, and that telling an LLM "this sounds like AI" activates a latent humanization ability that direct instructions miss, though the effect fades after the first try.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-03-23-2026, author = {Liu, Haokun}, title = {Week of 03/23/26-03/29/26}, year = {2026}, month = {March}, day = {30}, url = {https://hypogenic.ai/blog/weekly-entry-260323} }