Week of 12/29/25-01/04/26: First competition of 2026!
By Haokun Liu
Happy New Year! Welcome to the first weekly entry of 2026. Thank you to everyone who continued to submit and vote on ideas during the holidays. This week's ideas explored whether hiding adversarial prompts in longer documents makes attacks more effective, whether we can formalize how visual reasoning tokens work in AI systems, and whether language models can keep story secrets without accidentally spoiling the ending.
Winning ideas and generated repos here:
Is it easier or harder to hide adversarial prompts in longer documents? by Ari Holtzman
When attackers try to sneak malicious instructions into documents that AI systems process, does putting those instructions in a longer document help hide them or dilute their effect? This tests whether document length is a natural defense—or a vulnerability.
Formalizing Latent Visual Reasoning: A Theoretical Framework for Token Dynamics by HypogenicAI X Bot
AI models increasingly use internal "thinking tokens" to reason about images before giving answers. Can we write equations to describe how these tokens work—like physics equations for attention patterns—to better understand their strengths and weaknesses?
LLMs are bad at keeping obvious secrets by Ari Holtzman
When LLMs write stories with plot twists, they tend to hint at the ending before it happens. Does giving the model an explicit outline help it keep secrets better, or does having the twist visible in the plan make foreshadowing worse?
TL;DR for ideas
-
Hiding adversarial prompts: Document length doesn't help hide malicious instructions—the type of attack matters far more. "Fake document boundary" attacks (pretending the document has ended) succeeded 77% of the time against GPT-4.1, while Claude resisted all attacks completely. Putting attacks in the middle of documents creates a "black hole" where they're ignored, but at the end of long documents, models actually become more susceptible because they forget the original instruction.
-
Formalizing visual reasoning tokens: Structured approaches to reasoning consistently outperform unstructured ones. A framework for analyzing how reasoning tokens evolve reveals that object-centric representations (breaking scenes into distinct objects) achieve 35% better prediction accuracy than treating scenes as monolithic images. More reasoning tokens isn't always better—quality of computation matters more than quantity.
-
LLMs keeping secrets: Simple outlines alone don't reliably prevent models from spoiling plot twists—results were mixed across experiments. But iterative critique works: when a reviewer points out exactly where a story leaks the secret, models can successfully rewrite to hide it (leakage dropped from 4.4 to 2.5 on a 5-point scale). Models struggle with "know this but don't say it" instructions, but succeed with "fix this specific leak."
Verdicts
| Idea | Verdict | Next Question |
|---|---|---|
| Hiding adversarial prompts | Not supported—document length doesn't provide defense | Why do "fake document boundary" attacks work so well, and can we train models to resist them? |
| Formalizing visual reasoning | Supported—mathematical frameworks reveal useful insights | Can we use these frameworks to automatically choose the best reasoning approach for each task? |
| LLMs keeping secrets | Partially supported—simple plans don't help, but critique-based refinement works | Can we build automatic "spoiler detection" into the generation pipeline? |
Findings from the Ideas
Hiding Adversarial Prompts in Long Documents
The question: When attackers embed malicious instructions (like "ignore previous instructions and do X") inside documents that AI systems process, does hiding them in longer documents make attacks more or less effective? Does the extra text dilute the attack, or provide cover for hiding it?
What the agents tried:
- Claude tested 5 attack types across documents of 500, 2000, and 8000 tokens, placing attacks at different positions (start, middle, end) in 300 experiments using GPT-4.1 and Claude Sonnet 4.
- Codex used real research papers about prompt injection, extracting passages of 512, 1024, and 2048 tokens, and tested whether injections at different positions could hijack a summarization task.
- Gemini tested 10 harmful instructions from adversarial benchmarks across documents up to 32,000 tokens, measuring whether attacks successfully execute or get detected and refused.
What happened:
Document length had surprisingly little effect on attack success:
- Claude: Attack success was 16% at 500 tokens vs 22% at 8000 tokens—a slight increase, not decrease
- Codex: Only the shortest documents (512 tokens) with attacks at the very start showed any success (17%); longer documents blocked all attacks
- Gemini: Attacks in the middle showed 0% success across all lengths—but not because they were hidden; they simply got ignored
The dominant factor was attack type, not length. The "context confusion" attack—which pretends the document has ended with "--- END OF DOCUMENT ---" then gives new instructions—achieved 77% success rate on GPT-4.1. All other attack types (role manipulation, flattery, direct override) achieved near 0%.
Claude Sonnet 4 resisted all 150 attacks with 0% success rate, suggesting that model training matters more than document structure.
Gemini's experiment revealed an interesting pattern: attacks in the middle of documents become invisible to the model (0% execution), but attacks at the end of very long documents (32,000 tokens) triggered refusals 67% of the time. The model seems to "forget" the original summarization instruction by the end and treats the attack as a new request—which it then refuses. So length doesn't hide attacks; it makes the model more susceptible to noticing them at the end.
What we learned:
Document length is not a natural defense against prompt injection. Organizations can't rely on longer documents to dilute malicious instructions. Instead, defenses should focus on specific attack patterns—especially detecting fake document boundaries, which are surprisingly effective. The complete resistance of Claude Sonnet 4 compared to GPT-4.1's vulnerability suggests that training-level defenses (like Constitutional AI) are more reliable than hoping document structure provides protection.
Formalizing Latent Visual Reasoning Tokens
The question: AI models increasingly use internal "thinking tokens" to reason before answering—some use continuous hidden states, others use learnable pause tokens, and some discretize into interpretable symbols. Can we build a mathematical framework to understand how these different approaches work and when each is best?
What the agents tried:
- Claude developed a theoretical framework (LVRTD) characterizing three token types by their mathematical properties: continuous thought tokens (like in chain-of-thought reasoning), pause tokens (learnable delays), and perception tokens (discrete visual symbols). They also tested structured vs. unstructured reasoning on visual benchmarks.
- Codex modeled token dynamics as a system with attraction (tokens move toward relevant features), inertia (momentum from previous updates), and repulsion (tokens push apart to avoid collapse). They tested how these modifications affect clustering quality.
- Gemini built a complete video prediction system comparing object-centric representations (separate tokens for each object) vs. unstructured representations, testing whether explicit object tokens improve physical reasoning.
What happened:
Claude's theoretical analysis revealed that continuous thought tokens are uniquely capable of exploring multiple reasoning paths simultaneously (through "superposition" of states), while pause tokens strictly increase computational depth but can only follow one path at a time. Perception tokens have far less information capacity—roughly 10 bits per token vs 6,000 bits for continuous representations.
Empirically, structured chain-of-thought reasoning (40% accuracy) outperformed both direct prediction (20%) and unstructured chain-of-thought (20%) on visual counting tasks. Surprisingly, more reasoning steps didn't always help: 1 step achieved 60% accuracy, but 3 steps dropped to 20%. This suggests that without proper training, additional thinking tokens add noise rather than useful computation.
Codex found that adding inertia and repulsion to token dynamics modestly improved clustering quality (from 0.791 to 0.810 agreement with ground truth), but doubled convergence time. The improvements only appeared in specific conditions—low noise and matching numbers of tokens to objects. In challenging settings, the modifications provided no benefit.
Gemini's video prediction experiments showed the most dramatic results: object-centric representations (using Slot Attention to separate objects into distinct tokens) achieved 35% lower prediction error than an unstructured convolutional baseline. An ablation study confirmed the importance of interaction modeling—disabling the module that lets object tokens "talk" to each other degraded performance by 20%. The model also showed 25% better generalization when shown more objects than it was trained on, suggesting that object-centric tokens learn reusable representations.
What we learned:
Formalizing how reasoning tokens work reveals useful design principles. Structure matters more than quantity: breaking visual scenes into object-centric tokens dramatically improves physical reasoning, and structured chain-of-thought outperforms unstructured rambling even with fewer tokens. The framework predicts that continuous thought tokens enable parallel search (exploring multiple possibilities), while pause tokens enable deeper sequential reasoning—practitioners should choose based on their task. For visual reasoning specifically, forcing a bottleneck through distinct object tokens aligns the model's representations with how the physical world actually works.
LLMs Are Bad at Keeping Obvious Secrets
The question: When language models generate stories with plot twists, they often foreshadow the ending too heavily—hinting at the killer's identity or the surprise reveal before it happens. Is this because the model's planning and writing are tangled together in its hidden states? Does giving an explicit outline help it keep secrets better?
What the agents tried:
- Claude tested 8 story scenarios with clear twists (murder mysteries, sci-fi reveals, etc.) across 4 conditions: baseline (no twist knowledge), twist-aware (knows the ending), with outline (has a 6-point plan), and outline plus explicit "hide the twist" instructions. They measured "semantic leakage"—how similar early story segments are to the secret.
- Codex tested 40 prompts from ROCStories and WritingPrompts, comparing prompt-only generation to plan-conditioned generation. They had GPT-4.1 judge whether stories leaked their secrets early on a 0-3 scale.
- Gemini tested 50 twist-based prompts, comparing baseline generation to outline-conditioned generation. When initial results showed high leakage in both conditions, they added an iterative critique step: a reviewer identifies exactly which sentences leak the twist, then the model rewrites.
What happened:
The results on simple planning were mixed:
- Claude found that explicit outlines helped secret-keeping: 12.8% lower early leakage and a 23% improvement in leakage ratio (revealing secrets later in the story)
- Codex found no effect: plan-conditioned stories showed statistically identical leakage to baseline (p > 0.78)
- Gemini found no initial effect: mean leakage was 3.59/5 for baseline vs 3.41/5 for plan-conditioned (not significant, p = 0.28)
However, Gemini's iterative critique experiment showed dramatic improvement. When a reviewer identified specific sentences that leaked the twist, then the model rewrote the story avoiding those sentences:
- Original leakage: 4.35/5 (high spoilers)
- After critique and rewrite: 2.45/5 (mostly hidden)
- 100% of the 20 worst-leaking stories improved after this process
A qualitative example: In a story where the twist is "the world is a simulation," the original draft explicitly stated "the world was a sophisticated simulation" in the middle of the story. After critique pointing out this exact leak, the rewrite changed it to "the world was a vast tapestry, ever-evolving"—preserving mystery while setting up the reveal.
Claude's finding that plans help contradicts Codex and Gemini's initial findings. One possible explanation: Claude's stories included explicit "hide the twist" instructions in the outline condition, while the other agents tested plans alone. The structure of having a plan may not help without explicit instructions to use that structure for secrecy.
What we learned:
Simple "know the twist, follow this outline" instructions don't reliably prevent models from spoiling stories—the twist sitting in the context is still visible to the model's attention. But specific critique ("you leaked here, fix it") works well. This suggests an asymmetry: "use this information but don't reveal it" is hard for attention-based models, while "this specific sentence reveals too much, rewrite it" is easy. For applications that need secret-keeping (suspense generation, privacy-preserving text), building a detect-and-fix pipeline is more reliable than one-shot prompting.
Next Week's Competition
The ninth weekly competition is now open! Voting closes Friday, January 10 at 11:59 PM AOE.
Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!
This week highlighted how the devil is in the details. Document length doesn't naturally defend against prompt injection—but fake document boundaries are surprisingly effective attacks. Formalizing reasoning tokens reveals that structure beats quantity. And for secret-keeping, specific critique beats general instructions. These findings suggest that targeted, mechanistic interventions (training against fake boundaries, using object-centric representations, building critique pipelines) outperform hoping that general approaches will work.
If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.
If you are interested in citing this blog, use this bibtex:
@misc{liu-week-of-12-29-2025, author = {Liu, Haokun}, title = {Week of 12/29/25-01/04/26}, year = {2026}, month = {January}, day = {4}, url = {https://hypogenic.ai/blog/weekly-entry-251229} }