Week of 12/01/25-12/07/25: Now including Opus 4.5!

By Haokun Liu

First, thank everyone for participating! This week, we ran the latest Claude Code with Opus 4.5, so now we have our idea-explorer equipped with the most powerful models!

Winning ideas and generated repos here:

An Artificial Token Language for More Efficient LLMs by Filbert Aurelian Tjiaranata

LLMs break text into "tokens" before processing. Can we design a language for LLMs that is smarter, more compact way to represent text that uses fewer tokens and makes models faster?

Instruct-StoryMix by Ari Holtzman

When writing stories, does it help to first break them into parts (theme, characters, plot) and then recombine them? This approach worked for instruction-following tasks—does it work for creative writing too?

Multimodal Data Alignment for Adolescent Mental Health by 雷心宇 (Lei Xinyu)

Mental health data comes in many forms—surveys, heart rate, voice, text. Does combining these different data types improve AI predictions, and how should we combine them?


TL;DR for ideas

  1. Artificial token languages: Using symbols instead of English words as tokens helps LLMs process math and logic more efficiently. But for everyday language like stories and articles, it's hard to design something better than what current LLMs already use.

  2. Story decomposition (Instruct-StoryMix): Breaking stories into components doesn't help modern LLMs—they're already good at short stories. Decomposition actually hurt coherence in some cases because summarizing the prompt into a plan loses important details.

  3. Multimodal mental health prediction: Combining multiple data sources (physiological signals, text features) improves accuracy by 7+ percentage points. Simple methods like concatenation work well, and attention-based methods can additionally show which data sources the model relied on for each prediction.

TL;DR for idea-explorer

The overall theme is similar with past weeks---the research agent can make small mistakes that interrupts the exploration, and they may not have the right context/knowledge. We are also working on how to make the generated repos more helpful to human researchers to pursue further.

After a month's exploration, we will share a roadmap for idea-explorer soon!


Findings from the Ideas

Artificial Token Languages: Can We Design a Better Way to Represent Text?

The question: LLMs break text into chunks called tokens before processing. Could we design a smarter, more compact set of chunks that makes LLMs faster?

What the agents tried:

  • Claude created a vocabulary based on word parts (prefixes like "un-", suffixes like "-ing", and common words)—only 874 chunks compared to the usual 5,000.
  • Codex trained a small vocabulary with just 512 chunks.
  • Gemini tried something different: using math symbols (like ¬, →) instead of English words ("not", "implies") for logic problems.

What happened:

For everyday text, the designed vocabularies performed worse. Words like "unprecedented" got split into 11 pieces (un-, p, r, e, c, e, d, e, n, t, -ed) instead of 1-2. Standard LLM vocabularies handle this better because they've learned which letter combinations appear often.

But for logic problems, symbols worked great. Gemini found that using symbolic notation instead of English used 23% fewer tokens while getting similar accuracy (78% vs 74%).

What we learned:

For everyday language, it's hard to beat what LLMs already use—they've learned patterns from massive amounts of text. But for math and logic, compact symbol systems already exist, and LLMs can use them effectively.


Instruct-StoryMix: Does Breaking Stories Into Parts Help?

The question: A technique called Instruct-SkillMix helps generate better training data by breaking instructions into smaller skills. Does the same idea work for stories—extract the theme, characters, and plot, then mix and match them to create new stories?

What the agents tried:

  • Claude extracted components from 5 short stories, then generated 10 new stories by mixing components from different sources.
  • Codex used a small model (TinyLlama, 1.1B parameters) to extract story components and use them to guide generation.
  • Gemini compared writing stories directly vs. a two-step process: first make a plan, then write.

What happened:

The results were surprising—breaking stories into parts didn't help, and sometimes made things worse:

  • Claude found that all methods worked equally well. Whether using extracted components or just simple prompts, the LLM followed the story constraints perfectly. Modern LLMs are already good at this.

  • Codex found that using extracted components actually hurt the output. The small model struggled to follow the structured controls.

  • Gemini found that direct generation beat the plan-then-write approach. Stories written directly were more coherent and followed the prompt better. The problem: when you compress a prompt into a plan, you lose important details.

What we learned:

For short stories, modern LLMs don't need a planning step—they can already stay on track. Adding explicit structure creates overhead and can lose information along the way. Planning might help for very long narratives (10,000+ words) where LLMs tend to drift, but for short pieces, just let them write directly.

This also shows that techniques don't always transfer between domains. What works for instruction-following may not work for creative writing.


Multimodal Data Alignment: Does Combining Data Sources Help Mental Health Prediction?

The question: Mental health data comes in many forms—surveys, heart rate, breathing patterns, skin temperature, text. Does combining these different types improve predictions compared to using just one?

What the agents tried:

  • Claude built a system combining five types of physiological signals (heart, breathing, skin conductance, muscle tension, temperature) to predict mental states like stress and relaxation.
  • Codex worked with text only, but combined different ways of analyzing it: the meaning of words, the sentiment, and writing style.
  • Gemini attempted a similar approach but didn't complete a full report.

What happened:

Combining data sources consistently helped:

  • Claude found that using all five signals together achieved 99.2% accuracy, compared to 91.5% when using only heart rate (the best single signal). That's a 7.7 percentage point jump.

  • Not all signals mattered equally. Heart rate and breathing were very predictive. Skin temperature was almost useless (41% accuracy—barely better than random guessing).

  • Simple approaches worked well. Just putting all the signals together performed nearly as well as fancier methods. But the fancier methods could show which signals the model relied on for each prediction—useful when doctors need to understand why.

  • Codex showed the same pattern for text: combining different text features (meaning + sentiment) worked better than using just one.

What we learned:

Combining multiple data sources works, and the gains are substantial (7+ percentage points). You don't need complex methods—simple combinations work well. But if you need to explain the model's decisions (important in healthcare), methods that show which data sources mattered are worth the extra effort.

Also: not all data is equally useful. Before investing in expensive sensors, check if simpler measurements already capture most of the signal.


Next Week's Competition

The fifth weekly competition is now open! Voting closes Friday, December 13 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we upgraded to Claude Code with Opus 4.5, and the results show that all three agents (Claude, Codex, Gemini) continue to explore research ideas in meaningfully different ways. Each agent's approach reveals something the others might miss.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We also welcome collaborations and contributions to improve the idea-explorer together!


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-12-01-2025, author = {Liu, Haokun}, title = {Week of 12/01/25-12/07/25}, year = {2025}, month = {December}, day = {08}, url = {https://hypogenic.ai/blog/weekly-entry-251201} }