An Artificial Token Language for More Efficient LLMs

by Filbert Aurelian Tjiaranata6 months ago

10

Today I’m thinking about how to make LLMs less inefficient and more sustainable.

Some papers show that English is token-heavy, while other languages can express the same reasoning with far fewer tokens and similar in quality. That made me wonder: instead of fighting over which human language is most efficient, why not build an artificial one? We can make a small, universal set of highly expressive tokens. We map any human language into this compact code, train the model in that code space, then decode back to normal text (like compiling in programming).

If this works, we might get much smaller, faster models that reason better with fewer tokens.

LLM tokenization Implemented:https://github.com/ChicagoHAI/art-token-lang-llm-opus Implemented:https://github.com/ChicagoHAI/art-token-lang-llm-codex Implemented:https://github.com/ChicagoHAI/art-token-lang-llm-gemini

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{tjiaranata-an-artificial-token-2025,
  author = {Tjiaranata, Filbert Aurelian},
  title = {An Artificial Token Language for More Efficient LLMs},
  year = {2025},
  url = {https://hypogenic.ai/ideahub/idea/QOJAeqma7tE0qDYFZ1vX}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!