How many non-compositional contiguous sequences of tokens have token-like properties in an LLM? LLMs have an 'implicit' vocabulary (https://arxiv.org/abs/2406.20086) of tokens they merge to understand as a single reference. However, there is clearly a distinction between 'red scare' (which is a historical event that) and a 'red cat' (which is 'cat' modified with 'red' the same way it could be modified by any color). However, there is no easy algorithm for figuring out what is compositional—'black cat' has non-compositional metaphorical meaning due to signifying bad luck. How many non-composition contiguous sequences of tokens have vocabulary-like structure?
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{holtzman-all-the-words-2026,
author = {Holtzman, Ari},
title = {All the words and phrases},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/cSOuC7g6YSQHQcWgINsJ}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!