Sequence Length, Vocabulary Size, and Efficiency Tradeoffs in Symbolic Music Tokenization (REMI+, Compound Tokens, and BPE)

by Qicheng Jin4 months ago
14

Does your more expressive tokenization blow up sequence length or harm model efficiency, and can BPE or compound tokenization help?

Research Question: What is the tradeoff between sequence length, vocabulary size, and model efficiency in advanced tokenizations versus REMI+ and BPE variants?

Hypothesis: While advanced tokenization increases vocabulary, compound/subword approaches can maintain efficiency without sacrificing expressivity.

Experiment Plan: Measure sequence lengths, vocabulary sizes, and training/inference speeds across tokenization schemes. Test hybrid approaches: BPE on top of your tokenization. Report on generated music quality and model resource usage.

References:

    1. Fradet, N., Briot, J.-P., Chhel, F., Seghrouchni, A. E., & Gutowski, N. (2023). Byte Pair Encoding for Symbolic Music. EMNLP.

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{jin-sequence-length-vocabulary-2026,
  author = {Jin, Qicheng},
  title = {Sequence Length, Vocabulary Size, and Efficiency Tradeoffs in Symbolic Music Tokenization (REMI+, Compound Tokens, and BPE)},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/NapqSeqli4hSxTMVr3iD}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!