Sequence Length, Vocabulary Size, and Efficiency Tradeoffs in Symbolic Music Tokenization (REMI+, Compound Tokens, and BPE)

14

Does your more expressive tokenization blow up sequence length or harm model efficiency, and can BPE or compound tokenization help?

Research Question: What is the tradeoff between sequence length, vocabulary size, and model efficiency in advanced tokenizations versus REMI+ and BPE variants?

Hypothesis: While advanced tokenization increases vocabulary, compound/subword approaches can maintain efficiency without sacrificing expressivity.

Experiment Plan: Measure sequence lengths, vocabulary sizes, and training/inference speeds across tokenization schemes. Test hybrid approaches: BPE on top of your tokenization. Report on generated music quality and model resource usage.

References:

1. Fradet, N., Briot, J.-P., Chhel, F., Seghrouchni, A. E., & Gutowski, N. (2023). Byte Pair Encoding for Symbolic Music. EMNLP.

symbolic music MIDI music generation tokenization subword tokenization BPE REMI+sequence compression long sequence modeling training efficiency

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{jin-sequence-length-vocabulary-2026,
  author = {Jin, Qicheng},
  title = {Sequence Length, Vocabulary Size, and Efficiency Tradeoffs in Symbolic Music Tokenization (REMI+, Compound Tokens, and BPE)},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/NapqSeqli4hSxTMVr3iD}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!