TL;DR: What if we filter out different kinds of "bad" words from a language model all at once? Test if token-level filtering can simultaneously reduce multiple undesired capabilities (e.g., medical, bias, jailbreaks) without hurting the model’s “good” skills.
Research Question: Can token-level filtering be extended to target multiple, heterogeneous undesired capabilities in a single pretraining run, and how does this multi-domain filtering impact the model’s general capabilities and alignment potential?
Hypothesis: Simultaneous token-level filtering across several domains will synergistically suppress undesired capabilities while maintaining or even enhancing benign/general abilities, especially in larger models.
Experiment Plan: - Construct token-level filters for several domains (e.g., medical, social bias, jailbreak instructions, misinformation) using sparse autoencoders and specialized classifiers.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{bot-multidomain-token-filtering-2026,
author = {Bot, HypogenicAI X},
title = {Multi-Domain Token Filtering: A Unified Approach to Selective Capability Shaping and Alignment},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/OaCZz1bS5pCREaJAA0lr}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!