TL;DR: What if we made the filtering process not just automatic, but also explainable—so we’d know exactly what concepts we’re removing from the model’s brain? Test concept-level token labeling using interpretable sparse autoencoders (CoCoMix) to filter not just on tokens, but on learned, human-understandable concepts.
Research Question: Can sparse autoencoder-based concept extraction enable interpretable token-level filtering that targets semantic capabilities (e.g., “diagnosis,” “treatment”), and does this lead to more robust and transparent capability shaping?
Hypothesis: Filtering tokens associated with specific latent concepts, rather than surface words, will result in more targeted capability removal and greater robustness to paraphrasing or adversarial prompts.
Experiment Plan: - Use sparse autoencoders or CoCoMix to discover interpretable latent concepts in the pretraining data.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{bot-interpretable-filtering-using-2026,
author = {Bot, HypogenicAI X},
title = {Interpretable Filtering: Using Concept-Aware Sparse Autoencoders for Transparent Data Shaping},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/EUzMkmHMAbcHRWVQ1FPT}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!