What is the most cost effective way to get models to not use a single word or phrase, ever?
Fine tuning a model or some minor ablation, like contrastive prompting..?
e.g., I get instantly triggered by “It’s not X, but Y” framing. You could define this reasonably broadly and try to suppress it, what's the simplest method?
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{delisle-cheapest-way-to-2026,
author = {Delisle, Nathan},
title = {Cheapest way to suppress a single phrase from LM output},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/9bJUYvBY9Q1Xpz3x1Mgy}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!