In the person vectors paper (https://arxiv.org/abs/2507.21509) they show that you can essentially factor out a persona vector by adding it in during fine-tuning in order to relieve the pressure on the model to include information about that persona direction. Can we use this to factor out friendly, helpful, harmless, and honesty from assistants? I would certainly like an assistant that doesn't try to be so brand name cheerful. I think the first step in doing that is seeing we can make a purely neutral instruction following datasetsets we already have.
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{holtzman-factoring-out-friendly-2025,
author = {Holtzman, Ari},
title = {Factoring out friendly assistance},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/nzfMazrYsGTmAWf3TTXv}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!