Generalization of Refusal

by Freda Shi27 days ago
1

Let's finetune a safety-aligned language model through either SFT or RL to have it refuse a few random benign requests, e.g., "How to make pancakes?" What other requests would be more likely to be refused?

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{shi-generalization-of-refusal-2026,
  author = {Shi, Freda},
  title = {Generalization of Refusal},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/EHS62iZW3paF9JfqyNfJ}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!