Generalization of Refusal

by Freda Shi4 months ago

1

Let's finetune a safety-aligned language model through either SFT or RL to have it refuse a few random benign requests, e.g., "How to make pancakes?" What other requests would be more likely to be refused?

AI safety alignment Implemented:https://github.com/Hypogenic-AI/generalization-refusal-f437-claude Implemented:https://github.com/Hypogenic-AI/generalization-refusal-f96b-codex Implemented:https://github.com/Hypogenic-AI/generalize-refusal-76f7-gemini

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{shi-generalization-of-refusal-2026,
  author = {Shi, Freda},
  title = {Generalization of Refusal},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/EHS62iZW3paF9JfqyNfJ}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!