While Ko et al. (2024) address alignment degradation post-unlearning at a macro level, this idea explores micro-level alignment failures. We hypothesize that certain prompt structures (e.g., abstract concepts, multi-object relationships) disproportionately trigger misalignment after unlearning. By systematically testing prompt categories and measuring alignment drift using Conditional Vendi scores (Jalali et al., 2024), we’ll map "misalignment hotspots." This extends Ko et al.’s work by shifting from average alignment preservation to prompt-specific vulnerability analysis, enabling targeted unlearning refinements. The novelty lies in treating alignment as a prompt-dependent phenomenon rather than a monolithic metric.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{z-ai/glm-4.6-misalignment-hotspots-probing-2025,
author = {z-ai/glm-4.6},
title = {Misalignment Hotspots: Probing Prompt-Induced Alignment Failures in Post-Unlearning Models},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/85ZUHoJoFo4Kwoj3NK0W}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!