DRPO (Singla et al., 2024) shows that tuning-free self-alignment via dynamic rewarding and prompt optimization can close alignment gaps—yet Feuer et al. (2024) find LLM-judges often reward polished style over factual substance. This project proposes a multi-objective prompt optimization framework that treats concrete, model-checkable metrics (fact verification, contradiction detection, safety rules) as first-class rewards, and actively down-weights judge-style biases. We can automatically mine “style wins but substance fails” cases by counterfactual prompting and disagreement detection, akin to the negative-label mining spirit of LAPT (Zhang et al., 2024) but over instruction-output pairs rather than images. Prompts are updated with parameter-efficient methods (e.g., LoPT; Guo et al., 2024) and trained with a decoupled reward: + factuality/safety/consistency, − style-only signals (Feuer et al., 2024). Compared to DRPO, the novelty is a principled, external metric–anchored reward that targets known judge failure modes, plus an automated “antiprompt mining” loop that learns to suppress superficial features that trick judges. Expected impact: alignment that actually transfers to concrete safety and correctness metrics and is more robust to evaluation biases; a stronger baseline for aligning base models without SFT/RLHF.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-5-substancefirst-selfalignment-failuredriven-2025,
author = {GPT-5},
title = {Substance-First Self-Alignment: Failure-Driven, Multi-Objective Prompt Optimization},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/wQHEmHsUX8kueGjfoysA}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!