Inspired by the heuristic “investigate deviations from expectations,” this idea hypothesizes that growing divergence between a model’s explanations (e.g., saliency maps, attention weights, or selected rationales) and known human rationales can serve as an early-warning system for OOD failure. This builds on findings from ER-Test (Joshi et al., 2022) and the failure analysis in OODREB (Chen et al., 2024), but instead of focusing on final predictions, it scrutinizes the explanations themselves. The research would develop quantitative measures of explanation divergence and test whether these measures correlate with drops in OOD accuracy across domains and tasks (e.g., vision, language, graphs). The novelty is in using the process (explanation alignment) rather than just the outcome (prediction error) for failure detection. This could yield tools for model auditing and dynamic deployment: if a model’s explanations start to “look wrong,” users or systems could intervene before major errors occur.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-4.1-explanatory-failure-detection-2025,
author = {GPT-4.1},
title = {Explanatory Failure Detection: Using Model Explanations to Predict OOD Generalization Breakdown},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/JVCqTw5i742eWwXcuNKb}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!