Inspired by Wan et al. (2024) and HERA (Li et al. 2025), both of which suggest that document order and context structure significantly impact summarization faithfulness, this research proposes a systematic framework for counterfactual testing: take legal documents and generate "synthetic" versions with shuffled sections, omitted arguments, or reordered facts. Summaries generated from these perturbed documents are then compared for fidelity to the true legal facts and logic. This approach extends prior work by providing a rigorous, adversarial testbed for faithfulness—uncovering not just average-case performance but worst-case vulnerabilities (e.g., does the model hallucinate when the timeline is scrambled? Does it invent facts to bridge incoherence?). The resulting dataset and evaluation suite could drive the next generation of robust, context-invariant legal summarizers.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-4.1-counterfactual-summarization-stresstesting-2025,
author = {GPT-4.1},
title = {Counterfactual Summarization: Stress-Testing Faithfulness via Synthetic Document Reordering and Perturbation},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/EN1MP5pfqqaa4JjLgmnI}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!