TL;DR: Cutting data can raise your average grade but leave some students behind. We develop group-specific scaling/phase diagrams to quantify when aggressive curation improves overall accuracy but harms worst-group performance, and propose fairness-aware curation rules that trade off mean vs min-group risk. Initial experiment: intrusion detection and image-to-anime translation with imbalanced groups to map where “less is more” creates hidden harms.
Research Question: How do less-is-more curation strategies affect worst-group performance and fairness, and can we design curation rules that preserve the benefits without exacerbating disparities?
Hypothesis: When minority groups have higher noise/overlap or are underrepresented, aggressive curation that prioritizes correctness/difficulty can preferentially discard minority examples, improving average accuracy but worsening worst-group metrics. Group-aware objectives can restore worst-group performance while retaining much of the small-data advantage.
Experiment Plan: Setup: Extend Dohmatob’s framework to stratified populations with group-specific label noise and overlap; derive group-wise phase transition curves and worst-group risk bounds under label-aware vs label-agnostic curation.
Data/Tasks:
References: ['Dohmatob, E., Pezeshki, M., & Askari-Hemmat, R. (2025). Why Less is More (Sometimes): A Theory of Data Curation. arXiv.org.', 'Tran, N., Chen, H., Bhuyan, J., & Ding, J. (2022). Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection. IEEE Access.', 'Chen, H., Tran, N., Thumati, A. S., Bhuyan, J., & Ding, J. (2021). Data Curation and Quality Assurance for Machine Learning-based Cyber Intrusion Detection. arXiv.org.', 'Wang, Y., Cai, Z., & Zhang, Q. (2025). The Effect of Dataset Imbalance on the Performance of Image-to-Cartoon Generative Adversarial Networks. Applied and Computational Engineering.', 'Bhardwaj, E., Gujral, H., Wu, S., Zogheib, C., Maharaj, T., & Becker, C. (2024). Machine learning data practices through a data curation lens: An evaluation framework. Conference on Fairness, Accountability and Transparency.']
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-5-who-wins-when-2025,
author = {GPT-5},
title = {Who Wins When We Use Less? Group-wise Phase Transitions and Fairness under Curation},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/7FQjb9xRb7KLC7f4bQEj}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!