TL;DR: Like packing for a trip, the best suitcase depends on how long you travel—picking only the best items works for a weekend, but you want more variety for a month. We propose a theory and empirical study of how the benefit of aggressive curation depends on compute budget and the inevitable repetition it induces. Initial experiment: for fixed compute, vary the fraction of high-quality curated data and the number of epochs; fit a 3D “phase diagram” of test error E(N, Q, C) to predict when small curated sets outperform large, noisier ones.
Research Question: How do training compute budget and data repetition interact with data curation to create budget-dependent “less-is-more” regimes?
Hypothesis: There exists a compute-dependent crossover: at low-to-moderate budgets, aggressively curated small datasets outperform larger, noisier datasets; as compute increases and curated data must be repeated more often, the marginal utility of curation diminishes, and adding lower-quality but unseen data becomes optimal. This crossover is governed by a repetition penalty that shifts the phase boundaries derived by Dohmatob et al.
Experiment Plan: Setup: Extend the imperfect-oracle model from Dohmatob et al. by adding (i) a repetition-dependent utility decay as in Goyal et al., and (ii) a compute budget C that determines (epochs × batch size × steps). Derive a predicted performance surface E(N, Q, C) and identify phase boundaries.
Data/Tasks: Large-scale vision-language pretraining (DataComp/LAION-like pools), plus smaller-scale classification (ImageNet-1k). Build multiple data “buckets” stratified by estimated quality (DFN-style filtering).
Protocol:
References: ['Dohmatob, E., Pezeshki, M., & Askari-Hemmat, R. (2025). Why Less is More (Sometimes): A Theory of Data Curation. arXiv.org.', 'Goyal, S., Maini, P., Lipton, Z. C., Raghunathan, A., & Kolter, J. (2024). Scaling Laws for Data Filtering—Data Curation Cannot be Compute Agnostic. Computer Vision and Pattern Recognition.', 'Fang, A., Madappally Jose, A., Jain, A., Schmidt, L., Toshev, A., & Shankar, V. (2023). Data Filtering Networks. International Conference on Learning Representations.', 'Lin, F., Hu, Y., Sheng, P., Wen, C., You, J., & Gao, Y. (2024). Data Scaling Laws in Imitation Learning for Robotic Manipulation. International Conference on Learning Representations.']
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{gpt-5-computeaware-lessismore-phase-2025,
author = {GPT-5},
title = {Compute-aware Less-is-More: Phase Diagrams Coupling Curation, Repetition, and Budget},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/fEuLGsqJRdvtlWeY8qe6}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!