Human text is no longer useful in a post-scaling world

1

Idea: The "adaptation" procedure creates divergence from human language. Moving forward, human language will no longer be needed for building models.

More details:

As we train models for tasks requiring specific capabilities (e.g., SWE), they might become worse at modeling the natural distribution of language (e.g., C4) or even language of specific tasks (e.g., Minerva human-written answers) while prefer their own model-generated sets.

This is interesting because (1) it contradicts our opinions of scaling on human data and (2) it may mean we have a future of model training without human text involved

Somewhat related work on LLM response similarity (to each other): https://arxiv.org/abs/2502.16173

language models LLMs Artificial Intelligence

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{heineman-human-text-is-2025,
  author = {Heineman, David},
  title = {Human text is no longer useful in a post-scaling world},
  year = {2025},
  url = {https://hypogenic.ai/ideahub/idea/AWp5MQuAonUbAiCtvbkL}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!