Small Gradients as a the Potential Mechanism for Synthetic Data Issues

0

Synthetic data tends to have lower perplexity than human authored data because LLMs tend to write similarly. I hypothesize that this is both why (a) often you need more synthetic data to learn something than if you had human-authored data and (b) synthetic data causes collapse—because the relative size of gradients in certain directions that aren't related to LLM-style is larger.

llms synthetic data model collapse data efficiency

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-small-gradients-as-2026,
  author = {Holtzman, Ari},
  title = {Small Gradients as a the Potential Mechanism for Synthetic Data Issues},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/JEVfjBdf894UQGbmk4Qq}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!