Getting the model to output some book/article word by word is typically used as evidence of copyright infringement. As a result, it is mostly a proof of existence. Can we benchmark different methods for this goal? Which method is most effective in getting the models to output its training data?
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{tan-benchmarking-methods-for-2026,
author = {Tan, Chenhao},
title = {Benchmarking Methods for Word-by-Word Copying of Training Data},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/0e3kY7YQ6AENeEqAOsgy}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!