Are there any finetuning proof datasets currently?

by Ari Holtzman6 months ago
4

Most datasets get easier if models finetune on them, merely because of train/test overlap or because we amplify a subdistribution within the model that makes the model more confident in its already-known correct answers. Which datasets are most finetuning resistant? Are any of them finetuning proof?

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-are-there-any-2025,
  author = {Holtzman, Ari},
  title = {Are there any finetuning proof datasets currently?},
  year = {2025},
  url = {https://hypogenic.ai/ideahub/idea/XMu5uZXUuUig7I2khHhL}
}

Comments (2)

Please sign in to comment on this idea.

David Heineman6 months ago

I'm thinking about a specific instantiation of this idea: Let's say you have pretrain distribution X, finetune distribution Y_train and held-out Y_test (where Y_train / Y_test are random subsets of Y). Let's also say you measure "finetuning proof" as perplexity over Y_test. So, you're trying to find the distribution Y where fitting your pretrained model m_X to Y_train doesn't help you with in-domain generalization to Y_test.

Well, I'd probably look at how much X tells you about Y. If there's a large amount of mutual information between X and Y, then X probably helps you generalize to the unseen information in Y_test.

As a concrete example, take Sec. 5.1 of the GPT OSS report (https://arxiv.org/pdf/2508.10925#page=18.16). They called this process "Adversarial Training," where the goal was to make Chemical, Biological, Radiological, and Nuclear information (CBRN) abilities not useful even if you train on CBRN instruction data. They found that removing CBRN from pretraining made it difficult to elicit these capabilities even with a harmful fine-tuning dataset.

So, it could be the case that a simple MMI measure answers this question! A simple place to start might be n-gram overlap.

1
Ari Holtzman5 months ago

Love this! How would you measure MI? Using LLMs with unknown training data makes them confounded as a tool for measuring this. But I guess we could see if they work anyway or if ngrams already tell us enough?

0