In MechInterp interventions I see a lot of talk about necessity and sufficiency of interventions.
The current standard appears to be:
But I feel like most such evidence doesn't capture selecitivity: does X effect other stuff? Is Y just downstream of that other stuff? Many evals include something like "does model perplexity go up on unrelated data" or "does the model still do well on MMLU". This seems entirely insufficient (no pun intended).
We should fix this but also, I'm trying to think what causes the leakage in our current working definitions of necessity and sufficiency. One thing is that behaviors (Y) are often poorly defined enough that we are really seeing necessity or sufficiency to some subset of Y, often not the same subset.
How can we clean this all up?
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{holtzman-necessity-sufficiency-and-2026,
author = {Holtzman, Ari},
title = {Necessity, Sufficiency, and Selectivity},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/lFLYN3AqEPxjiygIyvt9}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!