LLMs can't keep secrets that you tell them in the system prompt or user messages: they leak it when they have to write open-ended text. In the long run, we'll want LLMs to be able to compartmentalize. But for now, can we cordon off information in the system prompt so it can only be accessed if absolutely necessary? I propose a potential method: putting an adversarial prompt right after information that we want to make sure doesn't leak, so it only gets attended to with something close to an 'exact match'. This will still be bad at invoking the information in more subtle situations but, frankly, llms are already bad at that.
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{holtzman-adversarial-prompts-as-2026,
author = {Holtzman, Ari},
title = {Adversarial Prompts as Firewalls},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/A15EmBF2vYd2RgpICBkU}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!