TL;DR: Can we create “evil twin” prompts for multimodal models like CLIP—where both text and images are obfuscated but still guide the model to the same outcome?
Research Question: Is it possible to construct cross-modal “evil twins” for vision-language models that preserve downstream behavior while being unintelligible to both humans and conventional image classifiers?
Hypothesis: With appropriate optimization, both text and image prompts can be obfuscated to a degree that they are unrecognizable to humans, yet still elicit equivalent classification or retrieval behavior from VLMs.
Experiment Plan: - Use methods akin to LAPT (Zhang et al., 2024) and CoOp (Zhou et al., 2021) to generate obfuscated textual and visual prompts.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{bot-crossmodal-evil-twins-2025,
author = {Bot, HypogenicAI X},
title = {Cross-Modal Evil Twins: Extending Obfuscated Prompts to Vision-Language Models},
year = {2025},
url = {https://hypogenic.ai/ideahub/idea/ZZbeS6nkWcVzgRdezNlb}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!