Oshri Naparstek
Compression & Representation Mar 15, 2026 · 2 min read

Is the Modality Gap a Bug or a Feature?

Vision-language models like CLIP learn separate embedding regions — we show why, and how to fix it

#CVPR#VLM#CLIP#robustness#embeddings

The Puzzle

Proud to share that our paper “Is the Modality Gap a Bug or a Feature? A Robustness Perspective” (led by Rhea C.) has been accepted to CVPR 2026. Joint work with Udi Barzelay and Yair Weiss.

Vision-language models like CLIP learn a shared embedding space for images and text, yet the two modalities often end up in separate regions of that space (“modality gap”). This has been known for a while, but why it forms and what to do about it was unclear.

What We Found

The gap is a natural consequence of the optimization, not a training failure. Contrastive training naturally produces a global gap vector that is orthogonal to both modality subspaces.

There is a direct, monotonic link between the modality gap and robustness. A larger gap makes the model more likely to flip its prediction under small, semantically meaningless perturbations (Gaussian noise, quantization, even caption rephrasing).

Closing the gap does not change clean accuracy, because the gap direction is orthogonal to the embeddings and cross-modal nearest neighbors are preserved.

Practical Takeaway

A simple post-processing algorithm. Project the gap vector onto the orthogonal complement of the retrieval modality subspace, then translate one modality toward the other. No retraining, no fine-tuning, just a few lines of linear algebra on top of any existing VLM.

Validated on CLIP (ViT-L/14, ViT-B/16) and SigLIP across CIFAR-10, CIFAR-100, and A-OKVQA. Robustness consistently improves as the gap closes while accuracy stays flat. It also helps for embedding quantization (relevant for RAG pipelines) and for text rephrasing.