International Conference on Machine Learning (ICML 2022), 3rd Women in Machine Learning Un-Workshop
Original Kurzfassung:
Contrastive Language-Image Pre-training (CLIP) showed spectacular performances at zero-shot transfer learning. CLIP learns expressive embeddings directly from image-text pairs and leverages a much richer source of supervision than just labels.
Though CLIP excels at zero-shot transfer learning, it suffers from an ?explaining away? problem, that is, it focuses on one or few features, while neglecting other relevant features. We suggest to use modern Hopfield networks (MHNs) to amplify co-occurrences and covariance structures of the original data.
However, MHNs increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect.
We introduce ?Contrastive Leave One Out Boost? (CLOOB) which combines modern Hopfield networks with the InfoLOOB objective. CLOOB overcomes CLIP?s problem of explaining away by extracting more covariance structure from the original data.