M3CoL: Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Raja Kumar*1, Raghav Singhal*1, Pranamya Kulkarni1, Deval Mehta2, Kshitij Jadhav1,
1 Indian Institute of Technology Bombay, India
2 AIM for Health Lab, Department of Data Science & AI, Monash University, Australia

* Indicates Equal Contribution
A version of our paper is accepted at NeurIPS 2024 Workshop on Unifying Representations in Neural Models (UniReps)

Abstract

Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, realworld data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

M3CoL Captures Shared Relations

M3CoL captures shared relations
Comparison of traditional contrastive and our proposed M3Co loss. M(1)i and M(2)i denote representations of the i-th sample from modalities 1 and 2, respectively. Traditional contrastive loss (left panel) aligns corresponding sample representations across modalities. M3Co (right panel) mixes the i-th and j-th samples from modality 1 and enforces the representations of this mixture to align with the representations of the corresponding i-th and j-th samples from modality 2, and vice versa. For the text modality, we mix the text embeddings, while we mix the raw inputs for other modalities. Similarity (Sim) represents type of alignment enforced between the embeddings for all modalities.

Pipeline for Multimodal Classification

M3CoL pipeline
Architecture of our proposed M3CoL model. Samples from modality 1 (xi(1), xj(1)) and modality 2 (xi(2), xk(2)), along with their respective mixed data i,j(1) and i,k(2), are fed into encoders f(1) and f(2) to generate embeddings. Unimodal embeddings pi(1) and pi(2) are processed through classifier 1 and 2 to produce predictions ŷi(1) and ŷi(2) for training supervision only. The unimodal embeddings pi(1) and pi(2) are concatenated and processed through classifier 3 to yield ŷfinal, utilized during training and inference. Additionally, unimodal embeddings pi(1), pj(1), pi(2), pk(2), and mixed embeddings i,j(1) and i,k(2) are utilized by our contrastive loss LM3Co for shared alignment.

Comparisons with Baselines

Comparison of M3CoL against various baselines on N24News, ROSMAP, BRCA, and Food-101 datasets.

Visualization of Attention Heatmaps

Text-guided visual grounding with varying input prompts.

-->

BibTeX

@misc{kumar2024harnessingsharedrelationsmultimodal,
        title={Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification}, 
        author={Raja Kumar and Raghav Singhal and Pranamya Kulkarni and Deval Mehta and Kshitij Jadhav},
        year={2024},
        eprint={2409.17777},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2409.17777}, 
  }