M3CoL: Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Raja Kumar^*¹, Raghav Singhal^*¹, Pranamya Kulkarni¹, Deval Mehta², Kshitij Jadhav¹,

¹Indian Institute of Technology Bombay, India
²AIM for Health Lab, Department of Data Science & AI, Monash University, Australia
^*Indicates Equal Contribution

A version of our paper is accepted at NeurIPS 2024 Workshop on Unifying Representations in Neural Models (UniReps)

Abstract

Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, realworld data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

M3CoL Captures Shared Relations

Comparison of traditional contrastive and our proposed M3Co loss. M⁽¹⁾_i and M⁽²⁾_i denote representations of the i-th sample from modalities 1 and 2, respectively. Traditional contrastive loss (left panel) aligns corresponding sample representations across modalities. M3Co (right panel) mixes the i-th and j-th samples from modality 1 and enforces the representations of this mixture to align with the representations of the corresponding i-th and j-th samples from modality 2, and vice versa. For the text modality, we mix the text embeddings, while we mix the raw inputs for other modalities. Similarity (Sim) represents type of alignment enforced between the embeddings for all modalities.

Pipeline for Multimodal Classification

Architecture of our proposed M3CoL model. Samples from modality 1 (x_i⁽¹⁾, x_j⁽¹⁾) and modality 2 (x_i⁽²⁾, x_k⁽²⁾), along with their respective mixed data x̃_i,j⁽¹⁾ and x̃_i,k⁽²⁾, are fed into encoders f⁽¹⁾ and f⁽²⁾ to generate embeddings. Unimodal embeddings p_i⁽¹⁾ and p_i⁽²⁾ are processed through classifier 1 and 2 to produce predictions ŷ_i⁽¹⁾ and ŷ_i⁽²⁾ for training supervision only. The unimodal embeddings p_i⁽¹⁾ and p_i⁽²⁾ are concatenated and processed through classifier 3 to yield ŷ_final, utilized during training and inference. Additionally, unimodal embeddings p_i⁽¹⁾, p_j⁽¹⁾, p_i⁽²⁾, p_k⁽²⁾, and mixed embeddings p̃_i,j⁽¹⁾ and p̃_i,k⁽²⁾ are utilized by our contrastive loss L_M3Co for shared alignment.

Comparisons with Baselines

N24News

ROSMAP and BRCA

Food-101

Comparison of M3CoL against various baselines on N24News, ROSMAP, BRCA, and Food-101 datasets.

BibTeX

@misc{kumar2024harnessingsharedrelationsmultimodal, title={Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification}, author={Raja Kumar and Raghav Singhal and Pranamya Kulkarni and Deval Mehta and Kshitij Jadhav}, year={2024}, eprint={2409.17777}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2409.17777}, }