发布日期: 2022-10-03

2022-10-03 更新

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

Authors:Bin Shan, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

Recent Vision-Language Pre-trained (VLP) models based on dual encoder have attracted extensive attention from academia and industry due to their superior performance on various cross-modal tasks and high computational efficiency. They attempt to learn cross-modal representation using contrastive learning on image-text pairs, however, the built inter-modal correlations only rely on a single view for each modality. Actually, an image or a text contains various potential views, just as humans could capture a real-world scene via diverse descriptions or photos. In this paper, we propose ERNIE-ViL 2.0, a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously, aiming at learning a more robust cross-modal representation. Specifically, we construct multiple views within each modality to learn the intra-modal correlation for enhancing the single-modal representation. Besides the inherent visual/textual views, we construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs. Pre-trained with 29M publicly available datasets, ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval. Additionally, to generalize our method to Chinese cross-modal tasks, we train ERNIE-ViL 2.0 through scaling up the pre-training datasets to 1.5B Chinese image-text pairs, resulting in significant improvements compared to previous SOTA results on Chinese cross-modal retrieval. We release our pre-trained models in https://github.com/PaddlePaddle/ERNIE.
PDF 14 pages, 6 figures

点此查看论文截图

木子已

https://ipaper.today/2022/10/03/2022-10-03-wu-jian-du-ban-jian-du-dui-bi-xue-xi/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

无监督半监督对比学习

Domain Adaptation

2022-10-04 Domain Adaptation

Domain Adaptation

人脸相关

2022-10-03 人脸相关

人脸相关

无监督/半监督/对比学习

2022-10-03 更新

ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training

打赏用于支持本站流量费