Vision Transformer

发布日期: 2022-05-12

2022-05-12 更新

Authors:Jing Yang, Junwen Chen, Keiji Yanai

In this paper, we present a cross-modal recipe retrieval framework, Transformer-based Network for Large Batch Training (TNLBT), which is inspired by ACME~(Adversarial Cross-Modal Embedding) and H-T~(Hierarchical Transformer). TNLBT aims to accomplish retrieval tasks while generating images from recipe embeddings. We apply the Hierarchical Transformer-based recipe text encoder, the Vision Transformer~(ViT)-based recipe image encoder, and an adversarial network architecture to enable better cross-modal embedding learning for recipe texts and images. In addition, we use self-supervised learning to exploit the rich information in the recipe texts having no corresponding images. Since contrastive learning could benefit from a larger batch size according to the recent literature on self-supervised learning, we adopt a large batch size during training and have validated its effectiveness. In the experiments, the proposed framework significantly outperformed the current state-of-the-art frameworks in both cross-modal recipe retrieval and image generation tasks on the benchmark Recipe1M. This is the first work which confirmed the effectiveness of large batch training on cross-modal recipe embeddings.
PDF 13 pages, 8 figures

论文截图

木子已

https://ipaper.today/2022/05/12/2022-05-12-vision-transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

Vision Transformer

检测/分割/跟踪

2022-05-12 检测/分割/跟踪

检测分割跟踪

NeRF

2022-05-12 NeRF

NeRF

Vision Transformer

2022-05-12 更新

Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training

打赏用于支持本站流量费