Vision Transformer

发布日期: 2023-05-12

2023-05-12 更新

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Authors:Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 $AP_r$ on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.
PDF CVPR 2023

点此查看论文截图

木子已

https://ipaper.today/2023/05/12/2023-05-12-vision-transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

Vision Transformer

检测/分割/跟踪

2023-05-12 检测/分割/跟踪

检测分割跟踪

I2I Translation

2023-05-12 I2I Translation

I2I Translation

Vision Transformer

2023-05-12 更新

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

打赏用于支持本站流量费