
2022-08-30 更新

Zero-shot Object Detection Through Vision-Language Embedding Alignment

Authors:Johnathan Xie, Shuai Zheng

Recent approaches have shown that training deep neural networks directly on large-scale image-text pair collections enables zero-shot transfer on various recognition tasks. One central issue is how this can be generalized to object detection, which involves the non-semantic task of localization as well as semantic task of classification. To solve this problem, we introduce a vision-language embedding alignment method that transfers the generalization capabilities of a pretrained model such as CLIP to an object detector like YOLOv5. We formulate a loss function that allows us to align the image and text embeddings from the pretrained model CLIP with the modified semantic prediction head from the detector. With this method, we are able to train an object detector that achieves state-of-the-art performance on the COCO, ILSVRC, and Visual Genome zero-shot detection benchmarks. During inference, our model can be adapted to detect any number of object classes without additional training. We also find that standard object detection scaling can transfer well to our method and find consistent improvements across various scales of YOLOv5 models and the YOLOv3 model. Lastly, we develop a self-labeling method that provides a significant score improvement without needing extra images nor labels.
PDF Code: https://github.com/Johnathan-Xie/ZSD-YOLO


A Multi-Modality Ovarian Tumor Ultrasound Image Dataset for Unsupervised Cross-Domain Semantic Segmentation

Authors:Qi Zhao, Shuchang Lyu, Wenpei Bai, Linghan Cai, Binghao Liu, Meijing Wu, Xiubo Sang, Min Yang, Lijiang Chen

Ovarian cancer is one of the most harmful gynecological diseases. Detecting ovarian tumors in early stage with computer-aided techniques can efficiently decrease the mortality rate. With the improvement of medical treatment standard, ultrasound images are widely applied in clinical treatment. However, recent notable methods mainly focus on single-modality ultrasound ovarian tumor segmentation or recognition, which means there still lacks researches on exploring the representation capability of multi-modality ultrasound ovarian tumor images. To solve this problem, we propose a Multi-Modality Ovarian Tumor Ultrasound (MMOTU) image dataset containing 1469 2d ultrasound images and 170 contrast enhanced ultrasonography (CEUS) images with pixel-wise and global-wise annotations. Based on MMOTU, we mainly focus on unsupervised cross-domain semantic segmentation task. To solve the domain shift problem, we propose a feature alignment based architecture named Dual-Scheme Domain-Selected Network (DS2Net). Specifically, we first design source-encoder and target-encoder to extract two-style features of source and target images. Then, we propose Domain-Distinct Selected Module (DDSM) and Domain-Universal Selected Module (DUSM) to extract the distinct and universal features in two styles (source-style or target-style). Finally, we fuse these two kinds of features and feed them into the source-decoder and target-decoder to generate final predictions. Extensive comparison experiments and analysis on MMOTU image dataset show that DS2Net can boost the segmentation performance for bidirectional cross-domain adaptation of 2d ultrasound images and CEUS images. Our proposed dataset and code are all available at https://github.com/cv516Buaa/MMOTU_DS2Net.
PDF code: https://github.com/cv516Buaa/MMOTU_DS2Net; paper:13 pages, 10 figures, 10 tables, 15 formulas


Effective Image Tampering Localization via Semantic Segmentation Network

Authors:Haochen Zhu, Gang Cao, Mo Zhao

With the widespread use of powerful image editing tools, image tampering becomes easy and realistic. Existing image forensic methods still face challenges of low accuracy and robustness. Note that the tampered regions are typically semantic objects, in this letter we propose an effective image tampering localization scheme based on deep semantic segmentation network. ConvNeXt network is used as an encoder to learn better feature representation. The multi-scale features are then fused by Upernet decoder for achieving better locating capability. Combined loss and effective data augmentation are adopted to ensure effective model training. Extensive experimental results confirm that localization performance of our proposed scheme outperforms other state-of-the-art ones.


Depth-Assisted ResiDualGAN for Cross-Domain Aerial Images Semantic Segmentation

Authors:Yang Zhao, Peng Guo, Han Gao, Xiuwan Chen

Unsupervised domain adaptation (UDA) is an approach to minimizing domain gap. Generative methods are common approaches to minimizing the domain gap of aerial images which improves the performance of the downstream tasks, e.g., cross-domain semantic segmentation. For aerial images, the digital surface model (DSM) is usually available in both the source domain and the target domain. Depth information in DSM brings external information to generative models. However, little research utilizes it. In this paper, depth-assisted ResiDualGAN (DRDG) is proposed where depth supervised loss (DSL), and depth cycle consistency loss (DCCL) are used to bring depth information into the generative model. Experimental results show that DRDG reaches state-of-the-art accuracy between generative methods in cross-domain semantic segmentation tasks.


文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !