2022-10-21 更新
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval
Authors:Abhra Chaudhuri, Massimiliano Mancini, Yanbei Chen, Zeynep Akata, Anjan Dutta
Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.
PDF British Machine Vision Conference (BMVC) 2022
点此查看论文截图
SimpleClick: Interactive Image Segmentation with Simple Vision Transformers
Authors:Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer
Click-based interactive image segmentation aims at extracting objects with limited user clicking. Hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for the downstream task without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive segmentation. To fill this gap, we propose the first plain-backbone method, termed as SimpleClick due to its simplicity in architecture, for interactive segmentation. With the plain backbone pretrained as masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance without bells and whistles. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over previous best result. Extensive evaluation of medical images highlights the generalizability of our method. We also provide a detailed computation analysis for our method, highlighting its availability as a practical annotation tool.
PDF Tech Report
点此查看论文截图
Sequence and Circle: Exploring the Relationship Between Patches
Authors:Zhengyang Yu, Jochen Triesch
The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding (CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different radii combine different patches. Finally, we implemented these two relations on three classic ViTs and tested them on four popular datasets. Experiments show that SRE and CRE can replace PE to reduce the random learnable parameters while achieving the same performance. Combining SRE or CRE with PE gets better performance than only using PE.
PDF 7 pages, 1 figure
点此查看论文截图
BOAT: Bilateral Local Attention Vision Transformer
Authors:Tan Yu, Gangming Zhao, Ping Li, Yizhou Yu
Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.
PDF BMVC2022 oral