Vision Transformer


2022-07-12 更新

Vision Transformer for Contrastive Clustering

Authors:Hua-Bao Ling, Bowen Zhu, Dong Huang, Ding-Hua Chen, Chang-Dong Wang, Jian-Huang Lai

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another popular research topic recently. While previous contrastive learning works are mostly based on CNNs, some recent studies have attempted to combine ViT and contrastive learning for enhanced self-supervised learning. Despite the considerable progress, these combinations of ViT and contrastive learning mostly focus on the instance-level contrastiveness, which often overlook the global contrastiveness and also lack the ability to directly learn the clustering result (e.g., for images). In view of this, this paper presents a novel deep clustering approach termed Vision Transformer for Contrastive Clustering (VTCC), which for the first time, to our knowledge, unifies the Transformer and the contrastive learning for the image clustering task. Specifically, with two random augmentations performed on each image, we utilize a ViT encoder with two weight-sharing views as the backbone. To remedy the potential instability of the ViT, we incorporate a convolutional stem to split each augmented sample into a sequence of patches, which uses multiple stacked small convolutions instead of a big convolution in the patch projection layer. By learning the feature representations for the sequences of patches via the backbone, an instance projector and a cluster projector are further utilized to perform the instance-level contrastive learning and the global clustering structure learning, respectively. Experiments on eight image datasets demonstrate the stability (during the training-from-scratch) and the superiority (in clustering performance) of our VTCC approach over the state-of-the-art.
PDF

点此查看论文截图

Visual Attention Network

Authors:Meng-Hao Guo, Cheng-Ze Lu, Zheng-Ning Liu, Ming-Ming Cheng, Shi-Min Hu

While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN surpasses similar size vision transformers(ViTs) and convolutional neural networks(CNNs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark and set new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. Code is available at https://github.com/Visual-Attention-Network.
PDF Code is available at https://github.com/Visual-Attention-Network

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录