Vision Transformer


2022-09-09 更新

Global Context Vision Transformers

Authors:Ali Hatamizadeh, Hongxu Yin, Jan Kautz, Pavlo Molchanov

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization. Our method leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the issue of lack of the inductive bias in ViTs via proposing to use a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the tiny, small and base variants of GC ViT with 28M, 51M and 90M parameters achieve 83.3%, 83.9% and 84.5% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer by a large margin. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation using MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins. Code available at https://github.com/NVlabs/GCViT.
PDF Tech. Report

点此查看论文截图

Construction material classification on imbalanced datasets using Vision Transformer (ViT) architecture

Authors:Maryam Soleymani, Mahdi Bonyani, Hadi Mahami, Farnad Nasirzadeh

This research proposes a reliable model for identifying different construction materials with the highest accuracy, which is exploited as an advantageous tool for a wide range of construction applications such as automated progress monitoring. In this study, a novel deep learning architecture called Vision Transformer (ViT) is used for detecting and classifying construction materials. The robustness of the employed method is assessed by utilizing different image datasets. For this purpose, the model is trained and tested on two large imbalanced datasets, namely Construction Material Library (CML) and Building Material Dataset (BMD). A third dataset is also generated by combining CML and BMD to create a more imbalanced dataset and assess the capabilities of the utilized method. The achieved results reveal an accuracy of 100 percent in evaluation metrics such as accuracy, precision, recall rate, and f1-score for each material category of three different datasets. It is believed that the suggested model accomplishes a robust tool for detecting and classifying different material types. To date, a number of studies have attempted to automatically classify a variety of building materials, which still have some errors. This research will address the mentioned shortcoming and proposes a model to detect the material type with higher accuracy. The employed model is also capable of being generalized to different datasets.
PDF 18 pages, 11 figures, 7 tables

点此查看论文截图

Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students

Authors:Xu Zheng, Yunhao Luo, Hao Wang, Chong Fu, Lin Wang

The popular methods for semi-supervised semantic segmentation mostly adopt a unitary network model using convolutional neural networks (CNNs) and enforce consistency of the model predictions over small perturbations applied to the inputs or model. However, such a learning paradigm suffers from a) limited learning capability of the CNN-based model; b) limited capacity of learning the discriminative features for the unlabeled data; c) limited learning for both global and local information from the whole image. In this paper, we propose a novel Semi-supervised Learning approach, called Transformer-CNN Cohort (TCC), that consists of two students with one based on the vision transformer (ViT) and the other based on the CNN. Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo labeling for the unlabeled data. First, as the inputs of the ViT student are image patches, the feature maps extracted encode crucial class-wise statistics. To this end, we propose class-aware feature consistency distillation (CFCD) that first leverages the outputs of each student as the pseudo labels and generates class-aware feature (CF) maps. It then transfers knowledge via the CF maps between the students. Second, as the ViT student has more uniform representations for all layers, we propose consistency-aware cross distillation to transfer knowledge between the pixel-wise predictions from the cohort. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which significantly outperforms existing semi-supervised methods by a large margin.
PDF

点此查看论文截图

Towards 3D Scene Reconstruction from Locally Scale-Aligned Monocular Video Depth

Authors:Guangkai Xu, Wei Yin, Hao Chen, Chunhua Shen, Kai Cheng, Feng Wu, Feng Zhao

Existing monocular depth estimation methods have achieved excellent robustness in diverse scenes, but they can only retrieve affine-invariant depth, up to an unknown scale and shift. However, in some video-based scenarios such as video depth estimation and 3D scene reconstruction from a video, the unknown scale and shift residing in per-frame prediction may cause the depth inconsistency. To solve this problem, we propose a locally weighted linear regression method to recover the scale and shift with very sparse anchor points, which ensures the scale consistency along consecutive frames. Extensive experiments show that our method can boost the performance of existing state-of-the-art approaches by 50% at most over several zero-shot benchmarks. Besides, we merge over 6.3 million RGBD images to train strong and robust depth models. Our produced ResNet50-backbone model even outperforms the state-of-the-art DPT ViT-Large model. Combining with geometry-based reconstruction methods, we formulate a new dense 3D scene reconstruction pipeline, which benefits from both the scale consistency of sparse points and the robustness of monocular methods. By performing the simple per-frame prediction over a video, the accurate 3D scene shape can be recovered.
PDF 16 pages

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录