Vision Transformer


2022-11-28 更新

TAOTF: A Two-stage Approximately Orthogonal Training Framework in Deep Neural Networks

Authors:Taoyong Cui, Jianze Li, Yuhan Dong, Li Liu

The orthogonality constraints, including the hard and soft ones, have been used to normalize the weight matrices of Deep Neural Network (DNN) models, especially the Convolutional Neural Network (CNN) and Vision Transformer (ViT), to reduce model parameter redundancy and improve training stability. However, the robustness to noisy data of these models with constraints is not always satisfactory. In this work, we propose a novel two-stage approximately orthogonal training framework (TAOTF) to find a trade-off between the orthogonal solution space and the main task solution space to solve this problem in noisy data scenarios. In the first stage, we propose a novel algorithm called polar decomposition-based orthogonal initialization (PDOI) to find a good initialization for the orthogonal optimization. In the second stage, unlike other existing methods, we apply soft orthogonal constraints for all layers of DNN model. We evaluate the proposed model-agnostic framework both on the natural image and medical image datasets, which show that our method achieves stable and superior performances to existing methods.
PDF

点此查看论文截图

TranViT: An Integrated Vision Transformer Framework for Discrete Transit Travel Time Range Prediction

Authors:Awad Abdelhalim, Jinhua Zhao

Accurate travel time estimation is paramount for providing transit users with reliable schedules and dependable real-time information. This paper proposes and evaluates a novel end-to-end framework for transit and roadside image data acquisition, labeling, and model training to predict transit travel times across a segment of interest. General Transit Feed Specification (GTFS) real-time data is used as an activation mechanism for a roadside camera unit monitoring a segment of Massachusetts Avenue in Cambridge, MA. Ground truth labels are generated for the acquired images based on the observed travel time percentiles across the monitored segment obtained from Automated Vehicle Location (AVL) data. The generated labeled image dataset is then used to train and evaluate a Vision Transformer (ViT) model to predict a discrete transit travel time range (band). The results of this exploratory study illustrate that the ViT model is able to learn image features and contents that best help it deduce the expected travel time range with an average validation accuracy ranging between 80%-85%. We also demonstrate how this discrete travel time band prediction can subsequently be utilized to improve continuous transit travel time estimation. The workflow and results presented in this study provide an end-to-end, scalable, automated, and highly efficient approach for integrating traditional transit data sources and roadside imagery to improve the estimation of transit travel duration. This work also demonstrates the value of incorporating real-time information from computer-vision sources, which are becoming increasingly accessible and can have major implications for improving operations and passenger real-time information.
PDF Revised typographical errors, added more details to results and discussions

点此查看论文截图

Data Augmentation Vision Transformer for Fine-grained Image Classification

Authors:Chao Hu, Liqiang Zhu, Weibin Qiu, Weijie Wu

Recently, the vision transformer (ViT) has made breakthroughs in image recognition. Its self-attention mechanism (MSA) can extract discriminative labeling information of different pixel blocks to improve image classification accuracy. However, the classification marks in their deep layers tend to ignore local features between layers. In addition, the embedding layer will be fixed-size pixel blocks. Input network Inevitably introduces additional image noise. To this end, we study a data augmentation vision transformer (DAVT) based on data augmentation and proposes a data augmentation method for attention cropping, which uses attention weights as the guide to crop images and improve the ability of the network to learn critical features. Secondly, we also propose a hierarchical attention selection (HAS) method, which improves the ability of discriminative markers between levels of learning by filtering and fusing labels between levels. Experimental results show that the accuracy of this method on the two general datasets, CUB-200-2011, and Stanford Dogs, is better than the existing mainstream methods, and its accuracy is 1.4\% and 1.6\% higher than the original ViT, respectively
PDF IEEE Signal Processing Letters

点此查看论文截图

2022-11-28 更新

MixMask: Revisiting Masked Siamese Self-supervised Learning in Asymmetric Distance

Authors:Kirill Vishniakov, Eric Xing, Zhiqiang Shen

Recent advances in self-supervised learning integrate Masked Modeling and Siamese Networks into a single framework to fully reap the advantages of both the two techniques. However, the previous erase-based masking scheme in masked image modeling is more aligned with the patchifying mechanism of ViT, it is not originally designed for siamese networks of ConvNet. Existing approaches simply inherit the default loss design from previous siamese networks and ignore the information loss after employing masking operation in the frameworks. In this paper, we propose a filling-based masking strategy called MixMask to prevent information loss due to the randomly erased areas of an image in the vanilla masking method. We further introduce a flexible loss function design that takes into account semantic distance change between two different mixed views for adapting the integrated architecture and avoiding mismatches between transformed input and objective in Masked Siamese ConvNets (MSCN). The flexible loss distance is calculated according to the proposed mix-masking scheme. Extensive experiments are conducted on various datasets of CIFAR-100, Tiny-ImageNet, and ImageNet-1K. The results demonstrate that the proposed framework can achieve better accuracy on linear probing, semi-supervised, and supervised finetuning, which outperforms the state-of-the-art MSCN by a significant margin. We also show the superiority on the downstream tasks of object detection and segmentation. Our source code is available at https://github.com/LightnessOfBeing/MixMask.
PDF Technical report. Code is available at https://github.com/LightnessOfBeing/MixMask

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录