2022-10-11 更新
PatchDropout: Economizing Vision Transformers Using Patch Dropout
Authors:Yue Liu, Christos Matsoukas, Fredrik Strand, Hossein Azizpour, Kevin Smith
Vision transformers have demonstrated the potential to outperform CNNs in a variety of vision tasks. But the computational and memory requirements of these models prohibit their use in many applications, especially those that depend on high-resolution images, such as medical image classification. Efforts to train ViTs more efficiently are overly complicated, necessitating architectural changes or intricate training schemes. In this work, we show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. This simple approach, PatchDropout, reduces FLOPs and memory by at least 50% in standard natural image datasets such as ImageNet, and those savings only increase with image size. On CSAW, a high-resolution medical dataset, we observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance. For practitioners with a fixed computational or memory budget, PatchDropout makes it possible to choose image resolution, hyperparameters, or model size to get the most performance out of their model.
PDF IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
点此查看论文截图
Vision+X: A Survey on Multimodal Learning in the Light of Data
Authors:Ye Zhu, Yu Wu, Nicu Sebe, Yan Yan
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, the multimodal machine learning that incorporates data from various modalities has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the nature of different data modalities. We analyze the commonness and uniqueness of each data format ranging from vision, audio, text and others, and then present the technical development categorized by the combination of Vision+X, where the vision data play a fundamental role in most multimodal learning works. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, or the rhythm correspondence between video dance moves and musical beats. The exploitation of the alignment, as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address and solve a specific challenge related to the concrete multimodal task, and to prompt a unified multimodal machine learning framework closer to a real human intelligence system.
PDF Survey paper on multimodal learning, 21 pages
点此查看论文截图
Fast-ParC: Position Aware Global Kernel for ConvNets and ViTs
Authors:Tao Yang, Haokui Zhang, Wenze Hu, Changwen Chen, Xiaoyu Wang
Transformer models have made tremendous progress in various fields in recent years. In the field of computer vision, vision transformers (ViTs) also become strong alternatives to convolutional neural networks (ConvNets), yet they have not been able to replace ConvNets since both have their own merits. For instance, ViTs are good at extracting global features with attention mechanisms while ConvNets are more efficient in modeling local relationships due to their strong inductive bias. A natural idea that arises is to combine the strengths of both ConvNets and ViTs to design new structures. In this paper, we propose a new basic neural network operator named position-aware circular convolution (ParC) and its accelerated version Fast-ParC. The ParC operator can capture global features by using a global kernel and circular convolution while keeping location sensitiveness by employing position embeddings. Our Fast-ParC further reduces the O(n2) time complexity of ParC to O(n log n) using Fast Fourier Transform. This acceleration makes it possible to use global convolution in the early stages of models with large feature maps, yet still maintains the overall computational cost comparable with using 3x3 or 7x7 kernels. The proposed operation can be used in a plug-and-play manner to 1) convert ViTs to pure-ConvNet architecture to enjoy wider hardware support and achieve higher inference speed; 2) replacing traditional convolutions in the deep stage of ConvNets to improve accuracy by enlarging the effective receptive field. Experiment results show that our ParC op can effectively enlarge the receptive field of traditional ConvNets, and adopting the proposed op benefits both ViTs and ConvNet models on all three popular vision tasks, image classification, object
PDF 19 pages, 8 figures, 11 tables. A preliminary version of this paper has been published in ECCV 2022 and it can be find in arXiv:2203.03952