Vision Transformer


2022-11-18 更新

TrojViT: Trojan Insertion in Vision Transformers

Authors:Mengxin Zheng, Qian Lou, Lei Jiang

Vision Transformers (ViTs) have demonstrated the state-of-the-art performance in various vision-related tasks. The success of ViTs motivates adversaries to perform backdoor attacks on ViTs. Although the vulnerability of traditional CNNs to backdoor attacks is well-known, backdoor attacks on ViTs are seldom-studied. Compared to CNNs capturing pixel-wise local features by convolutions, ViTs extract global context information through patches and attentions. Na\”ively transplanting CNN-specific backdoor attacks to ViTs yields only a low clean data accuracy and a low attack success rate. In this paper, we propose a stealth and practical ViT-specific backdoor attack $TrojViT$. Rather than an area-wise trigger used by CNN-specific backdoor attacks, TrojViT generates a patch-wise trigger designed to build a Trojan composed of some vulnerable bits on the parameters of a ViT stored in DRAM memory through patch salience ranking and attention-target loss. TrojViT further uses minimum-tuned parameter update to reduce the bit number of the Trojan. Once the attacker inserts the Trojan into the ViT model by flipping the vulnerable bits, the ViT model still produces normal inference accuracy with benign inputs. But when the attacker embeds a trigger into an input, the ViT model is forced to classify the input to a predefined target class. We show that flipping only few vulnerable bits identified by TrojViT on a ViT model using the well-known RowHammer can transform the model into a backdoored one. We perform extensive experiments of multiple datasets on various ViT models. TrojViT can classify $99.64\%$ of test images to a target class by flipping $345$ bits on a ViT for ImageNet.
PDF 10 pages, 4 figures, 10 tables

点此查看论文截图

Will Large-scale Generative Models Corrupt Future Datasets?

Authors:Ryuichiro Hataya, Han Bao, Hiromi Arai

Recently proposed large-scale text-to-image generative models such as DALL$\cdot$E 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users’ prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently a tremendous amount of generated images have been shared on the Internet. Meanwhile, today’s success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: “will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?” This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained on ``contaminated’’ datasets on various tasks including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets are available via https://github.com/moskomule/dataset-contamination.
PDF

点此查看论文截图

ShadowDiffusion: Diffusion-based Shadow Removal using Classifier-driven Attention and Structure Preservation

Authors:Yeying Jin, Wenhan Yang, Wei Ye, Yuan Yuan, Robby T. Tan

Shadow removal from a single image is challenging, particularly with the presence of soft and self shadows. Unlike hard shadows, soft shadows do not show any clear boundaries, while self shadows are shadows that cast on the object itself. Most existing methods require the detection/annotation of binary shadow masks, without taking into account the ambiguous boundaries of soft and self shadows. Most deep learning shadow removal methods are GAN-based and require statistical similarity between shadow and shadow-free domains. In contrast to these methods, in this paper, we present ShadowDiffusion, the first diffusion-based shadow removal method. ShadowDiffusion focuses on single-image shadow removal, even in the presence of soft and self shadows. To guide the diffusion process to recover semantically meaningful structures during the reverse diffusion, we introduce a structure preservation loss, where we extract features from the pre-trained Vision Transformer (DINO-ViT). Moreover, to focus on the recovery of shadow regions, we inject classifier-driven attention into the architecture of the diffusion model. To maintain the consistent colors of the regions where the shadows have been removed, we introduce a chromaticity consistency loss. Our ShadowDiffusion outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Our method outperforms the SOTA method by 20% of the RMSE of the whole image on the SRD dataset.
PDF

点此查看论文截图

Flamingo: a Visual Language Model for Few-Shot Learning

Authors:Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
PDF 54 pages. In Proceedings of Neural Information Processing Systems (NeurIPS) 2022

点此查看论文截图

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Authors:Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu Qiao

Learning discriminative spatiotemporal representation is the key problem of video understanding. Recently, Vision Transformers (ViTs) have shown their power in learning long-term video dependency with self-attention. Unfortunately, they exhibit limitations in tackling local video redundancy, due to the blind global comparison among tokens. UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format. However, this model has to require a tiresome and complicated image-pretraining phrase, before being finetuned on videos. This blocks its wide usage in practice. On the contrary, open-sourced ViTs are readily available and well-pretrained with rich image supervision. Based on these observations, we propose a generic paradigm to build a powerful family of video networks, by arming the pretrained ViTs with efficient UniFormer designs. We call this family UniFormerV2, since it inherits the concise style of the UniFormer block. But it contains brand-new local and global relation aggregators, which allow for preferable accuracy-computation balance by seamlessly integrating advantages from both ViTs and UniFormer. Without any bells and whistles, our UniFormerV2 gets the state-of-the-art recognition performance on 8 popular video benchmarks, including scene-related Kinetics-400/600/700 and Moments in Time, temporal-related Something-Something V1/V2, untrimmed ActivityNet and HACS. In particular, it is the first model to achieve 90% top-1 accuracy on Kinetics-400, to our best knowledge. Code will be available at https://github.com/OpenGVLab/UniFormerV2.
PDF 24 pages, 4 figures, 20 tables

点此查看论文截图

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Authors:Wele Gedara Chaminda Bandara, Naman Patel, Ali Gholami, Mehdi Nikkhah, Motilal Agrawal, Vishal M. Patel

Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame-based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs.
PDF Code available at: https://github.com/wgcban/adamae

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录