I2I Translation


2022-07-06 更新

Vision-and-Language Pretraining

Authors:Thong Nguyen, Cong-Duy Nguyen, Xiaobao Wu, Anh Tuan Luu

With the burgeoning amount of data of image-text pairs and diversity of Vision-and-Language (V&L) tasks, scholars have introduced an abundance of deep learning models in this research domain. Furthermore, in recent years, transfer learning has also shown tremendous success in Computer Vision for tasks such as Image Classification, Object Detection, etc., and in Natural Language Processing for Question Answering, Machine Translation, etc. Inheriting the spirit of Transfer Learning, research works in V&L have devised multiple pretraining techniques on large-scale datasets in order to enhance the performance of downstream tasks. The aim of this article is to provide a comprehensive revision of contemporary V&L pretraining models. In particular, we categorize and delineate pretraining approaches, along with the summary of state-of-the-art vision-and-language pre-trained models. Moreover, a list of training datasets and downstream tasks is supplied to further polish the perspective on V&L pretraining. Lastly, we decided to take a further step to discuss numerous directions for future research.
PDF 35 pages, 3 figures

点此查看论文截图

A case for using rotation invariant features in state of the art feature matchers

Authors:Georg Bökman, Fredrik Kahl

The aim of this paper is to demonstrate that a state of the art feature matcher (LoFTR) can be made more robust to rotations by simply replacing the backbone CNN with a steerable CNN which is equivariant to translations and image rotations. It is experimentally shown that this boost is obtained without reducing performance on ordinary illumination and viewpoint matching sequences.
PDF CVPRW 2022, updated version

点此查看论文截图

Multi Scale Identity-Preserving Image-to-Image Translation Network for Low-Resolution Face Recognition

Authors:Vahid Reza Khazaie, Nicky Bayat, Yalda Mohsenzadeh

State-of-the-art deep neural network models have reached near perfect face recognition accuracy rates on controlled high-resolution face images. However, their performance is drastically degraded when they are tested with very low-resolution face images. This is particularly critical in surveillance systems, where a low-resolution probe image is to be matched with high-resolution gallery images. super-resolution techniques aim at producing high-resolution face images from low-resolution counterparts. While they are capable of reconstructing images that are visually appealing, the identity-related information is not preserved. Here, we propose an identity-preserving end-to-end image-to-image translation deep neural network which is capable of super-resolving very low-resolution faces to their high-resolution counterparts while preserving identity-related information. We achieved this by training a very deep convolutional encoder-decoder network with a symmetric contracting path between corresponding layers. This network was trained with a combination of a reconstruction and an identity-preserving loss, on multi-scale low-resolution conditions. Extensive quantitative evaluations of our proposed model demonstrated that it outperforms competing super-resolution and low-resolution face recognition methods on natural and artificial low-resolution face data sets and even unseen identities.
PDF Accepted in the 35th Canadian Conference on Artificial Intelligence

点此查看论文截图

Harmonizer: Learning to Perform White-Box Image and Video Harmonization

Authors:Zhanghan Ke, Chunyi Sun, Lei Zhu, Ke Xu, Rynson W. H. Lau

Recent works on image harmonization solve the problem as a pixel-wise image translation task via large autoencoders. They have unsatisfactory performances and slow inference speeds when dealing with high-resolution images. In this work, we observe that adjusting the input arguments of basic image filters, e.g., brightness and contrast, is sufficient for humans to produce realistic images from the composite ones. Hence, we frame image harmonization as an image-level regression problem to learn the arguments of the filters that humans use for the task. We present a Harmonizer framework for image harmonization. Unlike prior methods that are based on black-box autoencoders, Harmonizer contains a neural network for filter argument prediction and several white-box filters (based on the predicted arguments) for image harmonization. We also introduce a cascade regressor and a dynamic loss strategy for Harmonizer to learn filter arguments more stably and precisely. Since our network only outputs image-level arguments and the filters we used are efficient, Harmonizer is much lighter and faster than existing methods. Comprehensive experiments demonstrate that Harmonizer surpasses existing methods notably, especially with high-resolution inputs. Finally, we apply Harmonizer to video harmonization, which achieves consistent results across frames and 56 fps at 1080P resolution. Code and models are available at: https://github.com/ZHKKKe/Harmonizer.
PDF

点此查看论文截图

SAC-GAN: Structure-Aware Image Composition

Authors:Hang Zhou, Rui Ma, Lingxiao Zhang, Lin Gao, Ali Mahdavi-Amiri, Hao Zhang

We introduce an end-to-end learning framework for image-to-image composition, aiming to seamlessly compose an object represented as a cropped patch from an object image into a background scene image. As our approach emphasizes more on semantic and structural coherence of the composed images, rather than their pixel-level RGB accuracies, we tailor the input and output of our network with structure-aware features and design our network losses accordingly, with ground truth established in a self-supervised setting through the object cropping. Specifically, our network takes the semantic layout features from the input scene image, features encoded from the edges and silhouette in the input object patch, as well as a latent code as inputs, and generates a 2D spatial affine transform defining the translation and scaling of the object patch. The learned parameters are further fed into a differentiable spatial transformer network to transform the object patch into the target image, where our model is trained adversarially using an affine transform discriminator and a layout discriminator. We evaluate our network, coined SAC-GAN, for various image composition scenarios in terms of quality, composability, and generalizability of the composite images. Comparisons are made to state-of-the-art alternatives, including Instance Insertion, ST-GAN, CompGAN and PlaceNet, confirming superiority of our method.
PDF

点此查看论文截图

Clustered Saliency Prediction

Authors:Rezvan Sherkati, James J. Clark

We present a new method for image salience prediction, Clustered Saliency Prediction. This method divides individuals into clusters based on their personal features and their known saliency maps, and generates a separate image salience model for each cluster. We test our approach on a public dataset of personalized saliency maps, with varying importance weights for personal feature factors and observe the effects on the clusters. For each cluster, we use an image-to-image translation method, mainly Pix2Pix model, to convert universal saliency maps to saliency maps of that cluster. We try three state-of-the-art universal saliency prediction methods, DeepGaze II, ML-Net and SalGAN, and see their impact on the results. We show that our Clustered Saliency Prediction technique outperforms the state-of-the-art universal saliency prediction models. Also we demonstrate the effectiveness of our clustering method by comparing the results of Clustered Saliency Prediction using clusters obtained by Subject Similarity Clustering algorithm with two baseline methods. We propose an approach to assign new people to the most appropriate cluster, based on their personal features and any known saliency maps. In our experiments we see that this method of assigning new people to a cluster on average chooses the cluster that gives higher saliency scores.
PDF 21 pages, 4 figures

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录