2022-06-03 更新
Analysis of Different Losses for Deep Learning Image Colorization
Authors:Coloma Ballester, Aurélie Bugeau, Hernan Carrillo, Michaël Clément, Rémi Giraud, Lara Raad, Patricia Vitoria
Image colorization aims to add color information to a grayscale image in a realistic way. Recent methods mostly rely on deep learning strategies. While learning to automatically colorize an image, one can define well-suited objective functions related to the desired color output. Some of them are based on a specific type of error between the predicted image and ground truth one, while other losses rely on the comparison of perceptual properties. But, is the choice of the objective function that crucial, i.e., does it play an important role in the results? In this chapter, we aim to answer this question by analyzing the impact of the loss function on the estimated colorization results. To that goal, we review the different losses and evaluation metrics that are used in the literature. We then train a baseline network with several of the reviewed objective functions: classic L1 and L2 losses, as well as more complex combinations such as Wasserstein GAN and VGG-based LPIPS loss. Quantitative results show that the models trained with VGG-based LPIPS provide overall slightly better results for most evaluation metrics. Qualitative results exhibit more vivid colors when with Wasserstein GAN plus the L2 loss or again with the VGG-based LPIPS. Finally, the convenience of quantitative user studies is also discussed to overcome the difficulty of properly assessing on colorized images, notably for the case of old archive photographs where no ground truth is available.
PDF arXiv admin note: text overlap with arXiv:2204.02850
论文截图
MontageGAN: Generation and Assembly of Multiple Components by GANs
Authors:Chean Fei Shee, Seiichi Uchida
A multi-layer image is more valuable than a single-layer image from a graphic designer’s perspective. However, most of the proposed image generation methods so far focus on single-layer images. In this paper, we propose MontageGAN, which is a Generative Adversarial Networks (GAN) framework for generating multi-layer images. Our method utilized a two-step approach consisting of local GANs and global GAN. Each local GAN learns to generate a specific image layer, and the global GAN learns the placement of each generated image layer. Through our experiments, we show the ability of our method to generate multi-layer images and estimate the placement of the generated image layers.
PDF Accepted at ICPR2022
论文截图
Creating synthetic night-time visible-light meteorological satellite images using the GAN method
Authors:Wencong Cheng
Meteorology satellite visible light images is critical for meteorology support and forecast. However, there is no such kind of data during night time. To overcome this, we propose a method based on deep learning to create synthetic satellite visible light images during night. Specifically, to produce more realistic products, we train a Generative Adversarial Networks (GAN) model to generate visible light images given the corresponding satellite infrared images and numerical weather prediction(NWP) products. To better model the nonlinear relationship from infrared data and NWP products to visible light images, we propose to use the channel-wise attention mechanics, e.g., SEBlock to quantitative weight the input channels. The experiments based on the ECMWF NWP products and FY-4A meteorology satellite visible light and infrared channels date show that the proposed methods can be effective to create realistic synthetic satellite visible light images during night.
PDF
论文截图
Video2StyleGAN: Disentangling Local and Global Variations in a Video
Authors:Rameen Abdal, Peihao Zhu, Niloy J. Mitra, Peter Wonka
Image editing using a pretrained StyleGAN generator has emerged as a powerful paradigm for facial editing, providing disentangled controls over age, expression, illumination, etc. However, the approach cannot be directly adopted for video manipulations. We hypothesize that the main missing ingredient is the lack of fine-grained and disentangled control over face location, face pose, and local facial expressions. In this work, we demonstrate that such a fine-grained control is indeed achievable using pretrained StyleGAN by working across multiple (latent) spaces (namely, the positional space, the W+ space, and the S space) and combining the optimization results across the multiple spaces. Building on this enabling component, we introduce Video2StyleGAN that takes a target image and driving video(s) to reenact the local and global locations and expressions from the driving video in the identity of the target image. We evaluate the effectiveness of our method over multiple challenging scenarios and demonstrate clear improvements over alternative approaches.
PDF Video : https://youtu.be/oUeXFyfdE1A
论文截图
A Survey of Deep Fake Detection for Trial Courts
Authors:Naciye Celebi, Qingzhong Liu, Muhammed Karatoprak
Recently, image manipulation has achieved rapid growth due to the advancement of sophisticated image editing tools. A recent surge of generated fake imagery and videos using neural networks is DeepFake. DeepFake algorithms can create fake images and videos that humans cannot distinguish from authentic ones. (GANs) have been extensively used for creating realistic images without accessing the original images. Therefore, it is become essential to detect fake videos to avoid spreading false information. This paper presents a survey of methods used to detect DeepFakes and datasets available for detecting DeepFakes in the literature to date. We present extensive discussions and research trends related to DeepFake technologies.
PDF 12 Pages, 1 Table
论文截图
Accelerating Diffusion Models via Early Stop of the Diffusion Process
Authors:Zhaoyang Lyu, Xudong XU, Ceyuan Yang, Dahua Lin, Bo Dai
Denoising Diffusion Probabilistic Models (DDPMs) have achieved impressive performance on various generation tasks. By modeling the reverse process of gradually diffusing the data distribution into a Gaussian distribution, generating a sample in DDPMs can be regarded as iteratively denoising a randomly sampled Gaussian noise. However, in practice DDPMs often need hundreds even thousands of denoising steps to obtain a high-quality sample from the Gaussian noise, leading to extremely low inference efficiency. In this work, we propose a principled acceleration strategy, referred to as Early-Stopped DDPM (ES-DDPM), for DDPMs. The key idea is to stop the diffusion process early where only the few initial diffusing steps are considered and the reverse denoising process starts from a non-Gaussian distribution. By further adopting a powerful pre-trained generative model, such as GAN and VAE, in ES-DDPM, sampling from the target non-Gaussian distribution can be efficiently achieved by diffusing samples obtained from the pre-trained generative model. In this way, the number of required denoising steps is significantly reduced. In the meantime, the sample quality of ES-DDPM also improves substantially, outperforming both the vanilla DDPM and the adopted pre-trained generative model. On extensive experiments across CIFAR-10, CelebA, ImageNet, LSUN-Bedroom and LSUN-Cat, ES-DDPM obtains promising acceleration effect and performance improvement over representative baseline methods. Moreover, ES-DDPM also demonstrates several attractive properties, including being orthogonal to existing acceleration methods, as well as simultaneously enabling both global semantic and local pixel-level control in image generation.
PDF Code is released at https://github.com/ZhaoyangLyu/Early_Stopped_DDPM
论文截图
Generative Models with Information-Theoretic Protection Against Membership Inference Attacks
Authors:Parisa Hassanzadeh, Robert E. Tillman
Deep generative models, such as Generative Adversarial Networks (GANs), synthesize diverse high-fidelity data samples by estimating the underlying distribution of high dimensional data. Despite their success, GANs may disclose private information from the data they are trained on, making them susceptible to adversarial attacks such as membership inference attacks, in which an adversary aims to determine if a record was part of the training set. We propose an information theoretically motivated regularization term that prevents the generative model from overfitting to training data and encourages generalizability. We show that this penalty minimizes the JensenShannon divergence between components of the generator trained on data with different membership, and that it can be implemented at low cost using an additional classifier. Our experiments on image datasets demonstrate that with the proposed regularization, which comes at only a small added computational cost, GANs are able to preserve privacy and generate high-quality samples that achieve better downstream classification performance compared to non-private and differentially private generative models.
PDF
论文截图
Enhancing Quality of Pose-varied Face Restoration with Local Weak Feature Sensing and GAN Prior
Authors:Kai Hu, Yu Liu, Renhe Liu, Wei Lu, Gang Yu, Bin Fu
Facial semantic guidance (facial landmarks, facial parsing maps, facial heatmaps, etc.) and facial generative adversarial networks (GAN) prior have been widely used in blind face restoration (BFR) in recent years. Although existing BFR methods have achieved good performance in ordinary cases, these solutions have limited resilience when applied to face images with serious degradation and pose-varied (look up, look down, laugh, etc.) in real-world scenarios. In this work, we propose a well-designed blind face restoration network with generative facial prior. The proposed network is mainly comprised of an asymmetric codec and StyleGAN2 prior network. In the asymmetric codec, we adopt a mixed multi-path residual block (MMRB) to gradually extract weak texture features of input images, which can improve the texture integrity and authenticity of our networks. Furthermore, the MMRB block can also be plug-and-play in any other network. Besides, a novel self-supervised training strategy is specially designed for face restoration tasks to fit the distribution closer to the target and maintain training stability. Extensive experiments over synthetic and real-world datasets demonstrate that our model achieves superior performance to the prior art for face restoration and face super-resolution tasks and can tackle seriously degraded face images in diverse poses and expressions.
PDF pdfLaTeX 2021, 11 pages with 15 figures
论文截图
Generating Sparse Counterfactual Explanations For Multivariate Time Series
Authors:Jana Lang, Martin Giese, Winfried Ilg, Sebastian Otte
Since neural networks play an increasingly important role in critical sectors, explaining network predictions has become a key research topic. Counterfactual explanations can help to understand why classifier models decide for particular class assignments and, moreover, how the respective input samples would have to be modified such that the class prediction changes. Previous approaches mainly focus on image and tabular data. In this work we propose SPARCE, a generative adversarial network (GAN) architecture that generates SPARse Counterfactual Explanations for multivariate time series. Our approach provides a custom sparsity layer and regularizes the counterfactual loss function in terms of similarity, sparsity, and smoothness of trajectories. We evaluate our approach on real-world human motion datasets as well as a synthetic time series interpretability benchmark. Although we make significantly sparser modifications than other approaches, we achieve comparable or better performance on all metrics. Moreover, we demonstrate that our approach predominantly modifies salient time steps and features, leaving non-salient inputs untouched.
PDF 10 pages, 6 figures. Preprint. Under review
论文截图
2022-06-03 更新
LatteGAN: Visually Guided Language Attention for Multi-Turn Text-Conditioned Image Manipulation
Authors:Shoya Matsumori, Yuki Abe, Kosuke Shingyouchi, Komei Sugiura, Michita Imai
Text-guided image manipulation tasks have recently gained attention in the vision-and-language community. While most of the prior studies focused on single-turn manipulation, our goal in this paper is to address the more challenging multi-turn image manipulation (MTIM) task. Previous models for this task successfully generate images iteratively, given a sequence of instructions and a previously generated image. However, this approach suffers from under-generation and a lack of generated quality of the objects that are described in the instructions, which consequently degrades the overall performance. To overcome these problems, we present a novel architecture called a Visually Guided Language Attention GAN (LatteGAN). Here, we address the limitations of the previous approaches by introducing a Visually Guided Language Attention (Latte) module, which extracts fine-grained text representations for the generator, and a Text-Conditioned U-Net discriminator architecture, which discriminates both the global and local representations of fake or real images. Extensive experiments on two distinct MTIM datasets, CoDraw and i-CLEVR, demonstrate the state-of-the-art performance of the proposed model.
PDF
论文截图
PandA: Unsupervised Learning of Parts and Appearances in the Feature Maps of GANs
Authors:James Oldfield, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Ioannis Patras
Recent advances in the understanding of Generative Adversarial Networks (GANs) have led to remarkable progress in visual editing and synthesis tasks, capitalizing on the rich semantics that are embedded in the latent spaces of pre-trained GANs. However, existing methods are often tailored to specific GAN architectures and are limited to either discovering global semantic directions that do not facilitate localized control, or require some form of supervision through manually provided regions or segmentation masks. In this light, we present an architecture-agnostic approach that jointly discovers factors representing spatial parts and their appearances in an entirely unsupervised fashion. These factors are obtained by applying a semi-nonnegative tensor factorization on the feature maps, which in turn enables context-aware local image editing with pixel-level control. In addition, we show that the discovered appearance factors correspond to saliency maps that localize concepts of interest, without using any labels. Experiments on a wide range of GAN architectures and datasets show that, in comparison to the state of the art, our method is far more efficient in terms of training time and, most importantly, provides much more accurate localized control. Our code is available at: https://github.com/james-oldfield/PandA.
PDF Code available at: https://github.com/james-oldfield/PandA
论文截图
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
Authors:Gwanghyun Kim, Taesung Kwon, Jong Chul Ye
Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts. To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models. Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains and takes another step towards general application by manipulating images from a widely varying ImageNet dataset. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation. Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines. Code is available at https://github.com/gwang-kim/DiffusionCLIP.git.
PDF Accepted to CVPR 2022
论文截图
StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2
Authors:Ivan Skorokhodov, Sergey Tulyakov, Mohamed Elhoseiny
Videos show continuous events, yet most $-$ if not all $-$ video synthesis frameworks treat them discretely in time. In this work, we think of videos of what they should be $-$ time-continuous signals, and extend the paradigm of neural representations to build a continuous-time video generator. For this, we first design continuous motion representations through the lens of positional embeddings. Then, we explore the question of training on very sparse videos and demonstrate that a good generator can be learned by using as few as 2 frames per clip. After that, we rethink the traditional image + video discriminators pair and design a holistic discriminator that aggregates temporal information by simply concatenating frames’ features. This decreases the training cost and provides richer learning signal to the generator, making it possible to train directly on 1024$^2$ videos for the first time. We build our model on top of StyleGAN2 and it is just ${\approx}5\%$ more expensive to train at the same resolution while achieving almost the same image quality. Moreover, our latent space features similar properties, enabling spatial manipulations that our method can propagate in time. We can generate arbitrarily long videos at arbitrary high frame rate, while prior work struggles to generate even 64 frames at a fixed rate. Our model is tested on four modern 256$^2$ and one 1024$^2$-resolution video synthesis benchmarks. In terms of sheer metrics, it performs on average ${\approx}30\%$ better than the closest runner-up. Project website: https://universome.github.io.
PDF CVPR 2022