Speech


2023-12-13 更新

UID as a Guiding Metric for Automated Authorship Obfuscation

Authors:Nicholas Abegg

Protecting the anonymity of authors has become a difficult task given the rise of automated authorship attributors. These attributors are capable of attributing the author of a text amongst a pool of authors with great accuracy. In order to counter the rise of these automated attributors, there has also been a rise of automated obfuscators. These obfuscators are capable of taking some text, perturbing the text in some manner, and, if successful, deceive an automated attributor in misattributing the wrong author. We devised three novel authorship obfuscation methods that utilized a Psycho-linguistic theory known as Uniform Information Density (UID) theory. This theory states that humans evenly distribute information amongst speech or text so as to maximize efficiency. Utilizing this theory in our three obfuscation methods, we attempted to see how successfully we could deceive two separate attributors. Obfuscating 50 human and 50 GPT-3 generated articles from the TuringBench dataset, we observed how well each method did on deceiving the attributors. While the quality of the obfuscation in terms of semantic preservation and sensical changes was high, we were not able to find any evidence to indicate UID was a viable guiding metric for obfuscation. However, due to restrictions in time we were unable to test a large enough sample of article or tune the parameters for our attributors to comment conclusively on UID in obfuscation.
PDF 20 pages, 10 figures

点此查看论文截图

Graph Convolutions Enrich the Self-Attention in Transformers!

Authors:Jeongwhan Choi, Hyowon Wi, Jayoung Kim, Yehjin Shin, Kookjin Lee, Nathaniel Trask, Noseong Park

Transformers, renowned for their self-attention mechanism, have achieved state-of-the-art performance across various tasks in natural language processing, computer vision, time-series modeling, etc. However, one of the challenges with deep Transformer models is the oversmoothing problem, where representations across layers converge to indistinguishable values, leading to significant performance degradation. We interpret the original self-attention as a simple graph filter and redesign it from a graph signal processing (GSP) perspective. We propose graph-filter-based self-attention (GFSA) to learn a general yet effective one, whose complexity, however, is slightly larger than that of the original self-attention mechanism. We demonstrate that GFSA improves the performance of Transformers in various fields, including computer vision, natural language processing, graph pattern classification, speech recognition, and code classification.
PDF

点此查看论文截图

Investigating the Design Space of Diffusion Models for Speech Enhancement

Authors:Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system in terms of perceptual metrics while using fewer sampling steps, thus reducing the computational cost by a factor of four.
PDF

点此查看论文截图

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Authors:Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart

Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our project website is amuse.is.tue.mpg.de.
PDF

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录