2023-06-16 更新
KIT’s Multilingual Speech Translation System for IWSLT 2023
Authors:Danni Liu, Thai Binh Nguyen, Sai Koneru, Enes Yavuz Ugan, Ngoc-Quan Pham, Tuan-Nam Nguyen, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues
Many existing speech translation benchmarks focus on native-English speech in high-quality recording conditions, which often do not match the conditions in real-life use-cases. In this paper, we describe our speech translation system for the multilingual track of IWSLT 2023, which evaluates translation quality on scientific conference talks. The test condition features accented input speech and terminology-dense contents. The task requires translation into 10 languages of varying amounts of resources. In absence of training data from the target domain, we use a retrieval-based approach (kNN-MT) for effective adaptation (+0.8 BLEU for speech translation). We also use adapters to easily integrate incremental training data from data augmentation, and show that it matches the performance of re-training. We observe that cascaded systems are more easily adaptable towards specific target domains, due to their separate modules. Our cascaded speech system substantially outperforms its end-to-end counterpart on scientific talk translation, although their performance remains similar on TED talks.
PDF IWSLT 2023
点此查看论文截图
Large-scale Language Model Rescoring on Long-form Data
Authors:Tongzhou Chen, Cyril Allauzen, Yinghui Huang, Daniel Park, David Rybach, W. Ronny Huang, Rodrigo Cabrera, Kartik Audhkhasi, Bhuvana Ramabhadran, Pedro J. Moreno, Michael Riley
In this work, we study the impact of Large-scale Language Models (LLM) on Automated Speech Recognition (ASR) of YouTube videos, which we use as a source for long-form ASR. We demonstrate up to 8\% relative reduction in Word Error Eate (WER) on US English (en-us) and code-switched Indian English (en-in) long-form ASR test sets and a reduction of up to 30\% relative on Salient Term Error Rate (STER) over a strong first-pass baseline that uses a maximum-entropy based language model. Improved lattice processing that results in a lattice with a proper (non-tree) digraph topology and carrying context from the 1-best hypothesis of the previous segment(s) results in significant wins in rescoring with LLMs. We also find that the gains in performance from the combination of LLMs trained on vast quantities of available data (such as C4) and conventional neural LMs is additive and significantly outperforms a strong first-pass baseline with a maximum entropy LM.
PDF 5 pages, accepted in ICASSP 2023
点此查看论文截图
DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer
Authors:Goeric Huybrechts, Srikanth Ronanki, Xilai Li, Hadis Nosrati, Sravan Bodapati, Katrin Kirchhoff
Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performance gap between streaming with a full and limited past context. To address this issue, we propose the integration of a novel dynamic contextual carry-over mechanism in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over mechanism that takes into account both the left context of a chunk and one or more preceding context embeddings. We outperform the SOTA by a relative 25.0% word error rate, with a negligible latency impact due to the additional context embeddings.
PDF
点此查看论文截图
A semantically enhanced dual encoder for aspect sentiment triplet extraction
Authors:Baoxing Jiang, Shehui Liang, Peiyu Liu, Kaifang Dong, Hongye Li
Aspect sentiment triplet extraction (ASTE) is a crucial subtask of aspect-based sentiment analysis (ABSA) that aims to comprehensively identify sentiment triplets. Previous research has focused on enhancing ASTE through innovative table-filling strategies. However, these approaches often overlook the multi-perspective nature of language expressions, resulting in a loss of valuable interaction information between aspects and opinions. To address this limitation, we propose a framework that leverages both a basic encoder, primarily based on BERT, and a particular encoder comprising a Bi-LSTM network and graph convolutional network (GCN ). The basic encoder captures the surface-level semantics of linguistic expressions, while the particular encoder extracts deeper semantics, including syntactic and lexical information. By modeling the dependency tree of comments and considering the part-of-speech and positional information of words, we aim to capture semantics that are more relevant to the underlying intentions of the sentences. An interaction strategy combines the semantics learned by the two encoders, enabling the fusion of multiple perspectives and facilitating a more comprehensive understanding of aspect—opinion relationships. Experiments conducted on benchmark datasets demonstrate the state-of-the-art performance of our proposed framework.
PDF 25 pages, 4 figures
点此查看论文截图
SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?
Authors:Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma
Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing spoken language understanding tasks, implying that the SSL models have the potential to learn not only acoustic but also linguistic information. In this paper, we aim to clarify if speech SSL techniques can well capture linguistic knowledge. For this purpose, we introduce SpeechGLUE, a speech version of the General Language Understanding Evaluation (GLUE) benchmark. Since GLUE comprises a variety of natural language understanding tasks, SpeechGLUE can elucidate the degree of linguistic ability of speech SSL models. Experiments demonstrate that speech SSL models, although inferior to text-based SSL models, perform better than baselines, suggesting that they can acquire a certain amount of general linguistic knowledge from just unlabeled speech data.
PDF Accepted at INTERSPEECH 2023
点此查看论文截图
Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement
Authors:Hejung Yang, Hong-Goo Kang
Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning because they are able to extract high-quality salient features from input data. As such, they have been frequently used as base networks for various pattern classification tasks such as speech recognition. However, not much research has been conducted on applying these types of models to the field of speech signal generation. In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task. To alleviate mismatches between the input features of the pre-trained model and the target enhancement model, we adopt a novel feature normalization technique to smoothly link these modules together. Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.
PDF INTERSPEECH 2023 accepted
点此查看论文截图
ITALIC: An Italian Intent Classification Dataset
Authors:Alkis Koudounas, Moreno La Quatra, Lorenzo Vaiani, Luca Colomba, Giuseppe Attanasio, Eliana Pastor, Luca Cagliero, Elena Baralis
Recent large-scale Spoken Language Understanding datasets focus predominantly on English and do not account for language-specific phenomena such as particular phonemes or words in different lects. We introduce ITALIC, the first large-scale speech dataset designed for intent classification in Italian. The dataset comprises 16,521 crowdsourced audio samples recorded by 70 speakers from various Italian regions and annotated with intent labels and additional metadata. We explore the versatility of ITALIC by evaluating current state-of-the-art speech and text models. Results on intent classification suggest that increasing scale and running language adaptation yield better speech models, monolingual text models outscore multilingual ones, and that speech recognition on ITALIC is more challenging than on existing Italian benchmarks. We release both the dataset and the annotation scheme to streamline the development of new Italian SLU models and language-specific datasets.
PDF Accepted at INTERSPEECH 2023. Data and code at https://github.com/RiTA-nlp/ITALIC
点此查看论文截图
Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports
Authors:Qingqing Zhu, Tejas Sudharshan Mathai, Pritam Mukherjee, Yifan Peng, Ronald M. Summers, Zhiyong Lu
Despite the reduction in turn-around times in radiology reports with the use of speech recognition software, persistent communication errors can significantly impact the interpretation of the radiology report. Pre-filling a radiology report holds promise in mitigating reporting errors, and despite efforts in the literature to generate medical reports, there exists a lack of approaches that exploit the longitudinal nature of patient visit records in the MIMIC-CXR dataset. To address this gap, we propose to use longitudinal multi-modal data, i.e., previous patient visit CXR, current visit CXR, and previous visit report, to pre-fill the ‘findings’ section of a current patient visit report. We first gathered the longitudinal visit information for 26,625 patients from the MIMIC-CXR dataset and created a new dataset called Longitudinal-MIMIC. With this new dataset, a transformer-based model was trained to capture the information from longitudinal patient visit records containing multi-modal data (CXR images + reports) via a cross-attention-based multi-modal fusion module and a hierarchical memory-driven decoder. In contrast to previous work that only uses current visit data as input to train a model, our work exploits the longitudinal information available to pre-fill the ‘findings’ section of radiology reports. Experiments show that our approach outperforms several recent approaches by >=3% on F1 score, and >=2% for BLEU-4, METEOR and ROUGE-L respectively. The dataset and code will be made publicly available.
PDF