2024-01-11 更新

Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding

Authors:Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, Peinan Zhang

One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse. Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed for generating diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying methods. In this paper, we investigate an alternative approach — we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding. We propose two variants of MBR, Diverse MBR (DMBR) and $k$-medoids MBR (KMBR), methods to generate a set of sentences with high quality and diversity. We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a large language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms.


Aligning Translation-Specific Understanding to General Understanding in Large Language Models

Authors:Yichong Huang, Xiaocheng Feng, Baohang Li, Chengpeng Fu, Wenshuai Huo, Ting Liu, Bing Qin

Although large language models (LLMs) have shown surprising language understanding and generation capabilities, they have yet to gain a revolutionary advancement in the field of machine translation. One potential cause of the limited performance is the misalignment between the translation-specific understanding and general understanding inside LLMs. To align the translation-specific understanding to the general one, we propose a novel translation process xIoD (Cross-Lingual Interpretation of Difficult words), explicitly incorporating the general understanding on the content incurring inconsistent understanding to guide the translation. Specifically, xIoD performs the cross-lingual interpretation for the difficult-to-translate words and enhances the translation with the generated interpretations. Furthermore, we reframe the external tools of QE to tackle the challenges of xIoD in the detection of difficult words and the generation of helpful interpretations. We conduct experiments on the self-constructed benchmark ChallengeMT, which includes cases in which multiple SOTA translation systems consistently underperform. Experimental results show the effectiveness of our xIoD, which improves up to +3.85 COMET.
PDF work in progress


BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation

Authors:Samuele Garda, Ulf Leser

Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.


Divide and Conquer for Large Language Models Reasoning

Authors:Zijie Meng, Yan Zhang, Zhaopeng Feng, Yang Feng, Gaoang Wang, Joey Tianyi Zhou, Jian Wu, Zuozhu Liu

Large language models (LLMs) have shown impressive performance in various reasoning benchmarks with the emergence of Chain-of-Thought (CoT) and its derivative methods, particularly in tasks involving multi-choice questions (MCQs). However, current works all process data uniformly without considering the problem-solving difficulty, which means an excessive focus on simple questions while insufficient to intricate ones. To address this challenge, we inspired by humans using heuristic strategies to categorize tasks and handle them individually, propose to apply the Divide and Conquer to LLMs reasoning. First, we divide questions into different subsets based on the statistical confidence score ($\mathcal{CS}$), then fix nearly resolved sets and conquer demanding nuanced process ones with elaborately designed methods, including Prior Knowledge based Reasoning (PKR) and Filter Choices based Reasoning (FCR), as well as their integration variants. Our experiments demonstrate that this proposed strategy significantly boosts the models’ reasoning abilities across nine datasets involving arithmetic, commonsense, and logic tasks. For instance, compared to baseline, we make a striking improvement on low confidence subsets of 8.72\% for AQuA, 15.07\% for ARC Challenge and 7.71\% for RiddleSense. In addition, through extensive analysis on length of rationale and number of options, we verify that longer reasoning paths in PKR could prevent models from referring infer-harmful shortcuts, and also find that removing irrelevant choices in FCR would substantially avoid models’ confusion. The code is at \url{https://github.com/AiMijie/Divide-and-Conquer}
PDF Technique Report


Pre-trained Large Language Models for Financial Sentiment Analysis

Authors:Wei Luo, Dihong Gong

Financial sentiment analysis refers to classifying financial text contents into sentiment categories (e.g. positive, negative, and neutral). In this paper, we focus on the classification of financial news title, which is a challenging task due to a lack of large amount of training samples. To overcome this difficulty, we propose to adapt the pretrained large language models (LLMs) [1, 2, 3] to solve this problem. The LLMs, which are trained from huge amount of text corpora,have an advantage in text understanding and can be effectively adapted to domain-specific task while requiring very few amount of training samples. In particular, we adapt the open-source Llama2-7B model (2023) with the supervised fine-tuning (SFT) technique [4]. Experimental evaluation shows that even with the 7B model (which is relatively small for LLMs), our approach significantly outperforms the previous state-of-the-art algorithms.


CASA: Causality-driven Argument Sufficiency Assessment

Authors:Xiao Liu, Yansong Feng, Kai-Wei Chang

The argument sufficiency assessment task aims to determine if the premises of a given argument support its conclusion. To tackle this task, existing works often train a classifier on data annotated by humans. However, annotating data is laborious, and annotations are often inconsistent due to subjective criteria. Motivated by the probability of sufficiency (PS) definition in the causal literature, we propose CASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion, when both the premise and conclusion events are absent. To estimate this probability, we propose to use large language models (LLMs) to generate contexts that are inconsistent with the premise and conclusion, and revise them by injecting the premise event. Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments. We further deploy CASA in a writing assistance application, and find that suggestions generated by CASA enhance the sufficiency of student-written arguments. Code and data are available at https://github.com/xxxiaol/CASA.
PDF Project website: https://xxxiaol.github.io/CASA/


AUTOACT: Automatic Agent Learning from Scratch via Self-Planning

Authors:Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, Huajun Chen

Language agents have achieved considerable performance on various complex tasks. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework that does not rely on large-scale annotated data and synthetic trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. We even notice that AutoAct, when using the Llama-2-13b model, can achieve performance comparable to that of the GPT-3.5-Turbo agent. Code will be available at https://github.com/zjunlp/AutoAct.
PDF Work in progress


I am a Strange Dataset: Metalinguistic Tests for Language Models

Authors:Tristan Thrush, Jared Moore, Miguel Monares, Christopher Potts, Douwe Kiela

Statements involving metalinguistic self-reference (“This paper has six sections.”) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present “I am a Strange Dataset”, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like “The penultimate word in this sentence is” (where a correct continuation is “is”). In verification, models judge the truth of statements like “The penultimate word in this sentence is sentence.” (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset.


文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !