论文标题
差异奖励估计器瓶颈:多域以任务为导向的对话框学习可靠的奖励估计器
Variational Reward Estimator Bottleneck: Learning Robust Reward Estimator for Multi-Domain Task-Oriented Dialog
论文作者
论文摘要
尽管在多域任务导向对话系统的对抗学习方法中取得了显着的成功,但通过对抗性逆强化学习训练对话策略通常无法平衡策略生成器的性能和奖励估算器的性能。在优化过程中,奖励估计器通常会压倒策略生成器,并产生过多的不信息梯度。我们提出了变异奖励估计器瓶颈(VRB),这是一种有效的正则化方法,旨在限制输入和奖励估计器之间的非生产性信息流。 VRB通过利用信息瓶颈来捕获歧视性特征。多域以任务为导向的对话框数据集的经验结果表明,VRB显着优于先前的方法。
Despite its notable success in adversarial learning approaches to multi-domain task-oriented dialog system, training the dialog policy via adversarial inverse reinforcement learning often fails to balance the performance of the policy generator and reward estimator. During optimization, the reward estimator often overwhelms the policy generator and produces excessively uninformative gradients. We proposes the Variational Reward estimator Bottleneck (VRB), which is an effective regularization method that aims to constrain unproductive information flows between inputs and the reward estimator. The VRB focuses on capturing discriminative features, by exploiting information bottleneck on mutual information. Empirical results on a multi-domain task-oriented dialog dataset demonstrate that the VRB significantly outperforms previous methods.