论文标题
隐式内核的关注
Implicit Kernel Attention
论文作者
论文摘要
\ textit {注意}计算表示表示之间的依赖关系,并鼓励模型专注于重要的选择性功能。基于注意力的模型(例如变压器和图形注意网络(GAT))被广泛用于顺序数据和图形结构化数据。本文提出了在变压器和GAT中注意的新解释和通用结构。对于变压器和GAT中的注意力,我们得出的是,注意是两个部分的产物:1)RBF内核测量两个实例的相似性和2)$ l^{2} $ norm的指数以计算单个实例的重要性。从这种分解中,我们以三种方式将注意力推广。首先,我们提出隐式内核的关注,具有隐式内核函数,而不是手动内核选择。其次,我们将$ l^{2} $ norm概括为$ l^{p} $ norm。第三,我们将注意力扩展到结构化的多头注意力。我们广泛的注意力显示在分类,翻译和回归任务上的表现更好。
\textit{Attention} computes the dependency between representations, and it encourages the model to focus on the important selective features. Attention-based models, such as Transformer and graph attention network (GAT), are widely utilized for sequential data and graph-structured data. This paper suggests a new interpretation and generalized structure of the attention in Transformer and GAT. For the attention in Transformer and GAT, we derive that the attention is a product of two parts: 1) the RBF kernel to measure the similarity of two instances and 2) the exponential of $L^{2}$ norm to compute the importance of individual instances. From this decomposition, we generalize the attention in three ways. First, we propose implicit kernel attention with an implicit kernel function instead of manual kernel selection. Second, we generalize $L^{2}$ norm as the $L^{p}$ norm. Third, we extend our attention to structured multi-head attention. Our generalized attention shows better performance on classification, translation, and regression tasks.