利用类标签以提高基于嵌入的文本分类的性能

论文标题

利用类标签以提高基于嵌入的文本分类的性能

Exploiting Class Labels to Boost Performance on Embedding-based Text Classification

论文作者

Zubiaga, Arkaitz

论文摘要

文本分类是处理文本数据的最常见任务之一，以及其他大规模数据集的研究。作为文本分类的功能，不同种类的嵌入已成为事实上的标准。这些嵌入有能力捕获大型外部收藏中发生的单词含义。尽管它们是由外部集合构建的，但它们并不意识到手头的分类数据集中单词的分布特征，其中包括最重要的是，培训数据中的单词分布。为了充分利用这些嵌入作为特征并提高使用它们的分类器的性能，我们引入了一个加权方案，术语频率类别比率（TF-CR），在计算单词嵌入时，可以加权高频，类别排斥的单词较高。我们在八个数据集上的实验显示了TF-CR的有效性，从而提高了众所周知的加权方案TF-IDF和KLD的性能得分，并且在大多数情况下没有加权方案。

Text classification is one of the most frequent tasks for processing textual data, facilitating among others research from large-scale datasets. Embeddings of different kinds have recently become the de facto standard as features used for text classification. These embeddings have the capacity to capture meanings of words inferred from occurrences in large external collections. While they are built out of external collections, they are unaware of the distributional characteristics of words in the classification dataset at hand, including most importantly the distribution of words across classes in training data. To make the most of these embeddings as features and to boost the performance of classifiers using them, we introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on eight datasets show the effectiveness of TF-CR, leading to improved performance scores over the well-known weighting schemes TF-IDF and KLD as well as over the absence of a weighting scheme in most cases.

下载PDF全文

下载文献需遵守相关版权规定

论文标题