对Arxiv文档，部分和摘要的分类和聚类，比较自然和数学语言的编码

论文标题

对Arxiv文档，部分和摘要的分类和聚类，比较自然和数学语言的编码

Classification and Clustering of arXiv Documents, Sections, and Abstracts, Comparing Encodings of Natural and Mathematical Language

论文作者

Scharpf, Philipp, Schubotz, Moritz, Youssef, Abdou, Hamborg, Felix, Meuschke, Norman, Gipp, Bela

论文摘要

在本文中，我们展示了选择和组合自然和数学语言的编码如何影响文档使用数学内容的分类和聚类。我们通过使用来自Arxiv预印刷服务器的一组文档，部分和摘要来证明这一点，这些文档和摘要由其主题类（数学，计算机科学，物理学等）标记，以比较文本和公式的不同编码，并评估所选分类和集群算法的性能和运行时间。我们的编码可实现最高$ 82.8 \％$的分类精度和群集纯度，最高$ 69.4 \％$ $（群集数等于类数）和$ 99.9 \％\％$（未指定的群集数）。我们观察到文本和数学相似性之间相对较低的相关性，这表明文本和公式的独立性，并激励将它们视为文档的独立特征。可以使用分类和聚类，例如用于文档搜索和建议。此外，我们表明计算机在对文档进行分类时的表现优于人类专家。最后，我们评估和讨论多标签分类和公式语义。

In this paper, we show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content. We demonstrate this by using sets of documents, sections, and abstracts from the arXiv preprint server that are labeled by their subject class (mathematics, computer science, physics, etc.) to compare different encodings of text and formulae and evaluate the performance and runtimes of selected classification and clustering algorithms. Our encodings achieve classification accuracies up to $82.8\%$ and cluster purities up to $69.4\%$ (number of clusters equals number of classes), and $99.9\%$ (unspecified number of clusters) respectively. We observe a relatively low correlation between text and math similarity, which indicates the independence of text and formulae and motivates treating them as separate features of a document. The classification and clustering can be employed, e.g., for document search and recommendation. Furthermore, we show that the computer outperforms a human expert when classifying documents. Finally, we evaluate and discuss multi-label classification and formula semantification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题