准确限制的分类 - 面临数据歧义问题

论文标题

准确限制的分类 - 面临数据歧义问题

Classification at the Accuracy Limit -- Facing the Problem of Data Ambiguity

论文作者

Metzner, Claus, Schilling, Achim, Traxdorf, Maximilian, Tziridis, Konstantin, Schulze, Holger, Krauss, Patrick

论文摘要

数据分类，分析数据并将其组织成类别的过程，是自然和人工信息处理系统的基本计算问题。理想情况下，将使用明确的数据集评估分类器模型的性能，其中“正确”类别标签对输入数据向量的“正确”分配是明确的。但是，在现实世界中，实际发生的数据向量的很大一部分将位于所有类别之间或外部之间的边界区域中，因此甚至无法实现完美的分类。我们得出了由数据类别重叠产生的分类精度的理论限制。通过使用具有可调节统计属性的替代数据生成模型，我们表明基于完全不同的原理（例如感知器和贝叶斯模型）的足够强大的分类器都以这种通用精度限制执行。值得注意的是，即使这些转换是不可逆的，并且会大大减少输入数据的信息内容，即使将非线性转换应用于数据，精度限制也不会受到影响。我们使用MNIST和人类的EEG记录在睡眠期间比较了由监督和无监督培训产生的新兴数据嵌入。我们发现，类别不仅在经过反向传播的分类器的最后一层中分离良好，而且在较小的程度下，无监督的维度降低。这表明人类定义的类别（例如手写数字或睡眠阶段）确实可以被视为“自然种类”。

Data classification, the process of analyzing data and organizing it into categories, is a fundamental computing problem of natural and artificial information processing systems. Ideally, the performance of classifier models would be evaluated using unambiguous data sets, where the 'correct' assignment of category labels to the input data vectors is unequivocal. In real-world problems, however, a significant fraction of actually occurring data vectors will be located in a boundary zone between or outside of all categories, so that perfect classification cannot even in principle be achieved. We derive the theoretical limit for classification accuracy that arises from the overlap of data categories. By using a surrogate data generation model with adjustable statistical properties, we show that sufficiently powerful classifiers based on completely different principles, such as perceptrons and Bayesian models, all perform at this universal accuracy limit. Remarkably, the accuracy limit is not affected by applying non-linear transformations to the data, even if these transformations are non-reversible and drastically reduce the information content of the input data. We compare emerging data embeddings produced by supervised and unsupervised training, using MNIST and human EEG recordings during sleep. We find that categories are not only well separated in the final layers of classifiers trained with back-propagation, but to a smaller degree also after unsupervised dimensionality reduction. This suggests that human-defined categories, such as hand-written digits or sleep stages, can indeed be considered as 'natural kinds'.

下载PDF全文

下载文献需遵守相关版权规定

论文标题