论文标题
通过多军匪徒的自适应数据深度
Adaptive Data Depth via Multi-Armed Bandits
论文作者
论文摘要
Tukey(1975)引入的数据深度是数据科学,健壮统计和计算几何形状的重要工具。其更广泛的实用性实用程序的一个主要障碍是,许多常见的深度措施在计算密集程度上,要求按$ n^d $操作的订单准确计算$ d $数值空间中$ n $点的数据集中的单点深度。但是,通常,我们对这些要点的绝对深度不直接感兴趣,而是对它们的相对顺序。例如,我们可能希望在数据集(广义中间)中找到最中心的点,或者识别和删除所有异常值(深度低的数据集的边缘上的点)。通过这种观察,我们通过将$ N $深度的问题降低到$ n $臂的随机多臂匪徒问题的问题来开发一种新颖的实例自适应算法来自适应数据深度计算,我们可以有效地解决这些问题。我们将博览会集中在刘(1990)开发的简单深度上,由于其可解释性和渐近性,它已成为有希望的深度概念。我们为我们提出的算法提供一般依赖实例的理论保证,这些算法很容易扩展到许多其他常见的数据深度度量,包括多数深度,OJA深度和可能性深度。当专门针对数据中的差距遵循带有参数$α<2 $的电源定律分布的情况时,我们表明我们可以减少识别数据集(简单中位数)的复杂性,从$ o(n^d)$从$ o(n^d)到$ \ tilde {o}因素。我们通过关于合成数据的数值实验来证实我们的理论结果,显示了我们提出的方法的实际实用性。
Data depth, introduced by Tukey (1975), is an important tool in data science, robust statistics, and computational geometry. One chief barrier to its broader practical utility is that many common measures of depth are computationally intensive, requiring on the order of $n^d$ operations to exactly compute the depth of a single point within a data set of $n$ points in $d$-dimensional space. Often however, we are not directly interested in the absolute depths of the points, but rather in their relative ordering. For example, we may want to find the most central point in a data set (a generalized median), or to identify and remove all outliers (points on the fringe of the data set with low depth). With this observation, we develop a novel and instance-adaptive algorithm for adaptive data depth computation by reducing the problem of exactly computing $n$ depths to an $n$-armed stochastic multi-armed bandit problem which we can efficiently solve. We focus our exposition on simplicial depth, developed by Liu (1990), which has emerged as a promising notion of depth due to its interpretability and asymptotic properties. We provide general instance-dependent theoretical guarantees for our proposed algorithms, which readily extend to many other common measures of data depth including majority depth, Oja depth, and likelihood depth. When specialized to the case where the gaps in the data follow a power law distribution with parameter $α<2$, we show that we can reduce the complexity of identifying the deepest point in the data set (the simplicial median) from $O(n^d)$ to $\tilde{O}(n^{d-(d-1)α/2})$, where $\tilde{O}$ suppresses logarithmic factors. We corroborate our theoretical results with numerical experiments on synthetic data, showing the practical utility of our proposed methods.