论文标题

用于特征选择的广义OMP算法,并应用于基因表达数据

A generalised OMP algorithm for feature selection with application to gene expression data

论文作者

Tsagris, Michail, Papadovasilakis, Zacharias, Lakiotaki, Kleanthi, Tsamardinos, Ioannis

论文摘要

预测分析的特征选择是确定最小尺寸的特征子集的问题,这些特征可以最大程度地预测感兴趣的结果。要应用于分子数据,特征选择算法需要可扩展到成千上万的可用功能。在本文中,我们提出了GOMP,即对几个方向的正交匹配追求特征选择算法的高度估计概括:(a)不同类型的结果,例如连续,二进制,二进制,名义和实时时间,(b)(b)不同类型的预测模型(例如,线性的,locigants,locares,lodistive)的类型(c),(c c),(例如)(例如,locigantive todial contriant)(c c)(c)(c c)(c c)(c c)(c c)(c)(c c)(c c)(c c)(c c)(c c)(c c)(c c)(c c)(c)(c)(c)(c)。不同的,基于统计的停止标准。我们将提出的算法与Lasso进行比较,Lasso是一种原型,广泛使用的算法,用于高维数据。在数十个模拟数据集以及实际基因表达数据集上,GOMP处于PAR上,或者在病例对照二进制分类,量化结果(回归)和(审查)生存时间(EVER-EVENT)分析的情况下表现出色。 GOMP还具有一些理论上的优势。尽管GOMP基于非常简单和基本的统计思想,易于实施和推广,但我们在广泛的评估中也表明,它在生物信息学分析设置中也非常有效。

Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of available features. In this paper, we propose gOMP, a highly-scalable generalisation of the Orthogonal Matching Pursuit feature selection algorithm to several directions: (a) different types of outcomes, such as continuous, binary, nominal, and time-to-event, (b) different types of predictive models (e.g., linear least squares, logistic regression), (c) different types of predictive features (continuous, categorical), and (d) different, statistical-based stopping criteria. We compare the proposed algorithm against LASSO, a prototypical, widely used algorithm for high-dimensional data. On dozens of simulated datasets, as well as, real gene expression datasets, gOMP is on par, or outperforms LASSO for case-control binary classification, quantified outcomes (regression), and (censored) survival times (time-to-event) analysis. gOMP has also several theoretical advantages that are discussed. While gOMP is based on quite simple and basic statistical ideas, easy to implement and to generalize, we also show in an extensive evaluation that it is also quite effective in bioinformatics analysis settings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源