使用表型特征进行了概率进行采样和光谱聚集的植物基因型

论文标题

使用表型特征进行了概率进行采样和光谱聚集的植物基因型

Probabilistically Sampled and Spectrally Clustered Plant Genotypes using Phenotypic Characteristics

论文作者

Shastri, Aditya A., Ahuja, Kapil, Ratnaparkhe, Milind B., Busnel, Yann

论文摘要

基于其表型特征的聚类基因型用于获得对其育种计划有用的多种父母。分层聚类（HC）算法是表型数据聚类的当前标准。该算法患有低精度和高计算复杂性问题。为了应对准确性挑战，我们建议使用光谱聚类（SC）算法。为了使算法在计算上便宜，我们建议使用基于概率的取样采样，特别是基于概率的关键采样。由于将采样应用于表型数据的应用并没有得到太多的探索，因此为有效比较，另一种称为矢量量化（VQ）的采样技术也适用于该数据。 VQ最近为基因组数据提供了有希望的结果。使用关键采样算法的SC的新颖性在于构建聚类算法的关键相似性矩阵，并定义了采样技术的概率。尽管我们的算法可以应用于任何植物基因型，但我们对从大约2400个大豆基因型获得的表型数据进行了测试。与使用采样算法相比，具有关键采样的SC比所有其他提出的竞争聚类（即带有VQ，具有枢轴采样的VQ，HC和带有VQ的HC）的SC比所有其他提出的竞争聚类的精度要高得多。由于涉及采样，我们SC与关键采样算法和这三个变体的复杂性几乎相同。除此之外，具有关键采样的SC在准确性和计算复杂性方面都超过了标准的HC算法。我们通过实验表明，就聚类准确性而言，我们比HC高45％。我们的算法的计算复杂性远不止比HC少。

Clustering genotypes based upon their phenotypic characteristics is used to obtain diverse sets of parents that are useful in their breeding programs. The Hierarchical Clustering (HC) algorithm is the current standard in clustering of phenotypic data. This algorithm suffers from low accuracy and high computational complexity issues. To address the accuracy challenge, we propose the use of Spectral Clustering (SC) algorithm. To make the algorithm computationally cheap, we propose using sampling, specifically, Pivotal Sampling that is probability based. Since application of samplings to phenotypic data has not been explored much, for effective comparison, another sampling technique called Vector Quantization (VQ) is adapted for this data as well. VQ has recently given promising results for genome data. The novelty of our SC with Pivotal Sampling algorithm is in constructing the crucial similarity matrix for the clustering algorithm and defining probabilities for the sampling technique. Although our algorithm can be applied to any plant genotypes, we test it on the phenotypic data obtained from about 2400 Soybean genotypes. SC with Pivotal Sampling achieves substantially more accuracy (in terms of Silhouette Values) than all the other proposed competitive clustering with sampling algorithms (i.e. SC with VQ, HC with Pivotal Sampling, and HC with VQ). The complexities of our SC with Pivotal Sampling algorithm and these three variants are almost same because of the involved sampling. In addition to this, SC with Pivotal Sampling outperforms the standard HC algorithm in both accuracy and computational complexity. We experimentally show that we are up to 45% more accurate than HC in terms of clustering accuracy. The computational complexity of our algorithm is more than a magnitude lesser than HC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题