论文标题
配置图形模型的异质混合数据
Copula Graphical Models for Heterogeneous Mixed Data
论文作者
论文摘要
本文提出了一个处理混合型,多组数据的图形模型。这种模型的动机源自现实世界的观察数据,这些数据通常包含在时空中异质条件下获得的样本组,可能导致组之间网络结构的差异。因此,I.I.D.假设是不现实的,并且将单个图形模型拟合在所有数据中导致网络中无法准确表示组差异之间的网络。此外,现实世界的观察数据通常是混合的离散和连续类型的,违反了图形模型的典型高斯假设,这导致该模型无法充分恢复基础的图形结构。提出的模型通过使用高斯副群将观察到的数据作为转换的潜在高斯数据来考虑这些数据的这些属性,从而允许使用逆协方差矩阵估算高斯分布的有吸引力的高斯分布属性。多组设置是通过共同拟合每个组的图形模型来解决的,并将融合组的惩罚应用于融合类似的图表。在一项广泛的仿真研究中,对所提出的模型进行了替代模型的评估,其中所提出的模型可以更好地恢复不同组的真实基础图结构。最后,提出的模型应用于与农场玉米产量有关的实际生产生态数据,以展示提出的方法在生成生产生态学家的新假设时的附加值。
This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for such a model originates from real-world observational data, which often contain groups of samples obtained under heterogeneous conditions in space and time, potentially resulting in differences in network structure among groups. Therefore, the i.i.d. assumption is unrealistic, and fitting a single graphical model on all data results in a network that does not accurately represent the between group differences. In addition, real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian assumption that is typical of graphical models, which leads to the model being unable to adequately recover the underlying graph structure. The proposed model takes into account these properties of data, by treating observed data as transformed latent Gaussian data, by means of the Gaussian copula, and thereby allowing for the attractive properties of the Gaussian distribution such as estimating the optimal number of model parameter using the inverse covariance matrix. The multi-group setting is addressed by jointly fitting a graphical model for each group, and applying the fused group penalty to fuse similar graphs together. In an extensive simulation study, the proposed model is evaluated against alternative models, where the proposed model is better able to recover the true underlying graph structure for different groups. Finally, the proposed model is applied on real production-ecological data pertaining to on-farm maize yield in order to showcase the added value of the proposed method in generating new hypotheses for production ecologists.