在消息通讯系统中优化了allgatherv，dread_scatter和Alleduce通信

论文标题

在消息通讯系统中优化了allgatherv，dread_scatter和Alleduce通信

Optimised allgatherv, reduce_scatter and allreduce communication in message-passing systems

论文作者

Jocksch, Andreas, Ohana, Noe, Lanti, Emmanuel, Karakasis, Vasileios, Villard, Laurent

论文摘要

集体通信，即基于库的安装时间的测量值，优化了消息通信系统中的allgatherv，dreed_scatter和Alleduce的模式。所使用的算法是在通信的初始化阶段设置的，类似于文献中引入的所谓持续集体通信中使用的方法。对于AllGatherV和Reled_ -Scatter现有算法，将递归的多重/划分和循环移位（Bruck's算法）应用于每个节点的灵活数量的通信端口。均等消息大小的算法与非平等消息大小一起使用，并将其启发式用于等级重新排序。两种通信模式应用于使用专门的矩阵矢量乘法的等离子体物理应用程序。对于Alleduce模式，使用前缀操作应用了循环移位算法。数据是由节点内的核心收集和散布的，并且在节点之间应用了通信算法。通常，我们的例程优于已建立的MPI库中的非持久性对应物，最多一个数量级或显示同等的性能，其中有少数几个节点和消息大小。

Collective communications, namely the patterns allgatherv, reduce_scatter, and allreduce in message-passing systems are optimised based on measurements at the installation time of the library. The algorithms used are set up in an initialisation phase of the communication, similar to the method used in so-called persistent collective communication introduced in the literature. For allgatherv and reduce_scatter the existing algorithms, recursive multiply/divide and cyclic shift (Bruck's algorithm) are applied with a flexible number of communication ports per node. The algorithms for equal message sizes are used with non-equal message sizes together with a heuristic for rank reordering. The two communication patterns are applied in a plasma physics application that uses a specialised matrix-vector multiplication. For the allreduce pattern the cyclic shift algorithm is applied with a prefix operation. The data is gathered and scattered by the cores within the node and the communication algorithms are applied across the nodes. In general our routines outperform the non-persistent counterparts in established MPI libraries by up to one order of magnitude or show equal performance, with a few exceptions of number of nodes and message sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题