在分布式GPU上批处理矩阵操作，并在理论物理中应用

论文标题

在分布式GPU上批处理矩阵操作，并在理论物理中应用

Batched matrix operations on distributed GPUs with application in theoretical physics

论文作者

Mijić, Nenad, Davidović, Davor

论文摘要

许多线性代数函数中最重要，最常用的操作之一是矩阵矩阵乘法（GEMM），这也是获得许多科学代码高性能的关键组成部分。这是一个需要$ O（n^3）$操作的计算密集型功能，其高计算强度使其非常适合使用GPU大大加速。如今，许多研究问题需要解决无法利用整个GPU的大量相对较小的GEMM操作。为了克服这种瓶颈，已经开发出了特殊功能，可以将多个GEMM操作包装到一个中，然后在GPU上同时计算它们，这称为批处理操作。在这项研究工作中，我们提出了一种不同的方法，基于将多个GEMM操作与MPI等级联系起来，然后将多个MPI等级与单个GPU结合。为了增加GPU利用率，添加了更多的MPI等级（即GEMM操作）。我们在理论物理学领域实施并测试这种方法，以通过模拟量子旋转链的蒙特卡洛模拟来计算纠缠特性。对于特定用例，与仅CPU版本相比，我们能够模拟更大的自旋系统，并达到高达$ 35 \ times $的加速。

One of the most important and commonly used operations in many linear algebra functions is matrix-matrix multiplication (GEMM), which is also a key component in obtaining high performance of many scientific codes. It is a computationally intensive function requiring $O(n^3)$ operations, and its high computational intensity makes it well-suited to be significantly accelerated with GPUs. Today, many research problems require solving a very large number of relatively small GEMM operations that cannot utilise the entire GPU. To overcome this bottleneck, special functions have been developed that pack several GEMM operations into one and then compute them simultaneously on a GPU, which is called a batch operation. In this research work, we have proposed a different approach based on linking multiple GEMM operations to MPI ranks and then binding multiple MPI ranks to a single GPU. To increase GPU utilisation, more MPI ranks (i.e. GEMM operations) are added. We implement and test this approach in the field of theoretical physics to compute entanglement properties through simulated annealing Monte Carlo simulation of quantum spin chains. For the specific use case, we were able to simulate a much larger spin system and achieve a speed-up of up to $35\times$ compared to the parallel CPU-only version.

下载PDF全文

下载文献需遵守相关版权规定

论文标题