论文标题
puptnipic:多GPU系统的隐式粒子代码
sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems
论文作者
论文摘要
对等离子体的大规模模拟对于促进我们对融合设备,空间和天体物理系统的理解至关重要。粒子中的粒子(PIC)代码已经证明了它们在模拟HPC系统上的众多血浆现象方面的成功。如今,旗舰超级计算机每个计算节点具有多个GPU,以高功率效率获得前所未有的计算能力。 PIC代码需要新的算法设计和实现来利用此类加速平台。在这项工作中,我们设计并优化了一个名为Sputnipic的三维隐式PIC代码,以在一般的多GPU计算节点上运行。与基于CPU的实现的域分解相反,我们引入了粒子分解数据布局,以使用粒子批次重叠GPU上的通信和计算。 Sputnipic还本地支持不同的精度表示形式,以在支持降低精度的硬件上提高速度。我们通过众所周知的宝石挑战验证挥发性并提供性能分析。我们在三个多GPU平台上测试了Putnipic,并报告了200-800X的性能改进,相对于Putnipic CPU OpenMP版本性能。我们表明,在三个平台上,降低的精度可以进一步提高45%至80%。由于这些性能的改进,在具有多个GPU的单个节点上,puptnipic启用了大规模的三维PIC模拟,仅使用簇才有可能。
Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.