马里的性能便携式冰盖建模

论文标题

马里的性能便携式冰盖建模

Performance portable ice-sheet modeling with MALI

论文作者

Watkins, Jerry, Carlson, Max, Shan, Kyle, Tezaur, Irina, Perego, Mauro, Bertagna, Luca, Kao, Carolyn, Hoffman, Matthew J., Price, Stephen F.

论文摘要

高分辨率的高分辨率模拟在正在进行的努力中为开发更准确，可靠的地球系统模型的概率海平面投影起着至关重要的作用。这些模拟通常需要大量的内存和计算，从大型超级计算集群提供足够的准确性和分辨率。准备在线的最新Exascale机器包含各种计算体系结构。为了避免特定于体系结构的编程并维持跨平台的生产力，冰盖建模代码称为马里，使用高级抽象来集成Trilinos库和Kokkos编程模型，用于跨多种不同体系结构的性能便携式代码。在本文中，我们通过对当前基于CPU和基于GPU的超级计算机的性能分析来分析MALI的性能便携式功能。该分析强调了在有限元组件中进行的性能便携式改进，并在MALI内部进行了Multigrid预处理，在CPU和GPU体系结构之间，速度在1.26-1.82倍之间，但同时也确定了在GPU上进一步提高软件耦合和预处理方面的性能。我们还进行了一项弱的可伸缩性研究，并表明基于GPU的机器的模拟在使用GPU时速度快1.24-1.92倍。最佳性能是在有限元组件中发现的，该组件的加速度高达8.65倍，而较弱的缩放效率为82.9％。我们还用更改点检测方法描述了为此代码库开发的自动性能测试框架。该框架用于做出有关马里绩效的可行决策。我们提供了几个具体的示例，其中该框架在两年的发展过程中确定了绩效回归，改进和算法差异。

High resolution simulations of polar ice-sheets play a crucial role in the ongoing effort to develop more accurate and reliable Earth-system models for probabilistic sea-level projections. These simulations often require a massive amount of memory and computation from large supercomputing clusters to provide sufficient accuracy and resolution. The latest exascale machines poised to come online contain a diverse set of computing architectures. In an effort to avoid architecture specific programming and maintain productivity across platforms, the ice-sheet modeling code known as MALI uses high level abstractions to integrate Trilinos libraries and the Kokkos programming model for performance portable code across a variety of different architectures. In this paper, we analyze the performance portable features of MALI via a performance analysis on current CPU-based and GPU-based supercomputers. The analysis highlights performance portable improvements made in finite element assembly and multigrid preconditioning within MALI with speedups between 1.26-1.82x across CPU and GPU architectures but also identifies the need to further improve performance in software coupling and preconditioning on GPUs. We also perform a weak scalability study and show that simulations on GPU-based machines perform 1.24-1.92x faster when utilizing the GPUs. The best performance is found in finite element assembly which achieved a speedup of up to 8.65x and a weak scaling efficiency of 82.9% with GPUs. We additionally describe an automated performance testing framework developed for this code base using a changepoint detection method. The framework is used to make actionable decisions about performance within MALI. We provide several concrete examples of scenarios in which the framework has identified performance regressions, improvements, and algorithm differences over the course of two years of development.

下载PDF全文

下载文献需遵守相关版权规定

论文标题