通过机器学习的性能可移植性在SYCL库中选择的内核选择

论文标题

通过机器学习的性能可移植性在SYCL库中选择的内核选择

Performance portability through machine learning guided kernel selection in SYCL libraries

论文作者

Lawson, John

论文摘要

自动调整并行计算内核允许库和框架在广泛的硬件上实现性能，但是这些技术通常集中在为特定输入大小和参数的特定输入大小和参数找到最佳内核参数。通用计算库必须能够迎合用户提供的所有输入和参数，因此这些技术的使用有限。此外，诸如SYCL之类的并行编程框架要求将内核以嵌入在库中的二进制格式部署。因此，部署大量可能的内核配置而不夸大库大小是不切实际的。机器学习方法可用于缓解这两个问题，并为通用例程提供有限数量的内核配置的性能。我们表明，无监督的聚类方法可用于选择应部署的可能内核的子集，并且可以在运行时培训简单的分类方法以从这些内核中进行选择以提供良好的性能。由于这些技术是完全自动化的，仅依靠基准数据，因此新硬件或问题的调整过程不需要任何开发人员的工作或专业知识。

Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so these techniques are of limited use. Additionally, parallel programming frameworks such as SYCL require that the kernels be deployed in a binary format embedded within the library. As such it is impractical to deploy a large number of possible kernel configurations without inflating the library size. Machine learning methods can be used to mitigate against both of these problems and provide performance for general purpose routines with a limited number of kernel configurations. We show that unsupervised clustering methods can be used to select a subset of the possible kernels that should be deployed and that simple classification methods can be trained to select from these kernels at runtime to give good performance. As these techniques are fully automated, relying only on benchmark data, the tuning process for new hardware or problems does not require any developer effort or expertise.

下载PDF全文

下载文献需遵守相关版权规定

论文标题