论文标题
超级计算机作业的基于端到端预测的资源管理框架
End-to-End Predictions-Based Resource Management Framework for Supercomputer Jobs
论文作者
论文摘要
必须根据工作提交参数仔细调整生产超级计算机系统的并行应用程序的作业提交,以获得最小响应时间。在这项工作中,我们开发了一个端到端的资源管理框架,该框架使用队列等待和执行时间的预测来最大程度地减少提交给超级计算机系统的用户作业的响应时间。我们预测队列等待时间的方法自适应地选择了一种基于相似作业集群结构的预测方法。我们的执行时间预测策略动态地了解了负载对执行时间的影响,并使用它来预测目标作业的一组执行时间范围。我们已经开发了两种采用这些预测的资源管理技术,一个选择执行者的处理器数量,另一个也可以动态地改变作业提交时间。使用大型超级计算机轨迹的工作负载模拟,我们显示出对现有技术和基线策略的响应时间的预测和减少的大规模改进。
Job submissions of parallel applications to production supercomputer systems will have to be carefully tuned in terms of the job submission parameters to obtain minimum response times. In this work, we have developed an end-to-end resource management framework that uses predictions of queue waiting and execution times to minimize response times of user jobs submitted to supercomputer systems. Our method for predicting queue waiting times adaptively chooses a prediction method based on the cluster structure of similar jobs. Our strategy for execution time predictions dynamically learns the impact of load on execution times and uses this to predict a set of execution time ranges for the target job. We have developed two resource management techniques that employ these predictions, one that selects the number of processors for execution and the other that also dynamically changes the job submission time. Using workload simulations of large supercomputer traces, we show large-scale improvements in predictions and reductions in response times over existing techniques and baseline strategies.