论文标题
部分可观测时空混沌系统的无模型预测
PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
论文作者
论文摘要
在云机学习(ML)推理系统中,为最终用户提供低潜伏期至关重要。但是,最大化服务器利用率和系统吞吐量对于ML服务提供商也至关重要,因为它有助于降低总成本所有权。 GPU经常因ML推理使用量而受到批评,因为在低批量推理方案下,很难完全使用其大量计算和内存吞吐量。为了解决这种限制,NVIDIA最近宣布的Ampere GPU架构提供了“重新配置”一个大型的单片GPU,将其“重新配置”为多个较小的“ GPU分区”。这样的功能为云ML服务提供商提供了能够利用可重构GPU进行大批量培训的能力,而且还可以用于小批量推断,并具有实现高资源利用率的潜力。在本文中,我们研究了具有重新配置的新兴GPU体系结构,以开发高性能多GPU ML推理服务器。我们的第一个命题是用于可重新配置GPU的复杂分区算法,该算法系统地确定了一组异质的多粒GPU分区集,最适合推理服务器的部署。此外,我们共同设计了针对我们异质分区的GPU服务器量身定制的弹性调度算法,该算法有效地平衡了低潜伏期和高GPU利用率。
In cloud machine learning (ML) inference systems, providing low latency to end-users is of utmost importance. However, maximizing server utilization and system throughput is also crucial for ML service providers as it helps lower the total-cost-of-ownership. GPUs have oftentimes been criticized for ML inference usages as its massive compute and memory throughput is hard to be fully utilized under low-batch inference scenarios. To address such limitation, NVIDIA's recently announced Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions". Such feature provides cloud ML service providers the ability to utilize the reconfigurable GPU not only for large-batch training but also for small-batch inference with the potential to achieve high resource utilization. In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server. Our first proposition is a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server's deployment. Furthermore, we co-design an elastic scheduling algorithm tailored for our heterogeneously partitioned GPU server which effectively balances low latency and high GPU utilization.