论文标题
ACELA:可预测的数据中心级维护作业计划
Acela: Predictable Datacenter-level Maintenance Job Scheduling
论文作者
论文摘要
数据中心运营商通过使用自动化流程安排维护作业以在严格的时间预算内完成,确保公平和常规的服务器维护。自动化此调度问题是一项挑战,因为维护工作持续时间根据工作类型和硬件而变化。尽管很容易使用先前的机器学习技术来预测工作持续时间,但我们发现维护工作调度问题的结构会带来独特的挑战。特别是,我们表明,产生最低误差预测的先前的机器学习方法不会产生由于不对称成本而产生的最佳调度结果。具体而言,与过分预测维护工作持续时间相比,预测的维护工作持续时间不足导致更多的服务器脱机和更长的服务器停机时间。预测的系统成本比过度预测的成本要大得多。 我们提出了ACELA,这是一种用于预测维护工作持续时间的机器学习系统,该系统使用分位回归来偏向持续时间预测的预测过度预测。我们将ACELA集成到维护工作调度程序中,并在大规模生产数据中心的数据集上对其进行评估。与先前工作的基于机器学习的预测变量相比,ACELA将离线接收的服务器数量减少了1.87-4.28X,并将服务器离线时间减少1.40-2.80x。
Datacenter operators ensure fair and regular server maintenance by using automated processes to schedule maintenance jobs to complete within a strict time budget. Automating this scheduling problem is challenging because maintenance job duration varies based on both job type and hardware. While it is tempting to use prior machine learning techniques for predicting job duration, we find that the structure of the maintenance job scheduling problem creates a unique challenge. In particular, we show that prior machine learning methods that produce the lowest error predictions do not produce the best scheduling outcomes due to asymmetric costs. Specifically, underpredicting maintenance job duration has results in more servers being taken offline and longer server downtime than overpredicting maintenance job duration. The system cost of underprediction is much larger than that of overprediction. We present Acela, a machine learning system for predicting maintenance job duration, which uses quantile regression to bias duration predictions toward overprediction. We integrate Acela into a maintenance job scheduler and evaluate it on datasets from large-scale, production datacenters. Compared to machine learning based predictors from prior work, Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.