论文标题

释放器:一种优化短暂云资源利用的增强学习策略

ReLeaSER: A Reinforcement Learning Strategy for Optimizing Utilization Of Ephemeral Cloud Resources

论文作者

Handaoui, Mohamed, Dartois, Jean-Emile, Boukhobza, Jalil, Barais, Olivier, d'Orazio, Laurent

论文摘要

云数据中心的容量过多地处理需求峰和硬件故障,从而导致资源较低的利用率。改善资源利用并因此降低总拥有成本的一种方法是以较低的价格提供未使用的资源(称为短暂资源)。但是,转售资源需要在服务质量方面满足客户的期望。目标是在避免SLA处罚的同时最大程度地提高回收资源的数量。为此,云提供商必须估算其未来利用以提供可用性保证。该预测应考虑资源对不可预测的工作负载做出反应的安全利润。面临的挑战是找到安全利润率,该安全利润可提供回收资源数量与违反SLA的风险之间的最佳权衡。大多数最先进的解决方案都考虑了所有类型的指标(例如CPU,RAM)的固定安全保证金。但是,唯一的固定保证金不会考虑随着时间的推移的各种工作量变化,这可能会导致SLA违规或/和/and和/and不良的利用率。为了应对这些挑战,我们提出了释放器,这是一种强化学习策略,用于优化云中短暂资源的利用。为每个资源度量标准,释放器会动态调整主机级别的安全余量。该策略从过去的预测错误(造成SLA违规)中学习。我们的解决方案显着将SLA违规罚款平均减少2.7倍,最高3.4倍。它还大大提高了CPS的潜在节省27.6%,最高可达43.6%。

Cloud data center capacities are over-provisioned to handle demand peaks and hardware failures which leads to low resources' utilization. One way to improve resource utilization and thus reduce the total cost of ownership is to offer unused resources (referred to as ephemeral resources) at a lower price. However, reselling resources needs to meet the expectations of its customers in terms of Quality of Service. The goal is so to maximize the amount of reclaimed resources while avoiding SLA penalties. To achieve that, cloud providers have to estimate their future utilization to provide availability guarantees. The prediction should consider a safety margin for resources to react to unpredictable workloads. The challenge is to find the safety margin that provides the best trade-off between the amount of resources to reclaim and the risk of SLA violations. Most state-of-the-art solutions consider a fixed safety margin for all types of metrics (e.g., CPU, RAM). However, a unique fixed margin does not consider various workloads variations over time which may lead to SLA violations or/and poor utilization. In order to tackle these challenges, we propose ReLeaSER, a Reinforcement Learning strategy for optimizing the ephemeral resources' utilization in the cloud. ReLeaSER dynamically tunes the safety margin at the host-level for each resource metric. The strategy learns from past prediction errors (that caused SLA violations). Our solution reduces significantly the SLA violation penalties on average by 2.7x and up to 3.4x. It also improves considerably the CPs' potential savings by 27.6% on average and up to 43.6%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源