Launchpad：学习使用离线和在线RL方法安排安排

论文标题

Launchpad：学习使用离线和在线RL方法安排安排

Launchpad: Learning to Schedule Using Offline and Online RL Methods

论文作者

Venkataswamy, Vanamala, Grigsby, Jake, Grimshaw, Andrew, Qi, Yanjun

论文摘要

深度强化学习算法在几个具有挑战性的领域中取得了成功。经典的在线RL工作调度程序可以学习有效的调度策略，但通常需要数千个时间步长才能探索环境并适应随机初始化的DNN政策。现有的RL调度程序忽略了从历史数据中学习的重要性，并根据自定义启发式政策进行改进。离线加强学习提供了没有在线环境互动的预录用数据集的政策优化的前景。在数据驱动学习的最新成功之后，我们探索了两种RL方法：1）行为克隆和2）离线RL，旨在从记录的数据中学习策略而不与环境进行交互。这些方法解决了有关数据收集和安全成本的挑战，尤其是与RL的现实应用有关的挑战。尽管数据驱动的RL方法会产生良好的结果，但我们表明该性能高度取决于历史数据集的质量。最后，我们证明，通过有效地合并以前的专家演示以预先培训代理商，我们将随机探索阶段短路以通过在线培训学习合理的政策。我们利用离线RL作为发射台，从使用Oracle或启发式政策收集的先前经验中学习有效的调度策略。这样的框架可有效地从历史数据集中进行预培训，并且非常适合通过在线数据收集进行持续改进。

Deep reinforcement learning algorithms have succeeded in several challenging domains. Classic Online RL job schedulers can learn efficient scheduling strategies but often takes thousands of timesteps to explore the environment and adapt from a randomly initialized DNN policy. Existing RL schedulers overlook the importance of learning from historical data and improving upon custom heuristic policies. Offline reinforcement learning presents the prospect of policy optimization from pre-recorded datasets without online environment interaction. Following the recent success of data-driven learning, we explore two RL methods: 1) Behaviour Cloning and 2) Offline RL, which aim to learn policies from logged data without interacting with the environment. These methods address the challenges concerning the cost of data collection and safety, particularly pertinent to real-world applications of RL. Although the data-driven RL methods generate good results, we show that the performance is highly dependent on the quality of the historical datasets. Finally, we demonstrate that by effectively incorporating prior expert demonstrations to pre-train the agent, we short-circuit the random exploration phase to learn a reasonable policy with online training. We utilize Offline RL as a launchpad to learn effective scheduling policies from prior experience collected using Oracle or heuristic policies. Such a framework is effective for pre-training from historical datasets and well suited to continuous improvement with online data collection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题