论文标题

部分可观测时空混沌系统的无模型预测

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits

论文作者

Banerjee, Siddhartha, Sinclair, Sean R., Tambe, Milind, Xu, Lily, Yu, Christina Lee

论文摘要

大多数实际的土匪算法部署都存在于离线和在线设置之间的某个地方,在线设置和在线设置之间,有一些历史数据可以预先提供一些历史数据,并且在线动态收集了其他数据。如何最好地将历史数据纳入“温暖的开始”匪徒算法是一个空旷的问题:使用所有历史样本的奖励估计天真初始化奖励估计可能会遭受虚假数据和不平衡数据覆盖率的损失,从而导致数据效率低下(使用的历史数据数量),尤其是对于连续的动作空间。为了应对这些挑战,我们提出了人工改造,这是一种将历史数据纳入任何任意的基本强盗算法中的元吻合。我们表明,与完整的温暖启动方法相比,人工改造仅使用了一小部分历史数据,同时仍然对满足无关数据独立性(IIDATA)的基础算法感到遗憾,这是我们引入的一种新颖且广泛适用的属性。我们通过对K臂匪徒和连续组合匪徒进行实验来补充这些理论结果,并在其上使用实际的偷猎数据对绿色安全域进行了建模。我们的结果表明,人工改造在提高数据效率方面的实际好处,包括不满足IIDATA的基本算法。

Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源