论文标题
解决分布式离线增强学习的沟通复杂性
Settling the Communication Complexity for Distributed Offline Reinforcement Learning
论文作者
论文摘要
我们研究了一个新颖的离线加固学习(RL)设置,其中许多分布式机器共同合作解决了问题,但是只允许一轮通信,并且对每台机器可以发送的信息总数(就位而言)构成了预算限制。对于上下文匪徒以及情节和非剧本MDP的价值函数预测,我们建立了有关分布式统计估计器的最小值风险的信息理论下限;这揭示了任何离线RL算法所需的最小通信量。具体来说,对于上下文匪徒,我们表明,位的数量必须至少缩放为$ω(AC)$才能匹配集中式最小值最佳速率,其中$ a $是动作的数量,$ c $是上下文维度;同时,我们在MDP设置中取得了类似的结果。此外,我们基于最小二乘估计和蒙特卡洛回报估算开发学习算法,并提供了尖锐的分析,表明它们可以达到对数因素的最佳风险。此外,我们还表明,由于该方法的初始偏见,时间差无法有效利用单轮通信设置的所有可用设备的信息。据我们所知,本文为分布式离线RL问题提供了第一个Minimax下限。
We study a novel setting in offline reinforcement learning (RL) where a number of distributed machines jointly cooperate to solve the problem but only one single round of communication is allowed and there is a budget constraint on the total number of information (in terms of bits) that each machine can send out. For value function prediction in contextual bandits, and both episodic and non-episodic MDPs, we establish information-theoretic lower bounds on the minimax risk for distributed statistical estimators; this reveals the minimum amount of communication required by any offline RL algorithms. Specifically, for contextual bandits, we show that the number of bits must scale at least as $Ω(AC)$ to match the centralised minimax optimal rate, where $A$ is the number of actions and $C$ is the context dimension; meanwhile, we reach similar results in the MDP settings. Furthermore, we develop learning algorithms based on least-squares estimates and Monte-Carlo return estimates and provide a sharp analysis showing that they can achieve optimal risk up to logarithmic factors. Additionally, we also show that temporal difference is unable to efficiently utilise information from all available devices under the single-round communication setting due to the initial bias of this method. To our best knowledge, this paper presents the first minimax lower bounds for distributed offline RL problems.