具有线性函数近似的非平稳加固学习

论文标题

具有线性函数近似的非平稳加固学习

Nonstationary Reinforcement Learning with Linear Function Approximation

论文作者

Zhou, Huozhi, Chen, Jinglin, Varshney, Lav R., Jagmohan, Ashish

论文摘要

我们考虑在漂移环境下具有线性函数近似的情节马尔可夫决策过程（MDP）中的增强学习（RL）。具体而言，奖励和状态过渡功能都可以随着时间的推移而发展，但是它们的总变化不超过$ \ textit {变化预算} $。我们首先开发$ \ texttt {lsvi-ucb-restart} $算法，这是对最小二乘价值迭代的乐观修改，并定期重新启动，并在已知变化预算时限制了其动态遗憾。然后，我们提出了一个无参数算法$ \ texttt {ada-lsvi-ucb-restart} $，该}扩展到未知变化预算。我们还得出了第一个最小值的动态遗憾下限，用于非组织线性MDP，作为副产品建立了Jin等人未解决的线性MDP的Minimax后悔下限。（2020）。最后，我们提供数值实验，以证明我们提出的算法的有效性。

We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time but their total variations do not exceed a $\textit{variation budget}$. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration with periodic restart, and bound its dynamic regret when variation budgets are known. Then we propose a parameter-free algorithm $\texttt{Ada-LSVI-UCB-Restart}$ that extends to unknown variation budgets. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al. (2020). Finally, we provide numerical experiments to demonstrate the effectiveness of our proposed algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题