论文标题
异步Q学习的样本复杂性:较清晰的分析和差异降低
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
论文作者
论文摘要
异步Q学习旨在基于由行为策略引起的单一轨迹学习马尔可夫决策过程(MDP)的最佳动作值函数(或Q功能)。专注于带有状态空间的$γ$ discoussed MDP $ \ Mathcal {s} $和Action Space $ \ Mathcal {a} $,我们证明了$ \ ell _ {\ ell _ {\ infty} $ - 基于基于经典异步Q-limimity的经典异步的$ samples $的样品复杂性 - Q功能---最多按$ \ frac {1} {μ_{\ min}(1-γ)^5 \ varepsilon^2}+ \ frac {t_ {mix}} {min min}(1-γ}(1-γ)} $ under conder corlity corper to ander to to to to to to to to to to in to corper,在这里,$ t_ {mix} $和$μ_ {\ min} $分别表示样本轨迹的混合时间和最小状态占用概率。该边界的第一项在同步情况下与从轨迹的固定分布中得出的独立样品中的样品复杂性匹配。第二项反映了马尔可夫轨迹达到稳态的经验分布所花费的成本,该稳定状态在一开始就发生,并且随着算法的运行而被摊销。令人鼓舞的是,上述界限在最先进的结果\ cite {qu2020finite}的因素上至少为$ | \ m rathcal {s} || \ mathcal {a} | $在所有情况下,至少是至少$ t_ {mix} | \ nation compal compacal {mix} | \ nation col { $ \ varepsilon $。此外,我们证明可以通过降低方差来改进有效的视野$ \ frac {1} {1-γ} $的缩放。
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a $γ$-discounted MDP with state space $\mathcal{S}$ and action space $\mathcal{A}$, we demonstrate that the $\ell_{\infty}$-based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise $\varepsilon$-accurate estimate of the Q-function --- is at most on the order of $\frac{1}{μ_{\min}(1-γ)^5\varepsilon^2}+ \frac{t_{mix}}{μ_{\min}(1-γ)}$ up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, $t_{mix}$ and $μ_{\min}$ denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the cost taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result \cite{qu2020finite} by a factor of at least $|\mathcal{S}||\mathcal{A}|$ for all scenarios, and by a factor of at least $t_{mix}|\mathcal{S}||\mathcal{A}|$ for any sufficiently small accuracy level $\varepsilon$. Further, we demonstrate that the scaling on the effective horizon $\frac{1}{1-γ}$ can be improved by means of variance reduction.