降低方差的非政策TDC学习：非反应收敛分析

论文标题

降低方差的非政策TDC学习：非反应收敛分析

Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis

论文作者

Ma, Shaocong, Zhou, Yi, Zou, Shaofeng

论文摘要

减少方差技术已成功地应用于时间差异（TD）学习，并有助于提高政策评估中样本复杂性。但是，现有的工作将差异降低到较不流行的一次性TD算法或两个时间尺度的GTD算法，但具有有限的I.I.D. \ samples，并且两种算法仅适用于单上市设置。在这项工作中，我们为离子设置中的两个时间尺度TDC算法开发了降低方案，并分析了其非质子收敛速率，而不是i.i.d. \ \ \ and Markovian样品。 In the i.i.d.\ setting, our algorithm {matches the best-known lower bound $\tilde{O}(ε^{-1}$).} In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity $O(ε^{-1} \log ε^{-1})$ that is near-optimal.实验表明，所提出的方差减少的TDC比常规TDC和方差还原的TD达到了渐近收敛误差。

Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d.\ samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d.\ and Markovian samples. In the i.i.d.\ setting, our algorithm {matches the best-known lower bound $\tilde{O}(ε^{-1}$).} In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity $O(ε^{-1} \log ε^{-1})$ that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题