论文标题
降低方差的非政策TDC学习:非反应收敛分析
Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence Analysis
论文作者
论文摘要
减少方差技术已成功地应用于时间差异(TD)学习,并有助于提高政策评估中样本复杂性。但是,现有的工作将差异降低到较不流行的一次性TD算法或两个时间尺度的GTD算法,但具有有限的I.I.D. \ samples,并且两种算法仅适用于单上市设置。在这项工作中,我们为离子设置中的两个时间尺度TDC算法开发了降低方案,并分析了其非质子收敛速率,而不是i.i.d. \ \ \ and Markovian样品。 In the i.i.d.\ setting, our algorithm {matches the best-known lower bound $\tilde{O}(ε^{-1}$).} In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity $O(ε^{-1} \log ε^{-1})$ that is near-optimal.实验表明,所提出的方差减少的TDC比常规TDC和方差还原的TD达到了渐近收敛误差。
Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d.\ samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d.\ and Markovian samples. In the i.i.d.\ setting, our algorithm {matches the best-known lower bound $\tilde{O}(ε^{-1}$).} In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity $O(ε^{-1} \log ε^{-1})$ that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD.