论文标题

通过随机非convex-rong-concave优化在差异私有时间差异学习中

Differentially Private Temporal Difference Learning with Stochastic Nonconvex-Strongly-Concave Optimization

论文作者

Zhao, Canzhe, Ze, Yanjie, Dong, Jing, Wang, Baoxiang, Li, Shuai

论文摘要

时间差异(TD)学习是一种评估强化学习政策的广泛使用方法。尽管近年来已经开发了许多TD学习方法,但很少有人关注保护隐私,大多数现有方法可能会面临用户数据隐私的关注。为了实现策略的复杂代表性能力,在本文中,我们考虑使用非线性价值函数近似中保存TD学习中的隐私。这是具有挑战性的,因为通常研究这种非线性问题是在制定随机非convex-rong-concave优化以获得有限样本分析的情况下,这将需要同时保留原始和二元方面的隐私。为此,我们采用了基于动量的随机梯度下降上升,以实现单时间尺度算法,并通过使用良好的校准的高斯noises在两侧扰动两侧的梯度来实现有意义的隐私和效用保证。结果,我们的DPTD算法可以提供$(ε,δ)$ - 差异隐私(DP)保证对过渡中编码的敏感信息的保证并保留TD学习的原始能力,并由实用程序上限限制$ \ widetilde {\ Mathcal {o}}(\ frac {(d \ log(1/δ))^{1/8}} {(nε)^{1/4}})$(本文中的tilde隐藏了log factor。在Openai体育馆进行的广泛实验显示了我们拟议的算法的优势。

Temporal difference (TD) learning is a widely used method to evaluate policies in reinforcement learning. While many TD learning methods have been developed in recent years, little attention has been paid to preserving privacy and most of the existing approaches might face the concerns of data privacy from users. To enable complex representative abilities of policies, in this paper, we consider preserving privacy in TD learning with nonlinear value function approximation. This is challenging because such a nonlinear problem is usually studied in the formulation of stochastic nonconvex-strongly-concave optimization to gain finite-sample analysis, which would require simultaneously preserving the privacy on primal and dual sides. To this end, we employ a momentum-based stochastic gradient descent ascent to achieve a single-timescale algorithm, and achieve a good trade-off between meaningful privacy and utility guarantees of both the primal and dual sides by perturbing the gradients on both sides using well-calibrated Gaussian noises. As a result, our DPTD algorithm could provide $(ε,δ)$-differential privacy (DP) guarantee for the sensitive information encoded in transitions and retain the original power of TD learning, with the utility upper bounded by $\widetilde{\mathcal{O}}(\frac{(d\log(1/δ))^{1/8}}{(nε)^{1/4}})$ (The tilde in this paper hides the log factor.), where $n$ is the trajectory length and $d$ is the dimension. Extensive experiments conducted in OpenAI Gym show the advantages of our proposed algorithm.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源