论文标题

具有自我修饰能力的有限理性代理的性能

Performance of Bounded-Rational Agents With the Ability to Self-Modify

论文作者

Tětek, Jakub, Sklenka, Marek, Gavenčiak, Tomáš

论文摘要

很难避免嵌入在复杂环境中的代理的自我修改,无论是通过直接均值(例如自己的代码修改)还是间接发生(例如影响操作员,利用错误或环境)。有人认为,智能代理人有动力避免修改其效用功能,以便他们的未来实例朝着相同的目标朝着相同的目标朝着努力。 Everitt等。 (2016年)正式表明,提供自我修饰的选项对完全理性的代理人无害。我们表明,对于具有有限合理性的代理人而言,此结果不再是正确的。在这样的药物中,自我修饰可能导致性能呈指数恶化和先前对准药物的逐渐失控。我们研究了这种效果的大小如何取决于代理合理性中缺陷的类型和幅度(下面的1-4)。我们还讨论了模型假设以及更广泛的问题和框架空间。 我们研究了代理可以是有限理性的四种方式:它(1)并不总是选择最佳动作,(2)与人类价值观并不完全一致,(3)具有环境不准确的模型,或(4)使用错误的时间折现因子。我们表明,在情况下(2) - (4),由代理商的完美不完美引起的未对准并没有随着时间而增加,但(1)未对准可能成倍增长。

Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals. Everitt et al. (2016) formally show that providing an option to self-modify is harmless for perfectly rational agents. We show that this result is no longer true for agents with bounded rationality. In such agents, self-modification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent's rationality (1-4 below). We also discuss model assumptions and the wider problem and framing space. We examine four ways in which an agent can be bounded-rational: it either (1) doesn't always choose the optimal action, (2) is not perfectly aligned with human values, (3) has an inaccurate model of the environment, or (4) uses the wrong temporal discounting factor. We show that while in the cases (2)-(4) the misalignment caused by the agent's imperfection does not increase over time, with (1) the misalignment may grow exponentially.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源