从深度学习的角度来对齐问题

论文标题

从深度学习的角度来对齐问题

The Alignment Problem from a Deep Learning Perspective

论文作者

Ngo, Richard, Chan, Lawrence, Mindermann, Sören

论文摘要

在未来几年或几十年中，人工通用情报（AGI）可能会超越许多关键领域的人类能力。我们认为，如果不进行大量努力来预防它，AGI可以学会追求与人类利益的冲突（即未对准）的目标。如果像当今最有能力的模型一样受过训练，Agis可以学会采取欺骗性的行动以获得更高的奖励，学习不一致的内部代表性目标，这些目标超出了他们的微调分布，并使用寻求权力的策略来实现这些目标。我们回顾了这些特性的新兴证据。在这篇修订的论文中，我们包括截至2025年初发表的更直接的经验证据。具有这些特性的Agis很难对齐，即使没有这些特性，也可能会显得一致。最后，我们简要概述了未经可逆的AGI的部署如何不可逆转地破坏人类对世界的控制，并回顾旨在防止这种结果的研究方向。

In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. In this revised paper, we include more direct empirical evidence published as of early 2025. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.

下载PDF全文

下载文献需遵守相关版权规定

论文标题