论文标题
BBTV2:迈向具有大语言模型的无梯度未来
BBTv2: Towards a Gradient-Free Future with Large Language Models
论文作者
论文摘要
大多数下游适应方法通过梯度下降调整了预训练模型(PTM)参数的全部或部分,其中调整成本随模型大小的增长而线性增加。相比之下,无梯度方法仅需要对PTM进行正面计算来调整提示,从而保留有效调整和部署的好处。但是,过去的无梯度调整工作通常会引入梯度下降,以寻求及时的良好初始化,并且在任务和PTM中缺乏多功能性。在本文中,我们介绍了BBTV2是Black-Box调整的改进版本,以驱动PTMS以进行几次学习。我们将连续的提示置于PTM的每一层,并提出了一个分隔和串联无梯度的算法,以交替优化不同层的提示。跨各种任务和PTM的广泛实验表明,BBTV2可以在几乎没有射击的设置下实现与完整的模型调整和最先进的参数效率方法(例如,适配器,Lora,BitFit等)相当的性能,同时保持较少的可调节参数。
Most downstream adaptation methods tune all or part of the parameters of pre-trained models (PTMs) through gradient descent, where the tuning cost increases linearly with the growth of the model size. By contrast, gradient-free methods only require the forward computation of the PTM to tune the prompt, retaining the benefits of efficient tuning and deployment. Though, past work on gradient-free tuning often introduces gradient descent to seek a good initialization of prompt and lacks versatility across tasks and PTMs. In this paper, we present BBTv2, an improved version of Black-Box Tuning, to drive PTMs for few-shot learning. We prepend continuous prompts to every layer of the PTM and propose a divide-and-conquer gradient-free algorithm to optimize the prompts at different layers alternately. Extensive experiments across various tasks and PTMs show that BBTv2 can achieve comparable performance to full model tuning and state-of-the-art parameter-efficient methods (e.g., Adapter, LoRA, BitFit, etc.) under few-shot settings while maintaining much fewer tunable parameters.