问我什么：提示语言模型的简单策略

论文标题

问我什么：提示语言模型的简单策略

Ask Me Anything: A simple strategy for prompting language models

论文作者

Arora, Simran, Narayan, Avanika, Chen, Mayee F., Orr, Laurel, Guha, Neel, Bhatia, Kush, Chami, Ines, Sala, Frederic, Ré, Christopher

论文摘要

大型语言模型（LLM）很好地转移到新任务外，只需给出一个自然的语言提示即可演示如何执行任务并且没有额外的培训。提示是一个脆弱的过程，其中对提示的小修改可能会在模型预测中引起很大的变化，因此，大量的努力致力于为任务设计一个艰苦的“完美提示”。为了减轻及时设计涉及的高度努力，我们询问产生多种有效但不完美的提示和汇总是否会导致高质量的提示策略。我们的观察激励了我们提出的提示方法，问我任何事情（AMA）。我们首先对有效的及时格式进行了了解，发现提问（QA）提示，这些提示鼓励开放式一代（“谁去了公园？”）倾向于胜过那些限制模型输出的人（“约翰去了公园。输出。我们的方法递归地使用LLM本身将任务输入转换为有效的质量质量质量检查格式。我们应用收集的提示，为输入的真实标签获得几票。我们发现提示可以具有非常不同的精度和复杂的依赖性，因此建议使用弱监督，这是将噪声预测组合的程序，以产生输入的最终预测。我们在开源模型家族（例如Eleutherai，Bloom，Opt和T0）和型号尺寸（125M-175B参数）中评估AMA，表明比几次基线的平均性能提升为10.2％。这种简单的策略使开源GPT-J-6B型号能够在20个流行基准中的15个中匹配并超过少数GPT3-175B的性能。 GPT-J-6B模型在这些任务中平均均优于少数GPT3-175B。我们在此处发布代码：https：//github.com/hazyresearch/ama_prompting

Large language models (LLMs) transfer well to new tasks out-of-the-box simply given a natural language prompt that demonstrates how to perform the task and no additional training. Prompting is a brittle process wherein small modifications to the prompt can cause large variations in the model predictions, and therefore significant effort is dedicated towards designing a painstakingly "perfect prompt" for a task. To mitigate the high degree of effort involved in prompt-design, we instead ask whether producing multiple effective, yet imperfect, prompts and aggregating them can lead to a high quality prompting strategy. Our observations motivate our proposed prompting method, ASK ME ANYTHING (AMA). We first develop an understanding of the effective prompt formats, finding that question-answering (QA) prompts, which encourage open-ended generation ("Who went to the park?") tend to outperform those that restrict the model outputs ("John went to the park. Output True or False."). Our approach recursively uses the LLM itself to transform task inputs to the effective QA format. We apply the collected prompts to obtain several noisy votes for the input's true label. We find that the prompts can have very different accuracies and complex dependencies and thus propose to use weak supervision, a procedure for combining the noisy predictions, to produce the final predictions for the inputs. We evaluate AMA across open-source model families (e.g., EleutherAI, BLOOM, OPT, and T0) and model sizes (125M-175B parameters), demonstrating an average performance lift of 10.2% over the few-shot baseline. This simple strategy enables the open-source GPT-J-6B model to match and exceed the performance of few-shot GPT3-175B on 15 of 20 popular benchmarks. Averaged across these tasks, the GPT-J-6B model outperforms few-shot GPT3-175B. We release our code here: https://github.com/HazyResearch/ama_prompting

下载PDF全文

下载文献需遵守相关版权规定

论文标题