跨域深层代码搜索元学习

论文标题

跨域深层代码搜索元学习

Cross-Domain Deep Code Search with Meta Learning

论文作者

Chai, Yitian, Zhang, Hongyu, Shen, Beijun, Gu, Xiaodong

论文摘要

最近，预培训的编程语言模型（例如Codebert）在代码搜索中已证明了可观的收益。尽管表现出色，但他们依靠大量并行数据的可用性来微调查询和代码之间的语义映射。这限制了他们在特定于领域的语言中的实用性，并具有相对较少和昂贵的数据。在本文中，我们提出了Crocs，这是一种针对特定领域代码搜索的新方法。 Crocs采用转移学习框架，其中初始程序表示模型在大量的通用编程语言（例如Java和Python）上进行了预培训，并进一步适应了特定于领域的语言，例如SQL和Solidity。与直接通过目标语言进行微调的跨语言Codebert不同，Crocs适应了一种称为MAML的几种元学习算法，以了解模型参数的良好初始化，可以在域特异性语言中最好地重复使用。我们在两种特定于域的语言上评估了所提出的方法，即SQL和Solidity，模型从两种广泛使用的语言（Python和Java）转移。实验结果表明，CDC显着优于直接以域特异性语言进行微调的常规预训练的代码模型，并且对于稀缺数据特别有效。

Recently, pre-trained programming language models such as CodeBERT have demonstrated substantial gains in code search. Despite showing great performance, they rely on the availability of large amounts of parallel data to fine-tune the semantic mappings between queries and code. This restricts their practicality in domain-specific languages with relatively scarce and expensive data. In this paper, we propose CroCS, a novel approach for domain-specific code search. CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages (such as Java and Python) and is further adapted to domain-specific languages such as SQL and Solidity. Unlike cross-language CodeBERT, which is directly fine-tuned in the target language, CroCS adapts a few-shot meta-learning algorithm called MAML to learn the good initialization of model parameters, which can be best reused in a domain-specific language. We evaluate the proposed approach on two domain-specific languages, namely, SQL and Solidity, with model transferred from two widely used languages (Python and Java). Experimental results show that CDCS significantly outperforms conventional pre-trained code models that are directly fine-tuned in domain-specific languages, and it is particularly effective for scarce data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题