迈向更清洁的文档的多语言爬行语料库

论文标题

迈向更清洁的文档的多语言爬行语料库

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

论文作者

Abadji, Julien, Suarez, Pedro Ortiz, Romary, Laurent, Sagot, Benoît

论文摘要

近年来，随着转移学习和半监督学习方法将自然语言处理引入，对原始大型原始语料库的需求已大大增加。尽管最近有一些尝试手动策划训练大型语言模型所需的数据量的尝试，但获取此数据的主要方法仍是通过自动网络爬网。在本文中，我们将现有的多语言网络语料库Oscar及其管道Undgoliant提取和分类，该数据在线级别提取和分类了数据，并提出了一系列改进和自动注释，以生产出新的注重文档的Oscar版本，以便可以证明更适合于预先传播的大型生成语言模型，以及预先传播的语言以及自然语言的人性化和数字化和数字化。

The need for raw large raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题