在基于Web的视觉语料库构造中用于视觉文档理解

论文标题

在基于Web的视觉语料库构造中用于视觉文档理解

On Web-based Visual Corpus Construction for Visual Document Understanding

论文作者

Kim, Donghyun, Hong, Teakgyu, Yim, Moonbin, Kim, Yoonsik, Kim, Geewook

论文摘要

近年来，对视觉文档理解（VDU）的研究已经大大增长，特别着重于自我监督学习方法的发展。但是，该领域面临的重大挑战之一是可公开访问的视觉语料库的可用性有限，或具有详细文本注释的大量图像收集，尤其是对于非拉丁语或资源宣传语言。为了应对这一挑战，我们提出了基于Web的Visual Copus Builder（WebVicob），这是一种数据集发电机引擎，能够从RAW WIKIPEDIA HTML转储中构造大规模的多语言视觉Corpora。我们的实验表明，WebVicob生成的数据可用于训练在各种下游任务（例如DOCVQA和OCR后解析）上表现良好的强大VDU模型。此外，与来自IIT-CDIP的1100万张图像的数据集相比，使用WebVicob生成的100万张图像的数据集在DOCVQA任务3上的提高了13％。我们的引擎的实施可在https://github.com/clovaai/webvicob上公开获得

In recent years, research on visual document understanding (VDU) has grown significantly, with a particular emphasis on the development of self-supervised learning methods. However, one of the significant challenges faced in this field is the limited availability of publicly accessible visual corpora or extensive collections of images with detailed text annotations, particularly for non-Latin or resource-scarce languages. To address this challenge, we propose Web-based Visual Corpus Builder (Webvicob), a dataset generator engine capable of constructing large-scale, multilingual visual corpora from raw Wikipedia HTML dumps. Our experiments demonstrate that the data generated by Webvicob can be used to train robust VDU models that perform well on various downstream tasks, such as DocVQA and post-OCR parsing. Furthermore, when using a dataset of 1 million images generated by Webvicob, we observed an improvement of over 13% on the DocVQA Task 3 compared to a dataset of 11 million images from the IIT-CDIP. The implementation of our engine is publicly available on https://github.com/clovaai/webvicob

下载PDF全文

下载文献需遵守相关版权规定

论文标题