VRDU：视觉丰富文档理解的基准

论文标题

VRDU：视觉丰富文档理解的基准

VRDU: A Benchmark for Visually-rich Document Understanding

论文作者

Wang, Zilong, Zhou, Yichao, Wei, Wei, Lee, Chen-Yu, Tata, Sandeep

论文摘要

了解视觉富裕的业务文件以提取结构化数据和自动化业务工作流程，这在学术界和行业中都引起了人们的关注。尽管最近的多模式模型取得了令人印象深刻的结果，但我们发现现有的基准并不能反映行业中实际文档的复杂性。在这项工作中，我们确定了Desiderata的更全面的基准，并提出了我们称之为视觉上丰富的文档理解（VRDU）的基准。 VRDU包含两个代表几个挑战的数据集：丰富的模式包括各种数据类型以及层次结构实体，包括表和多列布局在内的复杂模板，以及单个文档类型中不同布局（模板）的多样性。我们设计了一些射击和常规实验设置以及精心设计的匹配算法，以评估提取结果。我们报告了强大的基线的性能并提供三个观察结果：（1）对新文档模板的推广仍然非常具有挑战性，（2）几乎没有弹性的性能有很多净空，并且（3）模型在发票中的层次结构领域（例如线路目标）挣扎。我们计划开放基准和评估工具包。我们希望这有助于社区从视觉上丰富的文档中提取结构化数据时取得这些挑战性的任务的进步。

Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as hierarchical entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and offer three observations: (1) generalizing to new document templates is still very challenging, (2) few-shot performance has a lot of headroom, and (3) models struggle with hierarchical fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope this helps the community make progress on these challenging tasks in extracting structured data from visually rich documents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题