测试时间适应以了解视觉文档的理解

论文标题

测试时间适应以了解视觉文档的理解

Test-Time Adaptation for Visual Document Understanding

论文作者

Ebrahimi, Sayna, Arik, Sercan O., Pfister, Tomas

论文摘要

为了进行视觉文档的理解（VDU），已证明自我监督的预处理已成功生成可转移的表示形式，但是在测试时间时，此类表示形式有效适应了分配变化，这仍然是一个未开发的领域。我们提出了一种用于文档的新型测试时间适应方法Doctta，它使用未标记的目标文档数据进行了无源域的适应。 Doctta通过掩盖的视觉语言建模和伪标记来利用交叉模式自我监督学习，以适应在\ textit {source}域中学习的模型，以在测试时间在未标记的\ textit {targetit {target {targetit {targetit {target}域。我们使用现有的公共数据集介绍了新的基准测试，包括实体识别，键值提取和文档视觉问题回答。与源模型性能相比，Doctta在这些方面显示出显着改善，在（F1分数），3.43 \％（F1分数）和17.68 \％（ANLS得分）中最多1.89％（ANLS得分）。我们的基准数据集可在\ url {https://saynaebrahimi.github.io/doctta.html}上找到。

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively. Our benchmark datasets are available at \url{https://saynaebrahimi.github.io/DocTTA.html}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题