论文标题

测试时间适应以了解视觉文档的理解

Test-Time Adaptation for Visual Document Understanding

论文作者

Ebrahimi, Sayna, Arik, Sercan O., Pfister, Tomas

论文摘要

为了进行视觉文档的理解(VDU),已证明自我监督的预处理已成功生成可转移的表示形式,但是在测试时间时,此类表示形式有效适应了分配变化,这仍然是一个未开发的领域。我们提出了一种用于文档的新型测试时间适应方法Doctta,它使用未标记的目标文档数据进行了无源域的适应。 Doctta通过掩盖的视觉语言建模和伪标记来利用交叉模式自我监督学习,以适应在\ textit {source}域中学习的模型,以在测试时间在未标记的\ textit {targetit {target {targetit {targetit {target}域。我们使用现有的公共数据集介绍了新的基准测试,包括实体识别,键值提取和文档视觉问题回答。与源模型性能相比,Doctta在这些方面显示出显着改善,在(F1分数),3.43 \%(F1分数)和17.68 \%(ANLS得分)中最多1.89%(ANLS得分)。我们的基准数据集可在\ url {https://saynaebrahimi.github.io/doctta.html}上找到。

For visual document understanding (VDU), self-supervised pretraining has been shown to successfully generate transferable representations, yet, effective adaptation of such representations to distribution shifts at test-time remains to be an unexplored area. We propose DocTTA, a novel test-time adaptation method for documents, that does source-free domain adaptation using unlabeled target document data. DocTTA leverages cross-modality self-supervised learning via masked visual language modeling, as well as pseudo labeling to adapt models learned on a \textit{source} domain to an unlabeled \textit{target} domain at test time. We introduce new benchmarks using existing public datasets for various VDU tasks, including entity recognition, key-value extraction, and document visual question answering. DocTTA shows significant improvements on these compared to the source model performance, up to 1.89\% in (F1 score), 3.43\% (F1 score), and 17.68\% (ANLS score), respectively. Our benchmark datasets are available at \url{https://saynaebrahimi.github.io/DocTTA.html}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源