Yongbiao Deng - 21DOCS Test Area

Abstract—Document key information extraction (DKIE) is a challenging task that aims to automatically understand documents in their varied formats and layouts (forms, receipts, etc.). Existing pre-trained methods have shown superior performance on multiple DKIE tasks. However, there are three main drawbacks to DKIE. Firstly, these methods do not consider the ambiguities arising from similar text representations before cross-modal interaction. Secondly, they ignore cross-modal feature alignment before cross-modal interaction. Thirdly, self-attention layers in cross-modal interaction suffer from high memory consumption, which hinders joint representation reasoning from all negative samples. To tackle the above limitations, we present a Dynamical Cross-Modal Alignment Interaction framework(DCMAI). Specifically, (1) to disambiguate similar textual representations, a prior knowledge-guided module is formulated to adaptively mine fine-grained visual information to distinguish similar textual representations, which generates a prior visual knowledge-guided text embedding for each token. (2) A crossover alignment loss is proposed to align cross-modal information, which contributes to improving the interaction between visual and text features before cross-modal interaction. (3) To further reasoning joint representation from a cross-modal encoder and effectively mine cross-modal negative samples, we introduce a hierarchical interaction sampling strategy to mine negative samples, and a contrastive loss is applied to optimize joint representation reasoning. We pre-train the DCMAI framework on a public corpus and fine-tune it on several downstream tasks, such as entity extraction, sequence labeling, and document question answering. The proposed DCMAI approach achieves superior performance on various downstream tasks. Code will be open to the public.