Saifullah Saifullah

and 4 more

As data-driven AI systems become increasingly integrated into industry, concerns have recently arisen regarding potential privacy breaches and the inadvertent leakage of sensitive user data through the exploitation of these systems. In this paper, we explore the intersection of data privacy and AI-powered document analysis systems, presenting a comprehensive benchmark of well-known privacy-preserving methods for the task of document image classification. In particular, we investigate four different privacy methods—Differential Privacy (DP), Federated Learning (FL), Differentially Private Federated Learning (DP-FL), and Secure Multi-Party Computation (SMPC)—on two well-known document benchmark datasets, namely RVL-CDIP and Tobacco3482. Furthermore, we investigate the performance of each method under a variety of configurations for thorough benchmarking. Finally, the privacy strength of each approach is assessed by subjecting the private models to well-known membership inference attacks. Our results demonstrate that, with sufficient tuning of hyperparameters, Differential Privacy (DP) can achieve reasonable performance on the task of document image classification while also ensuring rigorous privacy constraints, both in standalone and federated learning setups. On the other hand, while FL-based approaches present less implementation complexity and incur little to no loss in performance on the task, they do not offer sufficient protection against privacy attacks. By rigorously benchmarking various privacy approaches, our study paves the way for integrating deep document classification models into industrial pipelines while meeting regulatory and ethical standards, including GDPR and the AI Act 2022.

Saifullah Saifullah

and 3 more

Convolutional Neural Networks (ConvNets) have been thoroughly researched for document image classification and are known for their exceptional performance in unimodal image-based document classification. Recently, however, there has been a sudden shift in the field towards multimodal approaches that simultaneously learn from the visual and textual features of the documents. While this has led to significant advances in the field, it has also led to a waning interest in improving pure ConvNets-based approaches. This is not desirable, as many of the multimodal approaches still use ConvNets as their visual backbone, and thus improving ConvNets is essential to improving these approaches. In this paper, we present DocXClassifier, a ConvNet-based approach that, using state-of-the-art model design patterns together with modern data augmentation and training strategies, not only achieves significant performance improvements in image-based document classification, but also outperforms some of the recently proposed multimodal approaches. Moreover, DocXClassifier is capable of generating transformer-like attention maps, which makes it inherently interpretable, a property not found in previous image-based classification models. Our approach achieves a new peak performance in image-based classification on two popular document datasets, namely RVL-CDIP and Tobacco3482, with a top-1 classification accuracy of 94.17% and 95.57% on the two datasets, respectively. Moreover, it sets a new record for the highest image-based classification accuracy of 90.14% on Tobacco3482 without transfer learning from RVL-CDIP. Finally, our proposed model may serve as a powerful visual backbone for future multimodal approaches, by providing much richer visual features than existing counterparts.
Deep neural networks have demonstrated exceptional performance breakthroughs in the field of document image classification; yet, there has been limited research in the field that delves into the explainability of these models. In this paper, we present a comprehensive study in which we analyze 9 different explainability methods across 10 different state-of-the-art document classification models and 2 popular benchmark datasets, making three major contributions. First, through an exhaustive qualitative and quantitative analysis of various explainability approaches, we demonstrate that majority of them perform poorly in generating useful explanations for document images, with only two techniques, namely, Occlusion and DeepSHAP, providing relatively adequate, human-interpretable and faithful explanations. Second, to identify the features most relevant to the models’ prediction, we present an approach to generate counterfactual explanations. An analysis of these explanations reveals that many document classification models can be highly susceptible to minor perturbations in the input. Moreover, they may easily fall victim to biases in the document data, and end up relying on seemingly irrelevant features to make their decisions, with 25-50% of the predictions overall, and up to 60% for some classes strongly depending on these features. Lastly, our analysis revealed that the popular document benchmark datasets, RVL-CDIP and Tobacco3482, are inherently biased, with document identification (ID) numbers of specific styles consistently appearing in certain document regions. If unaddressed, this bias allows the models to predict document classes solely by looking at the ID numbers and prevents them from learning more complex document features. Overall, by unveiling the strengths and weaknesses of various explainability methods, document datasets and deep learning models, our work presents a major step towards creating more transparent and robust document image classification systems.