Kamran Alipour - 21DOCS Test Area

Detecting Glaucoma from Fundus Photographs Using Deep Learning without Convolutions:...

Rui Fan

and 14 more

May 12, 2022

Purpose: To compare the diagnostic accuracy and explainability of a new Vision Transformer deep learning technique, Data-efficient image Transformer (DeiT), and Resnet-50, trained on fundus photographs from the Ocular Hypertension Treatment Study (OHTS) to detect primary open-angle glaucoma (POAG) and to identify the salient areas of the photographs most important for each model’s decision-making process. Study Design: Evaluation of a diagnostic technology Subjects, Participants, and/or Controls: 66,715 photographs from 1,636 OHTS participants and an additional five external datasets of 16137 photographs of healthy and glaucoma eyes. Methods, Intervention, or Testing: DeiT models were trained to detect five ground truth OHTS POAG classifications: OHTS Endpoint Committee POAG determinations due to disc changes (Model 1), visual field changes (Model 2), or either disc or visual field changes (Model 3) and reading center determinations based on disc (Model 4) and visual fields (Model 5). The best-performing DeiT models were compared to ResNet-50 on OHTS and five external datasets. Main Outcome Measures: Diagnostic performance was compared using areas under the receiver operating characteristic curve (AUROC) and sensitivities at fixed specificities. The explainability of the DeiT and ResNet-50 models was compared by evaluating the attention maps derived directly from DeiT to 3 gradient-weighted class activation map generation strategies. Results: Compared to our best-performing ResNet-50 models, the DeiT models demonstrated similar performance on the OHTS test sets for all five-ground truth POAG labels; AUROC ranged from 0.82 (Model 5) to 0.91 (Model 1). However, the AUROC of DeiT was consistently higher than ResNet-50 on the five external datasets. For example, AUROC for the main OHTS endpoint (Model 3) was between 0.08 and 0.20 higher in the DeiT compared to ResNet-50 models. The saliency maps from the DeiT highlight localized areas of the neuroretinal rim, suggesting the use of important clinical features for classification, while the same maps in the ResNet-50 models show a more diffuse, generalized distribution around the optic disc, Conclusions: Vision transformer has the potential to improve the generalizability and explainability of deep learning models for the detection of eye disease and possibly other medical conditions that rely on imaging modalities for clinical diagnosis and management.

Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions f...

Arijit Ray

and 6 more

June 25, 2021

Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users’ understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users’ interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly (rho>0.97) with how well users can predict model correctness.

Improving Users' Mental Model with Attention-directed Counterfactual Edits

Kamran Alipour

and 6 more

June 25, 2021

In the domain of Visual Question Answering (VQA), studies have shown improvement in users’ mental model of the VQA system when they are exposed to examples of how these systems answer certain Image-Question (IQ) pairs. In this work, we show that showing controlled counterfactual image-question examples are more effective at improving the mental model of users as compared to simply showing random examples. We compare a generative approach and a retrieval-based approach to show counterfactual examples. We use recent advances in generative adversarial networks (GANs) to generate counterfactual images by deleting and inpainting certain regions of interest in the image. We then expose users to changes in the VQA system’s answer on those altered images. To select the region of interest for inpainting, we experiment with using both human-annotated attention maps and a fully automatic method that uses the VQA system’s attention values. Finally, we test the user’s mental model by asking them to predict the model’s performance on a test counterfactual image. We note an overall improvement in users’ accuracy to predict answer change when shown counterfactual explanations. While realistic retrieved counterfactuals obviously are the most effective at improving the mental model, we show that a generative approach can also be equally effective.