Alzheimer's disease (AD) is a chronic, degenerative brain disease that affects memory, thinking, and retention. Early diagnosis of AD is essential for effective therapy before clinical symptoms. Positron Emission Tomography (PET) measures the decline in glucose concentration in the temporoparietal association cortex. By identifying meaningful features in medical images, deep learning is an artificial intelligence technology used to identify and predict disease. A convolutional neural network (CNN) is an example of an effective application of deep learning for diagnosing Alzheimer's disease. In several diagnostic imaging classifications, Vision Transformers (ViT) recently outperformed CNN. Transformers allow attention to be drawn to all previously computed elements in a sequence so that they exhibit minimal inductive bias toward learning compression representations over time. A slow, naturally iterative stream tries to learn a specialized, compressed representation by grouping K time-step parts into a single representation decomposed into multiple vectors. With the proposed approach, we intend to achieve Transformer expressiveness while promoting improved representational structure and slow in-stream compression for ADNI dataset. For visual perception and sequential decision-making tasks, we demonstrate the advantages of the proposed technique in terms of improved sample efficiency and generalization performance over other competitive benchmarks. Accordingly, we propose a technique to identify dementia by combining 18F-Florbetaben PET scan with ViT. The results show that the proposed method can be successfully applied in the field of brain imaging and may offer a potential way to use the pre-trained model in dataintensive applications. Moreover, compared with most of the current studies, the proposed cross-domain transfer learning technique can achieve comparable classification performance. According to the experimental findings, the suggested model has an accuracy of 91.08% when applied to the ADNI database for AD/CN classification task. Then, to explain the findings, we offer an Explainable Artificial Intelligence paradigm using attention maps.