Advances in Artificial Intelligence play an important role in the development of modern Computer-Aided Diagnosis (CAD) systems. CAD software has greatly benefited from deep learning advances in disease detection. Nonetheless, these tasks remain challenging when it comes to the early stage of the disease where imaging atrophy is usually underrepresented in imaging data and annotated data are limited for efficient model training. Indeed, early disease diagnosis often requires discrimination between patients at the sub-category level, which is considered a fine-grained classification problem. The latter is a challenging task due to the subtle disease-specific patterns, clinically overlapping sub-categories (small variance), and hence the difficulty in learning medical clinically discriminant features. In this paper, we propose a transformer-based framework that optimally learns a clinically meaningful disease representation for fine-grained classification of early Alzheimer’s disease conditions using 18F FluoroDeoxyGlucose Positron Emission Tomography (18F-FDG PET) images. The proposed method captures both local and global contextual clinical information from the whole image and learns a latent representation through a regularized objective function. The network training function is boosted using the Contrastive Learning (CL) paradigm. Yet, adversarial augmentation has been proposed for effective pairs mining for the optimization of the CL. The proposed method has been evaluated on six fine-grained classification tasks for AD detection and prediction on the ADNI dataset. Obtained results outperform recent state-of-the-art approaches.