Detecting human emotional states, crucial in research for human-machine interaction and affective computing, often relies on objective and spontaneous physiological signals such as Electroencephalogram (EEG). Despite the wealth of stateof-the-art methods for emotion recognition, lingering challenges in terms of generalization power persist. Addressing this issue, we introduce an approach employing a transformer-based autoencoder for the emotion classification using EEG. An autoencoder is applied to connectivity matrices to distil essential features and reduce data dimensionality. Additionally, a multihead attention transformer directs attention to pertinent features for emotion recognition which subsequently added to a fully connected network for classification. Extensive experiments have been conducted on DEAP and DREAMER datasets. In subjectdependent evaluation on the DEAP dataset, the model achieves an accuracy of 99.68% and 99.47% for valence and arousal, respectively. In the subject-independent scheme, the model demonstrates an accuracy of 94.08% and 94.98% for valence and arousal. In the DREAMER experiment, subject-dependent evaluation results in an accuracy of 99.54% and 99.30% for valence and arousal prediction. For the subject-independent experiment, the model accurately predicts valence and arousal levels with 98.38% and 97.32%, respectively. The results are further elucidated through ablation experiments and the visualization of learned features.