Eye diseases are a significant global health concern, affecting approximately a quarter of the world's population. Half of these cases are preventable and can be addressed if discovered early. Numerous prior studies have delivered several promising results using standalone Convolutional Neural Network (CNN) and Vision Transformer (ViT) models for eye disease classification. However, very few have explored the potential hybrid models for this task. This research aims to implement Convolution vision Transformer (CvT) model-a hybrid of CNN and ViT-on the ODIR-5k dataset to develop multiclass eye disease classification model. Despite showing promising results with 68% accuracy, 69% precision, 68% recall, and 66% f1-score, our CvT model slightly underperforms compared to standalone ViT model, which achieved 69% accuracy, 69% precision, 69% recall, and 66% f1-score despite being smaller in size (CvT has 276.2M parameters, while ViT only has 86.7M parameters). Additionally, we found an incompatibility between EfficientNet and the dataset used. We suspect that the convolutional layer in EfficientNet and CvT model might hinder the model to perform effective evaluation by masking global patterns and fine-grained details. This limitation could also explain the extreme discrepancies in EfficientNet's performance metrics across various sampling treatments. However, further research is required to investigate and address this issue.