Image classification is one of the prominent tasks influencing the rapid advancement in computer vision using deep learning models such as convolutional neural networks, analogously gaining remarkable attention in remote sensing scene classification. However, the systematic comparability of literature concerning well-established large-scale or application-dependent datasets and models still requires improvement. This article intends to introduce the five high-capacity convolutional neural network architectures (VGG, ResNet, DenseNet, SqueezeNet, and AlexNet) and their variations with ten large-scale and small- scale datasets to provide a one-to-one experimentation summary. The observations were noted for comparison by keeping the configurations ideal in each trial. The performance of the models is depicted and discussed, with a focus on the number of flops, accuracy, F1-score, precision, recall, and AUCROC. This experiment validates the remote sensing image classification benchmark and provides insight for researchers and practitioners based on the summary and discussion for choosing the optimal model by comparing performance and the number of required floating point operations. Finally, the advanced optimization techniques of Normalization, Mixed-up augmentation, and Label smoothing are introduced to the best-chosen model in the case of each dataset.