Remote sensing has enabled large-scale crop classification to understand agricultural ecosystems and estimate production yields. Since a few years, machine learning is increasingly used for automated crop classification. However, in most approaches the novel algorithms are applied to custom datasets containing information of few crop fields covering a small region and this often leads to models that lack generalization capability. In this work, we propose a multi-modal contrastive self-supervised learning approach to obtain a pre-trained model for crop-classification without the use of labeled data. Such multi-modal self-supervised learning exploits the synergies of different data sources to obtain a richer representation of the data. We build our analysis by adapting the DENETHOR dataset developed for a part of Eastern Germany to our usecase. We use the publicly available Sentinel2 and commercial Planetscope data. While Sentinel2 has higher spectral resolution, Planetscope has finer spatial resolution. For an end-user application, only one source is required. In this work, we analyze and compare the performance of our multi-modal self-supervised model against the uni-modal contrastive self-supervised model using the SCARF algorithm. In addition, we also compare our multi-modal self-supervised model with a supervised model. We find that our multi-modal pre-trained model surpasses the uni-modal and supervised models in almost all test cases.