Pre-trained vision language models (VL) have seen a rise in recent years, achieving state-of-the-art performance on tasks such as visual question answering, image captioning, zero-shot tasks, text-to-image synthesis, and many more. However, these models are large, also requiring large amounts of data, which can hinder application on resource-limited settings. This paper proposes CAPIT (Cross Attention on Pre-trained Image and Text models), a novel architecture built on top of frozen pre-trained unimodal encoders, utilizing transferable knowledge from pre-trainined models through cross-attentional transformers. CAPIT is trained with a simple supervised task that learns to predict the correspondence between image-text pairs. It is then tested on zero-shot image classification. The proposed model contains less parameters and uses less data compared to other methods, while performing fairly well with trade-offs between compute requirements performance, showing promise for future work in resource-limited settings.