In this research, we present a model architecture for employing reinforcement learning with transformers for multimodal tasks. Using transformers allows us to take advantage of the transformer architecture’s simplicity and scalability, as well as developments in language and vision modelling such as ViT, GPT-x, and BERT. Specifically we have trained the model to recognize various digits from the MNIST dataset from their associated word labels and instructions provided. The approach is similar to how an infant would learn to associate pictorial representation of digits to their corresponding words. We have used a MoME transformer in conjunction with Deep Q Learning to train our model. The image inputs have been embedded using pre-trained ResNet18 and the instructions have been embedded using GLoVe before passing them to the model for prediction and training.