Humans have the innate ability to perceive an image just by looking at it, for us images are not just a collection of objects but a network of interconnected object relationships. The problem arises when a machine tries to inspect an image, hence we try to convert image data to textual data. Despite major achievements in the image captioning field, there is a lack of models that provide concise captions of a given image, moreover, already existing models are so much bigger in size that the number of learning parameters is very high. The objective of this paper is to fill that gap, hence we provide an image captioning model that will be utilizing a small and good CNN architecture which is relatively new in the research field and is not used much. Our model incorporates an advanced Deep Convolution Neural Network to extract image features and an Attention GRU with a local attention network to generate captions. We have also identified a class imbalance problem with this popular dataset so we tried to rectify this problem by adding some images of some specific classes, hence improvising the dataset as well. The model has been trained on this improvised Flickr Dataset.