A vision transformer (ViT) is developed to perform image classification on beam profiles coupled out from silicon photonics (SiPh) gratings. The image classification task is aimed to distinguish 'converged' and 'diverged' beam profiles, and the regions where the corresponding beam profiles are located above the SiPh gratings. Upon training with 1247 beam profile images, the ViT model is able to perform 6-category image classification task on 832 beam profile images with classification accuracy of 0.989. Since the training of ViT is probabilistic in nature, the ViT training is repeated for 100 times to test its robustness. Classification accuracy ranges from 0.83 to 0.99, where 82/100 runs with testing accuracy values of >0.95, are obtained.