Vision Transformers (ViT) in Image Captioning Using Pretrained ViT Models