python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py
+### VQA:
+1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
+2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
+python -m torch.distributed.run --nproc_per_node=8 --use_env train_vqa.py --evaluate
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
+python -m torch.distributed.run --nproc_per_node=16 --use_env train_vqa.py
+
+### NLVR2:
+1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
+2. To evaluate the finetuned BLIP model, run
+python -m torch.distributed.run --nproc_per_node=8 --use_env train_nlvr.py --evaluate
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
+python -m torch.distributed.run --nproc_per_node=16 --use_env train_nlvr.py
+