diff --git a/README.md b/README.md index 7650a23..979a38a 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ Task | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L Image-Text Retrieval (COCO) | Download| - | Download Image-Text Retrieval (Flickr30k) | Download| - | Download Image Captioning (COCO) | - | Download| Download | -VQA | Download| - | - +VQA | Download| Download | - NLVR2 | Download| - | - @@ -46,8 +46,22 @@ NLVR2 | python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py +### VQA: +1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml. +2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server) +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_vqa.py --evaluate
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run: +
python -m torch.distributed.run --nproc_per_node=16 --use_env train_vqa.py 
+ +### NLVR2: +1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml. +2. To evaluate the finetuned BLIP model, run +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_nlvr.py --evaluate
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run: +
python -m torch.distributed.run --nproc_per_node=16 --use_env train_nlvr.py 
+