From 13e95d7ff7a0724babe074b804b6731af3ca7c0f Mon Sep 17 00:00:00 2001
From: Junnan Li <junnan.li@salesforce.com>
Date: Thu, 27 Jan 2022 21:11:30 +0800
Subject: [PATCH 1/6] Update README.md

---
 README.md | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/README.md b/README.md
index bd2dac1..d2b3817 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,32 @@
 ## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
 
-This is the PyTorch implementation of the <a href="https://arxiv.org/abs/2107.07651">BLIP paper</a>.
+This is the PyTorch implementation of the <a href="https://arxiv.org/abs/2107.07651">BLIP paper</a>. The code has been tested on PyTorch 1.9 and 1.10.
 
 Catalog:
 - [x] Inference demo
 - [x] Pre-trained and finetuned checkpoints
-- [x] Pre-training code
 - [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
+- [x] Pre-training code
 - [x] Download of bootstrapped image-text dataset 
 
 
 ### Inference demo (Image Captioning and VQA):
 Run our interactive demo using Colab notebook (no GPU needed):
+
+### Pre-trained checkpoints:
+Num. pre-train images | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L 
+--- | --- | --- | --- 
+14M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth">Download</a>| - | -
+129M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth">Download</a>| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth">Download</a> | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth">Download</a>
+
+### Image-Text Retrieval:
+1. Download COCO or Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
+2. To evaluate the finetuned BLIP model on COCO, run:
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \
+--config ./configs/retrieval_coco.yaml \
+--output_dir output/retrieval_coco \
+--evaluate</pre> 
+3. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as . Then run:
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \
+--config ./configs/retrieval_coco.yaml \
+--output_dir output/retrieval_coco </pre> 

From 473f323924b28bd03eb808a1c3cfe87ea997b9c1 Mon Sep 17 00:00:00 2001
From: Junnan Li <junnan.li@salesforce.com>
Date: Thu, 27 Jan 2022 21:19:58 +0800
Subject: [PATCH 2/6] Update README.md

---
 README.md | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index d2b3817..4bdbe16 100644
--- a/README.md
+++ b/README.md
@@ -3,11 +3,11 @@
 This is the PyTorch implementation of the <a href="https://arxiv.org/abs/2107.07651">BLIP paper</a>. The code has been tested on PyTorch 1.9 and 1.10.
 
 Catalog:
-- [x] Inference demo
+- [ ] Inference demo
 - [x] Pre-trained and finetuned checkpoints
 - [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
 - [x] Pre-training code
-- [x] Download of bootstrapped image-text dataset 
+- [x] Download of bootstrapped image-text datasets 
 
 
 ### Inference demo (Image Captioning and VQA):
@@ -15,10 +15,20 @@ Run our interactive demo using Colab notebook (no GPU needed):
 
 ### Pre-trained checkpoints:
 Num. pre-train images | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L 
---- | --- | --- | --- 
+--- | :---: | :---: | :---: 
 14M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_14M.pth">Download</a>| - | -
 129M | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth">Download</a>| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth">Download</a> | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth">Download</a>
 
+### Finetuned checkpoints:
+Task | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L 
+--- | :---: | :---: | :---:
+Image-Text Retrieval (COCO) | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth">Download</a>| - | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco.pth">Download</a>
+Image-Text Retrieval (Flickr30k) | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_flickr.pth">Download</a>|  - | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_flickr.pth">Download</a>
+Image Captioning (COCO) | - | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth">Download</a>| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth">Download</a> | 
+VQA | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_vqa.pth">Download</a>| - | - 
+NLVR2 | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth">Download</a>| - | - 
+
+
 ### Image-Text Retrieval:
 1. Download COCO or Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
 2. To evaluate the finetuned BLIP model on COCO, run:

From 4bf27877af3032edc562eb2547aa7ce55947f38e Mon Sep 17 00:00:00 2001
From: Junnan Li <junnan.li@salesforce.com>
Date: Thu, 27 Jan 2022 21:22:08 +0800
Subject: [PATCH 3/6] Update README.md

---
 README.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/README.md b/README.md
index 4bdbe16..9e316cd 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,7 @@
 ## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
 
 This is the PyTorch implementation of the <a href="https://arxiv.org/abs/2107.07651">BLIP paper</a>. The code has been tested on PyTorch 1.9 and 1.10.
+To install the dependencies, run <pre/>pip install -r requirements.txt</pre> 
 
 Catalog:
 - [ ] Inference demo

From 16633402c211f32e3bd03bbd4d182446901bf75d Mon Sep 17 00:00:00 2001
From: Junnan Li <junnan.li@salesforce.com>
Date: Thu, 27 Jan 2022 21:31:05 +0800
Subject: [PATCH 4/6] Update README.md

---
 README.md | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 9e316cd..7650a23 100644
--- a/README.md
+++ b/README.md
@@ -31,13 +31,23 @@ NLVR2 | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLI
 
 
 ### Image-Text Retrieval:
-1. Download COCO or Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
+1. Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
 2. To evaluate the finetuned BLIP model on COCO, run:
 <pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \
 --config ./configs/retrieval_coco.yaml \
 --output_dir output/retrieval_coco \
 --evaluate</pre> 
-3. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as . Then run:
+3. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
 <pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \
 --config ./configs/retrieval_coco.yaml \
 --output_dir output/retrieval_coco </pre> 
+
+### Image-Text Captioning:
+1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
+2. To evaluate the finetuned BLIP model on COCO, run:
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py --evaluate</pre> 
+3. To evaluate the finetuned BLIP model on NoCaps, generate results with:
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env eval_nocaps.py </pre> 
+4. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py </pre> 
+

From 08627003f8290f340d22dfb5b4bf436cd25c4925 Mon Sep 17 00:00:00 2001
From: Junnan Li <junnan.li@salesforce.com>
Date: Thu, 27 Jan 2022 21:38:51 +0800
Subject: [PATCH 5/6] Update README.md

---
 README.md | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 7650a23..979a38a 100644
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@ Task | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L
 Image-Text Retrieval (COCO) | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_coco.pth">Download</a>| - | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco.pth">Download</a>
 Image-Text Retrieval (Flickr30k) | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_retrieval_flickr.pth">Download</a>|  - | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_flickr.pth">Download</a>
 Image Captioning (COCO) | - | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base_caption.pth">Download</a>| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth">Download</a> | 
-VQA | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_vqa.pth">Download</a>| - | - 
+VQA | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_vqa.pth">Download</a>| <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_vqa.pth">Download</a> | - 
 NLVR2 | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base_nlvr.pth">Download</a>| - | - 
 
 
@@ -46,8 +46,22 @@ NLVR2 | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLI
 1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
 2. To evaluate the finetuned BLIP model on COCO, run:
 <pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py --evaluate</pre> 
-3. To evaluate the finetuned BLIP model on NoCaps, generate results with:
+3. To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)
 <pre>python -m torch.distributed.run --nproc_per_node=8 --use_env eval_nocaps.py </pre> 
 4. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
 <pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py </pre> 
 
+### VQA:
+1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
+2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_vqa.py --evaluate</pre> 
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
+<pre>python -m torch.distributed.run --nproc_per_node=16 --use_env train_vqa.py </pre> 
+
+### NLVR2:
+1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
+2. To evaluate the finetuned BLIP model, run
+<pre>python -m torch.distributed.run --nproc_per_node=8 --use_env train_nlvr.py --evaluate</pre> 
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
+<pre>python -m torch.distributed.run --nproc_per_node=16 --use_env train_nlvr.py </pre> 
+

From 0f8d19bbc952cdca675519249bfe75f57195844d Mon Sep 17 00:00:00 2001
From: Junnan Li <junnan.li@salesforce.com>
Date: Thu, 27 Jan 2022 21:40:36 +0800
Subject: [PATCH 6/6] Update README.md

---
 README.md | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/README.md b/README.md
index 979a38a..2c6d756 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,7 @@
 ## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
 
+<img src="img.png" width="600">
+
 This is the PyTorch implementation of the <a href="https://arxiv.org/abs/2107.07651">BLIP paper</a>. The code has been tested on PyTorch 1.9 and 1.10.
 To install the dependencies, run <pre/>pip install -r requirements.txt</pre> 
 
@@ -65,3 +67,12 @@ NLVR2 | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLI
 3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
 <pre>python -m torch.distributed.run --nproc_per_node=16 --use_env train_nlvr.py </pre> 
 
+### Citation
+If you find this code to be useful for your research, please consider citing.
+<pre>
+@inproceedings{ALBEF,
+      title={Align before Fuse: Vision and Language Representation Learning with Momentum Distillation}, 
+      author={Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Deepak Gotmare and Shafiq Joty and Caiming Xiong and Steven Hoi},
+      year={2021},
+      booktitle={NeurIPS},
+}</pre>