From 13e95d7ff7a0724babe074b804b6731af3ca7c0f Mon Sep 17 00:00:00 2001 From: Junnan Li Date: Thu, 27 Jan 2022 21:11:30 +0800 Subject: [PATCH 1/6] Update README.md --- README.md | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index bd2dac1..d2b3817 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,32 @@ ## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation -This is the PyTorch implementation of the BLIP paper. +This is the PyTorch implementation of the BLIP paper. The code has been tested on PyTorch 1.9 and 1.10. Catalog: - [x] Inference demo - [x] Pre-trained and finetuned checkpoints -- [x] Pre-training code - [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2 +- [x] Pre-training code - [x] Download of bootstrapped image-text dataset ### Inference demo (Image Captioning and VQA): Run our interactive demo using Colab notebook (no GPU needed): + +### Pre-trained checkpoints: +Num. pre-train images | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L +--- | --- | --- | --- +14M | Download| - | - +129M | Download| Download | Download + +### Image-Text Retrieval: +1. Download COCO or Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly. +2. To evaluate the finetuned BLIP model on COCO, run: +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \
+--config ./configs/retrieval_coco.yaml \
+--output_dir output/retrieval_coco \
+--evaluate
+3. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as . Then run: +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \
+--config ./configs/retrieval_coco.yaml \
+--output_dir output/retrieval_coco 
From 473f323924b28bd03eb808a1c3cfe87ea997b9c1 Mon Sep 17 00:00:00 2001 From: Junnan Li Date: Thu, 27 Jan 2022 21:19:58 +0800 Subject: [PATCH 2/6] Update README.md --- README.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index d2b3817..4bdbe16 100644 --- a/README.md +++ b/README.md @@ -3,11 +3,11 @@ This is the PyTorch implementation of the BLIP paper. The code has been tested on PyTorch 1.9 and 1.10. Catalog: -- [x] Inference demo +- [ ] Inference demo - [x] Pre-trained and finetuned checkpoints - [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2 - [x] Pre-training code -- [x] Download of bootstrapped image-text dataset +- [x] Download of bootstrapped image-text datasets ### Inference demo (Image Captioning and VQA): @@ -15,10 +15,20 @@ Run our interactive demo using Colab notebook (no GPU needed): ### Pre-trained checkpoints: Num. pre-train images | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L ---- | --- | --- | --- +--- | :---: | :---: | :---: 14M | Download| - | - 129M | Download| Download | Download +### Finetuned checkpoints: +Task | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L +--- | :---: | :---: | :---: +Image-Text Retrieval (COCO) | Download| - | Download +Image-Text Retrieval (Flickr30k) | Download| - | Download +Image Captioning (COCO) | - | Download| Download | +VQA | Download| - | - +NLVR2 | Download| - | - + + ### Image-Text Retrieval: 1. Download COCO or Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly. 2. To evaluate the finetuned BLIP model on COCO, run: From 4bf27877af3032edc562eb2547aa7ce55947f38e Mon Sep 17 00:00:00 2001 From: Junnan Li Date: Thu, 27 Jan 2022 21:22:08 +0800 Subject: [PATCH 3/6] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 4bdbe16..9e316cd 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ ## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation This is the PyTorch implementation of the BLIP paper. The code has been tested on PyTorch 1.9 and 1.10. +To install the dependencies, run
pip install -r requirements.txt
Catalog: - [ ] Inference demo From 16633402c211f32e3bd03bbd4d182446901bf75d Mon Sep 17 00:00:00 2001 From: Junnan Li Date: Thu, 27 Jan 2022 21:31:05 +0800 Subject: [PATCH 4/6] Update README.md --- README.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9e316cd..7650a23 100644 --- a/README.md +++ b/README.md @@ -31,13 +31,23 @@ NLVR2 | python -m torch.distributed.run --nproc_per_node=8 --use_env train_retrieval.py \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco + +### Image-Text Captioning: +1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly. +2. To evaluate the finetuned BLIP model on COCO, run: +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py --evaluate
+3. To evaluate the finetuned BLIP model on NoCaps, generate results with: +
python -m torch.distributed.run --nproc_per_node=8 --use_env eval_nocaps.py 
+4. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run: +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py 
+ From 08627003f8290f340d22dfb5b4bf436cd25c4925 Mon Sep 17 00:00:00 2001 From: Junnan Li Date: Thu, 27 Jan 2022 21:38:51 +0800 Subject: [PATCH 5/6] Update README.md --- README.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 7650a23..979a38a 100644 --- a/README.md +++ b/README.md @@ -26,7 +26,7 @@ Task | BLIP w/ ViT-B | BLIP w/ ViT-B and CapFilt-L | BLIP w/ ViT-L Image-Text Retrieval (COCO) |
Download| - | Download Image-Text Retrieval (Flickr30k) | Download| - | Download Image Captioning (COCO) | - | Download| Download | -VQA | Download| - | - +VQA | Download| Download | - NLVR2 | Download| - | - @@ -46,8 +46,22 @@ NLVR2 | python -m torch.distributed.run --nproc_per_node=8 --use_env train_caption.py +### VQA: +1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml. +2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server) +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_vqa.py --evaluate
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run: +
python -m torch.distributed.run --nproc_per_node=16 --use_env train_vqa.py 
+ +### NLVR2: +1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml. +2. To evaluate the finetuned BLIP model, run +
python -m torch.distributed.run --nproc_per_node=8 --use_env train_nlvr.py --evaluate
+3. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run: +
python -m torch.distributed.run --nproc_per_node=16 --use_env train_nlvr.py 
+ From 0f8d19bbc952cdca675519249bfe75f57195844d Mon Sep 17 00:00:00 2001 From: Junnan Li Date: Thu, 27 Jan 2022 21:40:36 +0800 Subject: [PATCH 6/6] Update README.md --- README.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/README.md b/README.md index 979a38a..2c6d756 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ ## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation + + This is the PyTorch implementation of the
BLIP paper. The code has been tested on PyTorch 1.9 and 1.10. To install the dependencies, run
pip install -r requirements.txt
@@ -65,3 +67,12 @@ NLVR2 | python -m torch.distributed.run --nproc_per_node=16 --use_env train_nlvr.py +### Citation +If you find this code to be useful for your research, please consider citing. +
+@inproceedings{ALBEF,
+      title={Align before Fuse: Vision and Language Representation Learning with Momentum Distillation}, 
+      author={Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Deepak Gotmare and Shafiq Joty and Caiming Xiong and Steven Hoi},
+      year={2021},
+      booktitle={NeurIPS},
+}