diff --git a/README.md b/README.md index 31cb7e5..476cf83 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ Catalog: - [x] Pre-trained and finetuned checkpoints - [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2 - [x] Pre-training code +- [x] Zero-shot video-text retrieval - [x] Download of bootstrapped pre-training datasets @@ -85,6 +86,12 @@ In order to finetune a model with ViT-L, simply change the config file to set 'v 3. Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain 
+### Zero-shot video-text retrieval: +1. Download MSRVTT dataset following the instructions from https://github.com/salesforce/ALPRO, and set 'video_root' accordingly in configs/retrieval_msrvtt.yaml. +2. Install [decord](https://github.com/dmlc/decord) with
pip install decord
+3. To perform zero-shot evaluation, run +
python -m torch.distributed.run --nproc_per_node=8 eval_retrieval_video.py
+ ### Pre-training datasets download: We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.