## BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation This is the PyTorch implementation of the BLIP paper. Catalog: - [x] Inference demo - [x] Pre-trained and finetuned checkpoints - [x] Pre-training code - [x] Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2 - [x] Download of bootstrapped image-text dataset ### Inference demo (Image Captioning and VQA): Run our interactive demo using Colab notebook (no GPU needed):