BLIP-2: salesforce proposes efficient training of multimodal large models based on frozen visual encoder and LLM model parameters

Paper link: https://arxiv.org/abs/2301.12597
Project code: https://github.com/salesforce/LAVIS/tree/main/projects/blip2
Experience address: https://huggingface.co/spaces/Salesforce/BLIP2
Document introduction: https://huggingface.co/docs/transformers/main/en/model_doc/blip-2
Fine-tuning reference: https://github.com/salesforce/LAVIS
Huggingface Space address: https://hf.co/spaces/Salesforce/BLIP2

In the past few years, visual language pre-training (VLP) has continuously refreshed SOTA when the model is getting larger and larger, but due to the end-to-end training method, a large amount of computational cost is required during pre-training. Since the release of ChatGPT in November 2022, people have become more and more aware of the “emergent ability” of large models, especially the zero-shot ability (direct prediction without any fine-tuning). Due to the huge amount of parameters of LLM, the cost of fine-tuning is very high, so the research and application of freezing LLM parameters emerge in an endless stream, such as the adapter strategy, LoRA and BLIP-2 mentioned in this article.

Introduction to BLIP-2

The BLIP-2 method proposed in this paper is a framework for a new multimodal pre-training model. The idea is to add a lightweight query Transformer (Query) between the frozen pre-trained image encoder and the frozen pre-trained large language model. Transformer, Q-Former) to bridge the modality gap between vision and language models. In the whole model, Q-Former is the only trainable module, while the image encoder and language model are always kept frozen.

Two stages of pre-training are used. In the first pre-training stage, we perform visual-linguistic representation learning, which forces Q-Former to learn the most relevant visual features for the text. In the second pre-training stage, we perform vision-to-language generative learning by connecting the output of the Q-Former to the LLM, and training the Q-Former to interpret its visual representation in natural language through the LLM.

Key advantages of BLIP-2 include:

BLIP-2 effectively utilizes pre-trained image models and language models with fixed parameters. The inter-modal gap is bridged using Q-Former pretrained in two stages: a representation learning stage and a generative learning stage. BLIP-2 achieves SOTA performance on various visual-language tasks, including visual question answering, image paraphrasing, and image-text retrieval.

With the support of LLM’s powerful capabilities (such as OPT, Flant5), BLIP-2 can be prompted to perform zero-shot image-to-text generation following natural language instructions, so that emerging functions such as visual knowledge reasoning, visual dialogue, etc. can be realized.

BLIP-2 is more efficient than existing SOTA due to the use of a parameter-fixed unimodal model and a lightweight Q-Former. For example, BLIP-2 outperforms 8.7% on zero-shot VQAv2 while using 54 times fewer trainable parameters than Flamingo. Furthermore, BLIP-2 is a general method that can incorporate better unimodal models for better VLP performance.

BLIP-2 model structure

Q-Former serves as a trainable module to bridge the modality gap between image encoders and LLMs. It extracts a fixed-dimensional output feature from an image encoder that is independent of the resolution of the input image. As shown in Figure 2, Q-Former consists of two transformer sub-modules that share the same self-attention layer: (1) an image transformer interacting with an image encoder with fixed parameters for visual feature extraction; (2) capable of simultaneous A text transformer that models text encoders and text decoders. A set of learnable query embeddings is created as input to the image transformer. The query interacts with self-attention layers and interacts with image features through cross-attention layers. The query can also interact with the text through the same self-attention layer. According to the pre-training task, different self-attention masks are applied to control the query-text interaction. The author uses bert-base pre-trained weights to initialize Q-Former, while the cross-attention layer is randomly initialized. Q-Former contains a total of 188M parameters. Note that query embeddings are also considered model parameters.

In the experiment, the author used 32 query embeddings, where each query has a dimension of 768 (the same as the hidden dimension of Q-Former). Z represents the output query. The size of Z ( 32 × 768 ) is much smaller than the size of image features (eg ViT-L/14 has 257 × 1024 ). These structures together with the pre-training objective force the query to extract the most relevant visual information for the text.

BLIP-2 represents the learning phase

In the representation learning stage, a Q-Former is connected to an image encoder with fixed parameters and pre-trained using [image-text] pairs. The goal is to train the Q-Former so that the query can learn to extract the most relevant visual representation of the text. Inspired by BLIP, three pre-trained objectives sharing the same input and model parameters are jointly optimized. Each target task adopts a different attention masking strategy to control the interaction between query and text (see Figure 2).

Image-Text Contrastive Learning (ITC):ITC learns to align image representations and text representations to maximize their mutual information. It achieves this goal by comparing the similarity between positive and negative [image-text] pairs. Align the query representation Z from the image transformer with the text representation t from the text transformer, where t is the output embedding of [CLS] characters. Since Z contains multiple output embeddings (one for each query), we first compute the pairwise similarity between each query output and t, and then choose the highest one as the image-text similarity. To avoid information leakage, a unimodal self-attention masking matrix is adopted. Compared to end-to-end methods, more training samples can be trained per GPU due to the use of an image encoder with frozen parameters. Therefore, we use in-batch negative examples instead of the momentum queue in BLIP.

Image-grounded Text Generation (ITG): ITG loss trains Q-Former to generate text, which is conditioned on the input image. Since the structure of Q-Former does not allow direct interaction between the image encoder and text characters, the information needed to generate text must first be extracted by the query and then passed to the text characters through the self-attention layer. Therefore, the query is forced to extract visual features to capture all information related to the text. We employ a multimodal causal self-attention masking matrix to control the interaction between query and text, similar to UniLM. queries can see each other, but not text characters. Each text character sees all query and its previous text characters. Also replaced the [CLS] characters with the new [DEC] as the first text to signal the decoding task.

Image-Text Matching (ITM): ITM aims to learn fine-grained alignment between image and text representations. This is a binary classification task that requires the model to predict whether an [image-text] pair is a positive (match) or a negative (mismatch). Using a two-way self-attention masking matrix, all queries and texts can see each other. The query embedding Z thus captures multimodal information. Each query embedding is mapped into a two-class linear classifier to obtain logits, and the logits of all queries are averaged to output a matching score.

BLIP-2 generative pre-training stage

In the generative pre-training stage, a Q-Former (with a parameter-fixed image encoder) is connected to a parameter-fixed LLM to capture the generative ability of the LLM. As shown in Figure 3, the query embedding Z is linearly mapped to the same dimensionality as the LLM text embedding using a fully connected layer (FC). Then the mapped query embedding is spliced before the input text embedding, which acts as a kind of soft cue, and the LLM is conditioned on the visual representation extracted from the Q-former for subsequent generation. Since Q-Former has been pre-trained to extract language-related visual representations, it can be effectively used as an information carrier to provide the most useful information to LLM while removing irrelevant visual information. This offloads the LLM from learning the visual-language alignment, thereby alleviating the problem of catastrophic forgetting.

The author tried two types of LLM: decoder-based LLM and encoder-decoder-based LLM. For decoder-based LLM, language modeling loss is used for pre-training, where the task of frozen LLM is to generate text conditioned on the Q-Former’s visual representation. For the encoder-decoder-based LLM, the prefix language modeling loss is used for pre-training, and the text is divided into two parts, the prefix text and visual representation are used as the input of the LLM encoder, and the suffix text is used as the generation target of the LLM decoder.

BLIP-2 use

Step1, install transformer environment

pip install git + https://github.com/huggingface/transformers.git

Step2. Collect test data

We need an input image. The New Yorker hosts a weekly cartoon captioning contest for its readers. We take a cartoon image from it and feed it to BLIP-2 for testing.

Cartoon Alphabet Contest Links:
https://www.newyorker.com/cartoons/contest#thisweek

import requests
from PIL import Image

url = 'https://media.newyorker.com/cartoons/63dc6847be24a6a76d90eb99/master/w_1160,c_limit/230213_a26611_838.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
display (image. resize ((596, 437)))

New Yorker Cartoon

Step3, load the model

Now that we have an input image, we need a pretrained BLIP-2 model and corresponding preprocessor to process the input. You can find a list of all available pre-trained checkpoints on the Hugging Face Hub. Here, we will load a BLIP-2 checkpoint using Meta AI’s pretrained OPT model with 2.7 billion parameters.

from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch

processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained ("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)

Note that you cannot yet load BLIP-2 models using Auto APIs (e.g. AutoModelForXXX), which is rare in Hugging Face. You need to explicitly use Blip2ForConditionalGeneration to load the BLIP-2 model. Although auto-fetching the model is not yet possible, you can use AutoProcessor to get the matching processor class, in this case Blip2Processor.

We can use the GPU to speed up the text generation:

device = "cuda" if torch.cuda.is_available () else "cpu"
model.to (device)

Let’s look at some specific cases

Image Caption Generation

Let’s first see if BLIP-2 can generate captions for New Yorker cartoon images from zero-shot. To caption an image, we don’t have to provide any text cues to the model, only the preprocessed input image. Without any text prompts, the model will generate image captions starting from the BOS (beginning-of-sequence).

inputs = processor (image, return_tensors="pt")

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print (generated_text)

"two cartoon monsters sitting around a campfire"

That’s an impressively accurate description for a model that wasn’t trained on New Yorker-style cartoon images!

Prompt image subtitle generation

We can also extend image captioning by providing text cues, which the model will then supplement with the cued words given the image.

prompt = "this is a cartoon of"

inputs = processor (image, text=prompt, return_tensors="pt").to (device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print (generated_text)

"two monsters sitting around a campfire"

prompt = "they look like they are"

inputs = processor (image, text=prompt, return_tensors="pt").to (device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print (generated_text)

"having a good time"

Visual Q&A

When used for visual Q&A, prompts must follow a specific format: “Question: {} Answer:”

prompt = "Question: What is a dinosaur holding? Answer:"

inputs = processor (image, text=prompt, return_tensors="pt").to (device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=10)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print (generated_text)

"A torch"

Chat-based tips

Finally, we can create a ChatGPT-like experience by stitching together the questions and answers from each round of the conversation. We ask the model with a cue (such as “what did the dinosaur hold?”), and the model generates an answer for it (such as “torch”), which we can stitch into the dialogue. Then we do another round, so the context is established. However, it needs to be ensured that the context cannot exceed 512 tokens, since this is the context length of the language models (OPT and T5) used by BLIP-2.

context = [
   ("What is a dinosaur holding?", "a torch"),
   ("Where are they?", "In the woods.")
]
question = "What for?"
template = "Question: {} Answer: {}."

prompt = "".join ([template.format (context [i][0], context [i][1]) for i in range (len (context))]) + " Question: " + question + "Answer:"

print (prompt)

Question: What is a dinosaur holding? Answer: a torch. Question: Where are they? Answer: In the woods.. Question: What for? Answer:

inputs = processor (image, text=prompt, return_tensors="pt").to (device, torch.float16)

generated_ids = model.generate(**inputs, max_new_tokens=10)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print (generated_text)

To light a fire.

References:

[1] https://baijiahao.baidu.com/s?id=1759140009156263839 &wfr=spider &for=pc