CLIP series: CLIP: a bridge between text and images

CLIP is a bridge between text and images.
SOTA’s visual task model requires fixed pairs of supervised data, such as -big Elephant,-Rabbit. This method can have very good performance on certain data sets, but the performance will drop sharply on other unknown categories. This form of supervision limits the generalizability of the model because additional data is required to retrain the model. One identifies and ‘s model does not recognize and 04 A0CE6C.png .
Traditional image tasks are performed using one-hot encoding, and each category has its own unique digital label, such as background 0, is 1, is 2, the computer only needs to classify the pixels into 0,1,2. CLIP learns directly from the text description of the image, “a photo of rabbit“, “a photo of elephant“. You only need to provide some text descriptions for each picture to pre-train the CLIP model. After pre-training on a large data set of millions, CLIP can achieve amazing generalization performance to adapt to any downstream tasks, such as recognition, segmentation, tracking, etc.

Background

In NLP, text-to-text pre-training methods have become very mature, allowing large-scale pre-trained models to zero-shot migrate to downstream tasks, which is now the most popular GPT series. The success of GPT has also led to thinking, is it possible to pre-train large-scale visual models through text-to-image methods? As early as 2016, someone tried to use a CNN model to predict the titles and phrases of images, and proved that this training method can improve the zero-shot transfer ability of the model. However, the performance of this approach will still be lower than that of standard vision models. One of the reasons is the size of the dataset. (After all, great strength can produce miracles, and bricks can fly with great force). **ChatGPT has proven that very large data sets improve performance and generalization capabilities. Therefore CLIP chose to train on a large-scale natural language supervised vision dataset.

Highlights

A sufficiently large dataset

MS-COCO and YFCC100M are large data sets commonly used for visual training. CLIP filters these datasets and roughly gets 15 million image data, which is comparable to ImageNet.
In addition, image data from the Internet was crawled, including 400 million image pairs, and the categories were balanced as much as possible.
An efficient pre-training method

A large enough data set can make pigs fly, and a good enough method can turn pigs into J-20s.
CLIP initially used a method similar to VirTex, using CNN to identify image features, Transformer to process text features, and directly predicting image titles. As you can imagine, the result is that the pig is still a pig, but it flies a little higher. Predicting titles directly is difficult and difficult to scale and inefficient.
The solution is to predict which image an entire text is paired with, rather than the exact words. That is to say, a sentence is a whole, rather than an arrangement and combination of words. After all, “an image of a dog”, “an image of an animal like a dog” and “an image of an animal being a dog” are the same in some senses. How does the model understand this sentence? This adds the complexity. So if these texts are taken as a whole, and the Transformer model extracts features from them, these few sentences will become a type of “dog” as a whole. To achieve this function, you need to design an exquisite model and loss function.

Given a batch of N image pairs (image, txet), CLIP needs to predict N×N possible pairings. CLIP learns the image representation (image embdding) and text representation (text embedding) from the image encoder and text encoder respectively, and calculates the cosine similarity between the two to minimize the cosine similarity loss. (This is called contrastive learning)

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# extract feature representations of each modality
I_f = image_encoder(I) #[n, d_i] Image representation
T_f = text_encoder(T) #[n, d_t] text representation

# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) # Embed image representation into feature space
T_e = l2_normalize(np.dot(T_f, W_t), axis=1) # Embed the text representation into the feature space to calculate the similarity between the two

# scaled pairwise cosine similarities [n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t) # Calculate the cosine similarity between image and text representation

#symmetric loss function
labels = np.arange(n) # There are N image-text pairs; each image and text has a label, for a total of N images and N texts.
loss_i = cross_entropy_loss(logits, labels, axis=0) # Calculate the cross entropy loss of the image
loss_t = cross_entropy_loss(logits, labels, axis=1) # Calculate the cross entropy loss of text
loss = (loss_i + loss_t)/2 # Balance the two losses. This process is called contrastive learning.

Choose and scale a model

For image encoder: CLIP considers and uses two classic models: ResNet50 and Vit, with some modifications.

ResNet50: Antialiased rect-2 blur pooling is used to replace the original pooling function; attention pooling mechanism is used to replace the global average pooling layer, where Q in QKV is the output of global average-pooling.
Vit: In Vit’s Transformer, the position of Layer Norm is changed. (The position of LayerNorm in the picture of Vit’s paper and the code seem to be controversial, not verified).

For text encoder: Transformer is selected. Mask self-attention is used in the text encoder.

Regarding the choice of model scale, CLIP simultaneously expands the depth and width of the image endcoder model; on the text encoder, it only scales the width.

prompt engineering

The prompt project is equally important. After CLIP standardizes the text, it can be improved in several points.
For example “a photo of xxx, a type of xxx”;

Methods

Discussion

The computational efficiency of ViT is higher than CLIP ResNet. Vision Transformer is faster than CNN when trained on a large enough data set.