Multimodal contrastive language image pre-training CLIP: breaking the boundaries between language and vision

Project design collection (artificial intelligence direction): Help newcomers quickly master skills in practice, independently complete project design upgrades, and improve their own hard power (not limited to NLP, knowledge graph, computer vision and other fields): Summary includes A collection of meaningful project designs helps newcomers quickly master skills in practice, helps users make better use of the CSDN platform, independently complete project design upgrades, and enhance their own hard power.

  1. Column subscription: Project encyclopedia to improve one’s hard power

  2. [Details of the column: Project design collection (artificial intelligence direction): Help newcomers quickly master skills in practice, independently complete project design upgrades, and improve their own hard power (not limited to NLP, knowledge graph, computer vision and other fields)

Multi-modal contrastive language image pre-training CLIP: breaking the boundaries between language and vision

A neural network trained based on multi-modal (image, text) comparison. It uses natural language to predict the most relevant text snippets given an image, without having to optimize for a specific task. The design of CLIP is similar to GPT-2 and GPT-3, has excellent zero-fire capabilities, and can be applied to a variety of multi-modal missions.

  • Multimodal contrastive language image pretraining (CLIP) is a neural network model that learns the association between images and text through multimodal contrastive training. Unlike traditional single-modal pre-trained models, CLIP is able to process images and text simultaneously to better understand the semantic relationships between them.

  • The design of CLIP is similar to GPT-2 and GPT-3 and is an autoregressive language model. It learns the mapping relationship between images and text through contrastive learning. During training, CLIP receives an image and an associated text fragment and learns how to associate information from the two modalities. In this way, CLIP learns to match images with corresponding text snippets, using natural language to predict the most relevant text snippets given an image.

  • Because CLIP adopts a contrastive learning method, it can perform a variety of multi-modal tasks well without optimizing for specific tasks. This makes CLIP a universal multi-modal pre-training model that can be widely used in image annotation, visual question answering, image generation and other fields.

CLIP (Contrastive Language Image Pre-training) is a neural network trained on multiple (image, text) pairs. It can predict the most relevant text snippets in natural language given an image without directly optimizing for the task, similar to the zero-shot capabilities of GPT-2 and GPT-3. We find that CLIP matches the performance of original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

1.Installation

ftfy
regex
tqd
torch
torchvision

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git + https://github.com/openai/CLIP.git

Replace cudatoolkit=11.0 above with the appropriate CUDA version on your machine or cpuonly when installing on a machine without a GPU.

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs) # prints: [[0.9927937 0.00421068 0.00299572]]
  • API

The CLIP module clip provides the following methods:

  • clip.available_models()

Returns the names of the available CLIP models.

  • clip.load(name, device=..., jit=False)

Returns the model and the TorchVision transformations required for the model, specified by the model name returned by ‘ clip.available_models() ‘. It will download the model as needed. The ‘name’ parameter can also be the path to a local checkpoint.

You can optionally specify the device on which the model is to be run. The default is to use the first CUDA device (if one is available), otherwise the CPU is used. When ‘jit’ is ‘False’, the non-jit version of the model will be loaded.

  • clip.tokenize(text: Union[str, List[str]], context_length=77)

Returns a LongTensor containing the tokenized sequence of the given text input. This can be used as input to the model

The model returned by ‘ clip.load() ‘ supports the following methods:

  • model.encode_image(image: Tensor)

Given a batch of images, return the image features encoded by the visual part of the CLIP model.

  • model.encode_text(text: Tensor)

Given a batch of text tokens, return the text features encoded by the language part of the CLIP model.

  • model(image: Tensor, text: Tensor)

Given a batch of images and a batch of text tokens, return two tensors containing the logit score corresponding to each image and text input. Its value is the cosine of the similarity between the corresponding image and text features, multiplied by 100.

2. Case introduction

2.1 Zero sample capability

The code below uses CLIP to perform zero-sample prediction, as shown in Appendix B of this article. This example takes an image from the CIFAR-100 dataset and predicts the most likely label among the 100 text labels in the dataset.

import os
import clip
import torch
from torchvision.datasets import CIFAR100

#Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

#Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

#Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {<!-- -->c}") for c in cifar100.classes]).to(device)

#Calculatefeatures
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

#Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

#Print the result
print("\
Top predictions:\
")
for value, index in zip(values, indices):
    print(f"{<!-- -->cifar100.classes[index]:>16s}: {<!-- -->100 * value.item():.2f}%")

The output will look like this (exact numbers may vary slightly depending on your computing device):

Top predictions:

           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          Lizard: 1.88%
       crocodile: 1.75%

Note that this example uses the encode_image() and encode_text() methods that return the encoded features of given inputs.

2.2 Linear-probe evaluation

The example below uses scikit-learn to perform logistic regression on image features.

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

#Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

#Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

#Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

#Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

#Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(float)) * 100.
print(f"Accuracy = {<!-- -->accuracy:.3f}")

Note that the C value should be determined via a hyperparameter sweep using a validation split.

3. For more information, please refer to:

  • OpenCLIP: includes larger and independently trained CLIP models up to ViT-G/14
  • Hugging Face implementation of CLIP: for easier integration with the HF ecosystem

For more high-quality content, please pay attention to the official account: Ting, Artificial Intelligence; some related resources and high-quality articles will be provided for free reading.