Image similarity comparison between CLIP and DINOv2

Source: DeepHub IMBA
This article is about 2,500 words, and it is recommended to read for 7 minutes
This article will explore the advantages and differences between CLIP and DINOv2.

There are two main self-supervised models in the field of computer vision: CLIP and DINOv2. CLIP revolutionized image understanding and became a bridge between pictures and text, while DINOv2 brings a new self-supervised learning method.

In this article, we will explore the advantages of CLIP and DINOv2 and their direct and subtle differences. Our goal is to discover which models actually perform well on image similarity tasks.

CLIP

Calculating the similarity between two images using CLIP is a simple process that can be achieved in just two steps: extract the features of the two images and then calculate their cosine similarity.

We first create a virtual environment and install packages:

#Start by setting up a virtual environment
 virtualenv venv-similarity
 source venv-similarity/bin/activate
 #Install required packages
 pip install transformers Pillow torch

Next, calculate the image similarity:

import torch
 from PIL import Image
 from transformers import AutoProcessor, CLIPModel
 import torch.nn as nn


 device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
 processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
 model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


 #Extract features from image1
 image1 = Image.open('img1.jpg')
 with torch.no_grad():
    inputs1 = processor(images=image1, return_tensors="pt").to(device)
    image_features1 = model.get_image_features(**inputs1)


 #Extract features from image2
 image2 = Image.open('img2.jpg')
 with torch.no_grad():
    inputs2 = processor(images=image2, return_tensors="pt").to(device)
    image_features2 = model.get_image_features(**inputs2)


 #Compute their cosine similarity and convert it into a score between 0 and 1
 cos = nn.CosineSimilarity(dim=0)
 sim = cos(image_features1[0],image_features2[0]).item()
 sim = (sim + 1)/2
 print('Similarity:', sim)

The two similar images above have a similarity score of 96.4%.

DINOv2

The process of using DINOv2 to calculate the similarity between two images is similar to that of CLIP. Using DINOv2 requires the same set of packages as previously mentioned without any additional installation:

import torch
 from PIL import Image
 from transformers import AutoProcessor, CLIPModel
 import torch.nn as nn


 device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
 processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
 model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


 #Extract features from image1
 image1 = Image.open('img1.jpg')
 with torch.no_grad():
    inputs1 = processor(images=image1, return_tensors="pt").to(device)
    image_features1 = model.get_image_features(**inputs1)


 #Extract features from image2
 image2 = Image.open('img2.jpg')
 with torch.no_grad():
    inputs2 = processor(images=image2, return_tensors="pt").to(device)
    image_features2 = model.get_image_features(**inputs2)


 #Compute their cosine similarity and convert it into a score between 0 and 1
 cos = nn.CosineSimilarity(dim=0)
 sim = cos(image_features1[0],image_features2[0]).item()
 sim = (sim + 1)/2
 print('Similarity:', sim)

For the same image pair in the CLIP example above, DINOv2 achieved a similarity score of 93%.

Both models can give the similarity of images. Let’s conduct an in-depth study below.

Testing using the COCO dataset

Images from the COCO dataset validation set are used here to compare the results produced by CLIP and DINOv2.

The process is as follows:

Traverse the dataset to extract features of all images.
Store embeddings in FAISS index.
Extract features of the input image.
Retrieve the top three similar images.

1. Feature extraction and index creation

import torch
 from PIL import Image
 from transformers import AutoProcessor, CLIPModel, AutoImageProcessor, AutoModel
 import faiss
 import os
 import numpy as np


 device = torch.device('cuda' if torch.cuda.is_available() else "cpu")


 #Load CLIP model and processor
 processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
 model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


 #Load DINOv2 model and processor
 processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
 model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device)


 #Retrieve all filenames
 images = []
 for root, dirs, files in os.walk('./val2017/'):
    for file in files:
        if file.endswith('jpg'):
            images.append(root + '/' + file)




 #Define a function that normalizes embeddings and add them to the index
 def add_vector_to_index(embedding, index):
    #convert embedding to numpy
    vector = embedding.detach().cpu().numpy()
    #Convert to float32 numpy
    vector = np.float32(vector)
    #Normalize vector: important to avoid wrong results when searching
    faiss.normalize_L2(vector)
    #Add to index
    index.add(vector)


 def extract_features_clip(image):
    with torch.no_grad():
        inputs = processor_clip(images=image, return_tensors="pt").to(device)
        image_features = model_clip.get_image_features(**inputs)
        return image_features


 def extract_features_dino(image):
    with torch.no_grad():
        inputs = processor_dino(images=image, return_tensors="pt").to(device)
        outputs = model_dino(**inputs)
        image_features = outputs.last_hidden_state
        return image_features.mean(dim=1)


 #Create 2 indexes.
 index_clip = faiss.IndexFlatL2(512)
 index_dino = faiss.IndexFlatL2(768)


 #Iterate over the dataset to extract features X2 and store features in indexes
 for image_path in images:
    img = Image.open(image_path).convert('RGB')
    clip_features = extract_features_clip(img)
    add_vector_to_index(clip_features,index_clip)
    dino_features = extract_features_dino(img)
    add_vector_to_index(dino_features,index_dino)


 #store the indexes locally
 faiss.write_index(index_clip,"clip.index")
 faiss.write_index(index_dino,"dino.index")

2. Image similarity search

import faiss
 import numpy as np
 import torch
 from transformers import AutoImageProcessor, AutoModel, AutoProcessor, CLIPModel
 from PIL import Image
 import os


 #Input image
 source='laptop.jpg'
 image = Image.open(source)


 device = torch.device('cuda' if torch.cuda.is_available() else "cpu")


 #Load model and processor DINOv2 and CLIP
 processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
 model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)


 processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base')
 model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device)


 #Extract features for CLIP
 with torch.no_grad():
    inputs_clip = processor_clip(images=image, return_tensors="pt").to(device)
    image_features_clip = model_clip.get_image_features(**inputs_clip)


 #Extract features for DINOv2
 with torch.no_grad():
    inputs_dino = processor_dino(images=image, return_tensors="pt").to(device)
    outputs_dino = model_dino(**inputs_dino)
    image_features_dino = outputs_dino.last_hidden_state
    image_features_dino = image_features_dino.mean(dim=1)


 def normalizeL2(embeddings):
    vector = embeddings.detach().cpu().numpy()
    vector = np.float32(vector)
    faiss.normalize_L2(vector)
    return vector


 image_features_dino = normalizeL2(image_features_dino)
 image_features_clip = normalizeL2(image_features_clip)


 #Search the top 5 images
 index_clip = faiss.read_index("clip.index")
 index_dino = faiss.read_index("dino.index")


 #Get distance and indexes of images associated
 d_dino,i_dino = index_dino.search(image_features_dino,5)
 d_clip,i_clip = index_clip.search(image_features_clip,5)

3. Results

Using four different images as input, the search produced the following results:

If judged by the naked eye, DINOv2 shows slightly better performance.

Testing using DISC21 data set

To quantify the differences between CLIP and DINOv2, we chose the DISC21 dataset created specifically for image similarity search. Since its actual size is 350GB, we will use a subset of 150,000 images.

In terms of parameters, we will calculate:?

Accuracy: The ratio of correctly predicted images to the total number of images.
top -3 Accuracy: The ratio of the number of times the correct image is found among the first three similar images to the total number of images.
Computation time: The time required to process the entire data set.

The results are as follows:

Feature extraction: CLIP: 70.7 images per second, DINOv2: 69.7 images per second, both are equally computationally intensive.

Accuracy and top three accuracy rates:

Both models correctly predicted the image:

Correct image not found for all models:

Only CLIP predicts correct images, top3 of DINOv2:

Only DINOv2 predicts correct images:

Result Analysis

DINOv2 is the clear winner, achieving an accuracy of 64% on this very challenging data set. In comparison, CLIP only has 28.45%.

Both models show very similar feature extraction times in terms of computational efficiency.

One reason why DINOv2 is ahead by a large margin here is that MetaAI uses the DISC21 dataset as the baseline for its model, which will definitely give DINOv2 a favorable advantage. But we can see that the tests on the COCO dataset show interesting nuances: DINOv2 shows a higher ability to identify the main elements in the image, while CLIP performs better at focusing on specific details in the input image. Very proficient (look at the bus image, all CLIP found are red buses, this may be because it contains color when aligned with the text)

Another issue is the difference in embedding dimensions between CLIP and DINOv2. The embedding dimension of CLIP is 512, while that of DINOv2 is 768. So it may be the reason for the difference, but if a larger CLIP model is used, the execution speed should not be so fast.

Summary

DINOv2 shows excellent accuracy in image similarity tasks, demonstrating its potential for practical applications. Although CLIP is commendable, it falls short by comparison. CLIP is particularly useful in scenarios where attention to small details is required. Both models show similar computational efficiency. If you only target a single modality of images, DINOv2 should be a good choice.

Author: JeremyK

Editor: Huang Jiyan

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill treeDeep learning in OpenCVImage classification 23596 people are learning the system