Source: DeepHub IMBA This article is about 2,500 words, and it is recommended to read for 7 minutes This article will explore the advantages and differences between CLIP and DINOv2.
There are two main self-supervised models in the field of computer vision: CLIP and DINOv2. CLIP revolutionized image understanding and became a bridge between pictures and text, while DINOv2 brings a new self-supervised learning method.
In this article, we will explore the advantages of CLIP and DINOv2 and their direct and subtle differences. Our goal is to discover which models actually perform well on image similarity tasks.
CLIP
Calculating the similarity between two images using CLIP is a simple process that can be achieved in just two steps: extract the features of the two images and then calculate their cosine similarity.
We first create a virtual environment and install packages:
#Start by setting up a virtual environment virtualenv venv-similarity source venv-similarity/bin/activate #Install required packages pip install transformers Pillow torch
Next, calculate the image similarity:
import torch from PIL import Image from transformers import AutoProcessor, CLIPModel import torch.nn as nn device = torch.device('cuda' if torch.cuda.is_available() else "cpu") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device) #Extract features from image1 image1 = Image.open('img1.jpg') with torch.no_grad(): inputs1 = processor(images=image1, return_tensors="pt").to(device) image_features1 = model.get_image_features(**inputs1) #Extract features from image2 image2 = Image.open('img2.jpg') with torch.no_grad(): inputs2 = processor(images=image2, return_tensors="pt").to(device) image_features2 = model.get_image_features(**inputs2) #Compute their cosine similarity and convert it into a score between 0 and 1 cos = nn.CosineSimilarity(dim=0) sim = cos(image_features1[0],image_features2[0]).item() sim = (sim + 1)/2 print('Similarity:', sim)
The two similar images above have a similarity score of 96.4%.
DINOv2
The process of using DINOv2 to calculate the similarity between two images is similar to that of CLIP. Using DINOv2 requires the same set of packages as previously mentioned without any additional installation:
import torch from PIL import Image from transformers import AutoProcessor, CLIPModel import torch.nn as nn device = torch.device('cuda' if torch.cuda.is_available() else "cpu") processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device) #Extract features from image1 image1 = Image.open('img1.jpg') with torch.no_grad(): inputs1 = processor(images=image1, return_tensors="pt").to(device) image_features1 = model.get_image_features(**inputs1) #Extract features from image2 image2 = Image.open('img2.jpg') with torch.no_grad(): inputs2 = processor(images=image2, return_tensors="pt").to(device) image_features2 = model.get_image_features(**inputs2) #Compute their cosine similarity and convert it into a score between 0 and 1 cos = nn.CosineSimilarity(dim=0) sim = cos(image_features1[0],image_features2[0]).item() sim = (sim + 1)/2 print('Similarity:', sim)
For the same image pair in the CLIP example above, DINOv2 achieved a similarity score of 93%.
Both models can give the similarity of images. Let’s conduct an in-depth study below.
Testing using the COCO dataset
Images from the COCO dataset validation set are used here to compare the results produced by CLIP and DINOv2.
The process is as follows:
-
Traverse the dataset to extract features of all images.
-
Store embeddings in FAISS index.
-
Extract features of the input image.
-
Retrieve the top three similar images.
1. Feature extraction and index creation
import torch from PIL import Image from transformers import AutoProcessor, CLIPModel, AutoImageProcessor, AutoModel import faiss import os import numpy as np device = torch.device('cuda' if torch.cuda.is_available() else "cpu") #Load CLIP model and processor processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device) #Load DINOv2 model and processor processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base') model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device) #Retrieve all filenames images = [] for root, dirs, files in os.walk('./val2017/'): for file in files: if file.endswith('jpg'): images.append(root + '/' + file) #Define a function that normalizes embeddings and add them to the index def add_vector_to_index(embedding, index): #convert embedding to numpy vector = embedding.detach().cpu().numpy() #Convert to float32 numpy vector = np.float32(vector) #Normalize vector: important to avoid wrong results when searching faiss.normalize_L2(vector) #Add to index index.add(vector) def extract_features_clip(image): with torch.no_grad(): inputs = processor_clip(images=image, return_tensors="pt").to(device) image_features = model_clip.get_image_features(**inputs) return image_features def extract_features_dino(image): with torch.no_grad(): inputs = processor_dino(images=image, return_tensors="pt").to(device) outputs = model_dino(**inputs) image_features = outputs.last_hidden_state return image_features.mean(dim=1) #Create 2 indexes. index_clip = faiss.IndexFlatL2(512) index_dino = faiss.IndexFlatL2(768) #Iterate over the dataset to extract features X2 and store features in indexes for image_path in images: img = Image.open(image_path).convert('RGB') clip_features = extract_features_clip(img) add_vector_to_index(clip_features,index_clip) dino_features = extract_features_dino(img) add_vector_to_index(dino_features,index_dino) #store the indexes locally faiss.write_index(index_clip,"clip.index") faiss.write_index(index_dino,"dino.index")
2. Image similarity search
import faiss import numpy as np import torch from transformers import AutoImageProcessor, AutoModel, AutoProcessor, CLIPModel from PIL import Image import os #Input image source='laptop.jpg' image = Image.open(source) device = torch.device('cuda' if torch.cuda.is_available() else "cpu") #Load model and processor DINOv2 and CLIP processor_clip = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32") model_clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device) processor_dino = AutoImageProcessor.from_pretrained('facebook/dinov2-base') model_dino = AutoModel.from_pretrained('facebook/dinov2-base').to(device) #Extract features for CLIP with torch.no_grad(): inputs_clip = processor_clip(images=image, return_tensors="pt").to(device) image_features_clip = model_clip.get_image_features(**inputs_clip) #Extract features for DINOv2 with torch.no_grad(): inputs_dino = processor_dino(images=image, return_tensors="pt").to(device) outputs_dino = model_dino(**inputs_dino) image_features_dino = outputs_dino.last_hidden_state image_features_dino = image_features_dino.mean(dim=1) def normalizeL2(embeddings): vector = embeddings.detach().cpu().numpy() vector = np.float32(vector) faiss.normalize_L2(vector) return vector image_features_dino = normalizeL2(image_features_dino) image_features_clip = normalizeL2(image_features_clip) #Search the top 5 images index_clip = faiss.read_index("clip.index") index_dino = faiss.read_index("dino.index") #Get distance and indexes of images associated d_dino,i_dino = index_dino.search(image_features_dino,5) d_clip,i_clip = index_clip.search(image_features_clip,5)
3. Results
Using four different images as input, the search produced the following results:
If judged by the naked eye, DINOv2 shows slightly better performance.
Testing using DISC21 data set
To quantify the differences between CLIP and DINOv2, we chose the DISC21 dataset created specifically for image similarity search. Since its actual size is 350GB, we will use a subset of 150,000 images.
In terms of parameters, we will calculate:?
-
Accuracy: The ratio of correctly predicted images to the total number of images.
-
top -3 Accuracy: The ratio of the number of times the correct image is found among the first three similar images to the total number of images.
-
Computation time: The time required to process the entire data set.
The results are as follows:
Feature extraction: CLIP: 70.7 images per second, DINOv2: 69.7 images per second, both are equally computationally intensive.
Accuracy and top three accuracy rates:
Both models correctly predicted the image:
Correct image not found for all models:
Only CLIP predicts correct images, top3 of DINOv2:
Only DINOv2 predicts correct images:
Result Analysis
DINOv2 is the clear winner, achieving an accuracy of 64% on this very challenging data set. In comparison, CLIP only has 28.45%.
Both models show very similar feature extraction times in terms of computational efficiency.
One reason why DINOv2 is ahead by a large margin here is that MetaAI uses the DISC21 dataset as the baseline for its model, which will definitely give DINOv2 a favorable advantage. But we can see that the tests on the COCO dataset show interesting nuances: DINOv2 shows a higher ability to identify the main elements in the image, while CLIP performs better at focusing on specific details in the input image. Very proficient (look at the bus image, all CLIP found are red buses, this may be because it contains color when aligned with the text)
Another issue is the difference in embedding dimensions between CLIP and DINOv2. The embedding dimension of CLIP is 512, while that of DINOv2 is 768. So it may be the reason for the difference, but if a larger CLIP model is used, the execution speed should not be so fast.
Summary
DINOv2 shows excellent accuracy in image similarity tasks, demonstrating its potential for practical applications. Although CLIP is commendable, it falls short by comparison. CLIP is particularly useful in scenarios where attention to small details is required. Both models show similar computational efficiency. If you only target a single modality of images, DINOv2 should be a good choice.
Author: JeremyK
Editor: Huang Jiyan
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill treeDeep learning in OpenCVImage classification 23596 people are learning the system