Text video retrieval 2 (Learning Joint Embedding with Multimodal Cues forCross-Modal Video-Text Retrieval)

Code link: https://github.com/niluthpol/multimodal_vtt

Introduction: This article is also an article on cross-modal retrieval of text and video retrieval. First, let’s introduce its main features:

1) A new feature extraction framework is proposed:

This is the picture. Although it looks very complicated, compared with the articles I read before, other articles usually directly integrate the appearance feature with the sentence feature, and then get the features to make predictions. Then this article is A dual channel is proposed. The first one is to fuse the appearance feature with the sentence feature to get a final feature, and then the other is to first fuse the motion feature with the audio feature to get a feature, and then fuse this feature with the sentence feature. , to obtain a final feature, and these two final features are fused to make predictions. It seems quite troublesome, but the fusion method is relatively simple, and it will be fine when we talk about the code later. Then I feel that the feature extraction part is the most difficult. The code given in the article does not give how to extract it. It just reads the features directly.

2) A new loss function is proposed.

S(v,t) is a positive sample, and S(v,t) (I don’t know how to type) is a negative sample. Generally, the positive and negative samples are directly summed, but this one only cares about the most difficult to identify. Negative sample.

Expansion: His similarity is cosine similarity, that is, the bigger the better, and then he does not care about the other negative samples, only the largest negative sample.

Then let’s look at the specific code (I only care about the more important codes “just in my personal opinion”):

1. Get basic features

1) This part of the code extracts video and text features

 def __getitem__(self, index):
        '''
Return a training sample pair (including the video frame feature and the corresponding caption)
        According to the caption to find the corresponding video, so the need for video storage is in accordance with the id ascending order
        '''
        caption = self.captions[index]
        length = self.lengths[index]
        video_id = self.video_ids[index]
        vid_feat_dir = self.vid_feat_dir

        path=vid_feat_dir + "video" + str(video_id) + ".npy"
        #/hdd2/niluthpol/VTT/MSR_VTT/resnet_feat_caffe_all/video8763.npy
        video_feat = torch.from_numpy(np.load(path))
        video_feat = video_feat.mean(dim=0, keepdim=False) # average pooling
        video_feat=video_feat.float()

        return video_feat, caption, index, video_id
        #torch.Size([2048])
        #torch.Size([28]) #The length is not necessarily
        # 18033 #The length is not necessarily
        # 9144 # Length not necessarily

2) Extract motion feature and audio feature, sentence feature

 def __getitem__(self, index):
        caption = self.captions[index]
        length = self.lengths[index]
        video_id = self.video_ids[index]
        vid_feat_dir = self.vid_feat_dir
\t\t
# activity (i3d) feature
        path1=vid_feat_dir + '/video_features' + "/msr_vtt-I3D-RGBFeatures-video" + str(video_id) + ".npy"
        video_feat = torch.from_numpy(np.load(path1))
        video_feat = video_feat.mean(dim=0, keepdim=False)

# audio (soundnet) Feature
        audio_feat_file = vid_feat_dir + '/audio_features/' + "/video" + str(video_id) + ".mp3.soundnet.h5"
        audio_h5 = h5py.File(audio_feat_file,'r')
        audio_feat=audio_h5['layer24'][()]
        audio_feat=torch.from_numpy(audio_feat)
        audio_feat = audio_feat.mean(dim=1, keepdim=False)

        video_feat = torch.cat([video_feat,audio_feat]) # Directly splice the extracted activity and audio features

        return video_feat, caption, index, video_id

    #torch.Size([2048])
    #torch.Size([8]) #Not necessarily
    #39587 #Not necessarily
    # 8604 #Not necessarily

It can be seen that the step of obtaining basic features is relatively simple.

2. Feature transformation

It is to change the dimensions of these features to the same dimension. The code is basically a linear + activation layer (see if it is a direction that can be optimized in the future). In the end, it is all 1024 dimensions, 1024 dimensions.

3. Loss function:

 def forward(self, im, s):
        #compute image-sentence score matrix
        scores = self.sim(im, s) #Calculate the cosine matching degree of the sentence
        diagonal = scores.diag().view(im.size(0), 1) #Extract the scores of the diagonal matrix
        d1 = diagonal.expand_as(scores) #The value of each row is the same
        d2 = diagonal.t().expand_as(scores) #The value of each column is the same
\t\t
        d1_sort, d1_indice=torch.sort(scores,dim=1,descending=True) #Sort according to the values inside the row, from large to small (value, subscript)
        val, id1 = torch.min(d1_indice,1) #Find the minimum value corresponding to each row
        rank_weights1 = id1.float()
\t\t
        for j in range(d1.size(0)):
                rank_weights1[j]=1 + torch.tensor(self.beta)/( d1.size(0)-(d1_indice[j,:]==j).nonzero() ).to(dtype=torch.float)
#Get the index of the weight

        d2_sort, d2_indice=torch.sort(scores.t(),dim=1,descending=True)
        val, id2 = torch.min(d2_indice,1)
        rank_weights2 = id2.float()
\t\t
        for k in range(d2.size(0)):
            rank_weights2[k]=1 + torch.tensor(self.beta)/(d2.size(0)-(d2_indice[k,:]==k).nonzero()).to(dtype=torch.float)
\t\t\t
        # compare every diagonal score to scores in its column
        # Compare the score of each diagonal to the score of its column
        # caption retrieval
        # Title search
        cost_s = (self.margin + scores - d1).clamp(min=0)
        # compare every diagonal score to scores in its row
        # image retrieval
        cost_im = (self.margin + scores - d2).clamp(min=0)

        # clear diagonals Clear diagonal elements to 0
        mask = torch.eye(scores.size(0)) > .5
        I = Variable(mask)
        if torch.cuda.is_available():
            I = I.cuda()
        cost_s = cost_s.masked_fill_(I, 0)
        cost_im = cost_im.masked_fill_(I, 0)
 
        # keep the maximum violating negative for each query
        # Keep the maximum negative violation value for each query
        cost_s = cost_s.max(1)[0]
        cost_im = cost_im.max(0)[0]
\t\t
# weight similarity scores
        cost_s= torch.mul(rank_weights1, cost_s)
        cost_im= torch.mul(rank_weights2, cost_im)

        return cost_s.sum() + cost_im.sum()

4. Calculate the final result ranking: R@1, R@5, R@10

def i2t(videos, captions, videos2, captions2, shared_space='both', measure='cosine', return_ranks=False):
    npts = int(videos.shape[0] / 20)
    index_list = []
    print(npts)
\t
    ranks = numpy.zeros(npts)
    top1 = numpy.zeros(npts)
    for index in range(npts):
        # Get query image
        im = videos[20 * index].reshape(1, videos.shape[1])
        im2 = videos2[20 * index].reshape(1, videos2.shape[1])
        #Compute scores
        if 'both' == shared_space:
            d1 = numpy.dot(im, captions.T).flatten() #Multiply a video with multiple pairs of text descriptions, and then expand into a one-dimensional array
            d2 = numpy.dot(im2, captions2.T).flatten()
            d= d1 + d2 #Add the results obtained
        elif 'object_text' == shared_space:
            d = numpy.dot(im, captions.T).flatten()
        elif 'activity_text' == shared_space:
            d = numpy.dot(im2, captions2.T).flatten()
\t\t\t
        inds = numpy.argsort(d)[::-1] #Sort, the sorted results are reversed, and the ones with higher scores are in front
        index_list.append(inds[0])
        #Score
        rank=1e20
        for i in range(20 * index, 20 * index + 20, 1):
            tmp = numpy.where(inds == i)[0][0]
            if tmp < rank:
                rank=tmp
                flag=i-20 * index
        ranks[index] = rank
        top1[index] = inds[0]

    #Compute metrics
    r1 = 100.0 * len(numpy.where(ranks < 1)[0]) / len(ranks)
    r5 = 100.0 * len(numpy.where(ranks < 5)[0]) / len(ranks)
    r10 = 100.0 * len(numpy.where(ranks < 10)[0]) / len(ranks)
    medr = numpy.floor(numpy.median(ranks)) + 1
    meanr = ranks.mean() + 1
    if return_ranks:
        return (r1, r5, r10, medr, meanr), (ranks, top1)
    else:
        return (r1, r5, r10, medr, meanr)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial intelligenceMachine learning toolkit Scikit-learn382228 people are learning the system