yolov7 improvement using QFocalLoss

The three major components of deep learning: data, model, and loss. A good Loss helps make it easier for the model to learn the required features, but deep learning has become more intense, and the improvement of Loss to a mature task is getting smaller and smaller. Even so, it does not prevent us from trying this aspect when it is difficult to start from the data and model level.

BCEBlurWithLogitsLoss

In yolov7, loss consists of three parts, cls loss, obj loss, and box loss. They are category loss, box confidence loss, and box position loss. Among them, cls loss and obj loss both use BCEBlurWithLogitsLoss. The source code of this loss is as follows:

class BCEBlurWithLogitsLoss(nn.Module):
    # BCEwithLogitLoss() with reduced missing label effects.
    def __init__(self, alpha=0.05):
        super(BCEBlurWithLogitsLoss, self).__init__()
        self.loss_fcn = nn.BCEWithLogitsLoss(reduction='none') # must be nn.BCEWithLogitsLoss()
        self.alpha = alpha

    def forward(self, pred, true):
        loss = self.loss_fcn(pred, true)
        pred = torch.sigmoid(pred) # prob from logits
        dx = pred - true # reduce only missing label effects
        # dx = (pred - true).abs() # reduce missing label and false label effects
        alpha_factor = 1 - torch.exp((dx - 1) / (self.alpha + 1e-4))
        loss *= alpha_factor
        return loss.mean()

Compared with ordinary cross-entropy loss, this loss can reduce missing label (that is, it is a positive sample, but not labeled) and false label (error label, comment part in the code). The method is to reduce the loss weight of these samples. Specifically, it is measured by the difference between pred and label. If the difference is too large, the weight will be reduced.

FocalLoss

Yolov7 also provides another type of Loss, which can be turned on by setting a gamma greater than 0 in the hyp parameter. The idea of FocalLoss is to increase the weight of difficult samples and assign different weights to positive and negative samples. The formula is as follows:

(

)

(

)

(

)

FL(p_t)=-a_t(1-p_t)^\gamma log(p_t)

FL(pt?)=?at?(1?pt?)γlog(pt?)
The discussion is about a two-category problem, that is, two categories are discussed. When a sample is misclassified, that is, when the label class y = 1, p = 0.3. According to the above formula, we can see that y=1, p= 0.3, then

0.3

p_t = 0.3

pt?=0.3 then

(

)

(1 – p_t)^\gamma

(1?pt?)γ is very large (usually

\gamma

γ takes 2). This also means that the wrong category represents a category that is difficult to classify.

QFocalLoss

There are some problems with FocalLoss:
Classification score and IoU/centerness score are inconsistent in training and testing.

This inconsistency is mainly reflected in two aspects:

1) Inconsistent usage. During training, classification and quality estimation are each recorded several times, but during testing, they are multiplied together as the basis for NMS score sorting. This operation is obviously not end-to-end, and there must be a certain gap.

2) The objects are inconsistent. With the power of Focal Loss, the classification branch can successfully train a small number of positive samples and a large number of negative samples together, but quality estimation is usually only trained on positive samples. Then, for one-stage detectors, when doing NMS score sorting, all samples will be sorted by multiplying the classification score and the quality prediction score, so there will inevitably be some “negative samples” with lower scores. There is no supervision signal in the training process, which means that for a large number of possible negative samples, their quality prediction is an undefined behavior. This is very likely to lead to a situation: a true negative sample with a relatively low classification score may be ranked as a true positive sample (the classification score is not high enough because it predicts an untrustworthy extremely high quality score). and the quality score is relatively low). For details, please refer to the blog of the author of QFocalLoss, Zhihu “Generalized Focal Loss”.
For the first question, in order to ensure that training and testing are consistent, and that both the classification score and the quality prediction score can be trained on all positive and negative samples, a solution is readily available: combine the representations of the two. This merger is also very interesting. Physically speaking, we still retain the classification vector, but the physical meaning of the confidence corresponding to the category position is no longer a classification score, but a quality prediction score. In this way, a joint representation of the two is achieved. At the same time, without considering the optimization issue for the time being, we may be able to perfectly solve the first problem.

Simply put, those who have done testing should know that the confidence of a detection result = the confidence of the category * the confidence of the box. However, during training, the two confidence levels are separated. This inconsistency may occur when the category confidence level is low but the box confidence level is high. The starting point of QFocalLoss is that during training, the label in the classification loss is changed to label* box confidence. This modification will result in a label between 0 and 1. The original FocalLoss does not support discrete labels, so the author made a magical modification:

Someone in yolov7 submitted a QFocalLoss code, but it was not enabled, and judging from the code, the author’s main intention was not correctly implemented: joint training of classification score and frame confidence score. So instead of using the implementation from yolov7 I modified it:

class QFocalLoss(nn.Module):
    # Wraps Quality focal loss around existing loss_fcn(), i.e. criteria = FocalLoss(nn.BCEWithLogitsLoss(), gamma=1.5)
    def __init__(self, loss_fcn, beta = 2.0):
        super(QFocalLoss, self).__init__()
        self.loss_fcn = loss_fcn # must be nn.BCEWithLogitsLoss()
        self.beta = beta
        self.reduction = loss_fcn.reduction
        self.loss_fcn.reduction = 'none' # required to apply FL to each element

    def forward(self, pred, target):
        assert len(target) ==2, "target must be a tuple of (class, score)"
        label, score = target
        iou_target = label*score.view(-1,1)
        pred_sigmoid = torch.sigmoid(pred) # prob from logits
        scale_factor = torch.abs(pred_sigmoid - iou_target)
        beta = self.beta
        loss = self.loss_fcn(pred, iou_target) * scale_factor.pow(beta)

        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        else: # 'none'
            return loss

The main difference is that here the target consists of the category label and the predicted box confidence.
In order to use QFocalLoss, you need to pass in the execution degree of the prediction box when calling. Specifically, it is actually the iou of the prediction box and GT, which are already in the code:

 # Objectness
                tobj[b, a, gj, gi] = (1.0 - self.gr) + self.gr * iou.detach().clamp(0).type(tobj.dtype) # iou ratio

                #Classification
                selected_tcls = targets[i][:, 1].long()
                if self.nc > 1: # cls loss (only if multiple classes)
                    t = torch.full_like(ps[:, 5:], self.cn, device=device) # targets
                    t[range(n), selected_tcls] = self.cp
                    
                    if isinstance(self.BCEcls, QFocalLoss):
                        lcls + = self.BCEcls(ps[:, 5::], (t, iou.detach()))
                    else:
                        lcls + = self.BCEcls(ps[:, 5:], t) # BCE

Then change the place in the code that was originally used to trigger FocalLoss to trigger QFocalLoss.

Using experience

Judging from my personal use in private tasks, using QFocalLoss has basically not improved the map, and even false detections have increased (recalls have increased). The analysis may have the same problem as FocalLoss. Due to the increased weight of difficult samples, it will cause too much attention to some ambiguous samples. For small samples, it is estimated that there will be improvement. For general tasks, it is better to use BCEBlurWithLogitsLoss.