pytorch-loss function-difference between classification and regression

The difference between torch.nn library and torch.nn.functional library

  1. torch.nn Library: This library provides many predefined layers, such as fully connected layers (Linear), convolutional layers (Conv2d), etc., as well as some loss functions (such as MSELoss, CrossEntropyLoss, etc.). These layers are all classes, and they all inherit from nn.Module, so they can be easily integrated into custom models. Layers in the torch.nn library have their own weights and biases, and these parameters can be updated by the optimizer.

    1. When the operation you need contains learnable parameters (such as weights and biases), it is usually more convenient to use the torch.nn library. For example, for convolutional layers (Conv2d), fully connected layers (Linear), etc., since they contain learnable parameters, classes in the torch.nn library are usually used. These classes automatically manage the creation and update of parameters.

      For example:

    2. import torch.nn as nn
      
      conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=1)
      fc = nn.Linear(in_features=1024, out_features=10)
      
  2. torch.nn.functional library: This library provides some functions, such as activation functions (such as relu, sigmoid, etc.), pooling functions (such as max_pool2d, avg_pool2d, etc.) and some loss functions (such as cross_entropy, mse_loss, etc.). These functions are more flexible, but using them requires manual management of weights and biases.

    1. For operations without learnable parameters, such as ReLU activation functions, pooling operations, dropout, etc., you can choose to use the torch.nn.functional library because these operations do not require additional parameters.

    2. import torch.nn.functional as F
      
      x = F.relu(x)
      x = F.max_pool2d(x, kernel_size=2)
      x = F.dropout(x, p=0.5, training=self.training)
      
  3. For the loss function, both the torch.nn library and the torch.nn.functional library provide implementations, and you can choose according to your needs. If you need a loss function with learnable parameters (such as pos_weight in nn.BCEWithLogitsLoss), then you should use the torch.nn library. If your loss function has no learnable parameters, then you can choose to use the torch.nn.functional library to avoid creating unnecessary objects.

    For example:

  4. import torch.nn as nn
    import torch.nn.functional as F
    
    # Use nn library
    loss_fn = nn.CrossEntropyLoss()
    loss = loss_fn(prediction, target)
    
    # Use functional library
    loss = F.cross_entropy(prediction, target)
    

Correspondence between torch.nn library and torch.nn.functional library loss function

The following is the correspondence between some common loss functions in torch.nn and torch.nn.functional:

  1. Cross entropy loss:
    1. torch.nn.CrossEntropyLoss
    2. torch.nn.functional.cross_entropy
  2. Negative log-likelihood loss:
    1. torch.nn.NLLLoss
    2. torch.nn.functional.nll_loss
  3. Mean squared error loss:
    1. torch.nn.MSELoss
    2. torch.nn.functional.mse_loss
  4. Mean absolute error loss:
    1. torch.nn.L1Loss
    2. torch.nn.functional.l1_loss

The difference between classification and regression loss functions

  1. Classification problem: The goal of a classification problem is to predict the category of input data. For this type of problem, commonly used loss functions include Cross Entropy Loss and Negative Log Likelihood Loss. These loss functions calculate the loss based on the difference between the predicted probability distribution and the true probability distribution.
    1. nn.CrossEntropyLoss: This is the loss function used for classification problems. The input it expects is a tensor of shape (batch_size, num_classes), where each element is the raw score of the corresponding class (usually the output of the last fully connected layer), and a tensor of shape (batch_size,) , where each element is a true category label.
    2. nn.NLLLoss: This is also the loss function used for classification problems. The input it expects is a tensor of shape (batch_size, num_classes), where each element is the log probability of the corresponding class (usually the output of log_softmax), and a tensor of shape (batch_size,), where each element is the log probability of the corresponding class (usually the output of log_softmax), and a tensor of shape (batch_size,), where each element elements are actual category labels.
  2. Regression Problem: The goal of a regression problem is to predict a continuous value. For this type of problem, commonly used loss functions include Mean Squared Error Loss and Mean Absolute Error Loss. These loss functions calculate the loss based on the difference between the predicted value and the true value.
    1. nn.MSELoss: This is the loss function used for regression problems. The input it expects is two tensors of the same shape, one with the predicted value and one with the true value. The shapes of these two tensors can be arbitrary as long as they are the same.
    2. nn.L1Loss: This is also the loss function used for regression problems. The input it expects is two tensors of the same shape, one with the predicted value and one with the true value. The shapes of these two tensors can be arbitrary as long as they are the same.
Example

nn.MSELoss()

Input: predicted value and target value, their shapes should be the same. For example, if you have a batch size of batch_size data, each data has n features, then the shape of the predicted value and the target value should be (batch_size, n).

Output: A scalar representing the calculated mean squared error loss.

For example:

import torch
import torch.nn as nn

# Suppose we have a batch size of 3 data, each data has 2 features
prediction = torch.randn(3, 2)
target = torch.randn(3, 2)

loss_fn = nn.MSELoss()
loss = loss_fn(prediction, target)

print(loss) # Output a scalar representing the calculated mean square error loss

F.cross_entropy()

Input: Predicted value and target value. The shape of the predicted value should be (batch_size, num_classes), which represents the predicted probability of each category; the shape of the target value should be (batch_size,), which represents the true category label of each data.

Output: A scalar representing the calculated cross-entropy loss.

For example:

import torch
import torch.nn.functional as F

# Suppose we have data with a batch size of 3 and 4 categories
prediction = torch.randn(3, 4)
target = torch.tensor([1, 0, 3]) #Real category label

loss = F.cross_entropy(prediction, target)

print(loss) # Output a scalar representing the calculated cross-entropy loss

The difference between CrossEntropyLoss() and NLLLoss() in multiple categories

  1. CrossEntropyLoss(): Its input is the model’s raw scores for each category (usually the output of the last fully connected layer), and these scores are not normalized in any way. CrossEntropyLoss() internally performs a log_softmax operation on these scores, and then calculates the cross-entropy loss.
  2. NLLLoss(): Its input is the log probability of each category by the model. These log probabilities are usually obtained by performing a log_softmax operation on the original output of the model. NLLLoss() will directly calculate the negative log-likelihood loss.

CrossEntropyLoss() = softmax + log + NLLLoss() = log_softmax + NLLLoss()

The difference between BCELoss and BCEWithLogitsLoss in the second classification

BCELoss() and BCEWithLogitsLoss() are both commonly used loss functions in PyTorch, mainly used for binary classification problems. But they are input and processed differently.

  1. BCELoss(): Its input is the model’s probability for each category, which is usually obtained by performing a sigmoid operation on the original output of the model. BCELoss() directly calculates the binary cross-entropy loss.
  2. BCEWithLogitsLoss(): Its input is the model’s raw scores for each category (usually the output of the last fully connected layer), and these scores are not normalized in any way. BCEWithLogitsLoss() internally performs a sigmoid operation on these scores, and then calculates the binary cross-entropy loss.

In summary, the main difference between BCELoss() and BCEWithLogitsLoss() is their input: BCELoss() expects the input to be the model’s probabilistic output, while BCEWithLogitsLoss() expects the input to be the model’s raw output. In actual use, you can choose which loss function to use based on your needs and the output of the model.

In addition, BCEWithLogitsLoss() internally performs sigmoid and loss calculations to improve numerical stability. Therefore, in actual use, if the output of the model is the original score, it is recommended to use BCEWithLogitsLoss().

Detailed explanation of the reduction function in the regression loss function

Its complete definition is torch.nn.MSELoss(size_average=None, reduce=None, reduction=’mean’).

Below is an explanation of these parameters:

  1. size_average (deprecated): If set to True, the loss function averages the loss for each mini-batch. If set to False, the loss function sums the loss for each mini-batch. The default value is True. This parameter has been deprecated and it is recommended to use the reduction parameter.
  2. reduce (deprecated): If set to True, the loss function returns a scalar value that is the average or sum of the losses over all input elements (depending on the size_average parameter). If set to False, the loss function returns a vector of loss values, with each element corresponding to the loss for an input data point. The default value is True. This parameter has been deprecated and it is recommended to use the reduction parameter.
  3. reduction: Specify how to reduce the loss. Can be ‘none’ (does not reduce, returns a vector of loss values), ‘mean’ (averages, returns the average of the losses over all input elements) or ‘sum’ (sums, returns all the sum of the losses of the input elements). The default value is ‘mean’.

The inputs to the nn.MSELoss() function are two tensors, representing the predicted value and the target value respectively. They must have the same shape. The output of the function is a scalar value representing the loss.

Advantages of nn.SmoothL1Loss compared to nn.MSELoss loss function

  1. nn.MSELoss (mean square error loss) is very effective for regression problems, but it is very sensitive to outliers because it sums the square of each error. This means that even if the predicted value of even one sample is very different from the true value, it will cause the overall loss value to increase significantly.
  2. And nn.SmoothL1Loss (smooth L1 loss) is more robust when dealing with outliers. It combines the advantages of L1 loss and L2 loss: when the difference between the predicted value and the true value is large, it behaves like L1 loss (i.e., absolute value loss) and is insensitive to outliers; while when the predicted value is different from the true value When approached, it behaves like L2 loss (i.e. mean squared error loss), allowing for more granular optimization of the model.

Therefore, a major advantage of nn.SmoothL1Loss is that it can find a balance between handling outliers and performing fine-grained optimizations, which can be very useful in certain tasks.

nn.SmoothL1Loss achieves this advantage through a specific mathematical formula. The formula is as follows:

SmoothL1Loss(x, y) = 0.5 * (x - y)^2, if abs(x - y) < 1
                   = abs(x - y) - 0.5, otherwise

The meaning of this formula is that when the difference between the predicted value and the true value is less than 1, use the square error loss (i.e. L2 loss); when the difference is greater than or equal to 1, use the absolute value error loss (i.e. L1 loss).

It can be seen that when the gap is small, SmoothL1Loss behaves similar to nn.MSELoss, and it will finely optimize these small errors. When the gap is large, SmoothL1Loss behaves like an L1 loss and does not over-penalty these large errors, thereby improving robustness against outliers.

This is how nn.SmoothL1Loss finds the balance between handling outliers and making fine optimizations.

The role of nn.HuberLoss

nn.HuberLoss, also known as Huber loss, is a loss function that combines Mean Squared Error (MSE) and Mean Absolute Error (MAE). It shows good performance when dealing with regression problems, especially in the presence of outliers.

The calculation formula of Huber loss is as follows:

HuberLoss(x, y) = 0.5 * (x - y)^2, if abs(x - y) < delta
                = delta * abs(x - y) - 0.5 * delta^2, otherwise

The meaning of this formula is that when the difference between the predicted value and the true value is less than a threshold delta, use the square error loss (ie MSE); when the difference is greater than or equal to delta, use the linear error loss (ie MAE).

Similar to nn.SmoothL1Loss, nn.HuberLoss finds a balance between handling outliers and fine-grained optimization. When the prediction error is small, it behaves like MSE and can finely optimize these small errors; when the prediction error is large, it behaves like MAE and does not over-optimize these large errors. Penalty, thereby improving robustness to outliers.

In addition, an advantage of nn.HuberLoss is that its gradient is bounded throughout the domain, which makes the model more stable during training.

Referenced from:

Instructions on how to use loss functions commonly used in pytorch | w3cschool notes

pytorch tutorial (4) – loss function_pytorch comparison loss-CSDN blog