ICLR 2022) ODConv: Plug-and-play dynamic convolution (with code)

Paper address: Omni-Dimensional Dynamic Convolution | OpenReview

Code address: https://github.com/OSVAI/ODConv/blob/main/modules/odconv.py

1.What is it?

ODConv is a dynamic convolution algorithm. Its principle is to dynamically adjust the shape and size of the convolution kernel according to the characteristics of the input data during the convolution process to adapt to different input data. Specifically, ODConv improves the performance of convolutional neural networks by introducing a learnable deformation module to dynamically adjust the shape and size of the convolution kernel according to the characteristics of the input data. Different from CondConv and DyConv, ODConv not only considers the spatial dimension, input channel dimension and output channel dimension, but also considers the shape and size of the convolution kernel, so it can better adapt to different input data.

2.Why?

Conventional convolution has only one static convolution kernel and is independent of input samples. For dynamic convolution, it linearly weights multiple convolution kernels, and the weighting values are related to the input, which makes dynamic convolution input-dependent. It can be described as follows:

Although the definition of dynamic convolution is very simple, the implementation of CondConv and DyConv are different, mainly reflected in the calculationa_{wi}Structure\pi_{wi}(x)Training strategies and layers that implement dynamic convolution. These differences in implementation lead to different model accuracy, Model size and inference efficiency.

  • Both are \pi_{wi}(x)adopts an SE-like architecture, but CondConv uses Sigmoid, while DyConv uses Softmax;
  • The degradation strategy adopted by DyConv is trained to suppress the one-hot output of Softmax;
  • For their embedded CNN architecture, CondConv replaced the convolutions and fully connected layers of the last few modules, while DyConv replaced all other convolutions except the first convolution.

According to the formula of dynamic convolution, dynamic convolution has two basic elements:

  • Convolution kernel{W_{1},..,W_{n}};
  • Used to calculate attention {a_{w1,...,a_{wn}}}’s attention function\phi _{wi}(x)

Given n convolution kernels, the corresponding kernel space has the following four dimensions:

  • Spatial kernel size k×k;
  • Enter the number of channels c_{in}
  • Number of output channelsc_{out}
  • Number of convolution kernels n

However, for CondConv and DyConv, \phi _{wi}(x)all use a single attention scalara_{wi}, which means that its output filter W_{i}^{m}R^{k*k*c_{ in}}For the input has Same attention value. In other words, Convolution kernel W_{i}The spatial dimension, input channel dimension and output channel dimension are all ignored by CondConv and DyConv. This has led torough explorationof nuclear space. This may be the reason why CondConv and DyConv have lower performance gains for large networks.

In addition, compared with conventional convolution, the convolution kernel parameters of dynamic convolution are often n times larger. For example, n=8 in CondConv and n=4 in DyConv. When dynamic convolution is used too much, it will undoubtedly greatly increase the model size. We found that: when the attention mechanism in CondConv/DyConv is removed (i.e. a_{wi}=1), its performance improvement is close to zero. For example, for ResNet18, its performance gain dropped from 1.78%/2.51% to 0.08%/0.14.

The above findings mean: The attention mechanism in dynamic convolution plays a key role, and a more effective design may achieve a better balance between model accuracy and size.

To a certain extent, ODConv can be regarded as a continuation of CondConv, which expands the dynamic characteristics of CondConv in one dimension and takes into account the dynamics of airspace, input channels, output channels, etc., so it is called full-dimensional. Dynamic convolution. ODConv adopts a multi-dimensional attention mechanism through a parallel strategy to learn complementary attention along the four dimensions of the kernel space. As a “plug and play” operation, it can be easily embedded into existing CNN networks. Experiments on ImageNet classification and COCO detection tasks have verified the excellence of the proposed ODConv: it can not only improve the performance of large models, but also improve the performance of lightweight models. It is indeed a panacea! It is worth mentioning that, thanks to its improved feature extraction capabilities, ODConv can still achieve performance comparable to or even better than existing multi-core dynamic convolution when paired with a convolution kernel.

3 How about it?

3.1 Network structure

Based on the aforementioned discussion, ODConv introduces a multi-dimensional attention mechanism through a parallel strategy to learn more flexible attention in the four dimensions of the convolution kernel space. The above figure shows the difference diagram of CondConv, DyConv and ODConv.

Continuing the definition of dynamic convolution, ODConv can be described as follows:

Among them, a_{wi} represents the volume Notes on accumulation kernelW_{i} Force scalar, a_{si}\epsilon R^{k*k},a_{ci}\epsilon R^{c_{in}},a_{fi}\ \epsilon R^{c_{out}}represents the three newly introduced attentions, respectively along the airspace dimension, input channel dimension and output channel dimension. These four attentions use the multi-head attention module \pi_{i}(x)Calculated.

In ODConv, for the convolution kernelW_{i},a_{si}Assign different attention values to the convolution parameters at k*k spatial domain positions, see figure a above; a_{ci}Give different attention values to the convolution filters of different input channels, see figure b above; a_{fi }Give different attention to the convolution filters of different output channels Force value, see picture c above; and a_{wi}then assign different values to the n overall convolution kernels, see Figure d above.

In principle, these four types of attention are complementary, through progressive convolution W_{i}Multiplying different attention along dimensions such as position, channel, filter and kernel will make the convolution operation have differences in each dimension of the input, providing more Good performance to capture rich contextual information. Therefore, ODCOnv can greatly improve the feature extraction capability of convolution; more importantly, ODConv using fewer convolution kernels can achieve equivalent or even better performance than CondConv and DyConv.

Comparing the previous two dynamic convolution formulas, we can find that ODConv is a more generalized dynamic convolution. In addition, when setting n=1,a_{s1}=a_{c1}=a_{w1}=1 When , ODConv degenerates into only filter-level attention, modulating the convolution filter based on the input and then performing convolution, similar to SE. Therefore SE is a special case of ODConv.

So how to implement the four types of attention values of ODConv? Continuing CondConv and DyConv, we also use the SE-style attention module, but make it have multiple heads to calculate multiple types of attention. The overall structure is shown in the figure above. Specifically, the input is first shrunk by GAP to a length of c_{in} is then used to generate different types of attention values using FC and four heads. For the four heads, their dimensions are k*k,c_{in}×1,c_{out}×1,n×1.

In terms of training, we adopt the degradation strategy in DyConv to speed up training. In terms of specific architecture embedding, we refer to DyConv to replace all other convolutions except the first convolution.

3.2 Code Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd


class Attention(nn.Module):
    def __init__(self, in_planes, out_planes, kernel_size, groups=1, reduction=0.0625, kernel_num=4, min_channel=16):
        super(Attention, self).__init__()
        attention_channel = max(int(in_planes * reduction), min_channel)
        self.kernel_size = kernel_size
        self.kernel_num = kernel_num
        self.temperature = 1.0

        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Conv2d(in_planes, attention_channel, 1, bias=False)
        self.bn = nn.BatchNorm2d(attention_channel)
        self.relu = nn.ReLU(inplace=True)

        self.channel_fc = nn.Conv2d(attention_channel, in_planes, 1, bias=True)
        self.func_channel = self.get_channel_attention

        if in_planes == groups and in_planes == out_planes: # depth-wise convolution
            self.func_filter = self.skip
        else:
            self.filter_fc = nn.Conv2d(attention_channel, out_planes, 1, bias=True)
            self.func_filter = self.get_filter_attention

        if kernel_size == 1: # point-wise convolution
            self.func_spatial = self.skip
        else:
            self.spatial_fc = nn.Conv2d(attention_channel, kernel_size * kernel_size, 1, bias=True)
            self.func_spatial = self.get_spatial_attention

        if kernel_num == 1:
            self.func_kernel = self.skip
        else:
            self.kernel_fc = nn.Conv2d(attention_channel, kernel_num, 1, bias=True)
            self.func_kernel = self.get_kernel_attention

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            if isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

    def update_temperature(self, temperature):
        self.temperature = temperature

    @staticmethod
    def skip(_):
        return 1.0

    def get_channel_attention(self, x):
        channel_attention = torch.sigmoid(self.channel_fc(x).view(x.size(0), -1, 1, 1) / self.temperature)
        return channel_attention

    def get_filter_attention(self, x):
        filter_attention = torch.sigmoid(self.filter_fc(x).view(x.size(0), -1, 1, 1) / self.temperature)
        return filter_attention

    def get_spatial_attention(self, x):
        spatial_attention = self.spatial_fc(x).view(x.size(0), 1, 1, 1, self.kernel_size, self.kernel_size)
        spatial_attention = torch.sigmoid(spatial_attention / self.temperature)
        return spatial_attention

    def get_kernel_attention(self, x):
        kernel_attention = self.kernel_fc(x).view(x.size(0), -1, 1, 1, 1, 1)
        kernel_attention = F.softmax(kernel_attention / self.temperature, dim=1)
        return kernel_attention

    def forward(self, x):
        x = self.avgpool(x)
        x = self.fc(x)
        x = self.bn(x)
        x = self.relu(x)
        return self.func_channel(x), self.func_filter(x), self.func_spatial(x), self.func_kernel(x)


class ODConv2d(nn.Module):
    def __init__(self, in_planes, out_planes, kernel_size, stride=1, padding=0, dilation=1, groups=1,
                 reduction=0.0625, kernel_num=4):
        super(ODConv2d, self).__init__()
        self.in_planes = in_planes
        self.out_planes = out_planes
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.dilation = dilation
        self.groups = groups
        self.kernel_num = kernel_num
        self.attention = Attention(in_planes, out_planes, kernel_size, groups=groups,
                                   reduction=reduction, kernel_num=kernel_num)
        self.weight = nn.Parameter(torch.randn(kernel_num, out_planes, in_planes//groups, kernel_size, kernel_size),
                                   requires_grad=True)
        self._initialize_weights()

        if self.kernel_size == 1 and self.kernel_num == 1:
            self._forward_impl = self._forward_impl_pw1x
        else:
            self._forward_impl = self._forward_impl_common

    def _initialize_weights(self):
        for i in range(self.kernel_num):
            nn.init.kaiming_normal_(self.weight[i], mode='fan_out', nonlinearity='relu')

    def update_temperature(self, temperature):
        self.attention.update_temperature(temperature)

    def _forward_impl_common(self, x):
        # Multiplying channel attention (or filter attention) to weights and feature maps are equivalent,
        # while we observe that when using the latter method the models will run faster with less gpu memory cost.
        channel_attention, filter_attention, spatial_attention, kernel_attention = self.attention(x)
        batch_size, in_planes, height, width = x.size()
        x = x * channel_attention
        x = x.reshape(1, -1, height, width)
        aggregate_weight = spatial_attention * kernel_attention * self.weight.unsqueeze(dim=0)
        aggregate_weight = torch.sum(aggregate_weight, dim=1).view(
            [-1, self.in_planes // self.groups, self.kernel_size, self.kernel_size])
        output = F.conv2d(x, weight=aggregate_weight, bias=None, stride=self.stride, padding=self.padding,
                          dilation=self.dilation, groups=self.groups * batch_size)
        output = output.view(batch_size, self.out_planes, output.size(-2), output.size(-1))
        output = output * filter_attention
        return output

    def _forward_impl_pw1x(self, x):
        channel_attention, filter_attention, spatial_attention, kernel_attention = self.attention(x)
        x = x * channel_attention
        output = F.conv2d(x, weight=self.weight.squeeze(dim=0), bias=None, stride=self.stride, padding=self.padding,
                          dilation=self.dilation, groups=self.groups)
        output = output * filter_attention
        return output

    def forward(self, x):
        return self._forward_impl(x)

refer to:

Detailed explanation of ODConv

ICLR 2022 | A magic weapon to gain points! Intel proposes ODConv: plug-and-play dynamic convolution

Salute CondConv! Intel proposes plug-and-play “snake oil” dynamic convolution ODConv

The knowledge points of the article match the official knowledge archive, and you can further learn related knowledge. OpenCV skill treeDeep learning in OpenCVImage classification 23823 people are learning the system