Paper address: Omni-Dimensional Dynamic Convolution | OpenReview
Code address: https://github.com/OSVAI/ODConv/blob/main/modules/odconv.py
1.What is it?
ODConv is a dynamic convolution algorithm. Its principle is to dynamically adjust the shape and size of the convolution kernel according to the characteristics of the input data during the convolution process to adapt to different input data. Specifically, ODConv improves the performance of convolutional neural networks by introducing a learnable deformation module to dynamically adjust the shape and size of the convolution kernel according to the characteristics of the input data. Different from CondConv and DyConv, ODConv not only considers the spatial dimension, input channel dimension and output channel dimension, but also considers the shape and size of the convolution kernel, so it can better adapt to different input data.
2.Why?
Conventional convolution has only one static convolution kernel and is independent of input samples. For dynamic convolution, it linearly weights multiple convolution kernels, and the weighting values are related to the input, which makes dynamic convolution input-dependent. It can be described as follows:
Although the definition of dynamic convolution is very simple, the implementation of CondConv and DyConv are different, mainly reflected in the calculationStructureTraining strategies and layers that implement dynamic convolution. These differences in implementation lead to different model accuracy, Model size and inference efficiency.
- Both are adopts an SE-like architecture, but CondConv uses Sigmoid, while DyConv uses Softmax;
- The degradation strategy adopted by DyConv is trained to suppress the one-hot output of Softmax;
- For their embedded CNN architecture, CondConv replaced the convolutions and fully connected layers of the last few modules, while DyConv replaced all other convolutions except the first convolution.
According to the formula of dynamic convolution, dynamic convolution has two basic elements:
- Convolution kernel;
- Used to calculate attention {}’s attention function
Given n convolution kernels, the corresponding kernel space has the following four dimensions:
- Spatial kernel size k×k;
- Enter the number of channels
- Number of output channels
- Number of convolution kernels n
However, for CondConv and DyConv, all use a single attention scalar, which means that its output filter ∈For the input has Same attention value. In other words, Convolution kernel The spatial dimension, input channel dimension and output channel dimension are all ignored by CondConv and DyConv. This has led torough explorationof nuclear space. This may be the reason why CondConv and DyConv have lower performance gains for large networks.
In addition, compared with conventional convolution, the convolution kernel parameters of dynamic convolution are often n times larger. For example, n=8 in CondConv and n=4 in DyConv. When dynamic convolution is used too much, it will undoubtedly greatly increase the model size. We found that: when the attention mechanism in CondConv/DyConv is removed (i.e. =1), its performance improvement is close to zero. For example, for ResNet18, its performance gain dropped from 1.78%/2.51% to 0.08%/0.14.
The above findings mean: The attention mechanism in dynamic convolution plays a key role, and a more effective design may achieve a better balance between model accuracy and size.
To a certain extent, ODConv can be regarded as a continuation of CondConv, which expands the dynamic characteristics of CondConv in one dimension and takes into account the dynamics of airspace, input channels, output channels, etc., so it is called full-dimensional. Dynamic convolution. ODConv adopts a multi-dimensional attention mechanism through a parallel strategy to learn complementary attention along the four dimensions of the kernel space. As a “plug and play” operation, it can be easily embedded into existing CNN networks. Experiments on ImageNet classification and COCO detection tasks have verified the excellence of the proposed ODConv: it can not only improve the performance of large models, but also improve the performance of lightweight models. It is indeed a panacea! It is worth mentioning that, thanks to its improved feature extraction capabilities, ODConv can still achieve performance comparable to or even better than existing multi-core dynamic convolution when paired with a convolution kernel.
3 How about it?
3.1 Network structure
Based on the aforementioned discussion, ODConv introduces a multi-dimensional attention mechanism through a parallel strategy to learn more flexible attention in the four dimensions of the convolution kernel space. The above figure shows the difference diagram of CondConv, DyConv and ODConv.
Continuing the definition of dynamic convolution, ODConv can be described as follows:
Among them, represents the volume Notes on accumulation kernel Force scalar, ,,represents the three newly introduced attentions, respectively along the airspace dimension, input channel dimension and output channel dimension. These four attentions use the multi-head attention module Calculated.
In ODConv, for the convolution kernel,Assign different attention values to the convolution parameters at k*k spatial domain positions, see figure a above; Give different attention values to the convolution filters of different input channels, see figure b above; Give different attention to the convolution filters of different output channels Force value, see picture c above; and then assign different values to the n overall convolution kernels, see Figure d above.
In principle, these four types of attention are complementary, through progressive convolution Multiplying different attention along dimensions such as position, channel, filter and kernel will make the convolution operation have differences in each dimension of the input, providing more Good performance to capture rich contextual information. Therefore, ODCOnv can greatly improve the feature extraction capability of convolution; more importantly, ODConv using fewer convolution kernels can achieve equivalent or even better performance than CondConv and DyConv.
Comparing the previous two dynamic convolution formulas, we can find that ODConv is a more generalized dynamic convolution. In addition, when setting n=1,===1 When , ODConv degenerates into only filter-level attention, modulating the convolution filter based on the input and then performing convolution, similar to SE. Therefore SE is a special case of ODConv.
So how to implement the four types of attention values of ODConv? Continuing CondConv and DyConv, we also use the SE-style attention module, but make it have multiple heads to calculate multiple types of attention. The overall structure is shown in the figure above. Specifically, the input is first shrunk by GAP to a length of is then used to generate different types of attention values using FC and four heads. For the four heads, their dimensions are k*k,×1,×1,n×1.
In terms of training, we adopt the degradation strategy in DyConv to speed up training. In terms of specific architecture embedding, we refer to DyConv to replace all other convolutions except the first convolution.
3.2 Code Implementation
import torch import torch.nn as nn import torch.nn.functional as F import torch.autograd class Attention(nn.Module): def __init__(self, in_planes, out_planes, kernel_size, groups=1, reduction=0.0625, kernel_num=4, min_channel=16): super(Attention, self).__init__() attention_channel = max(int(in_planes * reduction), min_channel) self.kernel_size = kernel_size self.kernel_num = kernel_num self.temperature = 1.0 self.avgpool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Conv2d(in_planes, attention_channel, 1, bias=False) self.bn = nn.BatchNorm2d(attention_channel) self.relu = nn.ReLU(inplace=True) self.channel_fc = nn.Conv2d(attention_channel, in_planes, 1, bias=True) self.func_channel = self.get_channel_attention if in_planes == groups and in_planes == out_planes: # depth-wise convolution self.func_filter = self.skip else: self.filter_fc = nn.Conv2d(attention_channel, out_planes, 1, bias=True) self.func_filter = self.get_filter_attention if kernel_size == 1: # point-wise convolution self.func_spatial = self.skip else: self.spatial_fc = nn.Conv2d(attention_channel, kernel_size * kernel_size, 1, bias=True) self.func_spatial = self.get_spatial_attention if kernel_num == 1: self.func_kernel = self.skip else: self.kernel_fc = nn.Conv2d(attention_channel, kernel_num, 1, bias=True) self.func_kernel = self.get_kernel_attention self._initialize_weights() def _initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu') if m.bias is not None: nn.init.constant_(m.bias, 0) if isinstance(m, nn.BatchNorm2d): nn.init.constant_(m.weight, 1) nn.init.constant_(m.bias, 0) def update_temperature(self, temperature): self.temperature = temperature @staticmethod def skip(_): return 1.0 def get_channel_attention(self, x): channel_attention = torch.sigmoid(self.channel_fc(x).view(x.size(0), -1, 1, 1) / self.temperature) return channel_attention def get_filter_attention(self, x): filter_attention = torch.sigmoid(self.filter_fc(x).view(x.size(0), -1, 1, 1) / self.temperature) return filter_attention def get_spatial_attention(self, x): spatial_attention = self.spatial_fc(x).view(x.size(0), 1, 1, 1, self.kernel_size, self.kernel_size) spatial_attention = torch.sigmoid(spatial_attention / self.temperature) return spatial_attention def get_kernel_attention(self, x): kernel_attention = self.kernel_fc(x).view(x.size(0), -1, 1, 1, 1, 1) kernel_attention = F.softmax(kernel_attention / self.temperature, dim=1) return kernel_attention def forward(self, x): x = self.avgpool(x) x = self.fc(x) x = self.bn(x) x = self.relu(x) return self.func_channel(x), self.func_filter(x), self.func_spatial(x), self.func_kernel(x) class ODConv2d(nn.Module): def __init__(self, in_planes, out_planes, kernel_size, stride=1, padding=0, dilation=1, groups=1, reduction=0.0625, kernel_num=4): super(ODConv2d, self).__init__() self.in_planes = in_planes self.out_planes = out_planes self.kernel_size = kernel_size self.stride = stride self.padding = padding self.dilation = dilation self.groups = groups self.kernel_num = kernel_num self.attention = Attention(in_planes, out_planes, kernel_size, groups=groups, reduction=reduction, kernel_num=kernel_num) self.weight = nn.Parameter(torch.randn(kernel_num, out_planes, in_planes//groups, kernel_size, kernel_size), requires_grad=True) self._initialize_weights() if self.kernel_size == 1 and self.kernel_num == 1: self._forward_impl = self._forward_impl_pw1x else: self._forward_impl = self._forward_impl_common def _initialize_weights(self): for i in range(self.kernel_num): nn.init.kaiming_normal_(self.weight[i], mode='fan_out', nonlinearity='relu') def update_temperature(self, temperature): self.attention.update_temperature(temperature) def _forward_impl_common(self, x): # Multiplying channel attention (or filter attention) to weights and feature maps are equivalent, # while we observe that when using the latter method the models will run faster with less gpu memory cost. channel_attention, filter_attention, spatial_attention, kernel_attention = self.attention(x) batch_size, in_planes, height, width = x.size() x = x * channel_attention x = x.reshape(1, -1, height, width) aggregate_weight = spatial_attention * kernel_attention * self.weight.unsqueeze(dim=0) aggregate_weight = torch.sum(aggregate_weight, dim=1).view( [-1, self.in_planes // self.groups, self.kernel_size, self.kernel_size]) output = F.conv2d(x, weight=aggregate_weight, bias=None, stride=self.stride, padding=self.padding, dilation=self.dilation, groups=self.groups * batch_size) output = output.view(batch_size, self.out_planes, output.size(-2), output.size(-1)) output = output * filter_attention return output def _forward_impl_pw1x(self, x): channel_attention, filter_attention, spatial_attention, kernel_attention = self.attention(x) x = x * channel_attention output = F.conv2d(x, weight=self.weight.squeeze(dim=0), bias=None, stride=self.stride, padding=self.padding, dilation=self.dilation, groups=self.groups) output = output * filter_attention return output def forward(self, x): return self._forward_impl(x)
refer to:
Detailed explanation of ODConv
ICLR 2022 | A magic weapon to gain points! Intel proposes ODConv: plug-and-play dynamic convolution
Salute CondConv! Intel proposes plug-and-play “snake oil” dynamic convolution ODConv
The knowledge points of the article match the official knowledge archive, and you can further learn related knowledge. OpenCV skill treeDeep learning in OpenCVImage classification 23823 people are learning the system