CoordConv: Add coordinates to your convolution

This article is from the AlStudio community boutique project, [click here] to view more boutique content >>>


CoordConv: Add coordinates to your convolution



1. Theoretical introduction

1.1 Detailed explanation of CoordConv theory

This is an archaeological paper reproduction project. When the Uber team proposed the CoordConv module in 2018, many articles criticized it, thinking that this is not worth publishing a paper, but now revisit this idea, and at the same time Then compare the current position encoding (Position Encoding) proposed in Transformer, and you will feel that history is a circle. In corner convolution, adding two coordinate encodings to convolution is actually the same as the position encoding proposed in Transformer. reason.
As we all know, the convolution operation in deep learning has translation equivariance, so that the unified convolution kernel parameters can be shared at different positions of the image, but the coordinates of the current feature in the image cannot be perceived during the convolution learning process. Yes, the experimental proof in the paper is shown in the figure below. Through this experiment, the author proved that the traditional convolution can only perceive local information when the convolution kernel performs local operations, and cannot perceive position information. CoordConv is to represent the coordinates of the pixel points of the feature map by adding corresponding channels in the input feature map of the convolution, so that the coordinates can be perceived to a certain extent during the convolution learning process to improve the detection accuracy.



Traditional convolution cannot convert the spatial representation into coordinates in Cartesian space and coordinates in one-hot pixel space. Convolution is equivariant, which means it does not know where each filter is as it is applied to the input. We can help the convolution by letting it know where the filters are. This process needs to be implemented by adding two channels to the input, one at the i coordinate and the other at the j coordinate. Through the above operation of adding coordinates, we can create a new convolution structure-CoordConv, whose structure is shown in the following figure:



Second, code combat

This part completes the reproduction of CoordConv based on the CoordConv paper and referring to the official implementation of the flying paddle.

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle import ParamAttr
from paddle.regularizer import L2Decay
from paddle.nn import AvgPool2D, Conv2D

2.2 CoordConv class code implementation

First inherit the nn.Layer base class, and then use paddle.arange to define gx``gy two coordinates, and stop their gradient backpropagation gx. stop_gradient = True, and finally concat them together and send them to convolution.

class CoordConv(nn.Layer):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding):
        super(CoordConv, self).__init__()
        self.conv = Conv2D(
            in_channels + 2, out_channels, kernel_size, stride, padding)

    def forward(self, x):
        b = x. shape[0]
        h = x.shape[2]
        w = x.shape[3]

        gx = paddle.arange(w, dtype='float32') / (w - 1.) * 2.0 - 1.
        gx = gx.reshape([1, 1, 1, w]).expand([b, 1, h, w])
        gx.stop_gradient = True

        gy = paddle.arange(h, dtype='float32') / (h - 1.) * 2.0 - 1.
        gy = gy. reshape([1, 1, h, 1]). expand([b, 1, h, w])
        gy.stop_gradient = True

        y = paddle. concat([x, gx, gy], axis=1)
        y = self.conv(y)
        return y

class dcn2(paddle.nn.Layer):
    def __init__(self, num_classes=1):
        super(dcn2, self).__init__()

        self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3), stride=1, padding = 1)

        self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)

        self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)

        self.offsets = paddle.nn.Conv2D(64, 18, kernel_size=3, stride=2, padding=1)
        self.mask = paddle.nn.Conv2D(64, 9, kernel_size=3, stride=2, padding=1)
        self.conv4 = CoordConv(64, 64, (3,3), 2, 1)

        self.flatten = paddle.nn.Flatten()

        self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
        self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = self.conv3(x)
        x = F.relu(x)

        x = self.conv4(x)
        x = F.relu(x)

        x = self. flatten(x)
        x = self. linear1(x)
        x = F.relu(x)
        x = self. linear2(x)
        return x
cnn3 = dcn2()

model3 = paddle. Model(cnn3)

model3.summary((64, 3, 32, 32))
---------------------------------------------- ----------------------------
 Layer (type) Input Shape Output Shape Param #
==================================================== ===========================
   Conv2D-26 [[64, 3, 32, 32]] [64, 32, 32, 32] 896
   Conv2D-27 [[64, 32, 32, 32]] [64, 64, 15, 15] 18,496
   Conv2D-28 [[64, 64, 15, 15]] [64, 64, 7, 7] 36,928
   Conv2D-31 [[64, 66, 7, 7]] [64, 64, 4, 4] 38,080
  CoordConv-4 [[64, 64, 7, 7]] [64, 64, 4, 4] 0
   Flatten-1 [[64, 64, 4, 4]] [64, 1024] 0
   Linear-1 [[64, 1024]] [64, 64] 65,600
   Linear-2 [[64, 64]] [64, 1] 65
==================================================== ===========================
Total params: 160,065
Trainable params: 160,065
Non-trainable params: 0
-------------------------------------------------- -------------------------
Input size (MB): 0.75
Forward/backward pass size (MB): 26.09
Params size (MB): 0.61
Estimated Total Size (MB): 27.45
-------------------------------------------------- -------------------------






{'total_params': 160065, 'trainable_params': 160065}
class MyNet(paddle.nn.Layer):
    def __init__(self, num_classes=1):
        super(MyNet, self).__init__()

        self.conv1 = paddle.nn.Conv2D(in_channels=3, out_channels=32, kernel_size=(3, 3), stride=1, padding = 1)
        self.conv2 = paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)
        self.conv3 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 0)
        self.conv4 = paddle.nn.Conv2D(in_channels=64, out_channels=64, kernel_size=(3,3), stride=2, padding = 1)
        self.flatten = paddle.nn.Flatten()

        self.linear1 = paddle.nn.Linear(in_features=1024, out_features=64)
        self.linear2 = paddle.nn.Linear(in_features=64, out_features=num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = self.conv3(x)
        x = F.relu(x)
        x = self.conv4(x)
        x = F.relu(x)
        x = self. flatten(x)
        x = self. linear1(x)
        x = F.relu(x)
        x = self. linear2(x)
        return x
# Visualize the model

cnn1 = MyNet()

model1 = paddle. Model(cnn1)

model1.summary((64, 3, 32, 32))
---------------------------------------------- ----------------------------
 Layer (type) Input Shape Output Shape Param #
==================================================== ===========================
   Conv2D-1 [[64, 3, 32, 32]] [64, 32, 32, 32] 896
   Conv2D-2 [[64, 32, 32, 32]] [64, 64, 15, 15] 18,496
   Conv2D-3 [[64, 64, 15, 15]] [64, 64, 7, 7] 36,928
   Conv2D-4 [[64, 64, 7, 7]] [64, 64, 4, 4] 36,928
   Flatten-1 [[64, 64, 4, 4]] [64, 1024] 0
   Linear-1 [[64, 1024]] [64, 64] 65,600
   Linear-2 [[64, 64]] [64, 1] 65
==================================================== ===========================
Total params: 158,913
Trainable params: 158,913
Non-trainable params: 0
-------------------------------------------------- -------------------------
Input size (MB): 0.75
Forward/backward pass size (MB): 25.59
Params size (MB): 0.61
Estimated Total Size (MB): 26.95
-------------------------------------------------- -------------------------






{'total_params': 158913, 'trainable_params': 158913}

Summary

I believe that through the previous tutorials, I believe that everyone has mastered the method of quickly starting training. Therefore, in the following tutorials, I will focus on the specific code implementation and related theoretical introduction. If it is not necessary, the comparison experiment will not be carried out. This tutorial mainly introduces the theory of CoordConv, reproduces it, and shows its usage in the network structure. You can transplant it to your own network according to your actual needs.

  • Some points to note
  1. The position of CoordConv should be as far ahead as possible in the network

  2. The best application direction is CV tasks that are highly sensitive to position such as pose estimation