Reprint: TransXNet: A new CNN-Transformer visual backbone that aggregates global and local information, with powerful performance!

Article address: https://arxiv.org/abs/2310.19380</code><code>Project address: https://github.com/LMMMEng/TransXNet

00 | Introduction

Current situation:

Recent research integrates convolutions into transformers to introduce inductive bias and improve generalization performance.

(1) The static characteristics of traditional convolution make it unable to dynamically adapt to input changes, resulting in a representation difference between convolution and self-attention, because self-attention dynamically calculates the attention matrix.

(2) When stacking token mixers composed of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into the convolution kernel. These two limitations lead to the construction of a network sub-optimal representation capabilities.

Solution:

A lightweight dual dynamic token mixer (D-Mixer) is proposed, which aggregates global information and local details in an input-dependent manner. The working principle of D-Mixer is to apply efficient global attention modules and input-dependent deep convolutions on uniformly segmented feature segments respectively, thereby giving the network a powerful inductive bias and an expanded effective acceptance field. Use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer visual backbone network.

In the ImageNet-1K image classification task, TransXNet-T exceeds swwin-t by 0.3% in top-1 accuracy, while requiring less than half the computational cost of swwin-t. In addition, TransXNet-S and TransXNet-B have excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively at reasonable computational costs.

01 | Method

Figure 1 TransXNet Architecture

As shown in Figure 1, the proposed TransXNet adopts a four-stage hierarchical architecture. Eachstage consists of a patch embedding layer and several sequentially stacked blocks. The first patch embedding layer is implemented using 7×7 convolutional layers (stride=4), followed by batch normalization (BN), while the remaining stages of patch embedding layers use 3×3 convolutional layers (stride=2 ) and BN. Each block consists of a dynamic position encoding (DPE) layer, a dual dynamic token mixer (D-Mixer) and a multi-scale feedforward network (MS-FFN).

1) Dual Dynamic Token Mixer

Figure 2 D-Mixer workflow

In order to enhance the generalization ability of the Transformer model by introducing inductive bias, many previous methods combine convolution and self-attention to build hybrid models. However, their static convolutions dilute the input dependence of the transformer, that is, although convolutions naturally introduce inductive bias, they have limited ability to improve the model’s representation learning ability. In this work, a lightweight token mixer called Dual Dynamic Token Mixer (D-Mixer) is proposed, which dynamically exploits global and local information to inject large amounts of data without affecting input dependencies. ERF and the potential for strong inductive bias. The overall workflow of the proposed D-Mixer is shown in Figure 2(a). Specifically, for a feature map $X \in R^{C\times H \times W}$ . Subsequently, replace $X_{1}$ with $X_ {2}$ are sent to the global self-attention module OSRA and the dynamic depth convolution IDConv respectively, and we get Corresponding feature mapping $\left \{?{X}'_{1},{ X}'_{2} \right \} \in R^{C/2 \times H \times W}$ , and then connect along the channel dimension to generate an output feature map ${X}' \in R^{C\times H \times W}$ . Finally, we employ the compressed token enhancer (STE) for efficient local token aggregation. The D-Mixer is expressed as:

The dynamic feature aggregation weights generated by superimposing D-Mixers, OSRA and IDConv take into account both global and local information, thereby encapsulating powerful representation learning capabilities into the model.

Input-dependent depth convolution (IDConv) In order to inject inductive bias and perform local feature aggregation in a dynamic input-dependent manner, a new dynamic depth convolution is proposed, called input-dependent depth convolution (IDConv). As shown in Figure 2 (b), take an input feature map $X \in R^{C \times H \times W}$ , using adaptive average pooling to aggregate spatial context and compress spatial dimensions to $k^{ 2}$ , which is then forwarded into two sequential 1×1 convolutions , get the attention map ${A}' \in R^{\left (G\times C \right ) \times K^{2}}$ , where G represents the number of attention groups. Then, reshape ${A}'$ into $R^{G \times C \times K^{2}}$ , and use the softmax function on the G dimension to generate attention weights $A \in R^{G\times C \times K^{2}}$ . Finally, combine A element-wise with a set of learnable parameters $P \in R^{\left (G\times C \right ) \times K^{2}}$ Multiply and compare the output to G Dimensional summation to obtain the input-dependent depth convolution kernel $W \in R^{ \vec{C} \times K^{2}}$ , can be expressed as:

Since different inputs produce different attention maps A, the convolution kernel W changes with the input. IDConv generates a spatially varying attention map for each attention group, and the spatial dimension (K × K) of this attention map exactly matches the spatial dimension of the convolution kernel, while DyConv only generates a scalar attention for each attention group. weight. Therefore, DConv supports more dynamic local feature encoding. Compared with the recently proposed dynamic depth convolution (D-DWConv), IDConv combines dynamic attention maps with static learnable parameters, greatly reducing computational overhead. It is worth noting that D-DWConv applies global average pooling, and then performs channel squeeze and extended point convolution on the input features, resulting in a dimension of $\left ( C\times K^{2} \ right ) \times 1 \times 1$ The output is then reshaped to match the depth convolution kernel. The parameters generated in this process are $\frac{C^{2}}{r}\left ( K^{2} + 1 \right )$ , and the IDConv result is < img alt="\frac{C^{2}}{r}\left ( G + 1 \right ) + GCK^{2}" class="mathcode" src="//i2.wp.com/latex. csdn.net/eq?\frac{C^{2}}{r}\left ( G & amp;plus; 1 \right ) & amp;plus; GCK^{2}"> Params. In practical applications, when the maximum value of G is 4, r and K are 4 and 7 respectively, IDConv $\left(1.25C^{2}+196C\right)$ The number of Params is much less than that of DDWConv $\left ( 12.5C^{2} \right )$ .

Overlap Spatial Reduction Attention (OSRA): Spatial Reduction Attention (SRA) has been widely used in previous research, using sparsely labeled regional relationships to efficiently extract global information. However, non-overlapping space reduction to reduce token counts breaks the spatial structure near patch boundaries and reduces token quality. To solve this problem, Overlap Spatial Reduction (OSR) is introduced in SRA to better represent the spatial structure near patch boundaries by using larger overlapping patches. In practice, OSR is instantiated as a depthwise separable convolution, where the stride follows the PVT and the kernel size is equal to the stride plus 3. Figure 2 (c) depicts the OSRA Pipeline, which can also be expressed as:

Compressed Token Enhancer (STE): After performing token mixing, most previous methods use 1×1 convolution to achieve cross-channel communication, which incurs considerable computational overhead. In order to reduce the computational cost without affecting performance, a lightweight token enhancer (STE) is proposed, as shown in Figure 2 (d). STE includes 3 × 3 depth convolutions to enhance local relationships, channel compression and expansion 1 × 1 convolutions to reduce computational costs, and residual connections to preserve representation capabilities. STE can be expressed as:

It can be seen from the above formula that the FLOPs of STE are HWC (2C/r + 9). In actual operation, the channel reduction ratio r is set to 8, but it is necessary to ensure that the number of compressed channels is not less than 16, so that the FLOPs obtained are obviously less than 1 ×1 convolution FLOPs, namely $HWC^{2}$ .

2) Multi-scale Feed-forward Network (MS-FFN)

Figure 3 Different FFN

MS-FFN, shown in Figure 3, does not use a single 3 × 3 depth convolution, but uses four parallel depth convolutions of different scales, each convolution processing a quarter of the channel. The depth convolution kernel with kernel size {3,5,7} can effectively capture multi-scale information, and the 1×1 depth convolution kernel is actually a learnable channel scale factor.

3) Architecture variants

TransXNet comes in three different variants: TransXNet-t (Tiny), TransXNet-s (Small) and TransXNetB (Base). To control the computational cost of different variables, in addition to the number of channels and the number of blocks, there are two other tunable hyperparameters. First, since the computational cost of IDConv is directly related to the number of attention groups, different numbers of attention groups are used for different variables in IDConv. In the tiny version, the number of attention groups is fixed to 2 to ensure reasonable computational cost, while in deeper small models and basic models, more and more attention groups are used to improve the flexibility of IDConv, which is similar to The number of heads in the MHSA module increases with the depth of the model. Furthermore the expansion ratio is gradually increased in different architectural variants. Table 1 lists the details of the different architectural variants.

02 | Experimental results

Image Classification

Quantitative performance comparison on imagenet-1k dataset using 224×224 input

TransXNet-T achieves an impressive top-1 accuracy of 81.6% with only 1.8 GFLOPs and 12.8M Params, significantly surpassing other methods. Although the computational cost is less than half of swwin-t, TransXNet-T’s top-1 accuracy is 0.3% higher than swwin-t [4]. Second, TransXNet-S achieves 83.8% top-1 accuracy without requiring a dedicated CUDA implementation, which is 0.2% higher than InternImageT. Furthermore, the TransXNet method outperforms well-known hybrid models, including MixFormer and MaxViT, while having lower computational cost. The small model performs better than MixFormer-B5, whose number of Params actually exceeds TransXNet’s base model.

In terms of image classification, the performance improvement of TransXNet-S over CMT-s seems limited, presumably because CMT has a more complex classification head to improve performance.

Object detection and instance segmentation

For retinanet target detection, this method achieves the best performance compared with other competitors. Previous methods often did not perform well on both small and large objects. However, with the support of global and local dynamics and multi-scale token aggregation, our method not only achieves excellent results on small targets, but also significantly outperforms previous methods on medium and large targets.

For instance segmentation using Mask-RCNN, this method has obvious advantages over previous methods at a comparable computational cost. Although the performance improvement of TransXNet-S relative to CMT-S is limited in ImageNet-1K classification, it achieves significant performance improvements in object detection and instance segmentation, indicating that our backbone has stronger representation capabilities and better portability

Reference

https://arxiv.org/abs/2310.19380

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill tree Home page Overview 388,106 people are learning the system