How to make ResNet better than EfficientNet? Just improve training methods and expansion strategies

Click “Xiaobai Xue Vision” above and choose to add “Star” or “Pin“

Heavy stuff, delivered as soon as possible

Author: Edison_G

Architectural changes, training methods, and expansion strategies are indispensable and important factors that affect model performance, but current research only focuses on architectural changes. A new study from Google Brain and UC Berkeley revisits the ResNet architecture and finds that improved training and scaling strategies may be more important than architectural changes for improving model performance. They proposed a variant architecture of ResNet, ResNet-RS, which not only uses less memory, but is also several times faster to train on TPU and GPU than EfficientNet.

This article is reproduced from the Machine Heart report

The performance of a vision model is a combination of architecture, training methods, and scaling strategies. However, research often emphasizes only architectural changes. New architectures are the foundation for many advances, but they are often accompanied by changes to training methods and hyperparameters-details that are critical but rarely made public. Additionally, new architectures improved with modern training methods sometimes need to be compared with older architectures using outdated training methods, such as the ResNet-50 architecture with a Top-1 accuracy of 76.5% on the ImageNet dataset.

What impact do training methods and scaling strategies have on the popular ResNet architecture? Recently, researchers from Google Brain and UC Berkeley gave their answer.

Paper link: https://arxiv.org/pdf/2103.07579.pdf

The researchers investigated modern training and regularization methods that are widely used today and applied them to ResNet, as shown in Figure 1 below. In the process, they observed interactions between training methods and demonstrated the benefits of reducing weight decay when used with other regularization methods. Furthermore, the training method experiments in Table 1 below also reveal the significant impact of these strategies: the ImageNet Top-1 accuracy of the typical ResNet architecture increased from 79.0% to 82.2% (+3.2%) just by improving the training method. Through two small and commonly used architectural improvements: ResNetD and Squeeze-and-Excitation, the accuracy is increased to 83.4%. Figure 1 describes the optimization process of the ResNet architecture through the speed-accuracy Pareto curve:

Researchers also provide new ideas and practical suggestions for extending visual architecture. Compared with previous studies that inferred expansion strategies from small models or from training with a small number of epochs, this study designed expansion strategies based on performing complete training on models of different sizes (such as choosing 350 epochs instead of 10 epochs), and then found Strong dependence between optimal scaling strategy and training regime (number of epochs, model size, and dataset size). These dependencies are ignored in small training systems, leading to suboptimal scaling decisions. The researchers summarize their scaling strategy as follows:1) scale model depth under training settings where overfitting can occur (otherwise scaling width is preferable); 2) scale image resolution at a slower rate.

Using this improved training and expansion strategy, the researchers designed re-scaled ResNet (ResNet-RS), which is trained based on models of different sizes, and its speed accuracy Pareto curve is shown in Figure 1. The ResNet-RS model uses less memory during training but is 1.7-2.7 times faster than EfficientNets on TPUs and 2.1-3.3 times faster on GPUs. In a large-scale semi-supervised learning setting, when jointly trained using ImageNet and an additional 130 million pseudo-annotated images, ResNet-RS is 4.7 times faster on TPU than EfficientNet-B5 and 5.5 times faster on GPU than EfficientNet-B5 times.

Finally, the researchers verified the generalizability of these improved training and expansion strategies through a series of experiments. They first used this scaling strategy to design a faster version of EfficientNet, EfficientNet-RS, whose speed-accuracy Pareto curve performance was better than that of the original EfficientNet. Next, the researchers show that the performance of the improved training strategy is comparable to or even better than that of the self-supervised algorithms SimCLR and SimCLRv2 on a series of downstream tasks. Improved training strategies can also be generalized to video classification tasks. Applying this training strategy to 3D-ResNets on the Kinetics-400 dataset improves accuracy from 73.4% to 77.4% (+4%).

By combining small architectural changes with this improved training and scaling strategy, the researchers found that the ResNet architecture sets the state-of-the-art baseline for vision research. This finding highlights the importance of teasing apart these factors in order to understand which architectures perform better.

Method

The researcher introduced the basic ResNet architecture and the training methods used.

Architecture

We introduce the ResNet architecture and two widely used architectural change strategies: ResNet-D modification and Squeeze-and Excitation (SE) in all bottleneck blocks. Several architectures such as TResNet, ResNeSt, and EfficientNets use these two architectural change strategies.

Training methods

The researchers introduced regularization and data augmentation methods commonly used in SOTA classification models and semi/self-supervised learning.

The training method used by the researchers is very close to EfficientNet, which was trained for 350 epochs, but there are still some differences:

1) For simplicity, the researcher used cosine learning rate scheduling instead of exponential decay.

2) RandAugment is used in all models, and the original EfficientNet uses AutoAugment. The researchers used RandAugment to retrain EfficientNets B0-B4 and found that there was no performance improvement.

3) Use Momentum optimizer instead of RMSProp.

The researchers used regularization methods such as weight decay, label smoothing, dropout, and random depth, and used RandAugment data augmentation as an additional regularizer.

Improve training methods

Additive research

Table 1 below shows an additive study of training, regularization methods, and architectural changes. The baseline ResNet-200 achieved a Top-1 accuracy of 79.0%, and the researchers improved the training method (without changing the architecture) to improve the performance to 82.2% (+ 3.2%). After adding two common architectural updates (Squeeze-and-Excitation and ResNet-D), model performance further improved to 83.4%. Among them, the performance improvement brought by training method accounts for 3/4, which shows that the improvement of training method plays a key role in ImageNet performance.

Reduce the importance of weight decay values when combined with regularization methods

Table 2 below demonstrates the importance of changing the weight decay value when combining multiple regularization methods:

There is no need to change the default weight falloff of 1e-4 when applying RandAugment (RA) and Label Smoothing (LS). However, after further adding dropout (DO) and stochastic depth (SD), not reducing the weight decay value will cause the model performance to decrease.

Improve scaling strategy

The researchers show in this section that scaling strategies are equally important. To establish the scaling trend, the researchers conducted an extensive search on ImageNet for width multipliers [0.25, 0.5, 1.0, 1.5, 2.0], depth [26, 50, 101, 200, 300, 350, 400], and resolution [128, 160, 224, 320, 448].

This study mimics the training settings of the SOTA ImageNet model, with training epochs of 350. As model size increases, researchers add regularization to limit overfitting.

Strategy 1: Deep expansion under the mechanism of overfitting

For longer epoch mechanisms, depth expansion is better than width expansion. In the 350 epoch setting (Figure 3 below, right), the researchers observed that depth expansion performed significantly better than width expansion across all image resolutions. Width expansion leads to overfitting and performance degrades even with increased regularization. They hypothesized that this was due to the larger parameter increase when expanding the width. Therefore, extending depth (especially in earlier layers) introduces fewer parameters than extending width.

Under the shorter epoch mechanism, width expansion is better than depth expansion. In contrast, the width expansion is better when training for only 10 epochs (Figure 3, far left).

Strategy 2: Reduce image resolution expansion

In Figure 2 below, the researchers also observed that larger image resolutions lead to performance degradation. Therefore, they suggest that image resolution should beincreased gradually compared to previous work. Experiments show that slower image scaling improves the performance of both ResNet and EfficientNets architectures.

Two common mistakes in designing scaling strategies

1. Infer scaling strategies in small-scale settings (such as small models or few training epochs): this does not generalize to large models or longer training iterations;

2. Infer scaling strategies within a single suboptimal initial architecture: Suboptimal initial architecture can affect scaling results.

Summary

For a new task, it is generally recommended to run a set of models of different sizes for a full training epoch to understand which dimensions are most useful. But this approach is expensive, and the study points out that the cost would be significantly reduced if the architecture was not searched.

For image classification, the scaling strategy can be summarized as: scaling depth, slow image resolution scaling in settings where overfitting occurs. Experiments show that applying these scaling strategies to ResNet and EfficientNet (resulting in ResNetRS and EfficientNet-RS) results in significant speedups compared to EfficientNet. Recent studies such as LambdaResNet and NFNet, which achieve significant speedups compared to EfficientNet, also use similar scaling strategies.

Experiment

ResNet-RS Speed – Accuracy

The researchers designed ResNet-RS using improved training and expansion strategies. Figure 4 below compares the speed-accuracy Pareto curves of EfficientNet and ResNet-RS. It can be seen that when ResNet-RS and EfficientNet have similar performance, the speed on TPU is 1.7-2.7 times that of the latter.

This acceleration is unexpected, after all, the number of parameters and FLOPs of EfficientNet is significantly reduced compared to ResNet. The researchers analyzed the reasons and showed the performance comparison of EfficientNet and ResNet-RS, from which we can see the impact of parameter quantity and FLOPs:

Improving the efficiency of EfficientNet

The analysis above shows that scaling up image resolution leads to diminishing returns. This shows that the scaling rules advocated by EfficientNet (increasing model depth, width, and resolution) are suboptimal.

The researchers applied Strategy #2 to EfficientNet, training multiple reduced-resolution versions of the images without changing the depth or width of the model. Figure 5 below shows the performance improvement of the re-expanded EfficientNet (EfficientNetRS) compared to the original EfficientNet:

Semi-supervised learning

We measure the performance of ResNet-RS in a large-scale semi-supervised learning setting using large datasets. Specifically, this study trained the model on 1.2M ImageNet annotated images and 130M pseudo-annotated images, and the training method was similar to Noisy Student.

Table 4 below shows that the ResNet-RS model still performs strongly in a semi-supervised learning setting. The model achieves 86.2% top-1 accuracy on the ImageNet dataset and is 3.7x faster on TPU (4.5x faster on GPU) than the corresponding Noisy Student EfficientNet-B5 model.

Transfer learning effect of ResNet-RS

Table 5 below compares the transfer performance of the improved supervised learning strategy (RS) with self-supervised SimCLR and SimCLRv2. It is found that even on small data sets, the improved training strategy can improve the transfer performance of the model.

3D ResNet designed for video classification

Table 6 below shows an additive study of RS training method and architectural improvements. Extending this training strategy to the video classification task, the accuracy increased from 73.4% to 77.4% (+4.0%). ResNet-D and Squeeze-and-Excitation architectural changes further improved performance to 78.2% (+0.8%). Similar to the image classification task (see Table 1), the researchers found that most improvements did not require architectural changes. No model extension is required, and the performance of 3D ResNet-RS-50 is only 2.2% lower than the performance of the SOTA model (80.4%).

Download 1: OpenCV-Contrib extension module Chinese version tutorial

Reply in the background of the "Xiaobai Xue Vision" public account: Chinese tutorial on extension module, you can download the first Chinese version of OpenCV extension module tutorial on the entire network, covering extension module installation, SFM algorithm, stereo vision, target tracking, biological vision, super Resolution processing and more than twenty chapters.


Download 2: Python visual practical project 52 lectures
Reply in the background of the "Xiaobai Xue Vision" public account: Python visual practical projects, you can download them, including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, 31 practical vision projects, including facial recognition, support rapid school computer vision.


Download 3: 20 lectures on OpenCV practical projects
Reply in the background of the "Xiaobai Xue Vision" public account: OpenCV practical projects 20 lectures, you can download 20 practical projects based on OpenCV to achieve advanced OpenCV learning.


Communication group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will be gradually subdivided in the future). Please scan the WeChat ID below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + visual SLAM". Please note according to the format, otherwise it will not be approved. After successful addition, you will be invited to join the relevant WeChat group according to the research direction. Please do not send advertisements in the group, otherwise you will be asked to leave the group, thank you for your understanding~