Overview of 7 major categories of convolutional neural networks (CNN)

Click “Xiaobai Xue Vision” above and choose to add “Star” or “Pin


Heavy stuff, delivered as soon as possible

Foreword: Deep convolutional neural network (CNN) is a special type of neural network that has shown the current best results on various competition benchmarks. The high performance achieved by deep CNN architectures on challenging benchmark task competitions demonstrates that innovative architectural concepts as well as parameter optimization can improve the performance of CNNs on various vision-related tasks. This review classifies recent CNN architectural innovations into seven distinct categories based on space utilization, depth, multipath, width, feature map utilization, channel boosting, and attention.

d8e3eebec8997779308f4d5f6f2a1108.jpeg

Original text: https://arxiv.org/abs/1901.06032

Abstract: Deep convolutional neural network (CNN) is a special type of neural network that has demonstrated current state-of-the-art results on various competition benchmarks. The superior learning capabilities of deep CNNs are primarily achieved through the use of multiple nonlinear feature extraction stages that automatically learn hierarchical representations from data. The availability of large amounts of data and improvements in hardware processing units have accelerated CNN research, and very interesting deep CNN architectures have recently been reported. The recent high performance achieved by deep CNN architectures on challenging benchmark task competitions demonstrates that innovative architectural concepts as well as parameter optimization can improve the performance of CNNs on various vision-related tasks.

In view of this, different ideas about CNN design are explored, such as using different activation and loss functions, parameter optimization, regularization, and reconstruction of processing units. However, major improvements in representation capabilities have been achieved by restructuring the processing units. In particular, the idea of using blocks instead of layers as structural units was greatly appreciated.

This review divides recent CNN architectural innovations into seven distinct categories. The seven categories are based on space utilization, depth, multipath, width, feature map utilization, channel boosting and attention respectively. Additionally, this article covers a basic understanding of the components of a CNN and reveals the current challenges faced by CNNs and their applications.

1

『Introduction』

CNNs first gained attention through LeCun’s 1989 work on processing grid-like topological data (images and time series data). CNN is regarded as one of the best techniques for understanding image content and has demonstrated state-of-the-art performance on tasks related to image recognition, segmentation, detection, and retrieval. The success of CNN has attracted attention outside the academic community. In the industry, companies such as Google, Microsoft, AT&T, NEC and Facebook have set up research teams to explore new architectures of CNN. Currently, most of the front-runners in the image processing competition employ deep CNN-based models.

Since 2012, different innovations regarding CNN architecture have been proposed. These innovations can be divided into parameter optimization, regularization, structural reorganization, etc. However, it is observed that the performance improvement of CNN networks should be mainly attributed to the reconstruction of processing units and the design of new modules. Since AlexNet demonstrated extraordinary performance on the ImageNet dataset, CNN-based applications have become increasingly popular. Similarly, Zeiler and Fergus introduced the concept of feature hierarchical visualization, which changed the trend of extracting features at simple low spatial resolutions with deep architectures such as VGG. Today, most new architectures are built based on the simple principles and homogeneous topologies introduced by VGG.

On the other hand, the Google team introduced a very famous concept of splitting, transforming, and merging called Inception modules. The initial block uses the concept of intra-layer branches for the first time, allowing feature extraction at different spatial scales. In 2015, in order to train deep CNN, the concept of residual connection introduced by Resnet became famous, and most of the later networks like Inception-ResNet, WideResNet, ResNext, etc. are using it. Similarly, some architectures like WideResnet, Pyramidal Nets, and Xception introduce the concept of multi-layer transformations through additional cardinality and increased width. Therefore, the focus of research shifts from parameter optimization and connection readjustment to network architecture design (layer structure). This has led to many new architectural concepts like channel boosting, space and channel utilization, attention-based information processing, etc.

This article is structured as follows:

90373e886c868ea94a8156e432d759fc.jpeg

Figure 1: Article structure

23aeee3253a6333679b2404a00f061cf.jpeg

Figure 2: Basic layout of a typical pattern recognition (OR) system. The PR system is divided into three phases: Phase 1 is related to data mining, Phase 2 performs preprocessing and feature selection, and Phase 3 is based on model selection, parameter tuning, and analysis. CNN has good feature extraction capabilities and strong discrimination capabilities, so in a PR system, it can be used in the feature extraction/generation and model selection stages.

2

『Architectural Innovation in CNN』

There have been many different improvements to the CNN architecture since 1989 to the present day. All innovations in CNNs are achieved by combining depth and space. Based on the type of architectural modifications, CNNs can be broadly classified into 7 categories: CNNs based on space utilization, depth, multipath, width, channel boosting, feature map utilization, and attention. The classification of deep CNN architectures is shown in Figure 3.

52f340ee3a1683e49689e5c844fff748.jpeg

Figure 3: Classification of deep CNN architectures

3

『CNN based on space utilization』

CNNs have a large number of parameters such as the number of processing units (neurons), number of layers, filter size, stride, learning rate, activation function, etc. Since CNNs consider the neighborhood (locality) of input pixels, different sized filters can be used to explore different levels of correlation. Therefore, in the early 2000s, researchers exploited spatial transformations to improve performance and, in addition, evaluated the impact of filters of different sizes on network learning rates. Filters of different sizes encapsulate different levels of granularity; typically, smaller filters extract fine-grained information, while larger filters extract coarse-grained information. In this way, by adjusting the filter size, the CNN can perform well on both coarse-grained and fine-grained details.

Depth-based CNN

Deep CNN architecture is based on the assumption that as depth increases, the network can better approximate the objective function through a large number of nonlinear mappings and improved feature representation. Network depth plays an important role in the success of supervised learning. Theoretical studies have shown that deep networks can represent specific 20 function types exponentially more efficiently than shallow networks.

In 2001, Csáji expressed a universal approximation theorem stating that a single hidden layer is sufficient to approximate any function, but this requires an exponential number of neurons, thus often rendering it computationally infeasible. In this regard, Bengio and Elalleau argue that deeper networks have the potential to maintain network performance capabilities at less cost. In 2013, Bengio et al. empirically showed that for complex tasks, deep networks are both computationally and statistically more efficient. Inception and VGG, which performed best in the 2014-ILSVR competition, further illustrate that depth is an important dimension in regulating network learning capabilities.

Once a feature is extracted, its extraction location becomes less important as long as its approximate location relative to other locations is preserved. Pooling or downsampling (like convolution) is an interesting local operation. It summarizes similar information near the receptive field and outputs the main response within this local area. As the output of the convolution operation, feature patterns may appear at different locations in the image.

Multipath-based CNN

The training of deep networks is quite challenging, which is also the subject of much recent deep network research. Deep CNNs provide efficient computation and statistics for complex tasks. However, deeper networks may suffer from performance degradation or vanishing/exploding gradients, which are typically caused by increasing depth rather than overfitting. The vanishing gradient problem not only leads to higher test error, but also leads to higher training error.

To train deeper networks, the concept of multipath or cross-layer connections was proposed. Multipath or shortcut connections systematically connect one layer to another by skipping some intermediate layers so that specific information flows across the layers. Cross-layer connections divide the network into pieces. These paths also try to solve the vanishing gradient problem by giving lower layers access to the gradient. For this purpose, different types of shortcut joins are used such as zero-padding, projection-based, dropout, and 1×1 joins, etc.

An activation function is a decision function that helps in learning complex patterns. Choosing an appropriate activation function can speed up the learning process. The activation function of the convolutional feature map is defined as equation (3).

c1197511b33c889e70803e60d2c98b65.png

Width-based multi-connection CNN

From 2012 to 2015, network architecture focused on the power of depth and the importance of multi-channel regulatory connections in network regularization. However, the width of the network is just as important as the depth. By using multiple processing units in parallel within a layer, multilayer perceptrons gain the advantage of mapping complex functions on the perceptron. This shows that width, like depth, is an important parameter in defining learning principles.

Lu et al. and Hanin & Sellke recently showed that neural networks with linear rectified activation functions need to be wide enough to maintain general approximation properties as depth increases. Also, if the maximum width of the network is no larger than the input dimension, classes of continuous functions on compact sets cannot be well approximated by networks of arbitrary depth. Therefore, stacking multiple layers (adding layers) may not increase the representational power of a neural network. An important issue related to deep architectures is that some layers or processing units may not learn useful features. To address this issue, the focus of research has shifted from deep and narrower architectures to shallower and wider architectures.

CNN developed based on feature map (channel feature map)

CNN is well-known in MV tasks for its hierarchical learning and automatic feature extraction capabilities. Feature selection plays an important role in determining the performance of classification, segmentation and detection modules. The performance of classification modules in traditional feature extraction technology is limited by the singleness of features. Compared to traditional techniques, CNN uses multi-stage feature extraction to extract different types of features (called feature maps in CNN) based on the assigned input. However, some feature maps have little or no effect on object discrimination. Huge feature sets have noise effects and can lead to network overfitting.

This shows that, in addition to network engineering, the selection of category-specific feature maps is crucial to improving the generalization performance of the network. In this section, feature maps and channels will be used interchangeably because many researchers have used the word channel instead of feature maps.

CNN based on channel (input channel) utilization

Image representation plays an important role in determining the performance of image processing algorithms. Good characterization of images can define salient features of images from compact codes. In different studies, different types of traditional filters have been used to extract different levels of information from a single type of image. These different representations are used as inputs to the model to improve performance. CNN is a good feature learner that can automatically extract discriminative features based on the problem. However, CNN learning relies on input representations. The performance of a CNN as a discriminator suffers if there is a lack of diversity and class-defining information in the input. To this end, the concept of auxiliary learners is introduced into CNN to improve the input representation of the network.

Attention-based CNN

Different levels of abstraction play an important role in defining the discriminative capabilities of neural networks. In addition to this, selecting context-relevant features is also important for image localization and recognition. In the human visual system, this phenomenon is called attention. Humans observe a scene in one quick glance after another and pay attention to the parts that are contextually relevant.

In this process, humans not only pay attention to the selected area but also reason about different interpretations of the object at that location. Therefore, it helps humans grasp visual structures in a better way. Similar interpretive capabilities are added to neural networks like RNNs and LSTMs.

The above network utilizes attention modules to generate sequence data and weight new samples based on their occurrence in previous iterations. Different researchers have added the concept of attention into CNNs to improve representation and overcome the computational limitations of data. The concept of attention helps make CNN smarter, allowing it to recognize objects in cluttered backgrounds and complex scenes.

Download 1: OpenCV-Contrib extension module Chinese version tutorial

Reply in the background of the "Xiaobai Xue Vision" public account: Chinese tutorial on extension module, you can download the first Chinese version of OpenCV extension module tutorial on the entire network, covering extension module installation, SFM algorithm, stereo vision, target tracking, biological vision, super Resolution processing and more than twenty chapters.


Download 2: Python visual practical project 52 lectures
Reply in the background of the "Xiaobai Xue Vision" public account: Python visual practical projects, you can download them, including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, 31 practical vision projects, including facial recognition, support rapid school computer vision.


Download 3: 20 lectures on OpenCV practical projects
Reply in the background of the "Xiaobai Xue Vision" public account: OpenCV practical projects 20 lectures, you can download 20 practical projects based on OpenCV to achieve advanced OpenCV learning.


Communication group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will be gradually subdivided in the future). Please scan the WeChat ID below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + visual SLAM". Please note according to the format, otherwise it will not be approved. After successful addition, you will be invited to join the relevant WeChat group according to the research direction. Please do not send advertisements in the group, otherwise you will be asked to leave the group, thank you for your understanding~