Executable Case of Non-Local Network under the Mindspore Framework

Executable case download: Non-Local notebook

NonLocal

“Non-local Neural Networks” was published in CVPR2018 as a method for processing action classification.

Introduction to algorithm principles

Figure 1 nonlocal_block

NonLocal is a flexible building block that can be easily used with convolutional/recurrent layers. It can be added to the front part of the deep neural network, unlike the fc layer which is often used at the end. Allows building a richer hierarchy combining non-local and local information. The nonlocal in the paper regards the response of a certain position as a weighted sum of all positions from the feature map. These positions can represent not only spatial positions, but also time, space-time, etc. Nonlocal is actually very related to the self-attention mechanism. In this paper, in order to freely connect the proposed nonlocal block into each neural network as a component, the nonlocal operation designed by the author makes the input and output sizes consistent. The specific implementation formula is as follows:

Formulation

In the formula, x represents the input, y represents the output, i and j represent a certain spatial position of the input, xi is a vector, the dimension is the same as the number of channels of x, f is a function to calculate the similarity between any two points, g is a mapping function that maps a point to a vector, which is the feature of the point. In order to calculate a point of the output layer, it is necessary to consider each point of the input. The way of thinking is similar to the attention mechanism: the mask is given according to the f function, multiplied by the g mapping function, and finally summed. , the attention of a certain point of the output on the original image. Each point is calculated in this way, and finally a nonlocal “attention map” is obtained.

Table 1 baseline_ResNet50_C2D

Table 1 shows the C2D baselines under the ResNet-50 backbone. In this repository, we use the Inflated 3D ConvNet (I3D) under the ResNet-50 backbone. The C2D models in Table 1 can be transformed into 3D convolutional models by “inflating” kernels. For example, a 2D k×k kernel can be extended to a 3D t×k×k kernel spanning t frames. We add 5 blocks (3 to res4, 2 to res3, every other residual block). For more details, please read the paper “Non-local Neural Networks”.

Environment preparation

git clone https://gitee.com/yanlq46462828/zjut_mindvideo.git
cd zjut_mindvideo

# Please first install mindspore according to instructions on the official website: https://www.mindspore.cn/install

pip install -r requirements.txt
pip install -e .

Training process

from mindspore import nn
from mindspore. train import Model
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore.nn.metrics import Accuracy

from msvideo.utils.check_param import Validator,Rel

Dataset loading

Load the kinetic400 dataset through the Kinetic400 class written based on VideoDataset. Download the dataset to the following path, or change the path according to your needs. Dataset download link: https://deepmind.com/research/open-source/kinetics

from msvideo.data.kinetics400 import Kinetic400
# Data Pipeline.
dataset = Kinetic400(path='/home/publicfile/kinetics-400',
 split="train",
 shuffle=True,
 seq=32,
 seq_mode='interval',
 num_parallel_workers=1,
 batch_size=6,
 repeat_num=1,
 frame_interval=6)
ckpt_save_dir = './nonlocal'

Data processing

Use VideoShortEdgeResize to resize according to the short side, then use VideoRandomCrop to randomly crop the resized video, then use VideoRandomHorizontalFlip to horizontally flip the video according to the probability, use VideoRescale to scale the video, use VideoReOrder to transform the dimension, and then use VideoNormalize Normalized processing.

from msvideo.data.transforms import VideoRandomCrop, VideoRandomHorizontalFlip, VideoRescale
from msvideo.data.transforms import VideoNormalize, VideoShortEdgeResize, VideoReOrder
# Data Pipeline.
transforms = [VideoShortEdgeResize(size=256, interpolation='bicubic'),
                VideoRandomCrop([224, 224]),
                VideoRandomHorizontalFlip(0.5),
                VideoRescale(),
                VideoReOrder([3, 0, 1, 2]),
                VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])]

dataset.transform = transforms
dataset_train = dataset. run()
Validator.check_int(dataset_train.get_dataset_size(), 0, Rel.GT)
step_size = dataset_train.get_dataset_size()

Network Construction

The most important structure of Nonlocal is NonlocalBlockND (nn.Cell). This block contains four pairwise similarity calculation formulas. Taking dot_product as an example, it mainly performs linear transformation through three Conv3d. NonlocalBlockND operations only need to use commonly used operators such as convolution, matrix multiplication, addition, and softmax, and users can easily implement networking to build models.

nonlocal3d consists of backbone, avg_pool, flatten, and head. It can be roughly summarized as the following points. The first part: the backbone part is NLResInflate3D50 (NLInflateResNet3D class), which is a stage that implements the [3,4,6,3] specification in the NLInflateResNet3D structure. The structure of NLInflateResNet3D is inherited from the structure of ResNet3d50. In the 10th layer of the 2nd and 3rd stages of ResNet3d50 [3, 4, 6, 3], a NonlocalBlockND is inserted every other layer. The second part: NLResInflate3D50 output to an average pool and flatten, the third part: classification head. Input the flattened tensor to Dropdensehead for classification, and get the tensor of shape(N, NUM_CLASSES).

from msvideo.models.nonlocal3d import nonlocal3d
# Create model
network = nonlocal3d(in_d=32,
 in_h=224,
 in_w=224,
 num_classes=400,
 keep_prob=0.5)

from msvideo.schedule.lr_schedule import warmup_step_lr
# Set learning rate scheduler.
lr = warmup_step_lr(lr=0.0003,
 lr_epochs=[1],
 steps_per_epoch=step_size,
 warmup_epochs=1,
 max_epoch=1,
 gamma=0.1)

# Define optimizer.
network_opt = nn. SGD(network. trainable_params(),
                     lr,
 momentum=0.9,
 weight_decay=0.0001)
# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")

# set checkpoint for the network
ckpt_config = CheckpointConfig(
 save_checkpoint_steps=step_size,
 keep_checkpoint_max=1)
ckpt_callback = ModelCheckpoint(prefix='nonlocal_kinetics400',
 directory=ckpt_save_dir,
 config=ckpt_config)

#Init the model.
model = Model(network,
 loss_fn=network_loss,
 optimizer=network_opt,
 metrics={"Accuracy": Accuracy()})

# Begin to train.
print('[Start training `{}`]'. format('nonlocal_kinetics400'))
print("=" * 80)
model. train(1,
            dataset_train,
 callbacks=[ckpt_callback, LossMonitor()],
 dataset_sink_mode=False)
print('[End of training `{}`]'. format('nonlocal_kinetics400'))

Evaluation Process

from mindspore import context
from mindspore. train. callback import Callback

class PrintEvalStep(Callback):
 """ print eval step """
 def step_end(self, run_context):
 """ eval step """
        cb_params = run_context. original_args()
 print("eval: {}/{}".format(cb_params.cur_step_num, cb_params.batch_num))

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

from msvideo.data.kinetics400 import Kinetic400

dataset_eval = Kinetic400(path="/home/publicfile/kinetics-400",
 split="val",
 shuffle=True,
 seq=32,
 seq_mode='interval',
 num_parallel_workers=1,
 batch_size=1,
 frame_interval=6)

from msvideo.data.transforms import VideoReOrder, VideoRescale, VideoNormalize
from msvideo.data.transforms import VideoCenterCrop, VideoShortEdgeResize

transforms = [VideoShortEdgeResize(size=256, interpolation='bicubic'),
              VideoCenterCrop([224, 224]),
              VideoRescale(),
              VideoReOrder([3, 0, 1, 2]),
              VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])]
dataset_eval.transform = transforms
dataset_eval = dataset_eval. run()

from mindspore import nn
from mindspore. train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore import load_checkpoint, load_param_into_net
from msvideo.models.nonlocal3d import nonlocal3d

 # Create model.
network = nonlocal3d()


# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")

# Load pretrained model.
param_dict = load_checkpoint(ckpt_file_name='/home/hcx/nonlocal_mindspore/scripts/nonlocal_output_0.0003/nonlocal-1_4975.ckpt')
load_param_into_net(network, param_dict)

# Define eval_metrics.
eval_metrics = {'Loss': nn.Loss(),
 'Top_1_Accuracy': nn.Top1CategoricalAccuracy(),
 'Top_5_Accuracy': nn.Top5CategoricalAccuracy()}
print_cb = PrintEvalStep()

#Init the model.
model = Model(network, loss_fn=network_loss, metrics=eval_metrics)

# Begin to eval.
print('[Start eval `{}`]'. format('nonlocal_kinetics400'))
result = model.eval(dataset_eval,
 callbacks=[print_cb],
 dataset_sink_mode=False)
print(result)

Code

Gitee

GitHub