Executable case download: Non-Local notebook
NonLocal
“Non-local Neural Networks” was published in CVPR2018 as a method for processing action classification.
Introduction to algorithm principles
Figure 1 nonlocal_block
NonLocal is a flexible building block that can be easily used with convolutional/recurrent layers. It can be added to the front part of the deep neural network, unlike the fc layer which is often used at the end. Allows building a richer hierarchy combining non-local and local information. The nonlocal in the paper regards the response of a certain position as a weighted sum of all positions from the feature map. These positions can represent not only spatial positions, but also time, space-time, etc. Nonlocal is actually very related to the self-attention mechanism. In this paper, in order to freely connect the proposed nonlocal block into each neural network as a component, the nonlocal operation designed by the author makes the input and output sizes consistent. The specific implementation formula is as follows:
Formulation
In the formula, x represents the input, y represents the output, i and j represent a certain spatial position of the input, xi is a vector, the dimension is the same as the number of channels of x, f is a function to calculate the similarity between any two points, g is a mapping function that maps a point to a vector, which is the feature of the point. In order to calculate a point of the output layer, it is necessary to consider each point of the input. The way of thinking is similar to the attention mechanism: the mask is given according to the f function, multiplied by the g mapping function, and finally summed. , the attention of a certain point of the output on the original image. Each point is calculated in this way, and finally a nonlocal “attention map” is obtained.
Table 1 baseline_ResNet50_C2D
Table 1 shows the C2D baselines under the ResNet-50 backbone. In this repository, we use the Inflated 3D ConvNet (I3D) under the ResNet-50 backbone. The C2D models in Table 1 can be transformed into 3D convolutional models by “inflating” kernels. For example, a 2D k×k kernel can be extended to a 3D t×k×k kernel spanning t frames. We add 5 blocks (3 to res4, 2 to res3, every other residual block). For more details, please read the paper “Non-local Neural Networks”.
Environment preparation
git clone https://gitee.com/yanlq46462828/zjut_mindvideo.git cd zjut_mindvideo # Please first install mindspore according to instructions on the official website: https://www.mindspore.cn/install pip install -r requirements.txt pip install -e .
Training process
from mindspore import nn from mindspore. train import Model from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits from mindspore.nn.metrics import Accuracy from msvideo.utils.check_param import Validator,Rel
Dataset loading
Load the kinetic400 dataset through the Kinetic400 class written based on VideoDataset. Download the dataset to the following path, or change the path according to your needs. Dataset download link: https://deepmind.com/research/open-source/kinetics
from msvideo.data.kinetics400 import Kinetic400 # Data Pipeline. dataset = Kinetic400(path='/home/publicfile/kinetics-400', split="train", shuffle=True, seq=32, seq_mode='interval', num_parallel_workers=1, batch_size=6, repeat_num=1, frame_interval=6) ckpt_save_dir = './nonlocal'
Data processing
Use VideoShortEdgeResize to resize according to the short side, then use VideoRandomCrop to randomly crop the resized video, then use VideoRandomHorizontalFlip to horizontally flip the video according to the probability, use VideoRescale to scale the video, use VideoReOrder to transform the dimension, and then use VideoNormalize Normalized processing.
from msvideo.data.transforms import VideoRandomCrop, VideoRandomHorizontalFlip, VideoRescale from msvideo.data.transforms import VideoNormalize, VideoShortEdgeResize, VideoReOrder # Data Pipeline. transforms = [VideoShortEdgeResize(size=256, interpolation='bicubic'), VideoRandomCrop([224, 224]), VideoRandomHorizontalFlip(0.5), VideoRescale(), VideoReOrder([3, 0, 1, 2]), VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])] dataset.transform = transforms dataset_train = dataset. run() Validator.check_int(dataset_train.get_dataset_size(), 0, Rel.GT) step_size = dataset_train.get_dataset_size()
Network Construction
The most important structure of Nonlocal is NonlocalBlockND (nn.Cell). This block contains four pairwise similarity calculation formulas. Taking dot_product as an example, it mainly performs linear transformation through three Conv3d. NonlocalBlockND operations only need to use commonly used operators such as convolution, matrix multiplication, addition, and softmax, and users can easily implement networking to build models.
nonlocal3d consists of backbone, avg_pool, flatten, and head. It can be roughly summarized as the following points. The first part: the backbone part is NLResInflate3D50 (NLInflateResNet3D class), which is a stage that implements the [3,4,6,3] specification in the NLInflateResNet3D structure. The structure of NLInflateResNet3D is inherited from the structure of ResNet3d50. In the 10th layer of the 2nd and 3rd stages of ResNet3d50 [3, 4, 6, 3], a NonlocalBlockND is inserted every other layer. The second part: NLResInflate3D50 output to an average pool and flatten, the third part: classification head. Input the flattened tensor to Dropdensehead for classification, and get the tensor of shape(N, NUM_CLASSES).
from msvideo.models.nonlocal3d import nonlocal3d # Create model network = nonlocal3d(in_d=32, in_h=224, in_w=224, num_classes=400, keep_prob=0.5) from msvideo.schedule.lr_schedule import warmup_step_lr # Set learning rate scheduler. lr = warmup_step_lr(lr=0.0003, lr_epochs=[1], steps_per_epoch=step_size, warmup_epochs=1, max_epoch=1, gamma=0.1) # Define optimizer. network_opt = nn. SGD(network. trainable_params(), lr, momentum=0.9, weight_decay=0.0001) # Define loss function. network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") # set checkpoint for the network ckpt_config = CheckpointConfig( save_checkpoint_steps=step_size, keep_checkpoint_max=1) ckpt_callback = ModelCheckpoint(prefix='nonlocal_kinetics400', directory=ckpt_save_dir, config=ckpt_config) #Init the model. model = Model(network, loss_fn=network_loss, optimizer=network_opt, metrics={"Accuracy": Accuracy()}) # Begin to train. print('[Start training `{}`]'. format('nonlocal_kinetics400')) print("=" * 80) model. train(1, dataset_train, callbacks=[ckpt_callback, LossMonitor()], dataset_sink_mode=False) print('[End of training `{}`]'. format('nonlocal_kinetics400'))
Evaluation Process
from mindspore import context from mindspore. train. callback import Callback class PrintEvalStep(Callback): """ print eval step """ def step_end(self, run_context): """ eval step """ cb_params = run_context. original_args() print("eval: {}/{}".format(cb_params.cur_step_num, cb_params.batch_num)) context.set_context(mode=context.GRAPH_MODE, device_target="GPU") from msvideo.data.kinetics400 import Kinetic400 dataset_eval = Kinetic400(path="/home/publicfile/kinetics-400", split="val", shuffle=True, seq=32, seq_mode='interval', num_parallel_workers=1, batch_size=1, frame_interval=6) from msvideo.data.transforms import VideoReOrder, VideoRescale, VideoNormalize from msvideo.data.transforms import VideoCenterCrop, VideoShortEdgeResize transforms = [VideoShortEdgeResize(size=256, interpolation='bicubic'), VideoCenterCrop([224, 224]), VideoRescale(), VideoReOrder([3, 0, 1, 2]), VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])] dataset_eval.transform = transforms dataset_eval = dataset_eval. run() from mindspore import nn from mindspore. train import Model from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits from mindspore import load_checkpoint, load_param_into_net from msvideo.models.nonlocal3d import nonlocal3d # Create model. network = nonlocal3d() # Define loss function. network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean") # Load pretrained model. param_dict = load_checkpoint(ckpt_file_name='/home/hcx/nonlocal_mindspore/scripts/nonlocal_output_0.0003/nonlocal-1_4975.ckpt') load_param_into_net(network, param_dict) # Define eval_metrics. eval_metrics = {'Loss': nn.Loss(), 'Top_1_Accuracy': nn.Top1CategoricalAccuracy(), 'Top_5_Accuracy': nn.Top5CategoricalAccuracy()} print_cb = PrintEvalStep() #Init the model. model = Model(network, loss_fn=network_loss, metrics=eval_metrics) # Begin to eval. print('[Start eval `{}`]'. format('nonlocal_kinetics400')) result = model.eval(dataset_eval, callbacks=[print_cb], dataset_sink_mode=False) print(result)
Code
Gitee
GitHub