3D target detection (point cloud + voxel) – PV-RCNN

The article I read today is PV-RCNN. Because it combines point cloud features and voxel features, I find it very interesting, so let’s take a closer look.
PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object

The effect is very good, but the speed is about 13FPS (SECOND is around 40, pointpillar is around 60), which is still relatively slow.
After all, if you inherit the advantages of the second stage, you must also bear the shortcomings of the second stage.

Paper: https://arxiv.org/pdf/1912.13192.pdf
Code:https://github.com/sshaoshuai/PCDet

Directory

    • Brief introduction
    • Contribution points
    • Overall model architecture diagram
      • Stage one
        • Point cloud data voxelization feature extraction
        • Switch to BEV for one-stage forecasting
      • Stage 2
        • Voxel Set Abstraction Module.
        • Extended VSA module
        • Predicted Keypoint Weighting.
        • Keypoint-to-grid RoI Feature Abstraction for Proposal Refinement
        • 3D Proposal Refinement and Confidence Prediction
    • LOSS
      • Loss_RPN
      • L_rcnn
        • L_iou in L_rcnn
        • L_reg in L_rcnn
    • Experimental results
    • Summarize
    • refer to
    • Corrections welcome

Brief introduction

A three-dimensional voxel convolutional neural network (CNN) and PointNet-based Set abstraction are integrated to learn more discriminative point cloud features.

proposed roi-grid. Compared with traditional pooling operations, RoI grid feature points encode richer contextual information and are used to accurately estimate the confidence and location of objects.

(Similar to the idea of faster point rcnn)
Advantages of voxel-based methods:
The voxel-based method has higher computational efficiency, but loses positioning accuracy.
Advantages of pure point-based methods:
Point-based methods are computationally expensive, but have larger receptive fields (contextual information)
The advantages of point-based and voxel-based algorithms can be combined to improve the performance of 3D target detection.

Contribution points

1 proposed the PV-RCNN framework, which effectively utilizes voxel-based and point-based methods for 3D point cloud feature learning, thereby improving the performance of 3D object detection with manageable memory consumption.
2 proposed a voxel-to-keypoint scene encoding scheme, which encodes the multi-scale voxel features of the entire scene into a small keypoint set through a voxel set abstraction layer. These keypoint features not only maintain accurate locations but also encode rich scene context, significantly improving 3D detection performance.
3 propose a multi-scale RoI feature abstraction layer for each grid point in the proposal, which aggregates richer contextual information from scenes with multiple receptive fields for accurate box refinement and confidence prediction. .
4 The proposed method PV-RCNN outperforms all previous methods by significant advantages, ranking first in the highly competitive KITTI 3D detection benchmark, and also surpassing previous methods by a large advantage on the large-scale Waymo Open dataset. . (The author only used pv-rcnn for the competition, and reached first place without using too many tricks)

A brief description is as follows:
1 Proposed a PV-RCNN framework, based on voxels and points, and is a two-stage
2 Voxel-to-keypoint scene encoding scheme encodes the multi-scale voxel features of the entire scene into a small keypoint set through the voxel set abstraction layer. The point set contains location information and contextual information
3 Each grid point in the proposal proposes a multi-scale RoI feature abstraction layer that aggregates information from the scene with multiple receptive fields.
4 good effect

Two new modes of operation are proposed:
1 Voxel-to-keypoint scene encoding (corresponding to voxel Set Abstraction Module) summarizes all voxels of the entire scene feature volume into a small number of feature key points
2 point-to-grid RoI feature abstraction effectively aggregates scene key point features into the RoI grid

Overall model architecture diagram


The general steps are as follows:
1. Voxelize the point cloud data in space, and use sparse convolutional networks to perform multiple feature extraction downsamplings. -Go to –> Step 3
2. Use the voxel feature capture module to extract multi-scale features at each layer of sparse convolutional network feature extraction. – Go to –> Step 4
3. Project the features extracted by the sparse convolutional network onto the bird’s-eye view, and use the bird’s-eye view to generate one-stage target detection results.
4. Use the multi-scale features obtained by the voxel feature capture module to optimize the first-stage target detection results to obtain the final detection results.
Phase 1 Step 1->3
Phase Two Step 2->4

Two new modes of operation are proposed:
1 Voxel-to-keypoint scene encoding (corresponding to voxel Set Abstraction Module) summarizes all voxels of the entire scene feature volume into a small number of feature key points
2 point-to-grid RoI feature abstraction effectively aggregates scene key point features into the RoI grid

Phase 1

Voxelization feature extraction from point cloud data

The following operations correspond to the above step one
The point cloud data in space are first voxelized, where the features of non-empty voxels are directly calculated as the average of the point-wise features of all internal points.
Then use 3D sparse convolution to extract voxel-level features, and downsample the spatial data by setting Stride, ultimately achieving 8 times downsampling.
3D sparse convolution currently has two forms (sub-flow convolution and normal sparse convolution). The author did not specify which form is used in the paper, but judging from the code, sub-flow sparse convolution is used.
This part corresponds to the original figure 2:

PS sub-stream sparse convolution: The convolution output will be calculated only when the center of the kernel covers an active input site. Code call: spconv.SubMConv3d
Corresponding paper: 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
PS: Sparse convolution related information Reference: (2 types) sparse convolution https://zhuanlan.zhihu.com/p/383299678

The reference code is as follows: github hyperlink, location of line 70

The code snippet is as follows

 self.conv_input = spconv.SparseSequential(
            spconv.SubMConv3d(input_channels, 16, 3, padding=1, bias=False, index_key='subm1'),
            norm_fn(16),
            nn.ReLU(),
        )
        block = post_act_block ##Integrated sparse convolution module ##where conv_type indicates the type of sparse convolution. The default is sub-stream convolution. If set to spconv, it is normal sparse convolution.

        self.conv1 = spconv.SparseSequential(
            block(16, 16, 3, norm_fn=norm_fn, padding=1, index_key='subm1'),
        )

        self.conv2 = spconv.SparseSequential(
            # [1600, 1408, 41] <- [800, 704, 21]
            block(16, 32, 3, norm_fn=norm_fn, stride=2, padding=1, index_key='spconv2', conv_type='spconv'),
            block(32, 32, 3, norm_fn=norm_fn, padding=1, index_key='subm2'),
            block(32, 32, 3, norm_fn=norm_fn, padding=1, index_key='subm2'),
        )
Convert to BEV for one-stage prediction

The following operations correspond to step 3 above:
After convolution, the features are stacked on the Z-axis to project the feature map to a 2D plane, generating a BEV map of [W/8, L/8]. I didn’t quite understand how to stack it at first, but I figured it out later by looking up the relevant code. The following code:
N, C, D, H, W = spatial_features.shape #(Bacth_size, channel, Depth, Height, weight)
spatial_features = spatial_features.view(N, C * D, H, W) ###(Bacth_size,channel, Height, weight)
In fact, it is a reshape operation, merging D data (Z-axis features) and C (voxel features) together
Then, based on the bird’s-eye view, the anchor-based method is used to generate proposals and confidence scores for the detection target.

summary:
Here (step 1 + step 3), the first stage of target detection is completed, that is, there are W/8, L/8 center points, and 2*W/8 *L/8 boxes are generated (because of the two vertical angle issues)
The idea is still the same as fast point rcnn (two stages), and the 3D Bbox is simply obtained through one stage. What needs to be done in the second stage is the optimization and adjustment of the generated area.
PV-RCNN does not directly consider feature extraction of original point cloud information like fast point rcnn, but continues to use data in the form of voxels.
If considering that the original point cloud information will cause too much time complexity, the author insists on using voxels here.

Phase 2

Next, 3.2 Voxel-to-keypoint Scene Encoding via Voxel Set Abstraction in the paper
All the key lies in “key points” as a bridge between voxel features and the optimization network

Keypoint Sampling
First, select key points through FPS from the original points (but as 3DSSD said (this article is earlier than 3DSSD)), the points extracted in this way are mostly background points with less relevant information. How to solve the problem? Follow-up -> Predict key point weighting The module should refer to the refinement module in fast point rcnn)
2048 points were selected from the Kitti data set

Voxel Set Abstraction Module.

Encode multi-scale semantic features from 3D CNN feature quantities (several sparse convolution results in step 1) to key points.

A similar aggregation method proposed in PointNet++ is used. However, in PN++, each key point aggregates the features of points at a certain distance in the surrounding original point cloud, while in VSA, each key point aggregates the features of the surrounding voxels within a certain distance. After the aggregation is completed, a structure similar to PointNet is placed for feature extraction.
Finally, the abstract set features of the four corresponding voxel blocks in step 1 can be obtained.

Specifically, look at several formulas in the paper and the related definitions of letters in them:

P_i: a feature point extracted through FPS, a key point
f_j represents the jth voxel feature in step 1 (j=01,2,3)
v_j represents the position corresponding to each voxel in f_j
N_j represents the number of non-empty voxels in f_j
Looking at the right side of the formula, calculate ||v_j -p_i|| Looking at the left side, what is retained is [each high-dimensional voxel block that meets the distance constraint and the relative position (the distance difference between each voxel block and the key point)

In this way we obtain the voxel feature vector set S. Next, extract features
i=0,1,2,3 represents each voxel feature vector set
Where M represents sampling, from which only T_k high-dimensional voxel blocks and relative coordinates are retained.
G: MLP is used for upsampling and relative position encoding (position encoding is a bit questionable)
Max: maxpooling for feature extraction

Extended VSA module

The author believes that the 2D bird’s-eye view has a wider receptive field for the Z-axis (after all, the Z-axis is projected, and it is global. Compared with the previous VSA, it has more local Z-axis features)
We project the key points pi into the 2D bird’s-eye view coordinate system and utilize bilinear interpolation to obtain the features f(bev)i from the bird’s-eye view features.
Finally, the following features were obtained:

f(pv) is the 4 features obtained by the vsa process and spliced together.
f(raw) is the initial point cloud for fps extraction (for mlp)
f(bev) is the relative position of the initial point cloud (only x, y is considered) on Bev and is bilinearly interpolated.

Predicted Keypoint Weighting.

Very similar to the Network Structure in fast point rcnn (the difference here is to reweight and correct the features, fast point rcnn filters the required original point cloud through features)

Among the extracted feature points, there must be information that only contains background points, which we do not want. Key points belonging to foreground objects should make a greater contribution to the precise refinement of the proposal, while the contribution of key points from the background area should be less
Therefore, the author proposed the PKW module
Reweighting keypoint features with additional supervision.


The PKW module is trained through focal loss, and the default hyperparameters are used to handle the unbalanced number of foreground/background points in the training set.
If you observe in detail, you can find that only two mlps, n256 and n1, are actually trained.
Among them, Predict can be seen as: the point cloud feature is obtained by 2 mlp and sigmoid with probability (0–1 float). It can be seen as letting two MLPs learn to judge: given point cloud features to infer the probability of the corresponding feature point in the 3Dbbox.
Label can be regarded as: through key point coordinates, and then based on the real 3Dbbox. Determine whether the point is inside and get the label (0: not inside, 1: inside)
In this way, focal loss training can be performed. Let the PKW module implement “weighted processing” of features.

Keypoint-to-grid RoI Feature Abstraction for Proposal Refinement

The author proposes the RoI-grid pooling module to combine key point features and points in the RoI region. This part is very similar to the previous VSA module. Select 6×6×6=216 points in each Propoals, called grid points. Treat each grid point as a center point, and use the PN++ method to aggregate key point features within a certain distance around it (the same is the case of multiple distance thresholds to obtain multi-scale information).


Specifically, look at several formulas in the paper and the related definitions of letters in them:
If you watch it here corresponding to VSA, you will find that the ideas are basically the same.

fj: key point features
pj: coordinates of other key points
gi: coordinates of a selected point among 216
Look to the right: the sphere radius distance constraint. Note that the author here later talked about setting up multiple r to retain different scale information.
Looking at the left side, what is retained is [the characteristics and relative position of each key point that conforms to the distance constraint (the distance difference between each key point and the gi key point)

In this way we get other key point sets ф based on the r radius of the gi key point. Next, extract features

Among them, M represents sampling, from which only T_k key points and relative coordinates are retained.
G: MLP is used for upsampling and relative position encoding (position encoding is a bit questionable)
Max: maxpooling for feature extraction

3D Proposal Refinement and Confidence Prediction

In the end, the above features are used through MLP to predict the cls and reg of the anchor.

LOSS

Loss_RPN

L_cls: Focal loss two-class classification of foreground and background
L_reg: Regression of the frame under the smooth_l1 loss function. The regression of the angle refers to SECOND

L_rcnn

L_iou in L_rcnn

The wavy line above the head represents the predicted value.

where y_k is calculated as follows::

The IOU is the first-stage ROI and the IOU value of the real 3D box.

L_reg in L_rcnn

Consistent with l_reg in l_rpn. It is the regression of the frame under the smooth_l1 loss function.

Experimental results

Summary

Author: Proposed PV-RCNN, a new method for accurately detecting three-dimensional objects from point clouds. This method uses the voxel set abstraction layer proposed in this article to integrate multi-scale 3D voxel CNN features and features collected using PointNet. And through the characteristics of key points, the prediction boxes and scores generated by one-stage detection are effectively optimized. It has achieved excellent results on the KITTI and Waymo Open data sets. Compared with the previous state-of-the-art algorithms, the voxel-to-point and keypoint-to-grid structures proposed in this article effectively and significantly improve the performance of three-dimensional object detection.

Review steps
1. Voxelize the point cloud data in space, and use sparse convolutional networks to perform multiple feature extraction downsamplings. ->Step 3
2. Use the voxel feature capture module to extract multi-scale features at each layer of sparse convolutional network feature extraction. ->Step 4
3. Project the features extracted by the sparse convolutional network onto the bird’s-eye view, and use the bird’s-eye view to generate one-stage target detection results.
4. Use the multi-scale features obtained by the voxel feature capture module to optimize the first-stage target detection results.
Step four involves
Multi-scale feature VSA
Feature weight optimization prediction Keypoint weighting
roi final feature point screening roi feature abstraction

Author: After reading the full text, I feel that the whole article is focusing on key points. It can be seen from the contribution points of this article, from how to extract key points, to how to aggregate information around key points, and finally how to use key points to optimize the first-stage detection results. The biggest innovation of this article lies in this key point. Different from Fast Point RCNN, the author does not directly use PointNet to perform feature extraction on the original point cloud, but always performs feature extraction on voxel-level data. This not only effectively utilizes the advantages of sparse convolution and voxel representation, but also speeds up the operation and effectively combines the Point-based method.

Reference

There is a good article explaining PV-RCNN in Zhihu (I read the original paper and this is what I read): https://zhuanlan.zhihu.com/p/435867918

Correction welcome

Because this article is mainly used by me to take notes and consolidate knowledge by the way. If this article was helpful to you, then the purpose of this blog has been exceeded.
My English proficiency, ability to read papers, and ability to read and write code are relatively limited. There are errors, please correct me, thank you.

Welcome to communicate
Email: [email protected]