VoxelNext, a fully sparse 3D object detection network

GitHub – dvlab-research/VoxelNeXt: VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking (CVPR 2023)

https://arxiv.org/abs/2303.11301

Summary

The current 3D target detection model uses the 2D method in the detection part. On the dense feature map, the 3D frame is predicted through the preset anchor or center. The innovation of this paper is to use the sparse characteristics of the point cloud. Through After spconv extracts the features, it does not convert to the dense feature map, and directly predicts the 3D frame on the sparse features. It has been verified that it has achieved good results on commonly used public datasets.

1. Introduction

Take the commonly used centerpoint model as an example. Among them, sparse to dense can work effectively, but it brings the following problems: waste of computing resources, complicated process, and requires NMS post-processing.

The method proposed in this paper eliminates the steps of center anchor, sparse to dense, rpn, nms, etc., and directly and only predicts on sparse feature positions.

Optimization of VoxelNext and Centerpoint, flops.

VoxelNext method, compared to centerpoint, FSD, and the comparison of latency under different detection ranges, VoxelNext is very friendly to long-distance target detection.

2. Related work

Lidar Detectors

The current 3D detectors usually refer to 2D detectors, such as the rcnn series, such as the centerpoint series. Although the 3D point cloud is sparse compared to the 2D data itself, the current detectors are still based on the dense feature map. predicted. This paper makes a change point, directly performs target prediction on sparse features.

Sparse Detectors

Analyzed some sparse detectors, such as waymo’s RSN, first extracted the foreground points on the segmentation of the range image, and then performed target detection on the sparse foreground points; SWFormer and FSD are some sparse detection attempts, but the process is complicated. This paper uses commonly used sparse convolution to simplify the process as much as possible.
pillarnet

RSNs

Sparse Convlution Network

Because of the high efficiency of sparse convolution, it is now the mainstream method of 3D network backbone. However, it is generally not directly used to detect the head. There are currently some attempts to optimize, such as using transformers to increase the receptive field, but this article achieves the increase of the receptive field through additional downsampling.

3D Object Tracking

It is common to use kalman filter to track the results, and there is also a direct prediction speed like centertrack. This article also uses the query of voxel for correlation, which effectively predicts the deviation of the center of the object.

3. Fully Sparse Voxel-based Network

Schematic diagram of voxelnext network structure:

3.1 backbone adaptation

additional down sampling

On the basis of the original downsampling, {1, 2, 4, 8}, {F 1 , F 2 , F 3 , F 4 }, continue downsampling {16, 32}, {F5,F6}, and then put F4 , F5, F6 spatial resolution align to F4, and then generate Fc.

F is a sparse feature, and P is a 3D coordinate value. Fc is the feature superposition of F4, F5, and F6. Also update the size of P5, P6 to P4.

x_conv5 = self.conv5(x_conv4)
x_conv6 = self.conv6(x_conv5)

x_conv5.indices[:, 1:] *= 2
x_conv6.indices[:, 1:] *= 4
x_conv4 = x_conv4.replace_feature(torch.cat([x_conv4.features, x_conv5.features, x_conv6.features]))
x_conv4.indices = torch.cat([x_conv4.indices, x_conv5.indices, x_conv6.indices])

sparse height compression

In conventional practice, sparseness becomes dense, and then the z dimension is added to the channel dimension.

Here, the sparse features are placed directly on the bev plane, and then add is summed. Very efficient.

def bev_out(self, x_conv):
        features_cat = x_conv. features
        indices_cat = x_conv.indices[:, [0, 2, 3]]
        spatial_shape = x_conv.spatial_shape[1:]

        indices_unique, _inv = torch.unique(indices_cat, dim=0, return_inverse=True)
        features_unique = features_cat.new_zeros((indices_unique.shape[0], features_cat.shape[1]))
        features_unique.index_add_(0, _inv, features_cat)

        x_out = spconv.SparseConvTensor(
            features = features_unique,
            indices=indices_unique,
            spatial_shape=spatial_shape,
            batch_size=x_conv.batch_size
        )
        return x_out

spatially voxel pruning

In the process of downsampling, unimportant background features are prune. It can not only highlight the prospect, but also improve the computing efficiency.

3.2 sparse head

1. class head

Forecast, NxF => NxK

The target, the nearest voxel near the center of the gt box, is the positive sample.

loss, focal loss

inference, use sparse max pooling. The voxel itself is sparse enough to operate only in non-empty positions. What if the object is very close?

The experiment found that the query voxel is not necessarily in the center of the box, or even in the box.

2. regression head

Positive voxel screening, N->n

Forecast, nxF => nx2(dx,dy), nx1(z), nx3(w,h,l), nx2(cos,sin)

loss, l1 loss

Related code:

The forward network structure, the overall structure is compared with the previous centerhead, and the convolution is changed from 2d conv to 2d subMconv. hm is also called hm.

class SeparateHead(nn.Module):
    def __init__(self, input_channels, sep_head_dict, kernel_size, init_bias=-2.19, use_bias=False):
        super().__init__()
        self.sep_head_dict = sep_head_dict

        for cur_name in self.sep_head_dict:
            output_channels = self.sep_head_dict[cur_name]['out_channels']
            num_conv = self.sep_head_dict[cur_name]['num_conv']

            fc_list = []
            for k in range(num_conv - 1):
                fc_list.append(spconv.SparseSequential(
                    spconv.SubMConv2d(input_channels, input_channels, kernel_size, padding=int(kernel_size//2), bias=use_bias, index_key=cur_name),
                    nn.BatchNorm1d(input_channels),
                    nn.ReLU()
                ))
            fc_list.append(spconv.SubMConv2d(input_channels, output_channels, 1, bias=True, index_key=cur_name + 'out'))
            fc = nn. Sequential(*fc_list)
            if 'hm' in cur_name:
                fc[-1].bias.data.fill_(init_bias)
            else:
                for m in fc.modules():
                    if isinstance(m, spconv. SubMConv2d):
                        kaiming_normal_(m.weight.data)
                        if hasattr(m, "bias") and m.bias is not None:
                            nn.init.constant_(m.bias, 0)

            self.__setattr__(cur_name, fc)

    def forward(self, x):
        ret_dict = {}
        for cur_name in self.sep_head_dict:
            ret_dict[cur_name] = self.__getattr__(cur_name)(x).features

        return ret_dict

Target code, before it was the hm of dense, and the coded target boxes corresponding to gt

Now it is the sparse hm, and the corresponding encoded target boxes.

def assign_target_of_single_head(
            self, num_classes, gt_boxes, num_voxels, spatial_indices, spatial_shape, feature_map_stride, num_max_objs=500,
            gaussian_overlap=0.1, min_radius=2
    ):
        """
        Args:
            gt_boxes: (N, 8)
            feature_map_size: (2), [x, y]

        Returns:

        """
        heatmap = gt_boxes.new_zeros(num_classes, num_voxels)

        ret_boxes = gt_boxes.new_zeros((num_max_objs, gt_boxes.shape[-1] - 1 + 1))
        inds = gt_boxes.new_zeros(num_max_objs).long()
        mask = gt_boxes.new_zeros(num_max_objs).long()

        x, y, z = gt_boxes[:, 0], gt_boxes[:, 1], gt_boxes[:, 2]
        coord_x = (x - self.point_cloud_range[0]) / self.voxel_size[0] / feature_map_stride
        coord_y = (y - self.point_cloud_range[1]) / self.voxel_size[1] / feature_map_stride

        coord_x = torch.clamp(coord_x, min=0, max=spatial_shape[1] - 0.5) # bugfixed: 1e-6 does not work for center.int()
        coord_y = torch.clamp(coord_y, min=0, max=spatial_shape[0] - 0.5) #

        center = torch.cat((coord_x[:, None], coord_y[:, None]), dim=-1)
        center_int = center.int()
        center_int_float = center_int.float()

        dx, dy, dz = gt_boxes[:, 3], gt_boxes[:, 4], gt_boxes[:, 5]
        dx = dx / self.voxel_size[0] / feature_map_stride
        dy = dy / self. voxel_size[1] / feature_map_stride

        radius = centernet_utils.gaussian_radius(dx, dy, min_overlap=gaussian_overlap)
        radius = torch.clamp_min(radius.int(), min=min_radius)

        for k in range(min(num_max_objs, gt_boxes.shape[0])):
            if dx[k] <= 0 or dy[k] <= 0:
                continue

            if not (0 <= center_int[k][0] <= spatial_shape[1] and 0 <= center_int[k][1] <= spatial_shape[0]):
                continue

            cur_class_id = (gt_boxes[k, -1] - 1). long()
            
            # The nearest voxel is selected as the query voxel
            # inds are also updated for the order of this voxel
            distance = self.distance(spatial_indices, center[k])
            inds[k] = distance. argmin()
            mask[k] = 1
            
            
            # On the sparse hm, draw hm
            if 'gt_center' in self.gaussian_type:
                centernet_utils.draw_gaussian_to_heatmap_voxels(heatmap[cur_class_id], distance, radius[k].item() * self.gaussian_ratio)

            if 'nearst' in self.gaussian_type:
                centernet_utils.draw_gaussian_to_heatmap_voxels(heatmap[cur_class_id], self.distance(spatial_indices, spatial_indices[inds[k]]), radius[k].item() * self.gaussian_ratio)
            
            # △x, △y, is the offset of the spatial inds of the center and proxy voxel
            ret_boxes[k, 0:2] = center[k] - spatial_indices[inds[k]][:2]
            ret_boxes[k, 2] = z[k]
            ret_boxes[k, 3:6] = gt_boxes[k, 3:6].log()
            ret_boxes[k, 6] = torch.cos(gt_boxes[k, 6])
            ret_boxes[k, 7] = torch.sin(gt_boxes[k, 6])
            if gt_boxes.shape[1] > 8:
                ret_boxes[k, 8:] = gt_boxes[k, 7:-1]

        return heatmap, ret_boxes, inds, mask

hm and box decode

def decode_bbox_from_voxels_nuscenes(batch_size, indices, obj, rot_cos, rot_sin,
                            center, center_z, dim, vel=None, iou=None, point_cloud_range=None, voxel_size=None, voxels_3d=None,
                            feature_map_stride=None, K=100, score_thresh=None, post_center_limit_range=None, add_features=None):
    batch_idx = indices[:, 0]
    spatial_indices = indices[:, 1:]
    scores, inds, class_ids = _topk_1d(None, batch_size, batch_idx, obj, K=K, nuscenes=True)

    center = gather_feat_idx(center, inds, batch_size, batch_idx)
    rot_sin = gather_feat_idx(rot_sin, inds, batch_size, batch_idx)
    rot_cos = gather_feat_idx(rot_cos, inds, batch_size, batch_idx)
    center_z = gather_feat_idx(center_z, inds, batch_size, batch_idx)
    dim = gather_feat_idx(dim, inds, batch_size, batch_idx)
    spatial_indices = gather_feat_idx(spatial_indices, inds, batch_size, batch_idx)

    if not add_features is None:
        add_features = [gather_feat_idx(add_feature, inds, batch_size, batch_idx) for add_feature in add_features]

    if not isinstance(feature_map_stride, int):
        feature_map_stride = gather_feat_idx(feature_map_stride. unsqueeze(-1), inds, batch_size, batch_idx)

    angle = torch.atan2(rot_sin, rot_cos)
    xs = (spatial_indices[:, :, -1:] + center[:, :, 0:1]) * feature_map_stride * voxel_size[0] + point_cloud_range[0]
    ys = (spatial_indices[:, :, -2:-1] + center[:, :, 1:2]) * feature_map_stride * voxel_size[1] + point_cloud_range[1]
    #zs = (spatial_indices[:, :, 0:1]) * feature_map_stride * voxel_size[2] + point_cloud_range[2] + center_z

    box_part_list = [xs, ys, center_z, dim, angle]

    if not vel is None:
        vel = gather_feat_idx(vel, inds, batch_size, batch_idx)
        box_part_list.append(vel)

    if not iou is None:
        iou = gather_feat_idx(iou, inds, batch_size, batch_idx)
        iou = torch. clamp(iou, min=0, max=1.)

    final_box_preds = torch.cat((box_part_list), dim=-1)
    final_scores = scores. view(batch_size, K)
    final_class_ids = class_ids. view(batch_size, K)
    if not add_features is None:
        add_features = [add_feature.view(batch_size, K, add_feature.shape[-1]) for add_feature in add_features]

    assert post_center_limit_range is not None
    mask = (final_box_preds[..., :3] >= post_center_limit_range[:3]).all(2)
    mask & amp;= (final_box_preds[..., :3] <= post_center_limit_range[3:]).all(2)

    if score_thresh is not None:
        mask & amp;= (final_scores > score_thresh)

    ret_pred_dicts = []
    for k in range(batch_size):
        cur_mask = mask[k]
        cur_boxes = final_box_preds[k, cur_mask]
        cur_scores = final_scores[k, cur_mask]
        cur_labels = final_class_ids[k, cur_mask]
        cur_add_features = [add_feature[k, cur_mask] for add_feature in add_features] if not add_features is None else None
        cur_iou = iou[k, cur_mask] if not iou is None else None

        ret_pred_dicts.append({
            'pred_boxes': cur_boxes,
            'pred_scores': cur_scores,
            'pred_labels': cur_labels,
            'pred_ious': cur_iou,
            'add_features': cur_add_features,
        })
    return ret_pred_dicts

3.3 object tracking

voxel association

The query voxel is used as the proxy of the center, and the l2 distance is used to associate the query voxel.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. OpenCV skill treeHomepageOverview 15894 people are studying systematically