Analysis of BEV classics Lift, Splat, Shoot

Article 1: LSS algorithm data shape flow chart – Zhihu

The Lift Splat Shoot algorithm is an algorithm for autonomous driving perception, which was proposed by NVIDIA. This algorithm works by converting multi-view camera images into feature representations in 3D space. The main idea is to generate 3D features from each camera’s image through “Lift”, and then project these 3D features to the rasterized bird’s-eye view network through “Splat”. grid. Finally, interpretable end-to-end motion planning is achieved by “projecting (Shoot)” the template motion trajectory into the BEV Cost Map output by the network.

For the perception algorithm, the key to the Lift Splat Shoot algorithm is the setting of the BEV (bird’s eye view) perception range, BEV cell size and depth estimation range. The sensing range can determine the sensing distance in the x-axis and y-axis directions. The BEV cell size determines the unit size under BEV, and the depth estimation range determines the discrete depth range that the Lift Splat Shoot algorithm needs to estimate.

Through the Lift Splat Shoot algorithm, the 2D information of multiple cameras can be converted into 3D information, and the 3D information of multiple cameras can be spliced to obtain a unified representation of the entire scene. This algorithm achieves results superior to baseline methods and previous work on tasks such as object segmentation and map segmentation.

Overall process

Get spatial location

Construct frustum

frustum is a member of LiftSplatShoot wrapped with nn.Parameter(frustum, requires_grad=False) and participates in gradient and update.

frustum is a 4-dimensional matrix that represents the x, y, and z coordinates of each point in the 41×8×22 pseudo point cloud.

Actual value of partial x-coordinate

Actual value of partial y coordinate

Actual value of z coordinate (depth)

Convert the camera coordinate system to the self-vehicle coordinate system

geom = self.get_geometry(rots, trans, intrins, post_rots, post_trans)

The shape of geom is still (41,8,22,3), but it is converted to the self-vehicle coordinate system

Get camera features (CamEncode)

voxel_pooling

Get BEV characteristics (BevEncode)

Article 2: Analysis of BEV classics Lift, Splat, and Shoot – Zhihu

The main content of this article:

BEV perception
LSS contribution
LSS workflow
Lift & Splat principle and related source code analysis

BEV Awareness

In order to sense the surrounding environment, self-driving cars will install multiple sensors around the body, and each sensor has its own coordinate system. In order to make subsequent processing more convenient, we usually convert the sensing data of different sensors into a unified coordinate system. For example, the most popular BEV (“bird’s-eye-view”) perception paradigm is to extract features from data from different sensors, then convert these features into a unified BEV coordinate system, and then Carry out subsequent perception tasks, such as detection, segmentation, etc. As shown in the figure below, the upper part is the picture captured by the four cameras on the front, rear, left and right of the car, and the lower part is the converted BEV representation.

LSS’s contribution

LSS can be regarded as the pioneering work of BEV perception. Its main contributions are as follows:

Proposed a method to convert images from 2d to 3d (Lift);
An end-to-end model is proposed that can transform image features from multiple cameras into a unified BEV space;
LSS is a purely visual model, which lays the foundation for subsequent research on purely visual BEV perception algorithms.

The left side of the picture above is the image taken by different cameras, and the right side is the result of semantic segmentation directly in BEV space after converting these images from different perspectives into BEV space. (The colored points in the image on the left represent the back-projection of the prediction results in the BEV space onto the image)

LSS Workflow

The entire workflow of LSS is as shown above. The input of the model is ? images taken from different cameras, as well as the extrinsic parameter matrix and the internal parameter matrix of each camera, where ?∈[1,2,…,? ]; The model first performs a “Lift” operation on each image (actually a feature map) to lift the image from a 2D plane to a 3D space, generate a 3D frustum (frustum) point cloud, and perform all the operations on the point cloud. Points predict context features and generate context feature point clouds; then perform a “Splat” operation on the frustum point cloud and context feature point cloud to construct BEV features in the BEV grid, which is the output of the model; finally, after obtaining the BEV features, Complete specific downstream tasks such as Motion Planning through “Shooting”.

To clarify the workflow of LSS, we need to understand the two operations “Lift” and “Splat”. In the picture above, I list the two operations “Lift” and “Splat” in the source code. Related functions, next I will combine the source code to focus on analyzing the principles of these two operations and how they are implemented.

Lift

The camera projects the real world onto the image plane, which is a 3D to 2D process, and depth information is lost. The purpose of Lift is to restore the depth of each pixel in the image and lift the image from a 2D plane to a 3D space. This operation is divided into two steps:

[Step1] Generate 3D frustum point cloud: Let the image size be [H, W], generate D discrete depth values for each pixel, representing all possible depth positions of this pixel, thus generating a size of [ D, H, W], and then use the intrinsic and extrinsic parameters of the camera to convert all the frustum point clouds to the self-vehicle coordinate system. The range of the depth direction set in the paper is [4m, 45m], and a discrete depth is estimated every 1m, so each pixel will generate 41 depth values, that is, D=41.

The source code to implement this step is the two functions create_frustum() and get_geometry() in the models.py file. Source code analysis:

def create_frustum(self):
    """
    Lift the image from 2D to 3D to generate a 3D frustum point cloud.
    Here, the point cloud is actually generated on the downsampled feature map, and then the coordinates of each point are mapped back to the original image.
    """
    ogfH, ogfW = self.data_aug_conf['final_dim'] # Original image size (128,352)
    fH, fW = ogfH // self.downsample, ogfW // self.downsample # Feature map size after backbone downsampling (8,22)
    # Generate a set of depth positions for each point on the feature map, shape changes: (41,) -> (41,1,1) -> (41,8,22), where dbound=[4,45,1 ]
    ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
    D, _, _ = ds.shape
    # Map the x coordinate of each point back to the original image, the shape changes: (22,) -> (1,1,22) -> (41,8,22)
    xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)
    # Map the y coordinate of each point back to the original image, the shape changes: (8,) -> (1,8,1) -> (41,8,22)
    ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)
    # Stack to generate a frustum point cloud, the shape is (41,8,22,3), 3 represents the 3D coordinates (x, y, d) of each point
    frustum = torch.stack((xs, ys, ds), -1)
    return nn.Parameter(frustum, requires_grad=False)

def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
    """
    Convert the frustum point cloud of each image from the image coordinate system to the self-vehicle coordinate system.
    rots: rotation matrix from camera coordinate system -> self-car coordinate system, (B, N, 3, 3)
    trans: Translation matrix from camera coordinate system -> self-car coordinate system, (B, N, 3)
    intrins: camera internal parameters, (B, N, 3, 3)
    post_rots: rotation matrix caused by image enhancement, (B, N, 3, 3)
    post_trans: translation matrix caused by image enhancement, (B, N, 3)
    """
    B, N, _ = trans.shape
    #Recover the changes in pixel positions caused by data enhancement and preprocessing
    points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
    points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))
    # Image coordinate system -> Camera normalized coordinate system
    points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
                        points[:, :, :, :, :, 2:3]
                       ), 5)
    combine = rots.matmul(torch.inverse(intrins))
    # Camera normalized coordinate system -> Camera coordinate system -> Self-vehicle coordinate system
    points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
    points + = trans.view(B, N, 1, 1, 1, 3)
    return points # shape (B,N,41,8,22,3)

[Step2] Generate context feature point cloud: Use EfficientNet as backbone to extract image features, and for each point on the feature map, predict the C-dimensional feature vector ?∈ and the probability distribution of D discrete depths ? (C equals 64, D equals 41), and then do the outer product of the feature vector and the depth distribution to generate context features. In fact, the context feature of each point can be regarded as a 2-dimensional tensor with shape (64, 41), in which each column is multiplied by the feature vector c by the depth distribution value of the point at depth d. Obtained above, that is, formula (1) in the paper:

As shown in the figure below, the left side is the D depth distribution ? predicted in the depth direction of a certain point on the image, the middle is the feature vector c of the point, and the right side is the context feature obtained through the outer product. It can be seen that due to the The probability values at the three depths are the highest, so the features obtained at ?2? are the richest in the outer product results.

The source code to implement this step is the two functions get_cam_feats() and ==get_depth_feat() in the models.py file. Source code analysis:

def get_depth_feat(self, x):
    """
    Generate context features for each point on the feature map and construct a context feature point cloud.
    x: Input image, shape is (B*N,3,128,352). Note that the batch and the number of images in each batch, N, are merged into one dimension.
    """
    #get_eff_depth is used to extract image features. EfficientNet is used as the backbone in the source code. The output feature map shape is (B*N,512,8,22)
    x = self.get_eff_depth(x)
    # Depthnet is actually a Conv2d. The output feature map shape is (B*N,105,8,22), of which the first 41 items of 105 are depth values and the last 64 items are features.
    x = self.depthnet(x)
    # Extract the depth map of the first 41 dimensions and perform softmax through get_depth_dist to obtain the probability distribution. The shape is (B*N,41,8,22)
    depth = self.get_depth_dist(x[:, :self.D])
    # Extract the 64-dimensional feature map and do the outer product with the depth distribution to construct the context feature point cloud. The shape is (B*N,64,41,8,22)
    new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)
    return depth, new_x

def get_cam_feats(self, x):
    """
    Construct context feature point cloud.
    x: Input image, shape is (B,N,3,128,352)
    """
    B, N, C, imH, imW = x.shape # shape(B,N,3,128,352)
    x = x.view(B*N, C, imH, imW) # Combine dimension 0 and dimension 1, shape(B*N,3,128,352)
    x = self.camencode(x) # Camencode internally calls get_depth_feat to get the context feature point cloud, shape(B*N,64,41,8,22)
    x = x.view(B, N, self.camC, self.D, imH//self.downsample, imW//self.downsample) # shape(B,N,64,41,8,22)
    x = x.permute(0, 1, 3, 4, 5, 2) # shape(B,N,41,8,22,64)
    return x

Splat

After the Lift operation, we got two 3d point clouds:

View cone point cloud: shape is (B, N, 41, 8, 22, 3), including the position of each point in the self-vehicle coordinate system
Context feature point cloud: shape is (B, N, 41, 8, 22, 64), including the context features of each point

The purpose of Splat is to project context features into the BEV grid and construct BEV features. Specifically: first, translate the frustum point cloud from the self-vehicle coordinate system to the BEV grid, and filter out the points that fall outside the boundary of the BEV grid after translation; since there may be multiple points falling into the same cell, Therefore, each point is assigned a rank value, and points with the same rank value are represented in the same cell of the same batch; finally, the context features falling in the same cell are summed and pooled to obtain the BEV feature.

BEV grid: In the top view centered on the autonomous vehicle, N cells are divided along the x-axis and y-axis directions (also called “Pillar” in some papers), Each cell has a specific size, and all cells within the vehicle’s sensing range constitute the BEV grid.

In the paper, the sensing range in the x-axis direction and the y-axis direction is set to -50m ~ 50m, the sensing range in the z-axis direction is -10m ~10m, and the length, width and height of the cell are [0.5m, 0.5m, 20m], so BEV grid size is 200x 200×1.

The source code to implement Splat operation is the voxel_pooling() function in the models.py file. Source code analysis:

def voxel_pooling(self, geom_feats, x):
    """
    Build BEV features.
    geom_feats: frustum point cloud, shape is (B,N,41,8,22,3)
    x: context feature point cloud, shape is (B,N,41,8,22,64)
    """
    B, N, D, H, W, C = x.shape
    Nprime = B*N*D*H*W
    #flatten
    x = x.reshape(Nprime, C)
    # Translate the frustum point cloud from the self-vehicle coordinate system to the BEV grid, with the upper left corner as the origin of the BEV grid.
    geom_feats = ((geom_feats - (self.bx - self.dx / 2.)) / self.dx).long()
    geom_feats = geom_feats.view(Nprime, 3)
    #Batch index of each point, shape(Nprime,1)
    batch_ix = torch.cat([torch.full([Nprime//B, 1], ix, device=x.device, dtype=torch.long) for ix in range(B)])
    # Merge batch index, shape(Nprime,4)
    geom_feats = torch.cat((geom_feats, batch_ix), 1)

    # Filter out points that fall outside the BEV grid boundary. The boundaries of each dimension of the grid are [0,200), [0,200), [0,1)
    kept = (geom_feats[:, 0] >= 0) & amp; (geom_feats[:, 0] < self.nx[0])\
     & amp; (geom_feats[:, 1] >= 0) & amp; (geom_feats[:, 1] < self.nx[1])\
     & amp; (geom_feats[:, 2] >= 0) & amp; (geom_feats[:, 2] < self.nx[2])
    x = x[kept]
    geom_feats = geom_feats[kept]

    # Assign a rank value to each point. Points with the same rank indicate that they fall in the same grid of the same batch.
    ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
     + geom_feats[:, 1] * (self.nx[2] * B)\
     + geom_feats[:, 2] * B\
     + geom_feats[:, 3]
    # Sort the ranks and arrange points with the same rank value together. The purpose of doing this is for subsequent cumsum.
    sorts = ranks.argsort()
    x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts]

    #Use the cumsum trick to sum and pool the context features falling in the same grid.
    if not self.use_quickcumsum:
        x, geom_feats = cumsum_trick(x, geom_feats, ranks)
    else:
        x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks)

    # Construct BEV feature map, shape(B,64,1,200,200)
    final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)
    final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x
    # Eliminate the 2nd dimension, shape(B,64,200,200)
    final = torch.cat(final.unbind(dim=2), 1)
    return final

def cumsum_trick(x, geom_feats, ranks):
    """
    Perform sum pooling on features falling in the same cell.
    x: flat context feature, shape is (n,64)
    geom_feats: flattened view frustum, shape is (n,3)
    ranks: the rank value of each point, shape is (n,)
    """
    # Find the cumulative sum of the features of all points
    x = x.cumsum(0)
    # Get the index position where the front and rear rank values are not equal
    kept = torch.ones(x.shape[0], device=x.device, dtype=torch.bool)
    kept[:-1] = (ranks[1:] != ranks[:-1])
# For points with the same rank value, only the last one is retained
    x, geom_feats = x[kept], geom_feats[kept]
    # Since the cumulative sum is performed on all points before, shift and subtraction is performed here to obtain the actual feature sum of points with the same rank value.
    x = torch.cat((x[:1], x[1:] - x[:-1]))
    return x, geom_feats

Let’s use an example to understand the cumsum trick. Suppose we have a set of context features containing 5 points, with a feature dimension of 2, and find the cumulative sum:

feats = np.array([[1,1], [2,2], [3,3], [4,4], [5,5]])
ft_cumsum = feats.cumsum(0)
>>> ft_cumsum: [[1,1], [3,3], [6,6], [10,10], [15,15]]

Assume that in the BEV grid, the third point and the fourth point fall in the same cell, that is, they have the same rank value:

ranks = np.array([0, 1, 2, 2, 3])
kept = np.ones(feats.shape[0], dtype=bool)
kept[:-1] = (ranks[1:] != ranks[:-1])
>>>kept: [True, True, False, True, True]

For points with the same rank value, only the last one will be retained, so the third point will be filtered out:

ft_cumsum = ft_cumsum[kept]
>>> ft_cumsum: [[1,1], [3,3], [10,10], [15,15]]

The shifts are subtracted to obtain the actual feature sum of each remaining point:

ft_cumsum = np.concatenate((ft_cumsum[:1], ft_cumsum[1:] - ft_cumsum[:-1]))
>>> ft_cumsum: [[1,1], [2,2], [7,7], [5,5]]

BevEncode

In order to perform the semantic segmentation task, BevEncode is finally connected to further encode the BEV features. BevEncode consists of a 2d Conv, the first three layers of resnet18, and two upsampling layers. The final output feature map shape is (4, 1, 200, 200).

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill tree Home page Overview 22955 people are learning the system