vins-mono front end (1) – image optical flow tracking

Table of Contents

Preface

feature_tracker_node detailed explanation

Preparation

img_callback() detailed explanation

Foreword

The vins front-end mainly includes two parts: image optical flow tracking and imu pre-integration. Optical flow tracking is mainly used to track the same map point information of two adjacent frames of images, so that the back-end can solve the pose transformation between the two frames of images. There are two ways to find matching map points for two adjacent frames of images. One is to extract feature points and then perform feature matching based on the descriptor; the other is the optical flow tracking method, which is based on the grayscale invariance assumption and solves the problem. The movement speed of the feature point then estimates the pixel coordinates of the matching feature point. In the front-end feature_tracker_node node, this function is mainly implemented, and some other details are also included. In vins-fusion, optical flow tracking is placed in the same node as back-end optimization. This article introduces in detail what work has been done on the front-end optical flow tracking node vins-mono.

feature_tracker_node detailed explanation

Preparation

1. Read parameters

The front end contains many system parameters of vins-mono, which are obtained by reading the parameter file. The main parameters included in the front end are as shown in the following code:

fsSettings["image_topic"] >> IMAGE_TOPIC;
    fsSettings["imu_topic"] >> IMU_TOPIC;
    MAX_CNT = fsSettings["max_cnt"];
    MIN_DIST = fsSettings["min_dist"];
    ROW = fsSettings["image_height"];
    COL = fsSettings["image_width"];
    FREQ = fsSettings["freq"];
    F_THRESHOLD = fsSettings["F_threshold"];
    SHOW_TRACK = fsSettings["show_track"];
    EQUALIZE = fsSettings["equalize"];
    FISHEYE = fsSettings["fisheye"];
    if (FISHEYE == 1)
        FISHEYE_MASK = VINS_FOLDER_PATH + "config/fisheye_mask.jpg";
    CAM_NAMES.push_back(config_file);

    WINDOW_SIZE = 20;
    STEREO_TRACK = false;
    FOCAL_LENGTH = 460;
    PUB_THIS_FRAME = false;

    if (FREQ == 0)
        FREQ = 100;

From top to bottom are the original image topic name, imu data topic name, the maximum number of extracted feature points, the minimum pixel distance of extracted feature points, the height and width of the image, the frequency of front-end publishing, the pixel threshold of ransac removal, Whether to display the trajectory, whether to perform histogram equalization flag, and fisheye camera flag. Other subsequent parameters are set directly in the program. The more important one is FOCAL_LENGTH=460, which is the virtual focal length. The specific function is explained later.

2. Generate camera model

Generate a camera model through trackerData[i].readIntrinsicParameter(CAM_NAMES[i]). First, read the camera type from the parameter file. In vins-mono, it is a pinhole model. Then assign the camera parameters in the parameter file to the camera model. This process is performed in the function CameraPtr
CameraFactory::generateCameraFromYamlFile(const std::string & amp; filename) is completed. In addition, this function also contains the camera->setParameters(params) function, which mainly records some intermediate variables. These intermediate variables are used for distortion removal.

img_callback() detailed explanation

This part is the main function of front-end processing, and its function is to calculate the relevant information of feature points.

The first frame of image comes in, the first frame information is stored, and no other processing is performed;
Determine whether the image time flow is normal;
Control the release frequency to be less than 100Hz;
Convert ros format images to opencv format;
Enter optical flow tracking and feature point related information calculation, trackerData[i].readImage(ptr->image.rowRange(ROW * i, ROW * (i + 1)), img_msg->header.stamp.toSec());

The first parameter of this function is to determine the range of the rows of the image, which is for binocular preparation. For the monocular camera i is 0, and the second parameter is the timestamp of the image. The specific functions implemented by the function are introduced in detail below:

Adaptive image histogram equalization

This part is mainly to solve the problem of difficult feature point extraction when it is too bright or too dark. However, in the vins-fusion version, this function was removed by the author. It may be that the actual effect of optical flow tracking is not great and it is more time-consuming.

In the first frame of the image, feature points are extracted directly, and in the second frame optical flow tracking is performed. After tracking, additional feature points are re-extracted, and outer points are removed to slim down the relevant containers.

Feature point extraction The function is cv::goodFeaturesToTrack(forw_img, n_pts, MAX_CNT – forw_pts.size(), 0.01, MIN_DIST, mask). This function additionally extracts MAX_CNT based on the current image feature points – forw_pts.size() feature points. When the image is the first frame, the PUB_THIS_FRAME flag must be true, and the program runs to directly extract MAX_CNT. This parameter is also read in the parameter file.

If it is not the first frame of the image, then optical flow tracking needs to be performed to obtain the tracked feature points. The optical flow tracking function is in cv::calcOpticalFlowPyrLK(cur_img, forw_img, cur_pts, forw_pts, status, err, cv::Size(21, 21 ), 3) is completed. The parameter list of this function is: the previous frame image (input parameter), the current frame image (input parameter), the feature point of the previous frame (input parameter), the tracked feature point (output parameter), Status bit (output parameter, indicating whether the feature point of the previous frame is successfully tracked in the current frame), other parameters are of little significance and will not be introduced here. For details, please see the official website.

The next step is to “slim down” some containers. The specific function is to update the feature points that have not been successfully tracked in some related containers. These containers are the feature points of the penultimate frame image, the feature points of the previous frame, the feature points of the current frame, the feature point ID of the previous frame image, the dedistorted feature points of the previous frame, and the feature point tracking number container of the previous frame. . After that, all values in the remaining track_cnt should be increased by 1, which means that the number of feature point tracking times corresponding to the index is increased by 1.

 for (int i = 0; i < int(forw_pts.size()); i + + )
            if (status[i] & amp; & amp; !inBorder(forw_pts[i])) //The tracked point is outside the boundary
                status[i] = 0;
        reduceVector(prev_pts, status);
        reduceVector(cur_pts, status);
        reduceVector(forw_pts, status);
        reduceVector(ids, status);
        reduceVector(cur_un_pts, status);
        reduceVector(track_cnt, status);
 for (auto & n : track_cnt)//Update the number of times the current feature point has been tracked
        n + + ;

This “slimming” function is implemented using double pointers, and the index of 0 in status is cleared from the same index in v. There are two situations when the status value is 0, either the feature point is not tracked, or the tracked point is outside the image boundary.

void reduceVector(vector<cv::Point2f> & amp;v, vector<uchar> status)
{
    int j = 0;
    for (int i = 0; i < int(v.size()); i + + )
        if (status[i])
            v[j + + ] = v[i];
    v.resize(j);
}

If the frequency of the image meets the published frequency, then the outer points are first removed through the basic matrix, and then new feature points are extracted again. Finally, set the newly extracted feature point id to -1.

Remove external points according to the basic matrix, that is, the rejectWithF() function. This function first dedistorts the feature points of the previous frame image and the feature points of the current frame image to obtain points on the camera normalized plane, and then restores them to pixels based on the virtual focal length. Plane points, and then use the cv::findFundamentalMat(un_cur_pts, un_forw_pts, cv::FM_RANSAC, F_THRESHOLD, 0.99, status) function to calculate the external points, and also perform a “slimming” operation on the relevant containers based on status. The basic matrix calculation here is entirely to find the external points, not the pose. In addition, the virtual focal length can treat all cameras equally. When the feature points are converted to the camera normalized plane, the calculation scale (focal length) to obtain the pixel coordinates is the same. The advantage of using the virtual focal length is that it will not be affected by the resolution of different cameras. The difference results in different pixel scales, and this benefit will also be mentioned later.

After removing the external points, use the setMask() function, mainly to extract feature points more evenly. The specific method is to first sort the relevant containers according to the number of times the feature points have been tracked, and then put the extracted feature points in a blank On the image, draw a black circle with the pixel coordinates of the feature points as the center and MIN_DIS as the radius as the mask for the next feature point extraction. This will indeed ensure that the feature point extraction is relatively uniform. In ORB, the quadtree method is used to ensure the uniformity of the extracted feature points. From an intuitive comparison, the use of masks for uniformity in vins is more efficient than the quadtree implementation. It is concise, but it is also more violent and may extract poor quality feature points in areas with poor features. Personally, I think we can learn from the quadtree method to make modifications.

Finally, since the cv::goodFeaturesToTrack(forw_img, n_pts, MAX_CNT – forw_pts.size(), 0.01, MIN_DIST, mask) function extracts new feature points on the latest frame image, these feature points need to be added to the relevant container , in the addPoints() function, the newly extracted feature points are mainly added to the feature point container of the latest frame, the feature point id is set to -1, and the tracking number is set to 0.

 if (PUB_THIS_FRAME)//Image frame frequency is less than 100hz
    {
        rejectWithF();//Remove external points through the basic matrix
        ROS_DEBUG("set mask begins");
        TicToc t_m;
        setMask();
        ROS_DEBUG("set mask costs %fms", t_m.toc());

        ROS_DEBUG("detect feature begins");
        TicToc t_t;
        int n_max_cnt = MAX_CNT - static_cast<int>(forw_pts.size());//The number of newly extracted feature points is required
        if (n_max_cnt > 0)
        {
            if(mask.empty())
                cout << "mask is empty " << endl;
            if (mask.type() != CV_8UC1)
                cout << "mask type wrong " << endl;
            if (mask.size() != forw_img.size())
                cout << "wrong size " << endl;
                //0.01 means that the worst feature point score cannot be less than 0.01 times the best score
            cv::goodFeaturesToTrack(forw_img, n_pts, MAX_CNT - forw_pts.size(), 0.01, MIN_DIST, mask);
        }
        else
            n_pts.clear();
        ROS_DEBUG("detect feature costs: %fms", t_t.toc());

        ROS_DEBUG("add feature begins");
        TicToc t_a;
        addPoints();
        ROS_DEBUG("selectFeature costs: %fms", t_a.toc());
    }

Remove distortion and calculate feature point velocity

In the front-end optical flow tracking node, the last algorithm is to calculate the speed of the feature points on the normalized plane after dedistorting the feature points of the current frame. De-distortion in vins does not directly call opencv’s de-distortion library function, but uses the characteristics of distortion to complete it in an iterative way.

The de-distortion process is shown in the figure above. Point A in the first picture represents the normalized plane coordinate corresponding to the accurate map point, and point A’ represents the position of the normalized coordinate after distortion. As for why the distortion moves toward the center of the image, the reason is related to the characteristics of the camera: slam mostly uses wide-angle lenses, and wide-angle lenses often produce barrel distortion, so the feature points will move inward after distortion.

Dealing with distortion is an iterative process. First, the distorted A’ point is distorted according to the distortion formula. The distortion formula is as described in slam14:

$x_{distorted}=x(1 + k_{1}r^{2} + k_{2}r^{4} + k_{3}r^{6}) + 2p_{1} xy + p_{2}(r^{2} + 2x^{2})$

$y_{distorted} = y(1 + k_{1}r^{2} + k_{2}r^{4} + k_{3}r^{6}) + p_{1} (r^{2} + 2y^{2}) + 2p_{2}xy$

Calculating distortion is a forward process and is easy to implement. Through the calculation of the above formula, the distorted coordinates at position A’ can be obtained as B’. According to the distortion characteristics: the further away from the center, the more serious the distortion. It can be seen that the length of BB’ is definitely smaller than AA’. Then superimpose BB’ onto A’, then we can get point C in Figure 3. At this time, one iteration has been completed. , the updated C coordinate is closer to the true value A. Then C is dedistorted, and the obtained point is C’. It can be seen that BB’

Note: This iterative approach has a premise – barrel distortion. The effect of this type of distortion is that the distorted points at the edge shrink inward. If it is pincushion distortion, or the distorted points spread outward. Then this method is not applicable and the iteration will diverge. Based on this principle, whether it can be modified to achieve the effect of removing pincushion distortion, the author has tried but failed. If anyone can solve this problem, please feel free to communicate in the comment area!

De-distortion is relatively technical, and it will be easier to understand once you understand the principle code. In the actual code, the input parameter is a two-dimensional point coordinate, which is the pixel coordinate, and the output is the normalized plane coordinate after distortion. The specific code is as follows:

void
PinholeCamera::liftProjective(const Eigen::Vector2d & amp; p, Eigen::Vector3d & amp; P) const
{
    double mx_d, my_d,mx2_d, mxy_d, my2_d, mx_u, my_u;
    double rho2_d, rho4_d, radDist_d, Dx_d, Dy_d, inv_denom_d;
    //double lambda;

    // Lift points to normalized plane
    mx_d = m_inv_K11 * p(0) + m_inv_K13;
    my_d = m_inv_K22 * p(1) + m_inv_K23;

    if (m_noDistortion)
    {
        mx_u = mx_d;
        my_u = my_d;
    }
    else
    {
        if (0)
        {
            double k1 = mParameters.k1();
            double k2 = mParameters.k2();
            double p1 = mParameters.p1();
            double p2 = mParameters.p2();

            // Apply inverse distortion model
            // proposed by Heikkila
            mx2_d = mx_d*mx_d;
            my2_d = my_d*my_d;
            mxy_d = mx_d*my_d;
            rho2_d = mx2_d + my2_d;
            rho4_d = rho2_d*rho2_d;
            radDist_d = k1*rho2_d + k2*rho4_d;
            Dx_d = mx_d*radDist_d + p2*(rho2_d + 2*mx2_d) + 2*p1*mxy_d;
            Dy_d = my_d*radDist_d + p1*(rho2_d + 2*my2_d) + 2*p2*mxy_d;
            inv_denom_d = 1/(1 + 4*k1*rho2_d + 6*k2*rho4_d + 8*p1*my_d + 8*p2*mx_d);

            mx_u = mx_d - inv_denom_d*Dx_d;
            my_u = my_d - inv_denom_d*Dy_d;
        }
        else
        {
            // Recursive distortion model
            int n = 8;
            Eigen::Vector2d d_u;
            distortion(Eigen::Vector2d(mx_d, my_d), d_u);
            // Approximate value
            mx_u = mx_d - d_u(0);
            my_u = my_d - d_u(1);

            for (int i = 1; i < n; + + i)
            {
                distortion(Eigen::Vector2d(mx_u, my_u), d_u);
                mx_u = mx_d - d_u(0);
                my_u = my_d - d_u(1);
            }
        }
    }

    // Obtain a projective ray
    P << mx_u, my_u, 1.0;
}

After the distortion is removed, the feature point velocity is calculated, and all the information of the front-end optical flow is ready and can be sent to the back-end. The calculation speed is relatively simple, which is the pixel position difference divided by the time difference.

After updating the feature point ID, the normalized coordinates, pixel coordinates, feature point ID and feature point speed are sent out. Updating the ID is to continue assigning IDs to the newly extracted feature points in order. The normalized coordinates are sent to the backend for pose estimation, while the pixel coordinates and feature point velocities are used to estimate td.

The program code for publishing data is as follows:

 if (PUB_THIS_FRAME)
   {
        pub_count + + ;
        sensor_msgs::PointCloudPtr feature_points(new sensor_msgs::PointCloud);
        sensor_msgs::ChannelFloat32 id_of_point;
        sensor_msgs::ChannelFloat32 u_of_point;
        sensor_msgs::ChannelFloat32 v_of_point;
        sensor_msgs::ChannelFloat32 velocity_x_of_point;
        sensor_msgs::ChannelFloat32 velocity_y_of_point;

        feature_points->header = img_msg->header;
        feature_points->header.frame_id = "world";

        vector<set<int>> hash_ids(NUM_OF_CAM);
        for (int i = 0; i < NUM_OF_CAM; i + + )
        {
            auto &un_pts = trackerData[i].cur_un_pts;
            auto & amp;cur_pts = trackerData[i].cur_pts;
            auto &ids = trackerData[i].ids;
            auto &pts_velocity = trackerData[i].pts_velocity;
            for (unsigned int j = 0; j < ids.size(); j + + )
            {
                if (trackerData[i].track_cnt[j] > 1) //The number of tracking is greater than 1
                {
                    int p_id = ids[j];
                    hash_ids[i].insert(p_id);
                    geometry_msgs::Point32 p;
                    p.x = un_pts[j].x;
                    p.y = un_pts[j].y;
                    p.z = 1;

                    feature_points->points.push_back(p);//Normalized coordinates
                    id_of_point.values.push_back(p_id * NUM_OF_CAM + i);
                    u_of_point.values.push_back(cur_pts[j].x);//pixel coordinates
                    v_of_point.values.push_back(cur_pts[j].y);
                    velocity_x_of_point.values.push_back(pts_velocity[j].x);//Feature point velocity
                    velocity_y_of_point.values.push_back(pts_velocity[j].y);
                }
            }
        }
        feature_points->channels.push_back(id_of_point);
        feature_points->channels.push_back(u_of_point);
        feature_points->channels.push_back(v_of_point);
        feature_points->channels.push_back(velocity_x_of_point);
        feature_points->channels.push_back(velocity_y_of_point);
        ROS_DEBUG("publish %f, at %f", feature_points->header.stamp.toSec(), ros::Time::now().toSec());
        // skip the first image; since no optical speed on first image
        if (!init_pub)
        {
            init_pub = 1;
        }
        else
            pub_img.publish(feature_points);

At this point, the task of the front-end optical flow tracking node is completed, and the whole process is relatively simple. Finally, let’s expand on the front-end optical flow: First, there is a comparison between the optical flow method and the feature point + descriptor method. The optical flow method is indeed faster than descriptor matching, but it also relies heavily on the assumption of photometric invariance and requires relative The images in the adjacent two frames do not move much, otherwise tracking will fail easily. For those who want to use the vins-mono framework for actual development, you can consider the actual scene requirements and choose whether to change the way the front-end extracts and tracks feature points.

Then there are the issues of quadtree uniformization and mask uniformization mentioned earlier. Vins’s plan is to sort the feature points according to the number of tracking times, and then draw solid circles for the ones with the most tracking times to ensure that the next extraction will not be based on the feature points. carried out within a range of the center. This is relatively unreasonable for situations where there are fewer feature points. It feels like the areas outside the circle with poor features have to be forced to be extracted. In contrast, the quadtree is eliminated after the extraction is completed, which seems more reasonable.

There are also some variables in the member variables that are not used. For example, the prev-related ones only use a container when calculating the speed, and the others are not used and can be removed.

The above is about the optical flow part of the vins-mono front-end. The next article will introduce in detail another important work on the front-end, imu pre-integration.

The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. OpenCV skill tree Home page Overview 23977 people are learning the system