TerrainFusion: Real-time digital surface model reconstruction based on monocular SLAM

Article directory

TerrainFusion: Real-time digital surface model reconstruction based on monocular SLAM
- Summary
- introduce
- Related work
- frame
- - A.Monocular SLAM
  - B. Local DSM
  - C.DSM Fusion
- experiment
- in conclusion

Abstract

Abstract – This paper aims to generate a real-time Digital Surface Model (DSM) during flight based on the Simultaneous Localization and Mapping (SLAM) algorithm. We process the keyframes output by the monocular SLAM system, generate a local DSM, and incrementally fuse the local DSM into a global tiled DSM. During the local DSM generation process, the local digital elevation model (DEM) is estimated by projecting the filtered 2D Delaunay grid to the 3D grid, and the local orthogonal mosaic is obtained by projecting the triangular image patches to the 2D grid. In the DSM fusion process, the local DEM and orthomosaic are divided into blocks and fused to the global tiled DEM and orthomosaic respectively using a multi-band algorithm. Efficient DSM generation and fusion algorithms both help achieve real-time reconstruction. Qualitative and quantitative experiments were conducted on public aerial image data sets in different scenarios to verify the effectiveness of the method. Compared with traditional structure-from-motion (SfM)-based methods, this system can output large-scale high-quality DEMs and orthomosaics in real time at a lower computational cost.

Introduction

Digital Surface Model (DSM) [1], [2], [3] is a raster-based terrain description that includes terrain and all natural or artificial features located on the earth’s surface. It is used in 3D applications in aviation, urban planning, and telecommunications. plays an important role in modeling. In recent years, DSM has been increasingly widely used in many fields [4], [5], and the demand for real-time DSM reconstruction technology is also increasing rapidly, such as fire monitoring [6] and urban monitoring [7], [7]. 8] etc. In addition, with the rapid development of UAVs, it has become possible to collect aerial images with low cost and high mobility. Considering the above two aspects, we conceived the idea of reconstructing real-time DSM from aerial images taken during UAV flight.

Common implementations of DSM reconstruction are based on structure from motion (SfM) [9], [10], [11], [12], such as the projects OpenMVG [13], openDroneMap and the commercial software Pix4DMapper [14] or Photoscan [15]. They share similar pipelines [16], [17], [18], [19]: feature detection, matching, image alignment, generation of sparse point clouds using bundle algorithm, generation of dense point clouds, meshes and textures. However, traditional SfM methods always need to prepare all images before calculation, and then take hours to generate the final results, so these SfM-based methods [13], [15], [14], [20], [18] Reconstruction methods are not suitable for real-time and incremental use.

Another favorable option for real-time reconstruction is Simultaneous Localization and Mapping (SLAM) [21], [22], [23], [24], which is capable of outputting environment maps and camera positions in real-time due to its well-designed processing pipeline. Therefore, some researchers utilize SLAM for real-time UAV image stitching [25] and height field reconstruction [26]. However, the method proposed in [25] only generates orthogonal mosaics without height information, and [26] is limited to using height maps and GPU to achieve real-time speed.

In this paper, we explore the stepwise acquisition of a DSM consisting of a generated digital elevation model (DEM) [27] and an orthogonal tessellation [28], which are executed simultaneously and then fused in real time. To achieve this, we use visual SLAM to obtain camera poses and keypoints with depth information in real time and compute a sparse point cloud for each keyframe. We then combine the surface properties of the captured images and the 3D structure of the sparse point cloud to simultaneously generate local DEMs and orthomosaics. Finally, we fuse them into a globally stitched DEM and an orthomosaic respectively. Thanks to an advanced processing pipeline, our method called TerrainFusion is able to reconstruct a high-quality DSM of the scene in real time, as shown in Figure 1. Compared with previous work, our approach makes the following contributions:

A novel real-time DSM reconstruction framework based on monocular SLAM, capable of processing continuous aerial images to incrementally reconstruct DSM.
?A noise filtering algorithm that removes outliers and pits from the point cloud and effectively makes the resulting surface a 2D manifold.
?A robust DSM generation algorithm that quickly generates DEMs and orthomosaics simultaneously from each keyframe of the SLAM output.
?Tile format for managing DSM, enabling our system to handle large-scale scenarios.
?DSM fusion algorithm not only effectively updates the global DSM in real time, but also ensures high-quality reconstruction.

Related work

Reconstructing three-dimensional surfaces from acquired images has been a popular research topic in recent decades, and descriptions and comparisons of traditional methods can be found in [29], [27]. In this paper, we focus on methods based on SfM and SLAM, which are summarized as follows.

A. SfM-based method
SfM, as a central problem in the computer vision community for estimating 3D structure from image acquisition [30], [31], [32], is often used as the basis for DSM reconstruction [13], [15], [14], [20], [ 18]. So far, various SfM strategies have been proposed, including incremental [9], [10], global [11], [12] and hierarchical methods [33]. Incremental reconstruction of manifold surfaces [20] utilizes 3D Delaunay triangulation to reconstruct surfaces from sparse 3D point clouds computed via incremental SfM. However, [20] aims at reconstructing closed surfaces, which makes DSM reconstruction limited. Poisson surface reconstruction [18] is one of the most well-known three-dimensional reconstruction methods, which expresses surface reconstruction as a solution to the Poisson equation. Although the Poisson reconstruction method shows mastery of technical tricks and details, it cannot reconstruct surfaces in real time.

B. SLAM-based method

Unlike traditional SfM, which does not perform real-time calculations, SLAM emphasizes real-time calculation of the camera pose and point cloud of each key frame. SLAM technology develops rapidly, and various SLAM systems have been proposed, including monocular SLAM systems (keypoint-based [34], [35], [21], direct [36], [22], [37] and semi-direct methods [23], [38]), multi-sensor SLAM systems, and learning-based SLAM systems (supervised [46], [47], [48] and unsupervised [49], [50]).

In recent years, some studies [25], [26], [51] have attempted to adopt SLAM for photogrammetry to obtain orthoimages or height fields. Map2DFusion [25] progressively mosaics aerial images to generate 2D maps. Although Map2DFusion using the multiband algorithm can output high-quality mosaics in most cases, planar fitting makes it perform poorly in non-planar environments, and it only generates digital orthophoto maps (DOMs) without height information. ). In the case of orthomosaic generation, our method effectively handles non-planar environments by taking more into account the three-dimensional structure. In particular, our method is able to reconstruct orthomosaics and height fields in real time. Zienkiewicz et al. [26] reconstruct surfaces by directly fusing depth maps and colors into multi-resolution triangle meshes. However, this method is implemented on the GPU for real-time speed and only reconstructs the height field and not the texture. In addition, this method requires the camera to be as close to the ground as possible, so it is not suitable for DSM reconstruction of large-scale scenes when the camera is far from the ground. Hinzmann et al. [51] implemented a mapping framework that incrementally generates dense georeferenced point clouds, digital surface models, and orthomosaics. Their goal is to generate DSMs from dense point clouds which may not be able to handle complex scenes, while our approach focuses on reconstructing large-scale DSMs in real time using sparse point clouds.

Framework

An overview of our pipeline is shown in the figure. 2. It consists of three different stages: visual SLAM, local DSM generation and DSM fusion. Starting from aerial images, we first use visual SLAM to estimate camera pose and sparse structure (Section III-a). Then, we perform 2D Delaunay triangulation on the keypoints to generate a 2D mesh, and project the 2D mesh to a 3D mesh based on the projection relationship between the keypoints and the point cloud. Since there are many conflicts in the original point cloud, which interfere with the generation of DEM, an iterative filtering method is used to remove noise. After filtering the noise, we compute a local DEM from the filtered 3D mesh and generate a local orthomosaic from the mesh and image (Section III-B). In the final stage, the local DEM and orthogonal mosaic are stitched to the corresponding global map respectively, and a multi-band algorithm is applied on the stitched DEM and orthogonal mosaic for smooth transition (Section III-C).

A. Monocular SLAM

Keypoint-based monocular SLAM is used in our system to estimate real-time pose and sparse depth. The implementation of this framework is similar to the visual SLAM and geo-referencing performed in [25], but adopts some improvements that make it robust to high-frequency videos and low-frequency photos. Since it is not the main focus of this article, we only give a brief introduction.

For each captured image, a GPU-accelerated SIFT extractor is used to detect keypoints and extract features in real time. Initialize the mapping by decomposing the relative pose from a fundamental matrix or a homography matrix. The map is converted to Earth-Centered, Earth-Fixed (ECEF) coordinates when GPS is available, otherwise the initial plane is fitted to the ground plane. After the map is initialized, the tracking thread processes each frame and attempts to estimate the current pose by tracking the last frame. The pose is calculated by solving a perspective n-point (PnP) problem from pairs of 3D map points and 2D keypoints. The local subgraph is then projected to the current frame to obtain more PnP observations, and better pose estimation can be achieved. When the relative distance between the current frame and the last keyframe exceeds a certain threshold, a keyframe is inserted into the mapping thread. New map points are triangulated via epipolar search and some data association operations are performed, then local bundles are performed to optimize local map points and keyframe poses under GPS constraints. The last keyframe with sparse depth information will be released for later reconstruction.

B. Local DSM

Unlike most SfM-based methods that generate the entire DSM at once, our goal is to generate a local DSM for each keyframe and then incrementally fuse it to a global DSM. The visual SLAM system already outputs the pose and key points of each key frame as well as depth information in real time, from which the point cloud of the key frame is calculated. Unexpectedly, outliers and pits (collectively called noise) that corrupt DSM properties are present in keypoints and point clouds. In order to ensure that the generated surface is a two-dimensional manifold, a filtering algorithm is designed to filter out noise. This section first introduces the filtering algorithm, and then introduces the implementation details of DEM and orthomosaic generation.

a) Mesh generation: In order to quickly generate a 3D mesh for each keyframe, we propose a 2D to 3D method, which is the core part of local DSM generation. According to the pixel coordinates of the key point Pi in the current key frame, two-dimensional Delaunay triangulation is performed to obtain the two-dimensional triangular mesh Mi on the image plane. According to the projection relationship between the three-dimensional point cloud and the key points, the topology of Mi is projected onto the three-dimensional point cloud to obtain the three-dimensional triangle mesh Mw. In the world coordinate system, a three-dimensional triangle mesh Mw using point clouds as grid points contains geographical information. Since Mw exposes outliers and pits, a filtering operation is performed to eliminate them. To later generate the orthogonal mosaic, the elevation information of Mw is discarded to obtain a 2D triangular mesh Mh on the horizontal plane in the world coordinate system.

b) Edge detection: Our filtering algorithm is based on geometric constraints. The edge points and non-edge points of Mh have different geometric structures, so different filtering strategies are used to filter out noise. The judgment Ee aims to distinguish the edge point set Pe and the non-edge point set Pn from the grid points of Mh: for each grid point p in Mh, only if the opposite edge to p in the adjacent triangle forms a simple When closing a polygon, this point is a non-edge point. Adjacent triangles are triangles containing p in Mh. The red line in Figure 3 is the edge of Mh connected by edge point Pe.

c) Noise filter: Since there are both outliers and pits in Pe, each edge point is judged twice to estimate whether it is noise. The first judgment Ee1 mainly estimates whether the edge point is a pit, while the second judgment Ee2 mainly estimates whether it is an outlier. In the first judgment, the opposite sides of the edge point p in adjacent triangles form a polyline (when p has only one adjacent triangle, the polyline is a line segment). Then two lines are obtained by connecting p with the two endpoints of the polyline respectively. If these two lines intersect any triangle that does not contain an adjacent point of p as a vertex, we consider p as edge noise. The adjacent points of p refer to all other two vertices of every triangle in Mh that contains p. In Figure 3, P1 is edge noise. In the second judgment, a line segment is generated by connecting the two endpoints of the polyline given in the first judgment. We then get a concentric circle whose radius is k times the circle using the line segment as diameter. If the edge point is located outside the k-radius concentric circles, we judge the edge point to be irrelevant to the grid and treat it as noise. k is a parameter, which we empirically set to 1.4. As shown in Figure 3, P2 is a kind of noise. When it comes to the judgment En of noise in Pn. For every point p in Pn, if p lies outside the closed polygon formed by the opposite sides of p in adjacent triangles, it is noise. As shown in Figure 3, P3 is a kind of noise.

Figure 3 Using a noise filter to process two high-frequency results Figure 3. This figure shows different strategies for filtering noise. The 2D mesh Mh (blue and red) is generated by discarding the elevation from the 3D mesh Mw. The red line is the edge of Mh. P1 is edge noise because edge line segments P1a, P1b and Δcdh use P1 as the endpoint. P2 is edge noise because P2 is outside a concentric circle (thick green) whose radius is k times the circle (thin green) using line segment fg as its diameter. P3 is non-edge noise because it is outside the closed polygon cde formed by the opposite sides of P3 in adjacent triangles.

Figure 4. Comparison of DSM (top) and orthogonal mosaic (bottom) with and without noise filter using data from the mavic factory.

The iterative method shown in Algorithm 2 is used to filter out edge noise and non-edge noise. After generating the mesh and filtering out the noise once, use the updated key point set Pi and point cloud Pc to generate the mesh and filter the noise repeatedly until no noise is detected. In addition, edge noise filters are embedded in iterations of non-edge noise filters to avoid edge (non-edge) noise not being filtered out after a non-edge (edge) noise filter has changed the grid structure. Figure 4 shows that our filtering algorithm can effectively remove noise.

d) Orthogonal tessellation generation: Different from the method of wrapping images, we project triangular image patches onto Wh to generate orthogonal tessellation. Since Mi on the image plane has divided the image of the current keyframe into a set of triangular patches, a local positive tessellation can be obtained by projecting the triangular patches onto Mh.

e) DEM generation: In order to make the later fusion more efficient, we convert the irregular three-dimensional triangle mesh Mw into a local spliced DEM. The height of the local DEM is calculated by interpolating the Mw height of the grid points. As shown in Figure 5, in our implementation, linear interpolation is chosen due to its fastness and stability.

Figure 5. The figure shows a top view of the conversion process from a 3D triangular mesh Mw (red line) using point clouds as mesh points to a local tiled DEM (black line). Different background colors indicate that the local DEM is divided and stored in different tiles (this is a schematic diagram, only 4×4 is drawn in the tiles for clarity). a, b, c, d, e, and f are the vertices of Mw, and P1 and P1 are the three vertices of the local DEM. The height value of P1 is equal to the height value of a. The height value of P2 is obtained by interpolating the height values of the two endpoints of line segment bc. The height value of P3 is interpolated from the height of △de f.

C.DSM Fusion

When the local DSM is generated, the global DSM needs to be updated. In order to achieve high-quality stitching, the weighted image is adaptively calculated by considering the distribution of sparse point clouds in the DEM and the perspective of the current key frame. The local DSM with larger weight values is stitched to the global DSM in units of tiles. Finally, a multi-band algorithm is performed in the stitched DSM tiles to achieve smooth transitions.

a) Fusion weights: In our implementation, the weights used for DSM stitching are also stored as a map, which is a single-channel unsigned character image. Each pixel value of the weight image is calculated as follows: the arithmetic mean of the coordinates of all grid points in the current keyframe Mh is calculated to obtain the center coordinate o. The weight of o is set to 255, while the weight of o, the furthest position from o, is set to 0. When we calculate the pixel distance ζ between o and o′, the gradient ρ between this value and the pixel distance can be calculated as ρ = 255.0/ζ. Finally, according to the gradient ρ and the pixel spacing {disi} between {pi} and o, the value {wi} of all pixels {pi} in the weighted image can be calculated using the formula wi=ρ×(ζ?disi).

b) Stitching: Most online maps are provided in tiled form by service providers, and tiled DSM is beneficial for later mixing, so we manage DSM in tiled format. In our system, the local DEM is divided into rectangular patches, and each patch is stored in a separate tile. Local orthogonal mosaics and weighted images are also segmented in the same way, and the patches are also stored in the corresponding tiles. Therefore, DSM tiles contain several layers, including DEM patches, orthogonal tessellation patches and weighted image patches. We compute the local mask by comparing the local tiled weight image with the global tiled weight image. Based on this mask, we stitch the local mosaic DEM and orthomosaic respectively to the global mosaic.

c) Blending: Although the above steps enable real-time DSM reconstruction, there are visible seams in the stitched DSM as shown in the figure. Paragraph 6(a) and (c). In fact, this is caused by two reasons: 1) stitching orthogonal mosaics generated from differently exposed images results in a slight color deviation around the stitching line; 2) the stitched DEM also has a smaller offset around the stitching line .

To smooth color deviations and shifts, we apply multi-band algorithms on both the stitched DEM and the orthomosaic. To implement the multi-band algorithm, our system adds the following steps. The two Laplacian pyramids are extended from local DEM and orthogonal tessellation through the expansion operation [52], which we call local DEM pyramid and local orthogonal tessellation pyramid. Correspondingly, the weight pyramid is expanded from the weight image by downsampling. The three pyramids are also divided into rectangular sub-pyramids, and the sub-pyramids are stored in corresponding tiles. The local mask pyramid is calculated by comparing the local tiled weight pyramid with the global weight pyramid. Instead of stitching the locally stitched DEM, we pyramid stitch it to the global DEM based on the mask pyramid, as does the stitched orthotessellated pyramid stitching. Finally, the original DEM and orthomosaic are restored by summing the levels of their global spliced Laplacian pyramid from low to high.

Experiment

We implemented the proposed real-time DSM reconstruction system in C language and evaluated its performance in terms of computational efficiency and accuracy. We first tested our system on public datasets and presented three DSMs reconstructed from different environments. More results are shown on the website https://shaxikai.github.io/TerrainFusion. We then compared the mountainous area orthomosaics generated by TerrainFusion and Map2DFusion. In order to make a valid comparison, which to the best of our knowledge is the first open source online DSM reconstruction system, we have no choice but to compare our system with two offline existing software, Photoscan and Pix4DMapper. Since there is no dataset that contains both ground-truth heights and environmental aerial images, we finally performed a cross-comparison of the three surfaces generated by our methods Photoscan and Pix4DMapper respectively to analyze the accuracy of our method. The results show that our system achieves high robustness, relative quality and more efficient real-time DSM reconstruction.

All experiments were conducted on a desktop computer equipped with an Intel i7-7700 CPU, 16GB RAM, and a GTX 1060 GPU. We run our method and Photoscan in a 64-bit Linux system, while Pix4DMapper executes on a 64-bit Windows 10.

A. Results from the TerrainFusion dataset

To evaluate our system, we created a TerrainFusion dataset containing several scenes, including cities, mountains, deserts, and plains. The resolution of aerial images is 4000×3000. They are publicly available for download from the website, where demo videos are also available.
In the first experiment, we compared the quality of terrain mosaics generated by TerrainFusion and Map2DFusion [25]. For planar environments, both methods output almost identical orthomosaics, and further comparisons with Photoscan or Pix4DMapper can be found in [25]. However, when it comes to non-planar environments, especially mountainous areas with steep slopes, plane fitting in Map2DFusion may produce ghosting, as shown in Figure 7. While our method considers more 3D structures from point clouds to generate orthogonal mosaics, which can avoid ghosting.
To illustrate the practicality of our approach, some experiments were conducted under various conditions. For comparison, we present the reconstruction results using our method and Photoscan in the figure. 8. Comparison results show that our method is able to achieve similar acceptable DSM reconstruction as Photoscan. More reconstruction results can be accessed on the website, and they all show that our method not only performs good reconstruction of natural features, but also achieves acceptable reconstruction of urban and rural areas.

B. Efficiency comparison

We compared the execution efficiency of different methods (our method, Photoscan and Pix4DMapper) using the TerrainFusion dataset. Time usage statistics are listed in Table I. For a fair comparison, we detected almost the same number of keypoints (around 1000) in each keyframe for the three methods and used default values for other parameters. Unlike Photoscan and Pix4DMapper offline reconstruction, our method handles reconstruction in an incremental manner. Therefore, the time shown in Table I for our method is the sum of the time spent generating and fusing DSMs for all keyframes. After all, what we implement is online reconstruction, but Photoscan or Pix4DMapper can perform offline and batch optimized reconstruction. Therefore, our method takes much less time than Photoscan or Pix4DMapper.

C.Quantitative comparison

In a quantitative comparison, the surface error between different methods (TerrainFusion, Photoscan and Pix4DMapper) was calculated to analyze the accuracy of our method. We sample a point {pi} on the model and obtain the height value {h′i} of {pi} on the surface generated by a method. Similarly, we obtain the height value {hi} of {pi} over the surface generated by another method. Then, we use |hi?h′i| to calculate the error on {pi} between the two surfaces generated by different methods. The error distribution and statistics of TerrainFusion and the two software on mavic-fengniao data are shown in the figure. Paragraph 9(a) and (b). We find that between our method and Photoscan, the error for 90.13% of the sample points is less than 2.0m, and 62.92% of the sample points have the error less than 1.0m. When comparing our method with Pix4DMapper, 88.96% of the sample points have an error less than 2.0m, and 62.32% of the sample points have an error less than 1.0m. It is easy to see that the error distributions between Figure 9(a) and (b) are similar.

To evaluate the accuracy, we also calculated the error for the surfaces generated by Photoscan and Pix4DMapper, as shown in the figure. Paragraph 9(c). Among them, the error of 92.35% of the sample points is less than 2.0m, and the error of 81.56% of the sample points is less than 1.0m. It is clear that the error between the two software is smaller than the error between our method and either of them. The red bar on the right side of Figure 9 indicates sample points with errors greater than or equal to 8.0 m. When extracting image key points, a method can extract the key points of the image edge, so the surface of the edge can be reconstructed. In the other method, the surface cannot be reconstructed since no keypoints at the edges of the image are extracted. Therefore, surfaces reconstructed by different methods have larger errors at the edges. As shown in the error distribution in Figure 9, almost all of these large errors occur at the edges.

In summary, what we achieve is online DSM reconstruction from sparse point clouds. Therefore, it is acceptable to reconstruct the DSM in real time by trading small accuracy for speed.

Conclusion

We propose a new method for real-time DSM reconstruction from sparse point clouds generated by monocular SLAM. Although the accuracy of the proposed method is not better than that of commercial software, it is fast enough for online applications to reconstruct DSM in real time. In the future, we will work on encrypting sparse point clouds and reconstructing high-fidelity DSMs from dense point clouds in real time.