[Image segmentation] Satellite remote sensing image road segmentation: D-LinkNet algorithm interpretation

Foreword

Because part of the content in the final project involves road segmentation of satellite remote sensing images, I did some research on related algorithms.
The data set used in this article is DeepGlobe, which comes from a challenge in CVPR2018: DeepGlobe Road Extraction Challenge.
D-LinkNet is the champion algorithm of this challenge.

Considering that the development version of D-LinkNet is relatively old (Python 2.7, Pytorch 0.2.0), I refactored this project, and the specific work is as follows:

Modify the relevant Python2 syntax to meet the Python3.8 development environment
Remove the Doka training part (DataParallel) to make the code clearer and easier to read
Add model verification function (eval.py), add mIou index to evaluate model effect
Add a new algorithm NL-LinkNet, and provide related training results

Currently the repository supports the following segmentation algorithms:

UNet
D-UNet
LinkNet
D-LinkNet
NL-LinkNet

Project address: https://github.com/zstar1003/Road-Extraction

Introduction to DeepGlobe dataset

DeepGlobe dataset download address: https://pan.baidu.com/s/1chOnMUIzcKUzQr1LpuJohw?pwd=8888

The data set contains 6226 training pictures, each picture size is 1024×1024, and the image resolution is 0.5 m/pixel

Data preview:

D-LinkNet network structure

Image segmentation generally has the following series of algorithms in the field of satellite remote sensing road segmentation, and the algorithm release timeline is as follows:
FCN(2015)->UNet(2015)->LinkNet(2017)->D-LinkNet(2018)->NL-LinkNet(2019)->…

The network structure of D-LinkNet is shown in the figure below:

The overall structure of this network is similar to UNet, and some small improvements have been added to this architecture, such as residual blocks, hole convolution, etc. The obvious improvement is that the algorithm introduces the TTA (Test Time Augmentation) strategy, that is, the enhancement during the test, which will be explained in detail later.

Modify the model structure layer name

Since I removed the DataParallel multi-card parallel training structure, directly loading the official model will report an error:

RuntimeError: Error(s) in loading state_dict for DinkNet34:
Missing key(s) in state_dict: “firstconv.weight”, “firstbn.weight”, “firstbn.bias”,
Unexpected key(s) in state_dict: “module.firstconv.weight”, “module.firstbn.weight”, “module.firstbn.bias”
…

This is because the layer names of the model structure are inconsistent, and the model file contains more layer names module., so I wrote a script to convert utils/turn_model.py:

import collections
import torch

if __name__ == '__main__':
    path = '../weights/log01_dink34.th'
    model = torch.load(path)
    new_model = collections.OrderedDict([(k[7:], v) if k[:7] == 'module.' else (k, v) for k, v in model.items()])
    torch.save(new_model, "../weights/dlinknet.pt")

TTA strategy

The idea of TTA is to use data enhancement during testing. For example, if a picture is directly segmented, the effect may be limited. Then, the picture is rotated, flipped and other data enhancement methods are used for segmentation, and finally all the segmentation results are superimposed.

Let’s analyze the order of the program running logic:

First, after the program loads an image, img is the original image, and img90 rotates the image 90 degrees counterclockwise. Related codes:

def segment(self, path):
    img = cv2.imread(path)
    img = cv2.resize(img, resize_settings) # Shape: (1024, 1024, 3)
    img90 = np. array(np. rot90(img)) # Shape: (1024, 1024, 3)
    img1 = np.concatenate([img[None, ...], img90[None, ...]]) # Shape: (2, 1024, 1024, 3) img[None]: increase the first position dimension

img1 is to stitch these two pictures together, and display them intuitively below:

show_img(img1[0], img1[1])

After that, an img2 is built, and the reverse order is performed in the second dimension of img1 to achieve vertical flip

img2 = np.array(img1)[:, ::-1] # flip vertically

Visual display:

show_img(img2[0], img2[1])

In the same way, img3 is reversed in the third dimension of img1 to achieve horizontal flip

img3 = np.array(img1)[:, :, ::-1] # horizontal flip

Visual display:

show_img(img3[0], img3[1])

img4 is the horizontal flip of img2, which is equivalent to the horizontal and vertical flip of img1

img4 = np.array(img2)[:, :, ::-1] # vertical flip + horizontal flip

Visual display:

show_img(img4[0], img4[1])

The latter is to infer each part, and then the finally returned mask2 is the superimposed result, and mask[0] is the inference result of the original image

maska = self.net.forward(img1).squeeze().cpu().data.numpy() # img1:Shape:(2, 1, 1024, 1024) -> (2, 1024, 1024)
maskb = self.net.forward(img2).squeeze().cpu().data.numpy()
maskc = self.net.forward(img3).squeeze().cpu().data.numpy()
maskd = self.net.forward(img4).squeeze().cpu().data.numpy()

mask1 = maska + maskb[:, ::-1] + maskc[:, :, ::-1] + maskd[:, ::-1, ::-1]
mask2 = mask1[0] + np.rot90(mask1[1])[::-1, ::-1]

Intuitive comparison, the left side is the reasoning of the original graph, and the right side is the reasoning result after TTA:

show_img(maska[0], mask2)

It can be seen that the effect of using TTA is quite obvious.

NL-LinkNet

In 2019, NL-LinkNet was proposed, and it is said that its mIOU is higher than D-LinkNet on the DeepGlobe dataset.
Related warehouse: https://github.com/yswang1717/NLLinkNet

Due to the poor inference effect of the model provided by the warehouse author (maybe the author passed the wrong file), I trained 128 epochs on my RTX2060 (actually set 200 epochs, and the model converges early after 128 epochs). The training of the model is still relatively slow, and it takes about 57 hours. For specific log information, please refer to logs.

The following is a comparison of the segmentation results of the two models for the same image:

It can be seen that the NL-LinkNet segmentation results are smoother.