[3D Image Segmentation] 3D Image Segmentation 6 based on Pytorch (data preprocessing LIDC-IDRI tag xml tag dump and tag count statistics)

Since the previous LUNA16 data processing method compiled by the author of Bizhan was too cumbersome, this article will make a new sorting of the LUNA16 data. The final data and form are similar. of. However, the main difference is that the code logic is relatively simple and easy to understand.

For learning about the LUNA16 data set, you can refer here: [3D Image Classification] Pytorch-based 3D stereoscopic image classification 3 (LIDC-IDRI pulmonary nodule XML feature tag PKL dump)

The main steps and central content of this article include the following parts:

masks generation: extract the nodule mark position coordinates of the corresponding sequence series from the xml file (a nodule may be marked multiple times by multiple people) ), generate the corresponding mask array file, the size is consistent with the image array size;
Lung parenchyma extraction operation: From the lung area segmentation data, perform a product operation with the original image and the mask image, and fill or remove the non-lung area parts;
resample operation: According to spacing, perform resample operation. resample< can be performed in three dimensions of zyx /code>, you can also just perform the resample operation in the z direction at 1mm (I saw something similar to this in the paper) ;


According to mask, obtain the zyx center point coordinates and radius of the nodule.


At this point, we will have the following files:

Contains image data of ct;
Corresponding mask data;
A file that records zyx center point coordinates and radius.

Compared with the data format given by luna16, the current data is easier to understand and easier to view. Whether it is visualization or subsequent data processing and training, it is more intuitive and clear. This part will be expanded on one by one later.
Since the amount of code is still relatively large, there are many things to deal with, and there are many files involved, so it may be spread out in several chapters. In this article, we will first process the xml file and transfer it out for easy viewing. This involves the format and processing of xml files, so I will write a separate article and refer to the link: [Medical Imaging Data Processing] XML file format processing summary
1. xml file dump
1.1. Understanding the annotation file xml
For an introduction to what each field in the xml file means in the LIDC-IDRI data set, you can refer to my other article, click here: [LIDC-IDRI] CT Pulmonary nodule XML tag characteristics benign and malignant tag PKL dump (1)


In this article, we focus on the structure of this data and what the tag of each record in xml means. I believe that after reading this, you will have a deeper understanding of the processing of this data set.
Most of the code is the same as the content introduced and obtained in the link above. You can refer to this GitHub: NoduleNet – utils -LIDC
Some content has not been introduced, so I will simply make a supplement.

ResponseHeader: This is the header part, which records the information of this case (that is, the CT image of a single patient).

In order to facilitate viewing and learning of xml files, you can refer to this article: [Medical Imaging Data Processing] Summary of XML file format processing. We will use the xml to convert it to a dictionary to facilitate our viewing. The following shows the comparison of the before and after conversion to the dictionary, as follows:

The data form of the original xml is excerpted from a small section and is shown below:
<?xml version="1.0" encoding="UTF-8"?>
<LidcReadMessage uid="1.3.6.1.4.1.14519.5.2.1.6279.6001.1308168927505.0" xmlns="http://www.nih.gov" xmlns:xsi="http://www.w3. org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd">
  <ResponseHeader>
    <Version>1.7</Version>
    <MessageId>1148851</MessageId>
    <DateRequest>2005-11-03</DateRequest>
    <TimeRequest>12:25:10</TimeRequest>
    <RequestingSite>removed</RequestingSite>
    <ServicingSite>removed</ServicingSite>
    <TaskDescription>Second unblinded read</TaskDescription>
    <CtImageFile>removed</CtImageFile>
    <SeriesInstanceUid>1.3.6.1.4.1.14519.5.2.1.6279.6001.131939324905446238286154504249</SeriesInstanceUid>
    <StudyInstanceUID>1.3.6.1.4.1.14519.5.2.1.6279.6001.303241414168367763244410429787</StudyInstanceUID>
    <DateService>2005-11-03</DateService>
    <TimeService>12:25:40</TimeService>
    <ResponseDescription>1 - Reading complete</ResponseDescription>
    <ResponseComments></ResponseComments>
  </ResponseHeader>

Convert to dictionary dictionary form. (Easier to view)
{
  "LidcReadMessage": {
    "@uid": "1.3.6.1.4.1.14519.5.2.1.6279.6001.1308168927505.0",
    "@xmlns": "http://www.nih.gov",
    "@xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance",
    "@xsi:schemaLocation": "http://www.nih.gov http://troll.rad.med.umich.edu/lidc/LidcReadMessage.xsd",
    "ResponseHeader": {
      "Version": "1.7",
      "MessageId": "1148851",
      "DateRequest": "2005-11-03",
      "TimeRequest": "12:25:10",
      "RequestingSite": "removed",
      "ServicingSite": "removed",
      "TaskDescription": "Second unblinded read",
      "CtImageFile": "removed",
      "SeriesInstanceUid": "1.3.6.1.4.1.14519.5.2.1.6279.6001.131939324905446238286154504249",
      "StudyInstanceUID": "1.3.6.1.4.1.14519.5.2.1.6279.6001.303241414168367763244410429787",
      "DateService": "2005-11-03",
      "TimeService": "12:25:40",
      "ResponseDescription": "1 - Reading complete",
      "ResponseComments": null
    },
}

1.2. Convert xml comprehensive records to series-based npy files
LIDC-IDRI has 1018 checks, 6 folders in the tagged folder tcia-lidc-xml , there are 1318  xml files. Moreover, the names of these xml files are not in one-to-one correspondence with the sequence names of the images.
Therefore, it is necessary to reorganize the information marked in the xml file and convert it into content that people can easily understand and understand. Moreover, if the annotation file can have a one-to-one correspondence with the image file, subsequent processing will be much easier.
What this section does is to extract the xml file and leave the content you care about, leaving other unimportant and unconcerned content aside for the time being.
Below is the processing code, the main steps are outlined below:

Traverse all xml files and process them one by one;
For a single xml file, parse out the seriesuid and the labeled nodule coordinates;
Stored to the npy file named with seriesuid, the stored content is the coordinates of each nodule.

The complete code is as follows:
from tqdm import tqdm
importsys
import os
import numpy as np

from pylung.utils import find_all_files
from pylung.annotation import parse

def xml2mask(xml_file):
    header, annos = parse(xml_file) # get one xml info

    ctr_arrs = []
    for i, reader in enumerate(annos):
        for j, nodule in enumerate(reader.nodules):
            ctr_arr = []
            for k, roi in enumerate(nodule.rois):
                z = roi.z
                for roi_xy in roi.roi_xy:
                    ctr_arr.append([z, roi_xy[1], roi_xy[0]]) # [[[z, y, x], [z, y, x]]]
            ctr_arrs.append(ctr_arr)

    seriesuid = header.series_instance_uid
    return seriesuid, ctr_arrs

def annotation2masks(annos_dir, save_dir):
    # get all xml file path
    files = find_all_files(annos_dir, '.xml')
    for f in tqdm(files, total=len(files)):
        print(f)
        try:
            seriesuid, masks = xml2mask(f)
            np.save(os.path.join(save_dir, '%s' % (seriesuid)), masks) # save xml 3D coor [[z, y, x], [z, y, x]]
        except:
            print("Unexpected error:", sys.exc_info()[0])


if __name__ == '__main__':
    annos_dir = './LUNA16/annotation/LIDC-XML-only/tcia-lidc-xml' # .xml
    ctr_arr_save_dir = './LUNA16/annotation/noduleCoor' # Where to save the intermediate nodule mask parsed by each annotator

    os.makedirs(ctr_arr_save_dir, exist_ok=True)

    # xml information, dump npy (temporary file)
    annotation2masks(annos_dir, ctr_arr_save_dir)

Next, open an npy file for viewing. The recorded content is as follows, which are the polygon coordinate points of all nodules marked by all doctors in this sequence:

[list([[-299.8, 206, 42], [-299.8, 207, 41], [-299.8, 208, 41], [-299.8, 209, 40], [-299.8, 210, 40 ], [-299.8, 211, 41], [-299.8, 212, 41], [-299.8, 213, 42], [-299.8, 214, 42], [-299.8, 215, 43], [-299.8 , 216, 44], [-299.8, 216, 45], [-299.8, 215, 46], [-299.8, 215, 47], [-299.8, 215, 48], [-299.8, 214, 49] , [-299.8, 213, 49], [-299.8, 212, 49], [-299.8, 211, 49], [-299.8, 210, 49], [-299.8, 209, 49], [-299.8, 208, 48], [-299.8, 207, 47], [-299.8, 207, 46], [-299.8, 206, 45], [-299.8, 206, 44], [-299.8, 206, 43], [-299.8, 206, 42], [-298.0, 206, 46], [-298.0, 207, 45], [-298.0, 207, 44], [-298.0, 208, 43], [-298.0, 209 , 42], [-298.0, 209, 41], [-298.0, 210, 40], [-298.0, 211, 40], [-298.0, 212, 39], [-298.0, 213, 40], [ -298.0, 214, 41], [-298.0, 215, 42], [-298.0, 215, 43], [-298.0, 216, 44], [-298.0, 216, 45], [-298.0, 216, 46], [-298.0, 216, 47], [-298.0, 215, 48], [-298.0, 214, 48], [-298.0, 213, 48], [-298.0, 212, 48], [- 298.0, 211, 48], [-298.0, 210, 48], [-298.0, 209, 48], [-298.0, 208, 48], [-298.0, 207, 47], [-298.0, 206, 46 ], [-296.2, 209, 42], [-296.2, 210, 41], [-296.2, 211, 40], [-296.2, 212, 40], [-296.2, 213, 41], [-296.2 , 214, 42], [-296.2, 215, 43], [-296.2, 216, 44], [-296.2, 216, 45], [-296.2, 216, 46], [-296.2, 216, 47] , [-296.2, 216, 48], [-296.2, 215, 49], [-296.2, 214, 49], [-296.2, 213, 49], [-296.2, 212, 49], [-296.2, 211, 48], [-296.2, 210, 47], [-296.2, 209, 46], [-296.2, 209, 45], [-296.2, 209, 44], [-296.2, 209, 43], [-296.2, 209, 42]])
 list([[-227.8, 151, 405], [-227.8, 152, 404], [-227.8, 153, 403], [-227.8, 154, 402], [-227.8, 155, 402], [- 227.8, 156, 402], [-227.8, 157, 403], [-227.8, 157, 404], [-227.8, 157, 405], [-227.8, 158, 406], [-227.8, 158, 407 ], [-227.8, 158, 408], [-227.8, 157, 409], [-227.8, 156, 409], [-227.8, 155, 409], [-227.8, 154, 408], [-227.8 , 153, 408], [-227.8, 152, 407], [-227.8, 151, 406], [-227.8, 151, 405], [-226.0, 152, 405], [-226.0, 153, 404] , [-226.0, 154, 404], [-226.0, 155, 403], [-226.0, 156, 404], [-226.0, 157, 405], [-226.0, 157, 406], [-226.0, 157, 407], [-226.0, 156, 408], [-226.0, 155, 408], [-226.0, 154, 408], [-226.0, 153, 408], [-226.0, 152, 407], [-226.0, 152, 406], [-226.0, 152, 405]])
 list([[-226.0, 158, 407], [-226.0, 157, 408], [-226.0, 156, 409], [-226.0, 155, 409], [-226.0, 154, 409], [- 226.0, 153, 409], [-226.0, 152, 408], [-226.0, 151, 407], [-226.0, 152, 406], [-226.0, 153, 405], [-226.0, 153, 404 ], [-226.0, 154, 403], [-226.0, 155, 402], [-226.0, 156, 402], [-226.0, 157, 403], [-226.0, 158, 404], [-226.0 , 158, 405], [-226.0, 158, 406], [-226.0, 158, 407], [-227.8, 159, 407], [-227.8, 158, 408], [-227.8, 157, 409] , [-227.8, 156, 410], [-227.8, 155, 410], [-227.8, 154, 410], [-227.8, 153, 409], [-227.8, 152, 408], [-227.8, 151, 407], [-227.8, 151, 406], [-227.8, 151, 405], [-227.8, 152, 404], [-227.8, 153, 403], [-227.8, 154, 402], [-227.8, 155, 402], [-227.8, 156, 402], [-227.8, 157, 403], [-227.8, 158, 404], [-227.8, 158, 405], [-227.8, 158 , 406], [-227.8, 159, 407]])
 list([[-296.2, 214, 46], [-296.2, 213, 47], [-296.2, 212, 47], [-296.2, 211, 47], [-296.2, 210, 46], [- 296.2, 209, 45], [-296.2, 208, 44], [-296.2, 208, 43], [-296.2, 208, 42], [-296.2, 209, 41], [-296.2, 210, 42 ], [-296.2, 211, 42], [-296.2, 212, 43], [-296.2, 213, 44], [-296.2, 214, 45], [-296.2, 214, 46], [-298.0 , 216, 47], [-298.0, 215, 48], [-298.0, 214, 49], [-298.0, 213, 49], [-298.0, 212, 49], [-298.0, 211, 49] , [-298.0, 210, 49], [-298.0, 209, 48], [-298.0, 208, 47], [-298.0, 207, 46], [-298.0, 207, 45], [-298.0, 207, 44], [-298.0, 208, 43], [-298.0, 208, 42], [-298.0, 209, 41], [-298.0, 210, 41], [-298.0, 211, 41], [-298.0, 212, 41], [-298.0, 213, 41], [-298.0, 214, 42], [-298.0, 215, 43], [-298.0, 216, 44], [-298.0, 216 , 45], [-298.0, 216, 46], [-298.0, 216, 47], [-299.8, 216, 50], [-299.8, 215, 51], [-299.8, 214, 51], [ -299.8, 213, 50], [-299.8, 212, 50], [-299.8, 211, 50], [-299.8, 210, 49], [-299.8, 209, 48], [-299.8, 208, 47], [-299.8, 207, 46], [-299.8, 207, 45], [-299.8, 207, 44], [-299.8, 208, 43], [-299.8, 209, 42], [- 299.8, 210, 42], [-299.8, 211, 41], [-299.8, 212, 41], [-299.8, 213, 42], [-299.8, 214, 42], [-299.8, 215, 43 ], [-299.8, 216, 44], [-299.8, 216, 45], [-299.8, 216, 46], [-299.8, 216, 47], [-299.8, 216, 48], [-299.8 , 216, 49], [-299.8, 216, 50]])
 list([[-226.0, 158, 407], [-226.0, 157, 408], [-226.0, 156, 409], [-226.0, 155, 409], [-226.0, 154, 409], [- 226.0, 153, 409], [-226.0, 152, 409], [-226.0, 151, 409], [-226.0, 151, 408], [-226.0, 151, 407], [-226.0, 151, 406 ], [-226.0, 151, 405], [-226.0, 152, 404], [-226.0, 152, 403], [-226.0, 153, 403], [-226.0, 154, 402], [-226.0 , 154, 401], [-226.0, 155, 401], [-226.0, 156, 401], [-226.0, 157, 401], [-226.0, 157, 402], [-226.0, 158, 403] , [-226.0, 158, 404], [-226.0, 158, 405], [-226.0, 158, 406], [-226.0, 158, 407], [-227.8, 159, 407], [-227.8, 158, 408], [-227.8, 158, 409], [-227.8, 157, 409], [-227.8, 156, 410], [-227.8, 155, 410], [-227.8, 154, 409], [-227.8, 153, 409], [-227.8, 152, 409], [-227.8, 151, 408], [-227.8, 151, 407], [-227.8, 151, 406], [-227.8, 151 , 405], [-227.8, 151, 404], [-227.8, 152, 403], [-227.8, 152, 402], [-227.8, 153, 401], [-227.8, 154, 401], [ -227.8, 155, 401], [-227.8, 156, 401], [-227.8, 157, 401], [-227.8, 158, 402], [-227.8, 158, 403], [-227.8, 159, 404], [-227.8, 159, 405], [-227.8, 159, 406], [-227.8, 159, 407]])
 list([[-296.2, 215, 47], [-296.2, 214, 48], [-296.2, 213, 48], [-296.2, 212, 48], [-296.2, 211, 48], [- 296.2, 210, 47], [-296.2, 209, 47], [-296.2, 208, 46], [-296.2, 208, 45], [-296.2, 207, 44], [-296.2, 207, 43 ], [-296.2, 208, 42], [-296.2, 209, 42], [-296.2, 210, 42], [-296.2, 211, 42], [-296.2, 212, 43], [-296.2 , 213, 43], [-296.2, 214, 44], [-296.2, 215, 45], [-296.2, 215, 46], [-296.2, 215, 47], [-298.0, 216, 47] , [-298.0, 215, 48], [-298.0, 214, 49], [-298.0, 214, 50], [-298.0, 213, 50], [-298.0, 212, 50], [-298.0, 211, 49], [-298.0, 210, 49], [-298.0, 209, 48], [-298.0, 208, 48], [-298.0, 207, 47], [-298.0, 207, 46], [-298.0, 207, 45], [-298.0, 207, 44], [-298.0, 207, 43], [-298.0, 207, 42], [-298.0, 207, 41], [-298.0, 208 , 41], [-298.0, 209, 41], [-298.0, 210, 41], [-298.0, 211, 41], [-298.0, 212, 41], [-298.0, 213, 41], [ -298.0, 214, 41], [-298.0, 215, 42], [-298.0, 215, 43], [-298.0, 216, 44], [-298.0, 216, 45], [-298.0, 216, 46], [-298.0, 216, 47], [-299.8, 217, 46], [-299.8, 216, 47], [-299.8, 216, 48], [-299.8, 215, 49], [- 299.8, 214, 50], [-299.8, 213, 50], [-299.8, 212, 50], [-299.8, 211, 50], [-299.8, 210, 50], [-299.8, 209, 49 ], [-299.8, 208, 48], [-299.8, 208, 47], [-299.8, 207, 46], [-299.8, 207, 45], [-299.8, 207, 44], [-299.8 , 208, 43], [-299.8, 209, 42], [-299.8, 209, 41], [-299.8, 210, 41], [-299.8, 211, 41], [-299.8, 212, 41] , [-299.8, 213, 41], [-299.8, 214, 42], [-299.8, 215, 42], [-299.8, 215, 43], [-299.8, 216, 44], [-299.8, 217, 45], [-299.8, 217, 46], [-301.6, 214, 45], [-301.6, 213, 46], [-301.6, 212, 47], [-301.6, 211, 47], [-301.6, 210, 46], [-301.6, 209, 45], [-301.6, 210, 44], [-301.6, 211, 43], [-301.6, 212, 43], [-301.6, 213 , 44], [-301.6, 214, 45]])
 list([[-296.2, 209, 43], [-296.2, 209, 44], [-296.2, 210, 45], [-296.2, 211, 46], [-296.2, 212, 47], [- 296.2, 212, 48], [-296.2, 213, 48], [-296.2, 214, 48], [-296.2, 215, 47], [-296.2, 215, 46], [-296.2, 215, 45 ], [-296.2, 214, 44], [-296.2, 213, 43], [-296.2, 212, 43], [-296.2, 211, 43], [-296.2, 210, 43], [-296.2 , 209, 43], [-298.0, 208, 42], [-298.0, 208, 43], [-298.0, 208, 44], [-298.0, 208, 45], [-298.0, 208, 46] , [-298.0, 208, 47], [-298.0, 209, 47], [-298.0, 210, 48], [-298.0, 211, 48], [-298.0, 211, 49], [-298.0, 212, 49], [-298.0, 213, 48], [-298.0, 214, 48], [-298.0, 215, 47], [-298.0, 216, 46], [-298.0, 216, 45], [-298.0, 216, 44], [-298.0, 215, 43], [-298.0, 214, 43], [-298.0, 213, 42], [-298.0, 212, 42], [-298.0, 212 , 41], [-298.0, 211, 41], [-298.0, 210, 41], [-298.0, 209, 42], [-298.0, 208, 42], [-299.8, 210, 43], [ -299.8, 209, 43], [-299.8, 208, 44], [-299.8, 207, 44], [-299.8, 207, 45], [-299.8, 207, 46], [-299.8, 208, 47], [-299.8, 209, 48], [-299.8, 210, 49], [-299.8, 211, 49], [-299.8, 212, 49], [-299.8, 213, 50], [- 299.8, 214, 49], [-299.8, 215, 48], [-299.8, 215, 47], [-299.8, 216, 46], [-299.8, 216, 45], [-299.8, 215, 44 ], [-299.8, 215, 43], [-299.8, 214, 43], [-299.8, 214, 42], [-299.8, 213, 42], [-299.8, 212, 41], [-299.8 , 211, 41], [-299.8, 210, 42], [-299.8, 210, 43]])] <class 'numpy.ndarray'>

2. Mark times and mask array generation
Generating npy files is not the final result of this annotation information. There are several reasons:

The nodule coordinates marked in the xml file are marked separately by multiple doctors, so there will be overlap in marking (that is, a nodule is marked repeatedly by multiple doctors, many of which are back-to-back, and there is no know what other doctors have labeled). Therefore, it is necessary to process the content marked by multiple people and leave the final nodule coordinates;
It is just a coordinate point, but you also need to generate a mask file that has the same shape as the image and corresponds to each other.

Based on the above reasons, generating the final mask file requires the following steps:

The marked nodule coordinate points need to be processed by hu z to instanceNum on the corresponding image;
Process the nodules marked by multiple doctors and leave the final nodule according to the iou overlap rule;
The remaining nodule coordinates are drawn on mask and stored.

The implementation code is as follows:
import nrrd
import SimpleITK as sitk
import cv2
import os
import numpy as np

def load_itk_image(filename):
    """
    Return img array and [z,y,x]-ordered origin and spacing
    """
    # The shape of the image returned by sitk.ReadImage is x, y, z
    itkimage = sitk.ReadImage(filename)
    numpyImage = sitk.GetArrayFromImage(itkimage)

    numpyOrigin = np.array(list(reversed(itkimage.GetOrigin())))
    numpySpacing = np.array(list(reversed(itkimage.GetSpacing())))

    return numpyImage, numpyOrigin, numpySpacing


def arrs2mask(img_dir, ctr_arr_dir, save_dir):
    cnt = 0
    consensus = {<!-- -->1: 0, 2: 0, 3: 0, 4: 0} # Consensus

    # generate save document
    for k in consensus.keys():
        if not os.path.exists(os.path.join(save_dir, str(k))):
            os.makedirs(os.path.join(save_dir, str(k)))

    for f in os.listdir(img_dir):
        if f.endswith('.mhd'):
            pid = f[:-4]
            print('pid:', pid)
            #ct
            img, origin, spacing = load_itk_image(os.path.join(img_dir, '%s.mhd' % (pid)))

            # mask coor npy
            ctr_arrs = np.load(os.path.join(ctr_arr_dir, '%s.npy' % (pid)), allow_pickle=True)
            cnt + = len(ctr_arrs)

            nodule_masks = []
            # Label the nodules in sequence
            for ctr_arr in ctr_arrs:
                z_origin = origin[0]
                z_spacing = spacing[0]

                ctr_arr = np.array(ctr_arr)
                # ctr_arr[:, 0] z-axis direction value, from hu z to instanceNum [-50, -40, -30]-->[2, 3, 4]
                ctr_arr[:, 0] = np.absolute(ctr_arr[:, 0] - z_origin) / z_spacing # Find the absolute value of each element in the array. np.abs is the abbreviation of this function
                ctr_arr = ctr_arr.astype(np.int32)
                print(ctr_arr)

                # For each marked nodule, a mask file with the same size as img will be temporarily generated.
                mask = np.zeros(img.shape)
                # Traverse the z-axis sequence of the annotation layer
                for z in np.unique(ctr_arr[:, 0]): # Remove duplicate elements and sort them by elements from small to large
                    ctr = ctr_arr[ctr_arr[:, 0] == z][:, [2, 1]]
                    ctr = np.array([ctr], dtype=np.int32)
                    mask[z] = cv2.fillPoly(mask[z], ctr, color=(1,))
                nodule_masks.append(mask)

            i = 0
            visited = []
            d = {<!-- -->}
            masks = []
            while i < len(nodule_masks):
                # If mached before, then no need to create new mask
                if i in visited:
                    i+=1
                    continue
                same_nodules = []
                mask1 = nodule_masks[i]
                same_nodules.append(mask1)
                d[i] = {<!-- -->}
                d[i]['count'] = 1
                d[i]['iou'] = []

                # Find annotations pointing to the same nodule
                # The current node mask[i], and all the nodes behind it, find iou in turn
                for j in range(i + 1, len(nodule_masks)):
                    # if not overlapped with previous added nodules
                    if j in visited:
                        continue
                    mask2 = nodule_masks[j]
                    iou = float(np.logical_and(mask1, mask2).sum()) / np.logical_or(mask1, mask2).sum()

                    # If iou exceeds the threshold, the current i-th mask is recorded as being marked repeatedly.
                    if iou > 0.4:
                        visited.append(j)
                        same_nodules.append(mask2)
                        d[i]['count'] + = 1
                        d[i]['iou'].append(iou)

                masks.append(same_nodules)
                i+=1

            print(visited)
            exit()
            # only 4 people, check up 4 data
            for k, v in d.items():
                if v['count'] > 4:
                    print('WARNING: %s: %dth nodule, iou: %s' % (pid, k, str(v['iou'])))
                    v['count'] = 4
                consensus[v['count']] + = 1

            # number of consensus
            num = np.array([len(m) for m in masks])
            num[num > 4] = 4 # Up to 4 times. If the mark is repeated more than 4 times, it will be counted as 4 times.

            if len(num) == 0:
                continue
            # Iterate from the nodules with most consensus
            for n in range(num.max(), 0, -1):
                mask = np.zeros(img.shape, dtype=np.uint8)

                for i, index in enumerate(np.where(num >= n)[0]):
                    same_nodules = masks[index]
                    m = np.logical_or.reduce(same_nodules)
                    mask[m] = i + 1 # Distinguish different nodules, and give different values to different nodules, which increase in sequence (if it is segmented, you can directly give them all 1, or they can be unified to 1 in the end)
                nrrd.write(os.path.join(save_dir, str(n), pid + '.nrrd'), mask) # mask

    print(consensus)
    print(cnt)

if __name__ == '__main__':
    img_dir = r'./LUNA16/image_combined' # data

    ctr_arr_save_dir = r'./LUNA16/annotation/noduleCoor' # Where to save the intermediate nodule mask parsed by each annotator
    noduleMask_save_dir = r'./LUNA16/nodule_masks' # Folder to save merged nodule masks

    # Generate a mask for the dumped temporary file
    arrs2mask(img_dir, ctr_arr_save_dir, noduleMask_save_dir)

At this point, the mask of the shape that is the same as the image is generated. Next, use itk-snap to open and view the processed results, as shown below:

 Belongs to nrrd images that open image and mask respectively, image in mhd format, To convert nrrd, you can refer to the following code:
nii_path = os.path.join(r'./LUNA16/image_combined', '1.3.6.1.4.1.14519.5.2.1.6279.6001.184412674007117333405073397832.mhd')
image = itk.array_from_image(itk.imread(nii_path))

nrrd.write(r'./image.nrrd', image)

3. Summary
The data formats in the lidc-idri data set are data formats that we don’t often encounter, especially the raw files of mhd files. , representing two different parts of a data at the same time, is also rarely encountered.
But for beginners, understanding this data form is still a bit unfamiliar. I believe this part can be understood through this series. At the same time, this article is also stored as a nrrd file. This is my preferred array storage format. It is easy to understand and simple to understand.
At this point, you have gained a new one-to-one correspondence. This will be much easier to understand than looking at the xml file. In the next section, we will combine the initially obtained image and mask with the lung area segmentation for further refinement. The resample operation adjusts the data to a unified scale.