Solve the problems encountered by UniAD when running under high versions of CUDA and pytorch

UniADhttps://github.com/OpenDriveLab/UniAD is oriented to driving planning set perception (target detection and tracking) and mapping (not mapping the environment like SLAM, but real-time panoramic segmentation of roads and isolation zones in images) It is a unified large model that integrates multi-task modules such as trajectory planning and occupancy prediction. The installation instructions on the official website are based on the lower versions of CUDA11.1.1 and pytorch1.9.1 used by the author. The corresponding mmcv is also the lower version 1.4. These support tools are used in the nvidia ngc docker environment on our work server. The software was already a higher version, so we wanted to run UniAD in our own environment. We encountered some pitfalls during the installation process, but they were eventually solved one by one. We also tested it and found that UniAD can run normally on CUDA11.6. + pytorch1.12.0 + mmcv1.6 + mmseg0.24.0 + mmdet2.24 + mmdet3d1.0.0rc4 environment.

The steps to install and troubleshoot are as follows:

1. Pull the NVIDIA NGC docker image using CUDA11.6 and create a container as the UniAD operating environment

2. Install pytorch and torchvision:

pip install torch==1.12.0 + cu116 torchvision==0.13.0 + cu116 torchaudio==0.12.0 –extra-index-url https://download.pytorch.org/whl/cu116

3. Check whether the CUDA_HOME environment variable has been set, if not, set it:

export CUDA_HOME=/usr/local/cuda

4. Due to CUDA and pytorch version restrictions, mmcv needs to be installed with a version higher than 1.4 1.6.0 (for the installation version correspondence of openmmlab’s mm sequence framework package, please refer to the correspondence between openMMLab’s mmcv and mmdet, mmdet3d, and mmseg versions – CSDN Blog: )
pip install mmcv-full==1.6.0 -f https://download.openmmlab.com/mmcv/dist/cu116/torch1.12.0/index.html

According to the following correspondence:

Install mmdet2.24.0 and mmseg0.24.0 respectively

pip install mmdet==2.24.0
pip install mmsegmentation==0.24.0

Download the mmdetection3d source code and switch to version v1.0.0rc4:

git clone https://github.com/open-mmlab/mmdetection3d.git
cd mmdetection3d

git checkout v1.0.0rc4

Install the support package and compile and install mmdet3d from source:

pip install scipy==1.7.3
pip install scikit-image==0.20.0
#Modify the numba version in requirements/runtime.txt:
numba==0.53.1
#numba==0.53.0
pip install -v -e .

After installing the support environment, download and install UniAD:

git clone https://github.com/OpenDriveLab/UniAD.git
cdUniAD

#Modify the numpy version in requirements.txt and install the relevant support packages:
#numpy==1.20.0
numpy==1.22.0
pip install -r requirements.txt

#Download relevant pre-training weight files
mkdir ckpts & amp; & amp; cd ckpts
wget https://github.com/zhiqi-li/storage/releases/download/v1.0/bevformer_r101_dcn_24ep.pth
wget https://github.com/OpenDriveLab/UniAD/releases/download/v1.0/uniad_base_track_map.pth
wget https://github.com/OpenDriveLab/UniAD/releases/download/v1.0.1/uniad_base_e2e.pth

Follow the instructions at https://github.com/OpenDriveLab/UniAD/blob/main/docs/DATA_PREP.md to download and expand the NuScenes data set

Download data infos file:

cd UniAD/data
mkdir infos & amp; & amp; cd infos
wget https://github.com/OpenDriveLab/UniAD/releases/download/v1.0/nuscenes_infos_temporal_train.pkl # train_infos
wget https://github.com/OpenDriveLab/UniAD/releases/download/v1.0/nuscenes_infos_temporal_val.pkl # val_infos

Assume that the NuScenes data set and data infos file have been downloaded, decompressed and stored in ./data/ and stored according to the following structure:

UniAD
├── projects/
├── tools/
├── ckpts/
│ ├── bevformer_r101_dcn_24ep.pth
│ ├── uniad_base_track_map.pth
| ├── uniad_base_e2e.pth
├── data/
│ ├── nuscenes/
│ │ ├── can_bus/
│ │ ├── maps/
│ │ │ ├──36092f0b03a857c6a3403e25b4b7aab3.png
│ │ │ ├──37819e65e09e5547b8a3ceaefba56bb2.png
│ │ │ ├──53992ee3023e5494b90c316c183be829.png
│ │ │ ├──93406b464a165eaba6d9de76ca09f5da.png
│ │ │ ├──basemap
│ │ │ ├──expansion
│ │ │ ├──prediction
│ │ ├── samples/
│ │ ├── sweeps/
│ │ ├── v1.0-test/
│ │ ├── v1.0-trainval/
│ ├── infos/
│ │ ├── nuscenes_infos_temporal_train.pkl
│ │ ├── nuscenes_infos_temporal_val.pkl
│ ├── others/
│ │ ├── motion_anchor_infos_mode6.pkl

Note: The three directories basemap, expansion, and prediction that are expanded after downloading the map (v1.3) extensions compressed package need to be placed in the maps directory, rather than at the same level as samples, sweeps, and other directories. After all NuScenes train data compressed packages are expanded, , each subdirectory at the bottom of samples contains 34149 pictures, and the number of pictures recorded in sweeps varies, for example: 163881, 164274, 164166, 161453, 160856, 164266…etc. After expanding the compressed package of unlabeled test data in the nuscenes directory, the images in the subdirectories of the samples and sweeps directories will be automatically copied to the corresponding subdirectories of nuscenes/samples and nuscenes/sweeps. Statistics will be viewed again. The number of pictures in each subdirectory under samples becomes 40157, and the number of pictures in the subdirectory under sweeps becomes 193153, 189171, 189905, 193082, 193168, 192699…

implement

./tools/uniad_dist_eval.sh ./projects/configs/stage1_track_map/base_track_map.py ./ckpts/uniad_base_track_map.pth 8

Run it and try it. The last parameter is the number of GPUs. My working environment is the same as the author’s working environment, which has 8 A100 cards, so follow the instructions. If there are fewer cards, modify this parameter, for example, use 1, and you can run it. Yes, it’s just slower.

You may encounter the following problems when running the above command for the first time:

1. partially initialized module ‘cv2’ has no attribute ‘_registerMatType’ (most likely due to a circular import)

This is because the opencv-python version in the environment is too high and the version is incompatible. Mine is 4.8.1.78. I checked online and found that it needs to be reduced to 4.5. Execute the following command to reinstall opencv-python4.5. :

pip install opencv-python==4.5.4.58

2. ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Just install libgl:

sudo apt-get update & amp; & amp; sudo apt-get install libgl1

3. AssertionError: MMCV==1.6.0 is used but incompatible. Please install mmcv>=(1, 3, 13, 0, 0, 0), <=(1, 5, 0, 0, 0, 0)

Traceback (most recent call last):
  File "tools/create_data.py", line 4, in <module>
    from data_converter import uniad_nuscenes_converter as nuscenes_converter
  File "/workspace/workspace_fychen/UniAD/tools/data_converter/uniad_nuscenes_converter.py", line 13, in <module>
    from mmdet3d.core.bbox.box_np_ops import points_cam2img
  File "/workspace/workspace_fychen/mmdetection3d/mmdet3d/__init__.py", line 5, in <module>
    import mmseg
  File "/opt/conda/lib/python3.8/site-packages/mmseg/__init__.py", line 58, in <module>
    assert (mmcv_min_version <= mmcv_version <= mmcv_max_version), \
AssertionError: MMCV==1.6.0 is used but incompatible. Please install mmcv>=(1, 3, 13, 0, 0, 0), <=(1, 5, 0, 0, 0, 0).</ pre>
<p>This error is thrown by python3.8/site-packages/mmseg/__init__.py, indicating that mmseg and mmcv1.6.0 versions are incompatible. It requires the installation of mmcv version 1.3-1.5, indicating that the version of mmseg itself is low. The reason is that it started The installed mmsegmenation version is out of date, just install mmseg0.24.0 instead. If other functional framework packages encounter version issues, similar processing will be done.</p>
<p>4.KeyError: 'DiceCost is already registered in Match Cost'</p>
<pre>Traceback (most recent call last):
  File "./tools/test.py", line 16, in <module>
    from projects.mmdet3d_plugin.datasets.builder import build_dataloader
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/__init__.py", line 3, in <module>
    from .core.bbox.match_costs import BBox3DL1Cost, DiceCost
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/core/bbox/match_costs/__init__.py", line 2, in <module>
    from .match_cost import BBox3DL1Cost, DiceCost
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/core/bbox/match_costs/match_cost.py", line 32, in <module>
    Traceback (most recent call last):
class DiceCost(object):
  File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 337, in _register
  File "./tools/test.py", line 16, in <module>
    from projects.mmdet3d_plugin.datasets.builder import build_dataloader
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/__init__.py", line 3, in <module>
    from .core.bbox.match_costs import BBox3DL1Cost, DiceCost
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/core/bbox/match_costs/__init__.py", line 2, in <module>
    from .match_cost import BBox3DL1Cost, DiceCost
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/core/bbox/match_costs/match_cost.py", line 32, in <module>
    class DiceCost(object):
  File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 337, in _register
        self._register_module(module=module, module_name=name, force=force)self._register_module(module=module, module_name=name, force=force)

  File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/misc.py", line 340, in new_func
  File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/misc.py", line 340, in new_func
    output = old_func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 272, in _register_module
    raise KeyError(f'{name} is already registered '
KeyError: 'DiceCost is already registered in Match Cost''

This problem of duplicate class registration is because the mmdet3d_plugin of UniAD and the file python3.8/site-packages/mmdet/core/bbox/match_costs/match_cost.py of the mmdetection I installed have the DiceCost class with the same name (mmdetection used by the UniAD author Lower versions should not have this problem). Reading the registration code of python3.8/site-packages/mmcv/utils/registry.py in mmcv can tell that this problem can be solved by setting the parameter force=True:

 @deprecated_api_warning(name_dict=dict(module_class='module'))
    def _register_module(self, module, module_name=None, force=False):
        if not inspect.isclass(module) and not inspect.isfunction(module):
            raise TypeError('module must be a class or a function, '
                            f'but got {type(module)}')

        if module_name is None:
            module_name = module.__name__
        if isinstance(module_name, str):
            module_name = [module_name]
        for name in module_name:
            if not force and name in self._module_dict:
                raise KeyError(f'{name} is already registered '
                               f'in {self.name}')
            self._module_dict[name] = module

In order to ensure that the UniAD code can run correctly, the DiceCost class of UniAD can be forced to register, that is, modify the decorator statement of the DiceCost class in UniAD/projects/mmdet3d_plugin/core/bbox/match_costs/match_cost.py and add the force=True parameter:

@MATCH_COST.register_module(force=True)
class DiceCost(object):

5.TypeError: cannot pickle ‘dict_keys’ object

File "./tools/test.py", line 261, in <module>
    main()
  File "./tools/test.py", line 231, in main
    outputs = custom_multi_gpu_test(model, data_loader, args.tmpdir,
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/apis/test.py", line 88, in custom_multi_gpu_test
    for i, data in enumerate(data_loader):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 438, in __iter__
    return self._get_iterator()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 384, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1048, in __init__
    w.start()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'dict_keys' object

For solutions, see How to locate the TypeError: cannot pickle dict_keys object error cause and solve this error that occurs during multi-process concurrent training or testing of the NuScenes data set – CSDN Blog

6.protobuf reports TypeError: Descriptors cannot be created directly

Traceback (most recent call last):
  File "./tools/test.py", line 16, in <module>
    from projects.mmdet3d_plugin.datasets.builder import build_dataloader
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/__init__.py", line 5, in <module>
    from .datasets.pipelines import (
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/datasets/pipelines/__init__.py", line 6, in <module>
    from .occflow_label import GenerateOccFlowLabels
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/datasets/pipelines/occflow_label.py", line 5, in <module>
    from projects.mmdet3d_plugin.uniad.dense_heads.occ_head_plugin import calculate_birds_eye_view_parameters
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/__init__.py", line 2, in <module>
    from .dense_heads import *
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/__init__.py", line 4, in <module>
    from .occ_head import OccHead
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/occ_head.py", line 16, in <module>
    from .occ_head_plugin import MLP, BevFeatureSlicer, SimpleConv2d, CVT_Decoder, Bottleneck, UpsamplingAdd, \
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/occ_head_plugin/__init__.py", line 1, in <module>
    from .metrics import *
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/occ_head_plugin/metrics.py", line 10, in <module>
    from pytorch_lightning.metrics.metric import Metric
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 29, in <module>
    from pytorch_lightning.callbacks import Callback # noqa: E402
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/__init__.py", line 25, in <module>
    from pytorch_lightning.callbacks.swa import StochasticWeightAveraging
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/swa.py", line 26, in <module>
    from pytorch_lightning.trainer.optimizers import _get_default_scheduler_config
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/__init__.py", line 18, in <module>
    from pytorch_lightning.trainer.trainer import Trainer
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 30, in <module>
    from pytorch_lightning.loggers import LightningLoggerBase
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loggers/__init__.py", line 18, in <module>
    from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loggers/tensorboard.py", line 25, in <module>
    from torch.utils.tensorboard import SummaryWriter
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 12, in <module>
    from .writer import FileWriter, SummaryWriter # noqa: F401
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 9, in <module>
    from tensorboard.compat.proto.event_pb2 import SessionLog
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/proto/event_pb2.py", line 17, in <module>
    from tensorboard.compat.proto import summary_pb2 as tensorboard_dot_compat_dot_proto_dot_summary__pb2
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/proto/summary_pb2.py", line 17, in <module>
    from tensorboard.compat.proto import tensor_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__pb2
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/proto/tensor_pb2.py", line 16, in <module>
    from tensorboard.compat.proto import resource_handle_pb2 as tensorboard_dot_compat_dot_proto_dot_resource__handle__pb2
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/proto/resource_handle_pb2.py", line 16, in <module>
    from tensorboard.compat.proto import tensor_shape_pb2 as tensorboard_dot_compat_dot_proto_dot_tensor__shape__pb2
  File "/opt/conda/lib/python3.8/site-packages/tensorboard/compat/proto/tensor_shape_pb2.py", line 36, in <module>
    _descriptor.FieldDescriptor(
  File "/opt/conda/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 561, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

My protobuf version 4.24.4 is too high, it will be fine after downgrading to 3.20:

pip install protobuf==3.20

7. TypeError: expected str, bytes or os.PathLike object, not _io.BufferedReader

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
    return obj_cls(**args)
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/datasets/nuscenes_e2e_dataset.py", line 78, in __init__
    super().__init__(*args, **kwargs)
  File "/workspace/workspace_fychen/mmdetection3d/mmdet3d/datasets/nuscenes_dataset.py", line 131, in __init__
    super().__init__(
  File "/workspace/workspace_fychen/mmdetection3d/mmdet3d/datasets/custom_3d.py", line 88, in __init__
    self.data_infos = self.load_annotations(open(local_path, 'rb'))
  File "/workspace/workspace_fychen/UniAD/projects/mmdet3d_plugin/datasets/nuscenes_e2e_dataset.py", line 152, in load_annotations
    data = pickle.loads(self.file_client.get(ann_file))
  File "/opt/conda/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 1014, in get
    return self.client.get(filepath)
  File "/opt/conda/lib/python3.8/site-packages/mmcv/fileio/file_client.py", line 535, in get
    with open(filepath, 'rb') as f:
TypeError: expected str, bytes or os.PathLike object, not _io.BufferedReader

The cause of the problem occurs in the higher version of mmdetection3d/mmdet3d/datasets/custom_3d.py, which considers supporting local_path to read files. What is passed into load_annotations() is an io handle:

def __init__(self,
                 data_root,
                 ann_file,
                 pipeline=None,
                 classes=None,
                 modality=None,
                 box_type_3d='LiDAR',
                 filter_empty_gt=True,
                 test_mode=False,
                 file_client_args=dict(backend='disk')):
        super().__init__()
        self.data_root = data_root
        self.ann_file = ann_file
        self.test_mode = test_mode
        self.modality = modality
        self.filter_empty_gt = filter_empty_gt
        self.box_type_3d, self.box_mode_3d = get_box_type(box_type_3d)

        self.CLASSES = self.get_classes(classes)
        self.file_client = mmcv.FileClient(**file_client_args)
        self.cat2id = {name: i for i, name in enumerate(self.CLASSES)}

        # load annotations
        if not hasattr(self.file_client, 'get_local_path'):
            with self.file_client.get_local_path(self.ann_file) as local_path:
                self.data_infos = self.load_annotations(open(local_path, 'rb'))
        else:
            warnings.warn(
                'The used MMCV version does not have get_local_path. '
                f'We treat the {self.ann_file} as local paths and it '
                'might cause errors if the path is not a local path. '
                'Please use MMCV>= 1.3.16 if you meet errors.')
            self.data_infos = self.load_annotations(self.ann_file)

The root cause is UniAD/projects/mmdet3d_plugin/datasets/nuscenes_e2e_dataset.py in UniAD

When implementing load_annotations(), the default is to only support the use of ann_file as a string type, so here is a forced modification of mmdetection3d/mmdet3d/datasets/custom_3d.py to use self.data_infos = self.load_annotations(self.ann_file).

8. RuntimeError: DataLoader worker (pid 33959) is killed by signal: Killed

After the previous 7 problems have been solved, if the NuScenes data set is complete and the location is correct, running the following commands should be able to run:

./tools/uniad_dist_eval.sh ./projects/configs/stage1_track_map/base_track_map.py ./ckpts/uniad_base_track_map.pth 8
./tools/uniad_dist_eval.sh ./projects/configs/stage2_e2e/base_e2e.py ./ckpts/uniad_base_e2e.pth 8

However, a timeout error may occur when reading data in a loop, causing the process where the dataloader is located to be killed:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 936, in wait
    timeout = deadline - time.monotonic()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 33959) is killed by signal: Killed.

After checking, we found that the reason is that the workers_per_gpu=8 setting in the configuration files projects/configs/stage1_track_map/base_track_map.py and projects/configs/stage2_e2e/base_e2e.py is too much for our server. After changing it to 2, Run the above command again and it will be executed successfully.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeArtificial intelligenceDeep learning 388,000 people are learning the system