Deploy YOLO pruning model to edge terminal device jeston-orin-nano method

I encountered a lot of pitfalls in the process. I have only recently finished all the bugs in the experiment and recorded them

1Foreword

TensorRT is a high-performance deep learning inference optimization library officially provided by NVIDIA, supporting two programming languages API: C++ and Python . Under normal circumstances, deep learning model deployment will pursue efficiency, especially on embedded platforms, so C++ is generally chosen for deployment.

2Export ONNX model

YOLOv5 uses the PyTorch framework for training. You can use the export.py script in the official code repository to export Convert PyTorch model to ONNX model:

python export.py --weights yolov5x.pt --include onnx --imgsz 640 640

3 Prepare model input data

If you want to use YOLOv5 to perform target detection on images, you need to do certain preprocessing operations before inputting the images to the model. The preprocessing operations should be consistent with the operations done during model training. The input of YOLOv5 is a 3 channel image in RGB format. Each pixel of the image needs to be divided by 255. Normalize, and the data should be arranged in the order of CHW. Therefore, the preprocessing of YOLOv5 can be roughly divided into two steps:

Scale the original input image to the size required by the model, such as 640x640. What needs to be noted in this step is that the original image is scaled according to the same proportion. If the scaled image is smaller than the target value in a certain dimension, then it needs to be filled. For example: Assume that the input image size is 768x576, the model input size is 640x640, and the image size after scaling according to the principle of equal scaling is 640x480, then you need to fill 640-480=160 in the y direction (fill 80 at the top and bottom of the image respectively). Let’s take a look at the implementation code:

cv::Mat input_image = cv::imread("dog.jpg");
cv::Mat resize_image;
const int model_width = 640;
const int model_height = 640;
const float ratio = std::min(model_width / (input_image.cols * 1.0f),
                              model_height / (input_image.rows * 1.0f));
// proportional scaling
const int border_width = input_image.cols * ratio;
const int border_height = input_image.rows * ratio;
// Calculate offset value
const int x_offset = (model_width - border_width) / 2;
const int y_offset = (model_height - border_height) / 2;
cv::resize(input_image, resize_image, cv::Size(border_width, border_height));
cv::copyMakeBorder(resize_image, resize_image, y_offset, y_offset, x_offset,
                    x_offset, cv::BORDER_CONSTANT, cv::Scalar(114, 114, 114));
// Convert to RGB format
cv::cvtColor(resize_image, resize_image, cv::COLOR_BGR2RGB);

Normalize the image pixels and arrange them in the order of CHW. This step is relatively simple

input_blob = new float[model_height * model_width * 3];
const int channels = resize_image.channels();
const int width = resize_image.cols;
const int height = resize_image.rows;
for (int c = 0; c < channels; c + + ) {
  for (int h = 0; h < height; h + + ) {
    for (int w = 0; w < width; w + + ) {
      input_blob[c * width * height + h * width + w] =
          resize_image.at<cv::Vec3b>(h, w)[c] / 255.0f;
    }
  }
}

4ONNX model deployment

To deploy a model using TensorRT‘s C++ API, you first need to include the header file NvInfer.h.

#include "NvInfer.h"

All programming interfaces of TensorRT are placed in the namespace nvinfer1, and are prefixed with the letter I, such as ILogger code>, IBuilder, etc. To use TensorRT to deploy a model, you first need to create an IBuilder object. Before creating it, you must first instantiate the ILogger interface:

lass MyLogger : public nvinfer1::ILogger {
 public:
  explicit MyLogger(nvinfer1::ILogger::Severity severity =
                        nvinfer1::ILogger::Severity::kWARNING)
      : severity_(severity) {}

  void log(nvinfer1::ILogger::Severity severity,
           const char *msg) noexcept override {
    if (severity <= severity_) {
      std::cerr << msg << std::endl;
    }
  }
  nvinfer1::ILogger::Severity severity_;
};

The above code will capture log information with a level greater than or equal to WARNING by default and output it to the terminal. After instantiating the ILogger interface, you can create an IBuilder object:

MyLogger logger;
nvinfer1::IBuilder *builder = nvinfer1::createInferBuilder(logger);

After creating the IBuilder object, the first step in optimizing a model is to build the network structure of the model.

const uint32_t explicit_batch = 1U << static_cast<uint32_t>(
          nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
nvinfer1::INetworkDefinition *network = builder->createNetworkV2(explicit_batch);

There are two ways to build the network structure of the model. One is to use the API of TensorRT to build it layer by layer. This method is more troublesome; the other is to directly Parsing the network structure of the model from the ONNX model requires the ONNX parser to complete. Since we already have a ready-made ONNX model, we choose the second method. The ONNX parser interface of TensorRT is encapsulated in the header file NvOnnxParser.h, and the namespace is nvonnxparser. The code to create the ONNX parser object and load the model is as follows:

const std::string onnx_model = "yolov5m.onnx";
nvonnxparser::IParser *parser = nvonnxparser::createParser(*network, logger);
parser->parseFromFile(model_path.c_str(),
    static_cast<int>(nvinfer1::ILogger::Severity::kERROR))
//If there is an error, output error message
for (int32_t i = 0; i < parser->getNbErrors(); + + i) {
    std::cout << parser->getError(i)->desc() << std::endl;
}

After the model is parsed successfully, you need to create an IBuilderConfig object to tell TensorRT how to optimize the model. This interface defines many properties, the most important of which is the maximum capacity of the workspace. During the implementation of the network layer, some temporary workspace is usually required. This attribute will limit the capacity of the maximum workspace that can be applied for. If the capacity is insufficient, the network layer will not be successfully implemented and an error will occur. In addition, you can also set the data accuracy of the model through this object. The default data precision of TensorRT is FP32. We can also set FP16 or INT8, provided that the hardware platform supports it. This data accuracy.

nvinfer1::IBuilderConfig *config = builder->createBuilderConfig();
config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 1U << 25);
if (builder->platformHasFastFp16()) {
  config->setFlag(nvinfer1::BuilderFlag::kFP16);
}

After setting the IBuilderConfig attribute, you can start the optimization engine to optimize the model. This process takes a certain amount of time, and may take longer on embedded platforms. The serialized model optimized by TensorRT is saved to the IHostMemory object. We can save it to disk and load the optimized model directly the next time it is used. However, this can save you the long waiting process of model optimization. I usually save the serialized model to a file with the suffix .engine.

nvinfer1::IHostMemory *serialized_model =
      builder->buildSerializedNetwork(*network, *config);

// Serialize the model into the engine file
std::stringstream engine_file_stream;
engine_file_stream.seekg(0, engine_file_stream.beg);
engine_file_stream.write(static_cast<const char *>(serialized_model->data()),
                        serialized_model->size());
const std::string engine_file_path = "yolov5m.engine";
std::ofstream out_file(engine_file_path);
assert(out_file.is_open());
out_file << engine_file_stream.rdbuf();
out_file.close();

Since the IHostMemory object saves all the information of the model, the previously created IBuilder, IParser and other objects are no longer needed. You can pass >delete to release.

delete config;
delete parser;
delete network;
delete builder;

After the IHostMemory object is used up, it can also be released through delete.

2. Model deserialization

After obtaining the optimized serialized model through the previous step, if you want to use the model for inference, you also need to create an instance of the IRuntime interface, and then create a through its model deserialization interface. ICudaEngine object:

nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(
    serialized_model->data(), serialized_model->size());

delete serialized_model;
delete runtime;

If you load the .engine file directly from the disk, the steps are similar. First load the model into the memory from the .engine file, and then use IRuntime interface can be used to deserialize the model.


const std::string engine_file_path = "yolov5m.engine";
std::stringstream engine_file_stream;
engine_file_stream.seekg(0, engine_file_stream.beg);
std::ifstream ifs(engine_file_path);
engine_file_stream << ifs.rdbuf();
ifs.close();

engine_file_stream.seekg(0, std::ios::end);
const int model_size = engine_file_stream.tellg();
engine_file_stream.seekg(0, std::ios::beg);
void *model_mem = malloc(model_size);
engine_file_stream.read(static_cast<char *>(model_mem), model_size);

nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(model_mem, model_size);

delete runtime;
free(model_mem);

3. Model reasoning
The ICudaEngine object stores the TensorRT optimized model. However, if you want to use the model for inference, you need to create it through the createExecutionContext() function. An IExecutionContext object manages the inference process:
nvinfer1::IExecutionContext *context = engine->createExecutionContext();

Now let us first take a look at the complete process of using the TensorRT framework for model inference:


Perform the same preprocessing operations on the input image data as during model training.
Copy the model's input data from CPU to GPU.
Call the model inference interface to perform inference.
Copy the model's output data from GPU to CPU.
Analyze the output results of the model and perform necessary post-processing to obtain the final results.

Since the inference of the model is performed on GPU, there will be operations of moving input and output data. Therefore, it is necessary to create a memory area on GPU to store input and output data. Output Data. The size of the model input and output can be obtained through the interface of the ICudaEngine object. Based on this information, we can first allocate the input and output buffer areas for the model.

void *buffers[2];
// Get the model input size and allocate GPU memory
nvinfer1::Dims input_dim = engine->getBindingDimensions(0);
int input_size = 1;
for (int j = 0; j < input_dim.nbDims; + + j) {
  input_size *= input_dim.d[j];
}
cudaMalloc( & amp;buffers[0], input_size * sizeof(float));

// Get the model output size and allocate GPU memory
nvinfer1::Dims output_dim = engine->getBindingDimensions(1);
int output_size = 1;
for (int j = 0; j < output_dim.nbDims; + + j) {
  output_size *= output_dim.d[j];
}
cudaMalloc( & amp;buffers[1], output_size * sizeof(float));

// Allocate corresponding CPU memory to model output data
float *output_buffer = new float[output_size]();

At this point, if your input data is ready, you can call the TensorRT interface for inference. Normally, we will call the enqueueV2() function of the IExecutionContext object to perform asynchronous inference operations. The second parameter of this function is CUDA Stream object, the third parameter is the CUDA event object. This event indicates that the input data in the execution stream has been used up and can be used for other purposes. If you don’t know about CUDA’s streams and events, you can refer to this article I wrote before.
cudaStream_t stream;
cudaStreamCreate( & amp;stream);
//Copy input data
cudaMemcpyAsync(buffers[0], input_blob,input_size * sizeof(float),
                  cudaMemcpyHostToDevice, stream);
//Perform inference
context->enqueueV2(buffers, stream, nullptr);
//Copy output data
cudaMemcpyAsync(output_buffer, buffers[1],output_size * sizeof(float),
                  cudaMemcpyDeviceToHost, stream);

cudaStreamSynchronize(stream);

After the model inference is successful, its output data is copied to output_buffer. Next, we only need to parse it according to the output data arrangement rules of YOLOv5.

        The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57495 people are learning the system