I encountered a lot of pitfalls in the process. I have only recently finished all the bugs in the experiment and recorded them
1Foreword
TensorRT
is a high-performance deep learning inference optimization library officially provided by NVIDIA, supporting two programming languages API:
. Under normal circumstances, deep learning model deployment will pursue efficiency, especially on embedded platforms, so C++
and Python
C++
is generally chosen for deployment.
2Export ONNX model
YOLOv5
uses the PyTorch
framework for training. You can use the export.py
script in the official code repository to export Convert PyTorch
model to ONNX
model:
python export.py --weights yolov5x.pt --include onnx --imgsz 640 640
3 Prepare model input data
If you want to use YOLOv5
to perform target detection on images, you need to do certain preprocessing operations before inputting the images to the model. The preprocessing operations should be consistent with the operations done during model training. The input of YOLOv5
is a 3
channel image in RGB
format. Each pixel of the image needs to be divided by 255
. Normalize, and the data should be arranged in the order of CHW
. Therefore, the preprocessing of YOLOv5
can be roughly divided into two steps:
- Scale the original input image to the size required by the model, such as
640x640
. What needs to be noted in this step is that the original image is scaled according to the same proportion. If the scaled image is smaller than the target value in a certain dimension, then it needs to be filled. For example: Assume that the input image size is768x576
, the model input size is640x640
, and the image size after scaling according to the principle of equal scaling is640x480
, then you need to fill640-480=160
in they
direction (fill80
at the top and bottom of the image respectively). Let’s take a look at the implementation code: -
cv::Mat input_image = cv::imread("dog.jpg"); cv::Mat resize_image; const int model_width = 640; const int model_height = 640; const float ratio = std::min(model_width / (input_image.cols * 1.0f), model_height / (input_image.rows * 1.0f)); // proportional scaling const int border_width = input_image.cols * ratio; const int border_height = input_image.rows * ratio; // Calculate offset value const int x_offset = (model_width - border_width) / 2; const int y_offset = (model_height - border_height) / 2; cv::resize(input_image, resize_image, cv::Size(border_width, border_height)); cv::copyMakeBorder(resize_image, resize_image, y_offset, y_offset, x_offset, x_offset, cv::BORDER_CONSTANT, cv::Scalar(114, 114, 114)); // Convert to RGB format cv::cvtColor(resize_image, resize_image, cv::COLOR_BGR2RGB);
Normalize the image pixels and arrange them in the order of CHW
. This step is relatively simple
input_blob = new float[model_height * model_width * 3]; const int channels = resize_image.channels(); const int width = resize_image.cols; const int height = resize_image.rows; for (int c = 0; c < channels; c + + ) { for (int h = 0; h < height; h + + ) { for (int w = 0; w < width; w + + ) { input_blob[c * width * height + h * width + w] = resize_image.at<cv::Vec3b>(h, w)[c] / 255.0f; } } }
4ONNX model deployment
To deploy a model using TensorRT
‘s C++ API
, you first need to include the header file NvInfer.h
.
#include "NvInfer.h"
All programming interfaces of TensorRT
are placed in the namespace nvinfer1
, and are prefixed with the letter I
, such as ILogger
code>, IBuilder
, etc. To use TensorRT
to deploy a model, you first need to create an IBuilder
object. Before creating it, you must first instantiate the ILogger
interface:
lass MyLogger : public nvinfer1::ILogger { public: explicit MyLogger(nvinfer1::ILogger::Severity severity = nvinfer1::ILogger::Severity::kWARNING) : severity_(severity) {} void log(nvinfer1::ILogger::Severity severity, const char *msg) noexcept override { if (severity <= severity_) { std::cerr << msg << std::endl; } } nvinfer1::ILogger::Severity severity_; };
The above code will capture log information with a level greater than or equal to WARNING
by default and output it to the terminal. After instantiating the ILogger
interface, you can create an IBuilder
object:
MyLogger logger; nvinfer1::IBuilder *builder = nvinfer1::createInferBuilder(logger);
After creating the IBuilder
object, the first step in optimizing a model is to build the network structure of the model.
const uint32_t explicit_batch = 1U << static_cast<uint32_t>( nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH); nvinfer1::INetworkDefinition *network = builder->createNetworkV2(explicit_batch);
There are two ways to build the network structure of the model. One is to use the API
of TensorRT
to build it layer by layer. This method is more troublesome; the other is to directly Parsing the network structure of the model from the ONNX
model requires the ONNX
parser to complete. Since we already have a ready-made ONNX
model, we choose the second method. The ONNX
parser interface of TensorRT
is encapsulated in the header file NvOnnxParser.h
, and the namespace is nvonnxparser
. The code to create the ONNX
parser object and load the model is as follows:
const std::string onnx_model = "yolov5m.onnx"; nvonnxparser::IParser *parser = nvonnxparser::createParser(*network, logger); parser->parseFromFile(model_path.c_str(), static_cast<int>(nvinfer1::ILogger::Severity::kERROR)) //If there is an error, output error message for (int32_t i = 0; i < parser->getNbErrors(); + + i) { std::cout << parser->getError(i)->desc() << std::endl; }
After the model is parsed successfully, you need to create an IBuilderConfig
object to tell TensorRT
how to optimize the model. This interface defines many properties, the most important of which is the maximum capacity of the workspace. During the implementation of the network layer, some temporary workspace is usually required. This attribute will limit the capacity of the maximum workspace that can be applied for. If the capacity is insufficient, the network layer will not be successfully implemented and an error will occur. In addition, you can also set the data accuracy of the model through this object. The default data precision of TensorRT
is FP32
. We can also set FP16
or INT8
, provided that the hardware platform supports it. This data accuracy.
nvinfer1::IBuilderConfig *config = builder->createBuilderConfig(); config->setMemoryPoolLimit(nvinfer1::MemoryPoolType::kWORKSPACE, 1U << 25); if (builder->platformHasFastFp16()) { config->setFlag(nvinfer1::BuilderFlag::kFP16); }
After setting the IBuilderConfig
attribute, you can start the optimization engine to optimize the model. This process takes a certain amount of time, and may take longer on embedded platforms. The serialized model optimized by TensorRT
is saved to the IHostMemory
object. We can save it to disk and load the optimized model directly the next time it is used. However, this can save you the long waiting process of model optimization. I usually save the serialized model to a file with the suffix .engine
.
nvinfer1::IHostMemory *serialized_model = builder->buildSerializedNetwork(*network, *config); // Serialize the model into the engine file std::stringstream engine_file_stream; engine_file_stream.seekg(0, engine_file_stream.beg); engine_file_stream.write(static_cast<const char *>(serialized_model->data()), serialized_model->size()); const std::string engine_file_path = "yolov5m.engine"; std::ofstream out_file(engine_file_path); assert(out_file.is_open()); out_file << engine_file_stream.rdbuf(); out_file.close();
Since the IHostMemory
object saves all the information of the model, the previously created IBuilder
, IParser
and other objects are no longer needed. You can pass >delete
to release.
delete config; delete parser; delete network; delete builder;
After the IHostMemory
object is used up, it can also be released through delete
.
2. Model deserialization
After obtaining the optimized serialized model through the previous step, if you want to use the model for inference, you also need to create an instance of the IRuntime
interface, and then create a through its model deserialization interface. ICudaEngine
object:
nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger); nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine( serialized_model->data(), serialized_model->size()); delete serialized_model; delete runtime;
If you load the .engine
file directly from the disk, the steps are similar. First load the model into the memory from the .engine
file, and then use IRuntime
const std::string engine_file_path = "yolov5m.engine"; std::stringstream engine_file_stream; engine_file_stream.seekg(0, engine_file_stream.beg); std::ifstream ifs(engine_file_path); engine_file_stream << ifs.rdbuf(); ifs.close(); engine_file_stream.seekg(0, std::ios::end); const int model_size = engine_file_stream.tellg(); engine_file_stream.seekg(0, std::ios::beg); void *model_mem = malloc(model_size); engine_file_stream.read(static_cast<char *>(model_mem), model_size); nvinfer1::IRuntime *runtime = nvinfer1::createInferRuntime(logger); nvinfer1::ICudaEngine *engine = runtime->deserializeCudaEngine(model_mem, model_size); delete runtime; free(model_mem);
3. Model reasoning
The ICudaEngine
object stores the TensorRT
optimized model. However, if you want to use the model for inference, you need to create it through the createExecutionContext()
function. An IExecutionContext
object manages the inference process:
nvinfer1::IExecutionContext *context = engine->createExecutionContext();
Now let us first take a look at the complete process of using the TensorRT
framework for model inference:
- Perform the same preprocessing operations on the input image data as during model training.
- Copy the model's input data from
CPU
toGPU
. - Call the model inference interface to perform inference.
- Copy the model's output data from
GPU
toCPU
. - Analyze the output results of the model and perform necessary post-processing to obtain the final results.
Since the inference of the model is performed on GPU
, there will be operations of moving input and output data. Therefore, it is necessary to create a memory area on GPU
to store input and output data. Output Data. The size of the model input and output can be obtained through the interface of the ICudaEngine
object. Based on this information, we can first allocate the input and output buffer areas for the model.
void *buffers[2]; // Get the model input size and allocate GPU memory nvinfer1::Dims input_dim = engine->getBindingDimensions(0); int input_size = 1; for (int j = 0; j < input_dim.nbDims; + + j) { input_size *= input_dim.d[j]; } cudaMalloc( & amp;buffers[0], input_size * sizeof(float)); // Get the model output size and allocate GPU memory nvinfer1::Dims output_dim = engine->getBindingDimensions(1); int output_size = 1; for (int j = 0; j < output_dim.nbDims; + + j) { output_size *= output_dim.d[j]; } cudaMalloc( & amp;buffers[1], output_size * sizeof(float)); // Allocate corresponding CPU memory to model output data float *output_buffer = new float[output_size]();
At this point, if your input data is ready, you can call the TensorRT
interface for inference. Normally, we will call the enqueueV2()
function of the IExecutionContext
object to perform asynchronous inference operations. The second parameter of this function is CUDA
Stream object, the third parameter is the CUDA
event object. This event indicates that the input data in the execution stream has been used up and can be used for other purposes. If you don’t know about CUDA
’s streams and events, you can refer to this article I wrote before.
cudaStream_t stream; cudaStreamCreate( & amp;stream); //Copy input data cudaMemcpyAsync(buffers[0], input_blob,input_size * sizeof(float), cudaMemcpyHostToDevice, stream); //Perform inference context->enqueueV2(buffers, stream, nullptr); //Copy output data cudaMemcpyAsync(output_buffer, buffers[1],output_size * sizeof(float), cudaMemcpyDeviceToHost, stream); cudaStreamSynchronize(stream);
After the model inference is successful, its output data is copied to output_buffer
. Next, we only need to parse it according to the output data arrangement rules of YOLOv5
.
The knowledge points of the article match the official knowledge files, and you can further learn related knowledge. Algorithm skill tree Home page Overview 57495 people are learning the system