Object detection using TensorFlow SSD network

Table of Contents

describe
How does this example work?
- Process input graph
- Prepare data
- sampleUffSSD plugin
- Verify output
- TensorRT API layers and operations
prerequisites
Run the example
- Example --help option
additional resources
license
Change log
Known issues

Description

This example sampleUffSSD preprocesses the TensorFlow SSD network, uses TensorRT to perform inference on the SSD network, and uses the TensorRT plug-in to accelerate inference.

This example is based on the SSD: Single-shot Multibox Detector paper. SSD networks perform object detection and localization tasks in a single forward pass through the network.

The SSD network used in this example is based on the SSD network implemented in TensorFlow, which differs from the original paper in that it has an inception_v2 backbone network. For more information on the actual model, download ssd_inception_v2_coco. The TensorFlow SSD network was trained on the InceptionV2 architecture using the MSCOCO dataset, which includes 91 classes (including background classes). Configuration details for the network can be found here.

How does this example work?

The SSD network performs object detection and localization tasks through a single forward pass through the network. The TensorFlow SSD network is trained on the InceptionV2 architecture using the MSCOCO dataset.

This example uses the TensorRT plugin to run the SSD network. In order to use these plugins, you need to preprocess the TensorFlow graph and use the GraphSurgeon utility to do this.

The main components of the network include image preprocessor, feature extractor, box predictor, grid anchor generator and post-processor.

Image preprocessor
The image preprocessing step is responsible for resizing the image. The image is resized into a tensor of size 300x300x3. This step also performs normalization of the image so that all pixel values are within the range [-1, 1].

Feature Extractor
The feature extractor part of the graph runs the InceptionV2 network on the preprocessed image. The generated feature maps are used in the anchor generation step to generate default bounding boxes for each feature map.

In this network, the size of the feature maps used for anchor generation is [(19×19), (10×10), (5×5), (3×3), (2×2), (1×1)].

Box Predictor
The box predictor step accepts high-level feature maps as input and generates a list of box encodings (x-y coordinates) for each encoded box of each feature map and a list of class scores for each encoded box. This information is then passed to the post-processor.

Grid Anchor Generator
The goal of this step is to generate a set of default bounding boxes for each feature map cell (according to the scale and aspect ratio mentioned in the configuration). This is implemented as a layer in TensorRT called the gridAnchorGenerator plugin. The registered plug-in name is GridAnchor_TRT.

Postprocessor
The postprocessor step performs the final steps of generating network output. Bounding box data and confidence scores for all feature maps are input into this step along with pre-generated default bounding boxes (generated in the GridAnchorGenerator namespace). Then NMS (Non-Maximum Suppression) is performed to remove most of the bounding boxes based on a confidence threshold and IoU (intersection over union) overlap, leaving only the top N bounding boxes for each class. This is implemented as a plugin layer in TensorRT called NMS. The registered plug-in name is NMS_TRT.

Note: This example also implements another plugin called FlattenConcat that flattens the input and then concatenates the results. The position and confidence data are applied to this plugin before the input is passed to the post-processor, as the NMS plugin requires the data to be in this format.

For details on how the plugin is implemented, see the FlattenConcat plugin and in the sampleUffSSD.cpp file in the tensorrt/samples/sampleUffSSD directory >Implementation of FlattenConcatPluginCreator.

Specifically, this example performs the following steps:

Process input graph
Prepare data
sampleUffSSD plugin
Verify output

Processing input images

TensorFlow SSD graphs have some operations that are not currently supported in TensorRT. By preprocessing the graph, we can combine multiple operations in the graph into a single custom operation, which can be implemented as a plugin layer in TensorRT. Currently, the preprocessor provides the ability to stitch all nodes in the graph (nodes that have no output after removing assertions) into a single custom node.

To use the preprocessor, you should use the convert-to-uff utility and call it with the -p flag of the configuration file. The configure script should also contain properties for any custom plug-ins that will be embedded in the generated .uff file. The current sample script for SSD is located at /usr/src/tensorrt/samples/sampleUffSSD/config.py.

Using the graph’s preprocessor, we can remove the Preprocessor namespace from the graph and splice the GridAnchorGenerator namespaces together to create the GridAnchorGenerator plugin, Splice the postprocessor namespace together to get the NMS plug-in, and mark the concat operation in the BoxPredictor as the FlattenConcat plug-in.

There are some operations in TensorFlow graphs, such as Assert and Identity, that can be removed during inference. Operations such as Assert have been removed, and remaining nodes (nodes with no output after removing the assertion) will be removed recursively.

The Identity operation will be removed and input will be forwarded to all connected outputs. Additional documentation on the graph preprocessor can be found in the TensorRT API.

Prepare data

The resulting network has an input node named Input and an output node named MarkOutput_0 by the UFF converter.

parser->registerInput("Input", DimsCHW(3, 300, 300), UffInputOrder::kNCHW);
parser->registerOutput("MarkOutput_0");

The input to the SSD network in this example is a 3-channel 300×300 image. In the example, we normalize the image so that the pixel values lie in the range [-1,1].

Because TensorRT does not rely on any computer vision library, images are represented as binary R, G, and B values for each pixel. The format is Portable PixMap (PPM), a netpbm color image format. In this format, the R, G, and B values of each pixel are represented by integer bytes (0-255) and are stored together.

There is a simple PPM reading function called readPPMFile.

sampleUffSSD plug-in

For details on how to create a TensorRT plug-in, see Extending TensorRT with Custom Layers.

The config.py definition for the convert-to-uff command should map the custom layer to the plugin name in TensorRT by modifying the op field . Plugin parameter names should also match exactly the names and types expected by the TensorRT plugin. For example, for the GridAnchor plugin, config.py should look like this:

PriorBox = gs.create_plugin_node(name="GridAnchor",
op="GridAnchor_TRT",
numLayers=6,
minSize=0.2,
maxSize=0.95,
aspectRatios=[1.0, 2.0, 0.5, 3.0, 0.33],
variance=[0.1,0.1,0.2,0.2],
featureMapShapes=[19, 10, 5, 3, 2, 1])

Here, GridAnchor_TRT matches the registered plug-in name, and the parameter’s name and type are the same as those expected by the plug-in.

If config.py is defined as above, NvUffParser will be able to parse the network and call the appropriate plugin with the correct parameters.

Here are the details of some plugin layers implemented in TensorRT for SSD.

GridAnchorGeneration plug-in
This plugin layer implements the grid anchor generation step in the TensorFlow SSD network. For each feature map, we compute the bounding box of each grid cell. In this network, there are 6 feature maps and the number of bounding boxes per grid cell is as follows:

[19×19] Feature map: 3 bounding boxes (19x19x3x4 (coordinates/bounding box))
[10×10] Feature map: 6 bounding boxes (10x10x6x4)
[5×5] Feature map: 6 bounding boxes (5x5x6x4)
[3×3] Feature map: 6 bounding boxes (3x3x6x4)
[2×2] Feature map: 6 bounding boxes (2x2x6x4)
[1×1] Feature map: 6 bounding boxes (1x1x6x4)

NMS plug-in
The NMS plugin generates detection output based on the location and confidence predictions generated by BoxPredictor. This layer has three input tensors, corresponding to position data (locData), confidence data (confData) and previous box data (priorData) .

The input to the detection output plugin must be flattened and concatenated on all feature maps. We achieve this using the FlattenConcat plugin implemented in the sample. The dimensions of the location data generated by BoxPredictor are as follows:

19x19x12 -> Reshape -> 1083x4 -> Flatten -> 4332x1
10x10x24 -> Reshape -> 600x4 -> Flatten -> 2400x1

And so on, the same is true for the remaining feature maps.

After concatenation, the dimensions of the locData input are approximately 7668×1.

The dimensions of the confidence data generated by BoxPredictor are as follows:

19x19x273 -> Reshape -> 1083x91 -> Flatten -> 98553x1
10x10x546 -> Reshape -> 600x91 -> Flatten -> 54600x1

And so on, the same is true for the remaining feature maps.

After concatenation, the dimensions of the confData input are 174447×1.

The previous data generated by the Grid Anchor Generator plugin has 6 outputs with the following dimensions:

Output 1 corresponds to a 19x19 feature map with a dimension of 2x4332x1
Output 2 corresponds to a 10x10 feature map with dimensions 2x2400x1

And so on, the same is true for other feature maps.

Note: There are two channels in the output because one channel is used to store the variance of each coordinate, which is used in the NMS step. After concatenation, the dimensions of the priorData input are approximately 2x7668x1.

struct DetectionOutputParameters
{
bool shareLocation, varianceEncodedInTarget;
int backgroundLabelId, numClasses, topK, keepTopK;
float confidenceThreshold, nmsThreshold;
CodeTypeSSD codeType;
int inputOrder[3];
bool confSigmoid;
bool isNormalized;
};

shareLocation and varianceEncodedInTarget are used in Caffe SSD network implementation, so for TensorFlow network, they should be set to true and false respectively >. The confSigmoid and isNormalized parameters are necessary for the TensorFlow implementation. If confSigmoid is set to true, it will calculate the sigmoid value for all confidence scores. The isNormalized flag specifies whether the data is normalized. For TensorFlow graphs, it is set to true.

Verification output

After creating the generator (see Building an Engine in C++) and serializing the engine (see Serializing a Model in C++), we can perform inference. The steps for deserializing and running inference are outlined in Performing Inference in C++.

The output of the SSD network is human interpretable. The final NMS and other post-processing work is completed in the NMS plug-in. The results are organized into tuples, each tuple has 7 elements, which are the image ID, the object label, the confidence score, the (x,y) coordinates of the lower left corner of the bounding box, and the bounding box The (x,y) coordinates of the upper right corner. This information can be drawn on the output PPM image using the writePPMFileWithBBox function. The visualizeThreshold parameter can be used to control the visualization of objects in the image. Currently it is set to 0.5, so the output will show all objects with a confidence level of 50% and above.

TensorRT API layer and operations

In this example, the following layers are used. For more information about these layers, see the TensorRT Developer Guide: Layers documentation.

activation layer
The activation layer implements an element-wise activation function. Specifically, this example uses an activation layer of type kRELU.

connection layer
The connection layer connects multiple non-channel tensors of the same size along the channel dimension.

convolution layer
Convolutional layers compute 2D (channel, height and width) convolutions with or without bias.

filling layer
The padding layer implements zero padding in the innermost two dimensions.

Plug-in layer
The plug-in layer is user-defined and provides the ability to extend TensorRT functionality. For more details, see Extending TensorRT with Custom Layers.

Pooling layer
Pooling layers perform pooling within channels. Supported pooling types are maximum, average and maximum-average blend.

scale layer
The scale layer implements affine transformations and/or exponentiation of constant values per tensor, per channel, or per element.

Shuffle layer
The Shuffle layer implements the reshape and transpose operators for tensors.

Prerequisites

Install the UFF toolkit and Graphics Surgeon; depending on your TensorRT installation method, select the method you use to install the toolkit and Graphics Surgeon. You can refer to the TensorRT Installation Guide: Installing TensorRT for detailed instructions.
Download the ssd_inception_v2_coco TensorFlow training model.

Preprocess TensorFlow models using UFF converters.

Copy the TensorFlow protocol buffer file (frozen_inference_graph.pb) from the directory downloaded in the previous step to the working directory (such as /usr/src/tensorrt/samples/sampleUffSSD/) .
Run the following command to convert.

convert-to-uff frozen_inference_graph.pb -O NMS -p config.py

 This will save the converted `.uff` file to the same directory as the input and name it `frozen_inference_graph.pb.uff`.

 The `config.py` script specifies the preprocessing operations required for SSD TensorFlow graphs. The plugin nodes and plugin parameters used in the `config.py` script should match the registered plugins in TensorRT.

Copy the converted .uff file to the data directory and rename it to sample_ssd_relu6.uff /data/ssd/sample_ssd_relu6.uff.

This example also requires a labels.txt file that contains all the labels used to train the model. The label file for this network is /data/ssd/ssd_coco_labels.txt.

Run the example

Run make in the /samples/sampleUffSSD directory to compile the sample. The binary will be created in the /bin directory.
```
cd <TensorRT root directory>/samples/sampleUffSSD
make
```
Where is the location where you installed TensorRT.
Run the example to perform object detection and localization.

To run the examples in FP32 mode:
```
./sample_uff_ssd
```
To run the example in INT8 mode:
```
./sample_uff_ssd --int8
```
Note: To run the network in INT8 mode, please refer to BatchStreamPPM.h for instructions on how to perform calibration. Currently, we need a file called list.txt that lists all the files for calibration located in the /data/ssd/ folder ppm image. The PPM images used for calibration can also be located in the same folder.

Verify that the example runs successfully. If the example runs successfully, you should see output similar to the following:

 & amp; & amp; & amp; & amp; RUNNING TensorRT.sample_uff_ssd # ./build/x86_64-linux/sample_uff_ssd
[I] ../data/samples/ssd/sample_ssd_relu6.uff
[I] Start parsing the model...
[I] Completing the analytical model...
[I] Start building the engine...
I] Batch quantity 1
[I] Data size 270000
[I] *** Deserialization
[I] Inference took 4.24733 milliseconds.
[I] Reserve quantity 100
[I] Dog detected in image 0 (../../data/samples/ssd/dog.ppm) with confidence 89.001 and coordinates (81.7568, 23.1155), (295.041, 298.62).
[I] Results are saved in dog-0.890010.ppm.
[I] Dog detected in image 0 (../../data/samples/ssd/dog.ppm) with confidence 88.0681 and coordinates (1.39267, 0), (118.431, 237.262).
[I] Results are saved in dog-0.880681.ppm.
 & amp; & amp; & amp; & amp; PASSED TensorRT.sample_uff_ssd # ./build/x86_64-linux/sample_uff_ssd

This output indicates that the example ran successfully; “PASSED”.

Example `--help` option

To see a complete list of available options and their descriptions, use the -h or --help command-line option.

Additional resources

The following resources provide a deeper understanding of the TensorFlow SSD network structure:

Model

TensorFlow detection model library

Network

ssd_inception_v2_coco_2017_11_17

Dataset

MSCOCO data set

Documentation

Introduction to NVIDIA TensorRT Examples
Using the C++ API Using TensorRT
NVIDIA TensorRT Documentation Library

License Agreement

For terms and conditions of use, copying, and distribution, please see the TensorRT Software License Agreement document.

Update log

March 2019
Re-create, update, and review this README.md file.

Known issues

When running the network in INT8 mode, there may be some loss of accuracy, causing some objects to go undetected. A general observation is that >500 images is a good number for calibration.
On Windows, the Python script convert-to-uff is not available. You can generate the required .uff file on a Linux machine and then copy it to Windows to run this example.