2.11. Customized graph fusion process and quantification pipeline

introduction

Introduce how to customize the quantitative optimization process and how to manually call the optimization process

code

from typing import Callable, Iterable

import torch
import torchvision

from ppq import (BaseGraph, QuantizationOptimizationPass,
                 QuantizationOptimizationPipeline, QuantizationSetting,
                 TargetPlatform, TorchExecutor)
from ppq.api import ENABLE_CUDA_KERNEL
from ppq.executor.torch import TorchExecutor
from ppq.IR.quantize import QuantableOperation
from ppq.IR.search import SearchableGraph
from ppq.quantization.optim import (ParameterQuantizePass,
                                    PassiveParameterQuantizePass,
                                    QuantAlignmentPass, QuantizeRefinePass,
                                    QuantizeSimplifyPass,
                                    RuntimeCalibrationPass)
from ppq.quantization.quantizer import TensorRTQuantizer

#------------------------------------------------ ----------
# In this example, we will introduce you how to customize the quantization optimization process and how to manually call the optimization process
#------------------------------------------------ ----------

BATCHSIZE = 32
INPUT_SHAPE = [BATCHSIZE, 3, 224, 224]
DEVICE = 'cuda'
PLATFORM = TargetPlatform.TRT_INT8

#------------------------------------------------ ----------
# As usual, we need to create calibration data and load the model
#------------------------------------------------ ----------
def load_calibration_dataset() -> Iterable:
    return [torch.rand(size=INPUT_SHAPE) for _ in range(32)]
CALIBRATION = load_calibration_dataset()

def collate_fn(batch: torch.Tensor) -> torch.Tensor:
    return batch.to(DEVICE)

model = torchvision.models.mobilenet.mobilenet_v2(pretrained=True)
model = model.to(DEVICE)

#------------------------------------------------ ----------
# Below, we will show you how to customize the graph fusion process
# The graph fusion process will change the quantization scheme, PPQ uses Tensor Quantization Config
# To describe the specific rules of graph fusion, the bottom layer is implemented by union lookup
#------------------------------------------------ ----------

#------------------------------------------------ ----------
# Define our own graph fusion process, here we will try to perform Conv-Clip fusion
# But what is different from usual is that we will turn off the quantization point after the Clip and retain the quantization in the middle of Conv - Clip
# For more complex pattern matching, you can refer to ppq.quantization.optim.refine.SwishFusionPass
#------------------------------------------------ ----------
class MyFusion(QuantizationOptimizationPass):
    def optimize(self, graph: BaseGraph, dataloader: Iterable,
                 collate_fn: Callable, executor: TorchExecutor, **kwargs) -> None:
        
        # The graph fusion process often starts with graph pattern matching, let's build a pattern matching engine
        search_engine = SearchableGraph(graph=graph)
        for pattern in search_engine.pattern_matching(patterns=['Conv', 'Clip'], edges=[[0, 1]], exclusive=True):
            conv, relu = pattern

            # Match the conv - relu pair in the picture, and then close unnecessary quantization points
            # First we check whether conv - relu are both quantization operators and whether they are on the same platform
            is_quantable = isinstance(conv, QuantableOperation) and isinstance(relu, QuantableOperation)
            is_same_plat = conv.platform == relu.platform

            if is_quantable and is_same_plat:
                # Point all quantization of relu input and output to conv output
                # Once dominated_by is called to complete the assignment, call dominated_by at the same time
                # PPQ will set the status of relu.input_quant_config[0] and relu.output_quant_config[0] to OVERLAPPED
                # In subsequent operations, their corresponding quantization no longer plays a role
                relu.input_quant_config[0].dominated_by = conv.output_quant_config[0]
                relu.output_quant_config[0].dominated_by = conv.output_quant_config[0]

#------------------------------------------------ ----------
# The process of custom graph fusion will interfere with the quantizer logic, we need to create a new quantizer
# Here we inherit TensorRT Quantizer, and the quantization logic of the operator will use TensorRT’s configuration
# But when generating the quantization pipeline, we will overwrite the original logic of the quantizer and use our custom pipeline
# In this way we can place the custom graph fusion process in the appropriate position, and at this time QuantizationSetting will no longer work.
#------------------------------------------------ ----------
class MyQuantizer(TensorRTQuantizer):
    def build_quant_pipeline(self, setting: QuantizationSetting) -> QuantizationOptimizationPipeline:
        return QuantizationOptimizationPipeline([
            QuantizeRefinePass(),
            QuantizeSimplifyPass(),
            ParameterQuantizePass(),
            MyFusion(name='My Optimization Procedure'),
            RuntimeCalibrationPass(),
            QuantAlignmentPass(),
            PassiveParameterQuantizePass()])

from ppq.api import quantize_torch_model, register_network_quantizer
register_network_quantizer(quantizer=MyQuantizer, platform=TargetPlatform.EXTENSION)

#------------------------------------------------ ----------
# If you use the ENABLE_CUDA_KERNEL method
# PPQ will try to compile a custom high-performance quantization operator. This process requires the support of the compilation environment.
# If you get an error during compilation, you can delete the call to the ENABLE_CUDA_KERNEL method here
# This will significantly reduce the operation speed of PPQ; but even if you cannot compile these operators, you can still use pytorch's gpu operators to complete quantization
#------------------------------------------------ ----------
with ENABLE_CUDA_KERNEL():
    quantized = quantize_torch_model(
        model=model, calib_dataloader=CALIBRATION,
        calib_steps=32, input_shape=INPUT_SHAPE,
        collate_fn=collate_fn, platform=TargetPlatform.EXTENSION,
        onnx_export_file='model.onnx', device=DEVICE, verbose=0)

result

      ____ ____ __ ____ __ __
     / __ \/ __ \/ / / __ \__ ______ _____ / /_____ ____ / /
    / /_/ / /_/ / / / / / / / / / / __ `/ __ \/ __/ __ \/ __ \/ /
   / ____/ ____/ /__/ /_/ / /_/ / /_/ / / / / /_/ /_/ / /_/ / /
  /_/ /_/ /_____\___\_\__,_/\__,_/_/ /_/\__/\____/\____/_/


[31m[Warning] Compling Kernels... Please wait (It will take a few minutes).[0m
[07:13:18] PPQ Quantization Config Refine Pass Running ... Finished.
[07:13:18] PPQ Quantize Simplify Pass Running ... Finished.
[07:13:18] PPQ Parameter Quantization Pass Running ... Finished.
[07:13:19] My Optimization Procedure Running ... Finished.
[07:13:19] PPQ Runtime Calibration Pass Running...
Calibration Progress(Phase 1): 0%| | 0/32 [00:00<?, ?it/s]
Calibration Progress(Phase 1): 3%|▎ | 1/32 [00:00<00:09, 3.10it/s]
Calibration Progress(Phase 1): 6%|▋ | 2/32 [00:00<00:09, 3.08it/s]
Calibration Progress(Phase 1): 9%|▉ | 3/32 [00:01<00:10, 2.86it/s]
Calibration Progress(Phase 1): 12%|█▎ | 4/32 [00:01<00:09, 2.94it/s]
Calibration Progress(Phase 1): 16%|█ | 5/32 [00:01<00:08, 3.11it/s]
Calibration Progress(Phase 1): 19%|█▉ | 6/32 [00:02<00:08, 2.94it/s]
Calibration Progress(Phase 1): 22%|██▏ | 7/32 [00:02<00:08, 2.95it/s]
Calibration Progress(Phase 1): 25%|██ | 8/32 [00:02<00:08, 2.96it/s]
Calibration Progress(Phase 1): 28%|██▊ | 9/32 [00:02<00:07, 3.05it/s]
Calibration Progress(Phase 1): 31%|███▏ | 10/32 [00:03<00:07, 3.10it/s]
Calibration Progress(Phase 1): 34%|███▍ | 11/32 [00:03<00:06, 3.00it/s]
Calibration Progress(Phase 1): 38%|███▊ | 12/32 [00:03<00:06, 3.08it/s]
Calibration Progress(Phase 1): 41%|████ | 13/32 [00:04<00:06, 3.15it/s]
Calibration Progress(Phase 1): 44%|████▍ | 14/32 [00:04<00:05, 3.13it/s]
Calibration Progress(Phase 1): 47%|████▋ | 15/32 [00:05<00:06, 2.83it/s]
Calibration Progress(Phase 1): 50%|█████ | 16/32 [00:05<00:05, 2.76it/s]
Calibration Progress(Phase 1): 53%|█████▎ | 17/32 [00:05<00:05, 2.94it/s]
Calibration Progress(Phase 1): 56%|█████▋ | 18/32 [00:06<00:04, 2.90it/s]
Calibration Progress(Phase 1): 59%|█████▉ | 19/32 [00:06<00:04, 3.07it/s]
Calibration Progress(Phase 1): 62%|██████▎ | 20/32 [00:06<00:03, 3.02it/s]
Calibration Progress(Phase 1): 66%|██████ | 21/32 [00:06<00:03, 3.19it/s]
Calibration Progress(Phase 1): 69%|██████▉ | 22/32 [00:07<00:03, 3.14it/s]
Calibration Progress(Phase 1): 72%|███████▏ | 23/32 [00:07<00:02, 3.34it/s]
Calibration Progress(Phase 1): 75%|███████ | 24/32 [00:07<00:02, 3.18it/s]
Calibration Progress(Phase 1): 78%|███████▊ | 25/32 [00:08<00:02, 3.15it/s]
Calibration Progress(Phase 1): 81%|████████▏ | 26/32 [00:08<00:01, 3.13it/s]
Calibration Progress(Phase 1): 84%|████████▍ | 27/32 [00:08<00:01, 3.28it/s]
Calibration Progress(Phase 1): 88%|████████▊ | 28/32 [00:09<00:01, 3.24it/s]
Calibration Progress(Phase 1): 91%|█████████ | 29/32 [00:09<00:00, 3.11it/s]
Calibration Progress(Phase 1): 94%|█████████▍| 30/32 [00:09<00:00, 3.06it/s]
Calibration Progress(Phase 1): 97%|█████████▋| 31/32 [00:10<00:00, 3.08it/s]
Calibration Progress(Phase 1): 100%|██████████| 32/32 [00:10<00:00, 3.12it/s]
Calibration Progress(Phase 1): 100%|██████████| 32/32 [00:10<00:00, 3.06it/s]
Finished.
[07:13:30] PPQ Quantization Alignment Pass Running ... Finished.
[07:13:30] PPQ Passive Parameter Quantization Running ... Finished.
--------- Network Snapshot ---------
Num of Op: [100]
Num of Quantized Op: [54]
Num of Variable: [277]
Num of Quantized Var: [207]
------- Quantization Snapshot ------
Num of Quant Config: [214]
ACTIVATED: [108]
FP32: [106]
Network Quantization Finished.