PyTorch JIT and TorchScript, one API improves inference performance by 50%

PyTorch supports two modes: eager mode and script mode. The eager mode is mainly used for model writing, training and debugging, and the script mode is mainly for deployment, which includes PytorchJIT and TorchScript (a serialization code format that executes efficiently in PyTorch).

The script mode uses torch.jit.trace and torch.jit.script to create an intermediate representation (IR) of the PyTorch eager module. The IR is internally optimized and Use PyTorch JIT compilation at runtime. The PyTorch JIT compiler uses runtime information to optimize IR. The IR is decoupled from the Python runtime.

PyTorch JIT (Just-In-Time Compilation) is the just-in-time compiler in PyTorch.

It allows you to convert models into TorchScript format, thereby improving model performance and deployment efficiency.
JIT allows you to seamlessly switch between dynamic and static graphics. You can build and debug models as dynamic graphs in Python, then compile the model to TorchScript for optimization and deployment.
JIT allows you to convert models between different deep learning frameworks, such as converting PyTorch models to ONNX format so that they can be run in other frameworks.

TorchScript is a mechanism provided by PyTorch to serialize models for running in other environments. It compiles PyTorch models into an intermediate representation that can be run in environments without a Python interpreter. This allows the model to be run in other languages such as C++ and also enables efficient inference in resource-constrained environments such as embedded devices.

Here are some important features and uses of TorchScript:

Static graph representation: TorchScript is a static graph representation that compiles and optimizes the computational graph during the model building phase, rather than building it dynamically at runtime. This can improve the execution efficiency of the model.
Model Export: TorchScript allows PyTorch models to be exported to a standalone file that can then be run on devices without a Python environment.
Cross-platform deployment: TorchScript allows model conversion between different deep learning frameworks, such as converting PyTorch models to ONNX format so they can be run in other frameworks.
Model optimization and quantization: With TorchScript, you can use various techniques (such as quantization) to optimize the model, thereby reducing the model’s memory footprint and computing resource consumption.
Fusion and Integration: TorchScript can help you integrate multiple models into an overall process, thereby improving the overall performance of the system.
Embedded devices: For resource-constrained embedded devices, TorchScript can help you optimize your model to fit these environments.

Using TorchScript makes PyTorch models easier to deploy and integrate in production environments. However, it may also require you to make some modifications to the model in order for it to successfully compile to TorchScript.

Overall, TorchScript is a powerful tool, especially for situations where you need to deploy PyTorch models in different environments. By exporting your model to TorchScript, you can achieve a wider range of model applications and deployments.

Summary in one paragraph, why and when should we use script mode?

You can run the model without the limitations of python GIL and python runtime, such as running the model through C++ through LibTorch. This facilitates model deployment, for example, it can be run on platforms such as IoT. For example, this tutorial uses C++ to run the pytorch model.
PyTorch JIT is an optimized JIT compiler for pytorch. It uses runtime information to optimize TorchScript modules and can automatically perform optimizations such as layer fusion, quantization, and sparsification. Therefore, the performance of TorchScript will be higher compared to pytorch model.

Script mode is called through torch.jit.trace or torch.jit.script. Both functions are two different ways of converting python code to TorchScript. torch.jit.trace passes a specific input (usually a tensor, we need to provide an input) to a PyTorch model, and torch.jit.trace will trace The calculation process of this input in the model is then converted into a Torch script. This approach works well for models that can be fully defined in a static graph, such as neural networks with fixed input sizes. Typically used to convert pre-trained models. torch.jit.script directly converts a Python function (or a Python module) into a Torch script through python syntax rules and compilation. torch.jit.script is more suitable for dynamic graph models, whose structure and inputs can change at runtime. For example, for RNN or some models with variable sequence length, it will be more convenient to use torch.jit.script.

Under normal circumstances, you should prefer to use torch.jit.trace instead of torch.jit.script.

In the previous blog, we introduced the differences between torch.jit.trace and torch.jit.script in great detail, as well as usage suggestions. It is strongly recommended to read the previous blog first and then read this one.

In this article, we focus on the performance difference between TorchScript model and eager model.

JIT Trace

torch.jit.trace uses eager model and a dummy input as input. Tracer will record the flow of data in the model based on the provided model and input, and then convert the entire model into a TorchScript module. Let’s look at a specific example:

We use BERT (Bidirectional Encoder Representations from Transformers) as an example.

from transformers import BertTokenizer, BertModel
import numpy as np
import torch
from time import perf_counter

def timer(f,*args):
    
    start = perf_counter()
    f(*args)
    return (1000 * (perf_counter() - start))

#Load bert model
native_model = BertModel.from_pretrained("bert-base-uncased")
# In the API of huggingface, you can directly load the TorchScript model using the torchscript=True parameter.
script_model = BertModel.from_pretrained("bert-base-uncased", torchscript=True)

script_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', torchscript=True)



# Tokenizing input text
text = "[CLS] Who was Jim Henson? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = script_tokenizer.tokenize(text)

# Masking one of the input tokens
masked_index = 8

tokenized_text[masked_index] = '[MASK]'

indexed_tokens = script_tokenizer.convert_tokens_to_ids(tokenized_text)

segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

Then test the pytorch inference speed of eager mode on the CPU and GPU respectively.

# Test eager model inference performance on CPU
native_model.eval()
np.mean([timer(native_model,tokens_tensor,segments_tensors) for _ in range(100)])

# Test eager model inference performance on GPU
native_model = native_model.cuda()
native_model.eval()
tokens_tensor_gpu = tokens_tensor.cuda()
segments_tensors_gpu = segments_tensors.cuda()
np.mean([timer(native_model,tokens_tensor_gpu,segments_tensors_gpu) for _ in range(100)])

Then test the inference speed of the TorchScript model in script mode on the CPU and GPU respectively.

# Test TorchScript performance on CPU
traced_model = torch.jit.trace(script_model, [tokens_tensor, segments_tensors])
# Since the trace of the model already includes the behavior of .eval(), there is no need to explicitly call model.eval().
np.mean([timer(traced_model,tokens_tensor,segments_tensors) for _ in range(100)])

# Test the performance of TorchScript on GPU

The final running results are as shown in the table

	CPU latency (ms)	GPU latency (ms)
PyTorch	171.27	30.42
TorchScript	165.24	13.50

The hardware specifications I use are google colab, the CPU is Intel(R) Xeon(R) CPU @ 2.00GHz, and the GPU is Tesla T4.

Judging from the results, on the CPU, TorchScript is 3.5% faster than pytorch eager, and on the GPU, TorchScript is 55.6% faster than pytorch.

Then we use ResNet to do a test.

import torchvision
import torch
from time import perf_counter
import numpy as np

def timer(f,*args):
    start = perf_counter()
    f(*args)
    return (1000 * (perf_counter() - start))
  
# Pytorch cpu version

model_ft = torchvision.models.resnet18(pretrained=True)
model_ft.eval()
x_ft = torch.rand(1,3, 224,224)
print(f'pytorch cpu: {<!-- -->np.mean([timer(model_ft,x_ft) for _ in range(10)])}')

# Pytorch gpu version

model_ft_gpu = torchvision.models.resnet18(pretrained=True).cuda()
x_ft_gpu = x_ft.cuda()
model_ft_gpu.eval()
print(f'pytorch gpu: {<!-- -->np.mean([timer(model_ft_gpu,x_ft_gpu) for _ in range(10)])}')

#TorchScript cpu version

script_cell = torch.jit.script(model_ft, (x_ft))
print(f'torchscript cpu: {<!-- -->np.mean([timer(script_cell,x_ft) for _ in range(10)])}')

#TorchScript gpu version

script_cell_gpu = torch.jit.script(model_ft_gpu, (x_ft_gpu))
print(f'torchscript gpu: {<!-- -->np.mean([timer(script_cell_gpu,x_ft.cuda()) for _ in range(100)])}')

	CPU latency (ms)	GPU latency (ms)
PyTorch	77.47	2.99
TorchScript	74.24	1.64

Compared with PyTorch eager model, TorchScript improves CPU performance by 4.2% and GPU performance by 45%. Consistent with Bert’s conclusion.

Summary

This article focuses on the eager mode and script mode of Pytorch, focusing on the script mode of TorchScript and Pytorch JIT.
The previous article focused on the two APIs of TorchScript that convert eager mode models into script mode, the difference between torch.jit.trace and torch.jit.script. This is the basis of this article. It is recommended to read the previous article first.
The CPU and GPU performance tests of Pytorch eager model and TorchScript were conducted using two networks, Bert and ResNet. The conclusion is consistent on both networks. Using TorchScript on the CPU will have a performance improvement of about 4% compared to PyTorch eager mode, and on the GPU there will be a performance improvement of about 50%.