Background
AIGC is a rapidly growing and important business in the field of artificial intelligence computing. Stable Diffusion is the most popular open source model among them, and has received a lot of attention. However, with the continuous expansion of application scenarios, the problems of inference delay and computing cost faced by Stable Diffusion have become more and more prominent.
Introduction
PAI-Blade is a general-purpose inference optimization tool launched by PAI, which can achieve optimal inference performance through joint optimization of model systems. PAI-Blade relies on Completely dynamic size AI compiler BladeDISC and High-performance computing library BlaDNN based on deep learning automatic scheduling, for including image generation model Stable Diffsuion, big language Model LLM, large-scale sparse recommendation model CTR, speech recognition model ASR and many other models provide automatic high-performance reasoning optimization.
BladeDISC is an AI compiler that supports fully dynamic dimensions, and the front-end supports Pytorch and Tensorflow models. The Pytorch model can support two input modes, TorchScript and TorchDynamo, and the backend uses AStitch large-scale operator fusion technology and efficient codegen logic to improve the execution efficiency of model memory-intensive operators. BladeDISC is now open source on github, project address: https://github.com/alibaba/BladeDISC.
BlaDNN is a high-performance computing library based on deep learning automatic scheduling. As an upgraded version of Ansor, BlaDNN not only generates better kernel performance than Ansor, but also can completely rely on DNN automatic scheduling without using Tuning tuning, making online automatic scheduling of Dynamic Shape business scenarios possible. GPUs generated based on DNN automatic scheduling are computationally intensive The average performance of the operator reaches 99.39% of the ultimate tuning performance. Through the joint optimization of the model system, the DNN inference delay is as low as 2us, and only one CPU Core is used, so that it will not cause any jitter to the performance of the GPU model itself.
By adopting the PAI-Blade accelerated inference optimization technology, large-scale fusion of memory-intensive operators and optimized code generation, and automatic scheduling of computationally-intensive operators can greatly reduce the inference delay and memory usage of Stable Diffusion, thereby Reduce computational costs. Using PAI-Blade to optimize Stable Diffusion has the following three advantages:
- High performance, using Blade can reduce the end-to-end delay of inference processes such as Text2Img and Img2Img by 2.42-3.05 times, and at the same time reduce the memory usage by up to 5.27 times, more than TensorRT-8.5 and other industry SOTA optimization methods.
- Full dynamic shape support, after one optimization, can support input of any shape and batch size.
- Ease of use and scalability: Only a few lines of code are needed to enable Blade optimization in multiple types of pipelines, and at the same time support the optimization of inference schemes such as LoRA .
Example
Next, this article takes the popular “runwayml/stable-diffusion-v1-5” Text2Img pipeline as an example to introduce in detail how to use PAI-Blade in various usage scenarios.
environmental installation
The complete running script and related environment of the following example have been integrated into registry.cn-beijing.aliyuncs.com/blade_demo/blade_diffusion
docker. In this docker, the inference example can be run directly through python /blade/blade_diffusion.py
.
Official model optimization
Using PAI-Blade to optimize the Stable Diffusion model can be divided into the following steps.
First, load the pretrained model.
from diffusers import StableDiffusionPipeline device = torch.device("cuda:0") pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16).to(device)
In the second step, optimize with PAI-Blade. Note that since PAI-Blade is a fully dynamic shape optimization tool, any shape can be used for inference after optimization.
import torch_blade opt_cfg = torch_blade. Config() opt_cfg.enable_fp16 = True with opt_cfg, torch.no_grad(): encoder = blade_optimize(pipe. text_encoder, model_inputs=encoder_inputs, allow_tracing=True) unet = blade_optimize(pipe.unet, model_inputs=unet_inputs, allow_tracing=True) decoder = blade_optimize(pipe.vae.decoder, model_inputs=decoder_inputs, allow_tracing=True)
Finally, replace the original model with the optimized model, and then perform inference in the same way as the original pipeline.
@dataclass class UNet2DCoditionOutput: sample: torch.FloatTensor class TracedUNet(torch.nn.Module): def __init__(self): super().__init__() self.config = pipe.unet.config self.in_channels = pipe.unet.in_channels self.device = pipe.unet.device def forward(self, latent_model_input, t, encoder_hidden_states, **kwargs): sample = unet(latent_model_input.half(), t.half(), encoder_hidden_states.half())["sample"] return UNet2DCoditionOutput(sample=sample) class TracedEncoder(torch.nn.Module): def __init__(self): super().__init__() self.config = pipe.text_encoder.config self.device = pipe.text_encoder.device self.dtype = torch.half def forward(self, input_ids, **kwargs): embeddings = encoder(input_ids. long()) return [embeddings["last_hidden_state"]] class TracedDecoder(torch.nn.Module): def forward(self, input): return decoder(input. half()) pipe.text_encoder = TracedEncoder() pipe.unet = TracedUNet() pipe.vae.decoder = TracedDecoder()
A100 performance comparison
image size | samplesteps | Time of Pytorch(s) | Time of PAI-Blade(s) | speedup | Pytorch memory usage (GB) | PAI- Blade memory usage (GB) |
1024×1024 | 50 | 13.26 | 4.34 | 3.06X | 32.91 | 6.25 |
768×768 | 50 | 5.65 | 2.00 | 2.83X | 14.99 | 5.91 |
512×512 | 50 | 2.24 | 0.84 | 2.67X | 6.60 | 5.42 |
A10 performance comparison
image size | samplesteps | Time of Pytorch(s) | Time of PAI-Blade(s) | speedup | Pytorch memory usage (GB) | PAI- Blade memory usage (GB) |
1024×1024 | 50 | OOM | 13.86 | – | OOM | 6.89 |
768×768 | 50 | 13.13 | 5.61 | 2.34X | 12.60 | 6.22 |
512×512 | 50 | 4.53 | 2.11 | 2.15X | 6.28 | 5.47 |
Inference result verification
After using PAI-Blade to optimize, compare the generated image with the original output of Pytorch to observe whether the optimization result is correct. The picture on the left is the output of Pytorch eager mode, and the picture on the right is the output of the optimized model of PAI-Blade.
Validated pipeline type
- StableDiffusionPipeline
- StableDiffusionImg2ImgPipeline
- StableDiffusionInpaintPipeline
- AltDiffusionPipeline
LoRA optimization
LoRA refers to adding additional low-rank matrices to fine-tune the pre-trained model on the basis of the original model, and only train those newly added weights, thereby greatly reducing the cost of fine-tuning. The LoRA weight can be obtained by fine-tuning the official training code of diffusers. After diffusers are loaded using LoRA, the model operates slightly differently from the original model, resulting in additional computational overhead.
PAI-Blade is currently adapted to the LoRA optimization method in huggingface/diffusers. Similarly, Blade only needs to be optimized once for the same pipeline, and any LoRA weight can be used for inference. We will introduce how to use PAI-Blade to optimize LoRA in the next article, so stay tuned.
Outlook
At present, Stable Diffusion-related technologies are still evolving, and the PAI-Blade team is always paying attention to community trends, optimizing and adapting them to various tools. The current team is mainly focused on:
- Integrate relevant optimization into stable-diffusion-webui;
- Optimize finetune training speed.
Original link
This article is the original content of Alibaba Cloud and cannot be reproduced without permission
The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeHomepageOverview 300154 people are studying systematically