Optimizing the Stable Diffusion Inference Process with PAI-Blade

Background

AIGC is a rapidly growing and important business in the field of artificial intelligence computing. Stable Diffusion is the most popular open source model among them, and has received a lot of attention. However, with the continuous expansion of application scenarios, the problems of inference delay and computing cost faced by Stable Diffusion have become more and more prominent.

Introduction

PAI-Blade is a general-purpose inference optimization tool launched by PAI, which can achieve optimal inference performance through joint optimization of model systems. PAI-Blade relies on Completely dynamic size AI compiler BladeDISC and High-performance computing library BlaDNN based on deep learning automatic scheduling, for including image generation model Stable Diffsuion, big language Model LLM, large-scale sparse recommendation model CTR, speech recognition model ASR and many other models provide automatic high-performance reasoning optimization.

BladeDISC is an AI compiler that supports fully dynamic dimensions, and the front-end supports Pytorch and Tensorflow models. The Pytorch model can support two input modes, TorchScript and TorchDynamo, and the backend uses AStitch large-scale operator fusion technology and efficient codegen logic to improve the execution efficiency of model memory-intensive operators. BladeDISC is now open source on github, project address: https://github.com/alibaba/BladeDISC.

BlaDNN is a high-performance computing library based on deep learning automatic scheduling. As an upgraded version of Ansor, BlaDNN not only generates better kernel performance than Ansor, but also can completely rely on DNN automatic scheduling without using Tuning tuning, making online automatic scheduling of Dynamic Shape business scenarios possible. GPUs generated based on DNN automatic scheduling are computationally intensive The average performance of the operator reaches 99.39% of the ultimate tuning performance. Through the joint optimization of the model system, the DNN inference delay is as low as 2us, and only one CPU Core is used, so that it will not cause any jitter to the performance of the GPU model itself.

By adopting the PAI-Blade accelerated inference optimization technology, large-scale fusion of memory-intensive operators and optimized code generation, and automatic scheduling of computationally-intensive operators can greatly reduce the inference delay and memory usage of Stable Diffusion, thereby Reduce computational costs. Using PAI-Blade to optimize Stable Diffusion has the following three advantages:

High performance, using Blade can reduce the end-to-end delay of inference processes such as Text2Img and Img2Img by 2.42-3.05 times, and at the same time reduce the memory usage by up to 5.27 times, more than TensorRT-8.5 and other industry SOTA optimization methods.
Full dynamic shape support, after one optimization, can support input of any shape and batch size.
Ease of use and scalability: Only a few lines of code are needed to enable Blade optimization in multiple types of pipelines, and at the same time support the optimization of inference schemes such as LoRA .

Example

Next, this article takes the popular “runwayml/stable-diffusion-v1-5” Text2Img pipeline as an example to introduce in detail how to use PAI-Blade in various usage scenarios.

environmental installation

The complete running script and related environment of the following example have been integrated into registry.cn-beijing.aliyuncs.com/blade_demo/blade_diffusion docker. In this docker, the inference example can be run directly through python /blade/blade_diffusion.py .

Official model optimization

Using PAI-Blade to optimize the Stable Diffusion model can be divided into the following steps.

First, load the pretrained model.

from diffusers import StableDiffusionPipeline

device = torch.device("cuda:0")
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16).to(device)

In the second step, optimize with PAI-Blade. Note that since PAI-Blade is a fully dynamic shape optimization tool, any shape can be used for inference after optimization.

import torch_blade

opt_cfg = torch_blade. Config()
opt_cfg.enable_fp16 = True
with opt_cfg, torch.no_grad():
    encoder = blade_optimize(pipe. text_encoder, model_inputs=encoder_inputs, allow_tracing=True)
    unet = blade_optimize(pipe.unet, model_inputs=unet_inputs, allow_tracing=True)
    decoder = blade_optimize(pipe.vae.decoder, model_inputs=decoder_inputs, allow_tracing=True)

Finally, replace the original model with the optimized model, and then perform inference in the same way as the original pipeline.

@dataclass
class UNet2DCoditionOutput:
    sample: torch.FloatTensor

class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.config = pipe.unet.config
        self.in_channels = pipe.unet.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states, **kwargs):
        sample = unet(latent_model_input.half(), t.half(), encoder_hidden_states.half())["sample"]
        return UNet2DCoditionOutput(sample=sample)

class TracedEncoder(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.config = pipe.text_encoder.config
        self.device = pipe.text_encoder.device
        self.dtype = torch.half

    def forward(self, input_ids, **kwargs):
        embeddings = encoder(input_ids. long())
        return [embeddings["last_hidden_state"]]

class TracedDecoder(torch.nn.Module):
    def forward(self, input):
        return decoder(input. half())

pipe.text_encoder = TracedEncoder()
pipe.unet = TracedUNet()
pipe.vae.decoder = TracedDecoder()

A100 performance comparison

image size	samplesteps	Time of Pytorch(s)	Time of PAI-Blade(s)	speedup	Pytorch memory usage (GB)	PAI- Blade memory usage (GB)
1024×1024	50	13.26	4.34	3.06X	32.91	6.25
768×768	50	5.65	2.00	2.83X	14.99	5.91
512×512	50	2.24	0.84	2.67X	6.60	5.42

A10 performance comparison

image size	samplesteps	Time of Pytorch(s)	Time of PAI-Blade(s)	speedup	Pytorch memory usage (GB)	PAI- Blade memory usage (GB)
1024×1024	50	OOM	13.86	–	OOM	6.89
768×768	50	13.13	5.61	2.34X	12.60	6.22
512×512	50	4.53	2.11	2.15X	6.28	5.47

Inference result verification

After using PAI-Blade to optimize, compare the generated image with the original output of Pytorch to observe whether the optimization result is correct. The picture on the left is the output of Pytorch eager mode, and the picture on the right is the output of the optimized model of PAI-Blade.

Validated pipeline type

StableDiffusionPipeline
StableDiffusionImg2ImgPipeline
StableDiffusionInpaintPipeline
AltDiffusionPipeline

LoRA optimization

LoRA refers to adding additional low-rank matrices to fine-tune the pre-trained model on the basis of the original model, and only train those newly added weights, thereby greatly reducing the cost of fine-tuning. The LoRA weight can be obtained by fine-tuning the official training code of diffusers. After diffusers are loaded using LoRA, the model operates slightly differently from the original model, resulting in additional computational overhead.

PAI-Blade is currently adapted to the LoRA optimization method in huggingface/diffusers. Similarly, Blade only needs to be optimized once for the same pipeline, and any LoRA weight can be used for inference. We will introduce how to use PAI-Blade to optimize LoRA in the next article, so stay tuned.

Outlook

At present, Stable Diffusion-related technologies are still evolving, and the PAI-Blade team is always paying attention to community trends, optimizing and adapting them to various tools. The current team is mainly focused on:

Integrate relevant optimization into stable-diffusion-webui;
Optimize finetune training speed.

Original link

This article is the original content of Alibaba Cloud and cannot be reproduced without permission

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledgePython entry skill treeHomepageOverview 300154 people are studying systematically