Using PAI-Blade to optimize the Stable Diffusion inference process (2)

Background

In the previous article, we used PAI-Blade to optimize the Stable Diffusion model in diffusers. In this article, we continue to introduce the inference process of optimizing LoRA and Controlnet using PAI-Blade. Relevant optimizations can also be used directly in the registry.cn-beijing.aliyuncs.com/blade_demo/blade_diffusion image. At the same time, we will introduce the method of integrating PAI-Blade optimization in Stable-Diffusion-webui.

LoRA optimization

The way PAI-Blade optimizes LoRA is basically the same as the previous method. Including: load model, optimize model, replace original model. Only the parts that are different from the previous ones are introduced below.

First, after loading the Stable DIffusion model, you need to load the LoRA weights.

pipe.unet.load_attn_procs(“lora/”)

When using LoRA, users may need to switch between different LoRA weights and try different styles. Therefore, PAI-Blade needs to pass in freeze_module=False in the optimization configuration, so that the weights are not compiled and optimized during the optimization process, so as not to affect the function of model loading weights. In this way, the model optimized by PAI-Blade can still use the pipe.unet.load_attn_procs() method to load LoRA weights without recompiling and optimizing.

Since the model weights are not optimized, some optimizations to constants cannot be performed, so some optimization space will be lost. In order to solve the problem of performance damage, some patches are used in PAI-Blade to replace the original model at the python level, making the model more suitable for PAI-Blade optimization. By using torch_blade.monkey_patch to optimize the unet and vae parts of the Stable Diffusion model before optimization, the ability of PAI-Blade can be better utilized.

from torch_blade.monkey_patch import patch_utils

patch_utils.patch_conv2d(pipe.vae.decoder)
patch_utils.patch_conv2d(pipe.unet)

opt_cfg = torch_blade. Config()
...
opt_cfg.freeze_module = False
with opt_cfg, torch.no_grad():
    ...

If there is no need for LoRA weight switching, the above steps can be ignored to obtain faster inference speed.

Benchmark

We tested the above LoRA optimization results on A100/A10, the test model is runwayml/stable-diffusion-v1-5, and the number of test sampling steps is 50.

ControlNet adaptation

According to the model structure diagram of ControlNet and the implementation of ControlNet in diffusers, the reasoning of ControlNet can be divided into two parts.

  1. In the ControlNet part, its input blocks and mid block structures are the same as the first half of Stable DiffusionUnet, and the rest is convolution. All output of ControlNet is transferred to Unet of Stable DIffusion as input;
  2. In addition to the original input, Unet of Stable Diffusion adds the output of ControlNet as input.

According to the above characteristics, we can make the following optimizations:

First, optimize ControlNet,

controlnet = torch_blade.optimize(pipe.controlnet, model_inputs=tuple(controlnet_inputs), allow_tracing=True)

When optimizing the unet model, due to the version before torch2.0, torch.jit.trace does not support using dict as input, so we use Wrapper to wrap Unet for trace and optimization. At the same time, an inference is performed using the optimized ControlNet, whose output is added to the Unet input.

class UnetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(
        self,
        sample,
        time step,
        encoder_hidden_states,
        down_block_additional_residuals,
        mid_block_additional_residual,
    ):
        return self.unet(
            sample,
            time step,
            encoder_hidden_states=encoder_hidden_states,
            down_block_additional_residuals=down_block_additional_residuals,
            mid_block_additional_residual=mid_block_additional_residual,
        )

...
down_block_res_samples, mid_block_res_sample = controlnet(*controlnet_inputs)
unet_inputs + = [tuple(down_block_res_samples), mid_block_res_sample]
unet = torch_blade.optimize(UnetWrapper(pipe.unet).eval(), model_inputs=tuple(unet_inputs), allow_tracing=True)

Combining the above functions, it is possible to realize at the same time:

  1. LoRA weight replacement;
  2. ControlNet weight replacement to use different ControlNet models.

benchmark

We tested the above ControlNet optimization results on A100/A10, the test model is runwayml/stable-diffusion-v1-5, and the number of test sampling steps is 50.


Summary

In the above part, we used PAI-Blade to optimize the encoder, unet, and decoder parts of the Stable DIffusion model, greatly reducing the inference delay and reducing the memory usage, thereby reducing the inference cost of the Stable DIffusion model. At the same time, PAI-Blade supports common functions such as LoRA and ControlNet, which expands the practicability of PAI-Blade.

webui adaptation

stable-diffusion-webui is a very popular application of Stable DIffusion, and PAI-Blade also provides support for its optimization. At present, PAI-Blade already supports the functions commonly used in webui such as model weight switching, LoRA, and ControlNet, and is integrated in the form of extension, which is convenient for users to use. Currently, relevant optimizations have been integrated into PAI-EAS eas-registry.cn-hangzhou.cr.aliyuncs.com/pai-eas/sdwebui-inference:0.0.2-py310-gpu-cu117-ubuntu2204-blade image, you can directly experience the optimization capability of PAI-Blade through PAI_EAS.

The following introduces the optimization method and performance of PAI-Blade in webui in this plug-in. The principle of webui optimization is roughly the same as that of diffusers. Here are a few main differences:

Optimize Unet and ControlNet by module

Because in webui, ControlNet needs to call the submodules of Unet one by one, in order to take into account ControlNet, PAI-Blade does not optimize the entire Unet and ControlNet like in diffusers. Instead, it adopts the method of sub-module optimization one by one, and optimizes and replaces all down blocks, mid blocks, and up blocks in Unet and ControlNet respectively. After testing, this optimization method hardly affects the model inference speed.

Do not freeze weight

On the webpage of webui, you can quickly switch the model weight. Therefore, PAI-Blade adopts the same method as LoRA optimization in diffusers, and does not optimize the weights.

LoRA optimization

In the webui, multiple LoRAs will call LoRA calculations one by one, and the calculation time will become longer as the number of LoRAs increases. When PAI-Blade loads LoRA weights, multiple LoRA weights and scales are pre-fuse to reduce runtime overhead. The overhead of loading and fuse is negligible after testing.

Benchmark

We tested the inference speed of the Stable DIffusion V1 model in the webui on the A10 with a batch size of 1 and a resolution of 512*512. Since webui involves delays in model-independent parts such as network transmission, this part only tests the time-consuming part of the model. The result is as follows:

It can be seen from the table that in the eager and xformers mode of webui, the inference time increases with the increase of the number of LoRAs, while PAI-Blade integrates all LoRA weights into the basic model, so the inference time has nothing to do with the number of LoRAs.

Summary

In these two articles, we introduced the optimization experience of PAI-Blade on the Stable DIffusion model, which currently supports two mainstream inference methods, Difusers and Stable-DIffusion-webui.

We investigated the support of relevant public competing products for Stable Diffusion, and the results are as follows:

Framework/Model Base Model LoRA ControlNet webui
xformers ? ? ? ?
AITemplete ? ? ? ?
OneFlow ? ? ? ?
TensorRT ? ? ? ?
PAI-Blade ? ? ? ?

According to public performance figures and actual business measurements, PAI-Blade not only supports the Stable DIffusion model most comprehensively, but also has the best performance and memory usage.

At present, PAI-Blade has been launched and used in related businesses one after another. Next, we will continue to optimize performance and improve related function support. Welcome everyone to communicate, contact and cooperate~