diffusers-Understanding models and schedulers

https://huggingface.co/docs/diffusers/using-diffusers/write_own_pipelinehttps://huggingface.co/docs/diffusers/using-diffusers/write_own_pipelinediffusers There are 3 modules: diffusion pipelines, noise schedulers, model. This library is very good, and its design ideas are comparable to those of the mmlab series. The mm series generation algorithm is in mmagic, but it is not as rich as diffusers. Furthermore, almost all new algorithm training and reasoning will use the standard diffusers form.

Forward loading of a standard diffusers sd algorithm, combined with huggingface hub, is far ahead. This is the Vincentian graph algorithm of skypaint.

from diffusers import StableDiffusionPipeline

device = 'cuda'
pipe = StableDiffusionPipeline.from_pretrained("path_to_our_model").to(device)

prompts = [
    'robot dog',
    'Castle Sea Sunset Hayao Miyazaki Animation',
    'How many flowers have fallen',
    'Chicken you are so beautiful',
]

for prompt in prompts:
    prompt = 'sai-v1 art, ' + prompt
    image = pipe(prompt).images[0]
    image.save("%s.jpg" % prompt)

1.pipelines

Wrap the necessary components (multiple independently trained models, schedulers, processors) in an end-to-end class. All pipelines are built from DiffusionPipeline, a class that provides basic functionality for loading, downloading and saving all components. Pipelines do not provide training, and UNet2Model and UNet2DConditionModel are trained separately.

The following are the pipelines currently supported by v0.21.0, and will be added in the future.

example:

from diffusers import DDPMPipeline

ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
image = ddpm(num_inference_steps=25).images[0]
image

In the example above, UNet2DModel and DDPMScheduler are included in the pipeline, which denoises the image by taking random noise (the same size as the desired output) and feeding it into the model multiple times. At each time step, the model predicts a noisy residual, and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until the specified number of inference steps is reached.

Use model and scheduler respectively to re-create the pipeline and re-write the denoising process:

1. Load model and scheduler

from diffusers import DDPMScheduler, UNet2DModel

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")

2. timesteps of denoising process

scheduler.set_timesteps(50)

3. Setting scheduler timesteps will create a tensor in which the elements are evenly distributed, in this case 50 elements. Each element corresponds to a timestep of the image denoised by the model. When the denoising loop is created later, this tensor will be iterated to denoise the image:

scheduler.timesteps
tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
    700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
    420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
    140, 120, 100, 80, 60, 40, 20, 0])

4. Create some random noise with the same shape as the output

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda")

5. Write a loop to iterate over timesteps. At each timestep, the model performs the UNet2DModel.forward() operation and returns the noisy residuals. The scheduler’s step() method accepts the noisy residual, timestep and input, and then predicts the image of the previous timestep. This output becomes the next input to the model in the denoising loop, and is repeated until the end of the time steps array is reached. This is the entire denoising process.

input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
    previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
    input = previous_noisy_sample

6. Finally, convert the denoised output into an image

image = (input / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image

2.stable diffusion pipeline

Stable diffusion is a text-image latent diffusion model. It is called a latent diffusion model because it uses a lower dimensional representation of the image rather than the actual pixel space, which makes it more memory efficient. The encoder compresses the image into a smaller representation and the decoder converts the compressed representation back into an image. For text-to-image models, a tokenizer and an encoder are needed to generate text embeddings. From the previous example, we already know that a UNet model and a scheduler are needed.

from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(
    "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True
)
unet = UNet2DConditionModel.from_pretrained(
    "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True
)

Instead of the default PNMMScheduler, use UniPCMultistepScheduler

from diffusers import UniPCMultistepScheduler

scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

To speed up inference, the scheduler has no trainable weights, and it has no impact whether the inference is done on the GPU or not.

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)

2.1 create text embeddings

The text is tokenized to generate an embedding, which is used to condition the UNet and guide the diffusion model in a direction similar to that belonging to the tip. The guidance_scale parameter determines how much weight should be given to the hints when generating images.

prompt = ["a photograph of an astronaut riding a horse"]
height = 512 # default height of Stable Diffusion
width = 512 # default width of Stable Diffusion
num_inference_steps = 25 # Number of denoising steps
guidance_scale = 7.5 # Scale for classifier-free guidance
generator = torch.manual_seed(0) # Seed generator to create the initial latent noise
batch_size = len(prompt)

Tokenize the text and generate text embedding

text_input = tokenizer(
    prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)

with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

Unconditional text embeddings need to be generated, i.e. the embeddings used to fill the markup. These embeddings need to have the same shape (batch_size and seq_length) as the conditional text embeddings

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

Put unconditional text embeddings and conditional embeddings in the same batch to avoid going forward twice:

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

2.2 create random noise

Next, some initial random noise is generated as a starting point for the diffusion process. This is the underlying representation of the image, which will be gradually denoised. At this point, the potential image size is smaller than the final image size, but that’s okay because the model will convert it to the final 512×512 image size later.

The height and width are divided by 8 because VAE has 3 downsampling layers.

2 ** (len(vae.config.block_out_channels) - 1) == 8

latents = torch.randn(
    (batch_size, unet.in_channels, height // 8, width // 8),
    generator=generator,
)
latents = latents.to(torch_device)

2.3 denoise the image

First, the input is scaled by the initial noise distribution and the noise scale value sigma. This is required for improved schedulers such as UniPCMultistepScheduler.

latents = latents * scheduler.init_noise_sigma

The final step is to create a denoising loop that gradually converts the underlying pure noise into an image described by the cues. Remember, the denoising loop needs to accomplish three things:

1. Set the timesteps used by the scheduler during the denoising process. 2. Iterate timesteps. 3. In each timestep, call the UNet model to predict the noise residual and pass it to the scheduler to calculate the previous noise sample.

from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
    latent_model_input = torch.cat([latents] * 2)

    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    # perform guidance
    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t -> x_t-1
    latents = scheduler.step(noise_pred, t, latents).prev_sample

Classifier-free guidance: By adding classification labels to the UNet model, the model can consider both text embedding information and latent variables when generating images. Specifically, in each time step, the noise residual is divided into an unconditional part and a conditional part. The conditional part is combined with the text embedded information through a weighted summation method to achieve a conditional control effect. The weighting coefficient here is the guidance scale, which is used to adjust the impact of noise residuals on text embedded information. Therefore, in this way, it is possible to still combine text embedding information for conditional control without using a classifier. This is one of the ways Classifier-free Guidance is implemented. latents*2 and subsequent noise_pred.chunk(2) are all implementations of classifier-free guidance.

2.4 decode the image

Decode latent representation into image using VAE

# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample

image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
images = (image * 255).round().astype("uint8")
image = Image.fromarray(image)
image