01_stable_diffusion_introduction_CN

stable_diffusion

Configuration

!pip install -Uq diffusers ftfy accelerate
# Installing transformers from source for now since we need the latest version for Depth2Img:
!pip install -Uq git + https://github.com/huggingface/transformers
import torch
import requests
from PIL import Image
from io import BytesIO
from matplotlib import pyplot as plt

# We'll be exploring a number of pipelines today!
from diffusers import (
    StableDiffusionPipeline,
    StableDiffusionImg2ImgPipeline,
    StableDiffusionInpaintPipeline,
    StableDiffusionDepth2ImgPipeline
    )

# We'll use a couple of demo images later in the notebook
def download_image(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")

# Download images for inpainting example
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
# Set device
device = (
    "mps"
    if torch.backends.mps.is_available()
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)

Generate image from text

Let’s start by loading up the Stable Diffusion pipeline and see what we can do. There are many different versions of the Stable Diffusion model available, and the latest version as of this writing is version 2.1. If you want to explore an older version, just change it at model_id (for example, you can try changing it to CompVis/stable-diffusion-v1-4 or choose a model from the dreambooth concepts library).

# Load the pipeline
model_id = "stabilityai/stable-diffusion-2-1-base"
pipe = StableDiffusionPipeline.from_pretrained(model_id).to(device)
pipe.enable_attention_slicing()
unet\diffusion_pytorch_model.safetensors not found



Loading pipeline components...: 0%| | 0/6 [00:00<?, ?it/s]

If your GPU is running out of memory, here are some ways you might be able to reduce memory usage:

  • Loads the FP16 precision version (but not supported on all systems). At the same time, when you experiment with a specific part of the pipeline, you will also need to convert all tensors to torch.float16 precision:

    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16).to(device)

  • Turn on attention slicing. This sacrifices a little bit of speed to reduce GPU memory usage:

pipe.enable_attention_slicing()

  • Reduce the size of the image to be generated

When the pipeline is loaded, we can use the following code to generate images using text prompts:

# Set up a generator for reproducibility
generator = torch.Generator(device=device).manual_seed(42)

# Run the pipeline, showing some of the available arguments
pipe_output = pipe(
    prompt="What to generate Palette knife painting of an autumn cityscape",
    negative_prompt="Oversaturated, blurry, low quality", # What NOT to generate
    height=480, width=640, # Specify the image size
    guidance_scale=8, # How to strongly follow the prompt
    num_inference_steps=35, # How many steps to take
    generator=generator # Fixed random seed
)

# View the resulting image:
pipe_output.images[0]
 0%| | 0/35 [00:00<?, ?it/s]

Exercise: Take some time to experiment with the code blocks above, using your own text prompts, and iterating on settings to see how they affect the build. Use a different random seed or remove the generator input parameter and see what different effects you can get.

Introduction to the main parameters to be adjusted:

  • width and height specify the dimensions of the generated image. They must be numbers divisible by 8 for our variable autoencoder (VAE) to work properly (as we will learn about in a future chapter).
  • The number of steps num_inference_steps also affects the quality of the build. The default setting of 50 is fine, but sometimes you can use as few as 20 steps, which is much easier for experimentation.
  • Using negative_prompt to emphasize unwanted content, typically used during classifier-free guidance, can be a very useful way to add additional control. You can leave this field blank, but many users find it helpful to list some unwanted features for better builds.
  • guidance_scale This parameter determines how strong the influence of classifier-free guidance (CFG) is. Increasing this value will make the generated content closer to the text prompt; however, if this value is too large, the result may become oversaturated and unsightly.

If you’re looking for some inspiration for text prompts, you can also start here: Stable Diffusion Prompt Book

You can see below the effect of increasing the guidance_scale parameter:

#@markdown comparing guidance scales:
cfg_scales = [1.1, 8, 12] #@param
prompt = "A collie with a pink hat" #@param
fig, axs = plt.subplots(1, len(cfg_scales), figsize=(16, 5))
for i, ax in enumerate(axs):
  im = pipe(prompt, height=480, width=480,
    guidance_scale=cfg_scales[i], num_inference_steps=35,
    generator=torch.Generator(device=device).manual_seed(42)).images[0]
  ax.imshow(im); ax.set_title(f'CFG Scale {<!-- -->cfg_scales[i]}');
 0%| | 0/35 [00:00<?, ?it/s]



  0%| | 0/35 [00:00<?, ?it/s]



  0%| | 0/35 [00:00<?, ?it/s]

Play around with the values above and try different amplitudes and cues. Of course, how to interpret these parameters is very subjective, but for me, I feel that the range of 8 to 12 produces better results than other situations.

Components of the pipeline

The StableDiffusionPipeline we use here is a little more complicated than the DDPMPipeline used in the previous units. In addition to UNet and the scheduler, there are many other components in the pipeline:

print(list(pipe.components.keys())) # List components
['vae', 'text_encoder', 'tokenizer', 'unet', 'scheduler', 'safety_checker', 'feature_extractor']

To better understand how the pipeline works, we briefly look at the components one by one and then recreate the entire pipeline functionality by putting them together ourselves.

Variable autoencoder (VAE)

A variable autoencoder (VAE) is a model that encodes the input into a compressed representation and then decodes this “implicit” representation into an output that is close to the input. When we use Stable Diffusion to generate images, we first apply the diffusion process in the VAE’s “latent space” to generate latent codes, and then decode them at the end to view the resulting image.

Here is an example of using VAE to encode the input image into an implicit representation and then decode it:

# Create some fake data (a random image, range (-1, 1))
images = torch.rand(1, 3, 512, 512).to(device) * 2 - 1
print("Input images shape:", images.shape)

# Encode to latent space
with torch.no_grad():
  latents = 0.18215 * pipe.vae.encode(images).latent_dist.mean
print("Encoded latents shape:", latents.shape)

# Decode again
with torch.no_grad():
  decoded_images = pipe.vae.decode(latents / 0.18215).sample
print("Decoded images shape:", decoded_images.shape)
Input images shape: torch.Size([1, 3, 512, 512])
Encoded latent shape: torch.Size([1, 4, 64, 64])
Decoded images shape: torch.Size([1, 3, 512, 512])

As you can see, the original 512×512 image is compressed into a 64×64 implicit representation (with four channels). Each spatial dimension is compressed to one-eighth of its original size, which is why when we set width and height we need them to be multiples of 8.

Using these informative 4x64x64 latent codes is much more efficient than using 512px sized images, making our diffusion model faster and using fewer resources to train and use. VAE’s decoding process isn’t perfect, but even if it loses a little bit of quality, it’s generally good enough.

Note: The above code example includes a scaling factor of 0.18215 to adapt to the processing flow during stable diffusion training.

Tokenizer and Text Encoder

The function of the text encoder is to convert the input string (text prompt) into a numerical representation so that it can be input into UNet as a condition. The text must first be converted into a series of tokens by the tokenizer in the pipeline. The text encoder has a vocabulary of approximately 50,000 words – any word that does not exist in these vocabularies is broken down into multiple smaller words. These tokens are then fed into the text encoder model – a transformer model originally trained as a text encoder for CLIP. Here we hope that this pre-trained transformer model has learned enough text representation capabilities to be equally useful for our diffusion tasks here.

Here we verify this process by encoding a text prompt. First, we manually perform word segmentation and input it into the text encoder, and then use the pipeline’s _encode_prompt method to observe the completed process, which includes completing or truncating the length of the word segmentation string to make the word segmentation The length of the string is equal to the maximum length 77:

# Tokenizing and encoding an example prompt manualy:

#Tokenize
input_ids = pipe.tokenizer(["A painting of a flooble"])['input_ids']
print("Input ID -> decoded token")
for input_id in input_ids[0]:
  print(f"{<!-- -->input_id} -> {<!-- -->pipe.tokenizer.decode(input_id)}")

# Feed through CLIP text encoder
input_ids = torch.tensor(input_ids).to(device)
with torch.no_grad():
  text_embeddings = pipe.text_encoder(input_ids)['last_hidden_state']
print("Text embeddings shape:", text_embeddings.shape)
Input ID -> decoded token
49406 -> <|startoftext|>
320->a
3086 -> painting
539 -> of
320->a
4062 -> floo
1059->ble
49407 -> <|endoftext|>
Text embeddings shape: torch.Size([1, 8, 1024])
# Get the final text embeddings using the pipeline's _encode_prompt function:
text_embeddings = pipe._encode_prompt("A painting of a flooble", device, 1, False, '')
text_embeddings.shape
F:\software\Anaconda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py:237: FutureWarning: `_encode_prompt()` is deprecated and it will be removed in a future version . Use `encode_prompt()` instead. Also, be aware that the output format changed from a concatenated tensor to a tuple.
  deprecate("_encode_prompt()", "1.0.0", deprecation_message, standard_warn=False)



-------------------------------------------------- -----------------------

TypeError Traceback (most recent call last)

Cell In[6], line 2
      1 # Get the final text embeddings using the pipeline's _encode_prompt function:
----> 2 text_embeddings = pipe._encode_prompt("A painting of a flooble", device, 1, False, '')
      3 text_embeddings.shape


File F:\software\Anaconda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py:251, in StableDiffusionPipeline._encode_prompt(self, prompt, device, num_images_per_prompt, do_classifier_free_guidance, negative_prompt, prompt_embeds, negative_prompt_embed s , lora_scale)
    239 prompt_embeds_tuple = self.encode_prompt(
    240 prompt=prompt,
    241 device=device,
   (...)
    247 lora_scale=lora_scale,
    248 )
    250 # concatenate for backwards comp
--> 251 prompt_embeds = torch.cat([prompt_embeds_tuple[1], prompt_embeds_tuple[0]])
    253 return prompt_embeds


TypeError: expected Tensor as element 0 in argument 0, but got NoneType

These text embeddings, the “hidden state” of the last transformer module in the text encoder, will be fed into UNet as an additional input to the forward function. We will see this in detail below.

UNet

The UNet model takes a noisy input and predicts the noise, just like the UNet we saw in the previous unit. But what is different from previous examples is that the input here is not a picture, but a latent representation of the picture. In addition, in addition to inputting the timestep used to indicate the degree of noise into UNet as a condition, the model here also takes the text embeddings of the text prompt as an additional input. Here we use fake data to try to predict it:

# Dummy inputs:
timestep = pipe.scheduler.timesteps[0]
latents = torch.randn(1, 4, 64, 64).to(device)
text_embeddings = torch.randn(1, 77, 1024).to(device)

# Model prediction:
with torch.no_grad():
  unet_output = pipe.unet(latents, timestep, text_embeddings).sample
print('UNet output shape:', unet_output.shape) # Same shape as the input latents

Scheduler

The scheduler saves the schedule of how to add noise and manages how to update noisy samples based on model predictions. The default scheduler is the PNDMScheduler scheduler, but you can use others (such as the LMSDiscreteScheduler scheduler) as long as they are initialized with the same configuration.

We can plot a graph to observe the schedule of adding noise with timestep and see the noise level at different times (based on

α

ˉ

\bar{\alpha}

What does αˉthis parameter) look like:

plt.plot(pipe.scheduler.alphas_cumprod, label=r'$\bar{\alpha}$')
plt.xlabel('Timestep (high noise to low noise ->)');
plt.title('Noise schedule');plt.legend();

pip install scipy
Looking in indexes: http://mirrors.aliyun.com/pypi/simple/
Collecting scipy
  Downloading http://mirrors.aliyun.com/pypi/packages/32/8e/7f403535ddf826348c9b8417791e28712019962f7e90ff845896d6325d09/scipy-1.10.1-cp38-cp38-win_amd64.whl (42.2 MB)
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in f:\software\anaconda\envs\test\lib\site-packages (from scipy) (1.23.0)
Installing collected packages: scipy
Successfully installed scipy-1.10.1
Note: you may need to restart the kernel to use updated packages.

If you want to try a different scheduler, you can change it to a new one as shown in the following code:

from diffusers import LMSDiscreteScheduler

# Replace the scheduler
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)

#Print the config
print('Scheduler config:', pipe.scheduler)

# Generate an image with this new scheduler
pipe(prompt="Palette knife painting of an winter cityscape", height=480, width=480,
     generator=torch.Generator(device=device).manual_seed(42)).images[0]
Scheduler config: LMSDiscreteScheduler {
  "_class_name": "LMSDiscreteScheduler",
  "_diffusers_version": "0.21.4",
  "beta_end": 0.012,
  "beta_schedule": "scaled_linear",
  "beta_start": 0.00085,
  "clip_sample": false,
  "num_train_timesteps": 1000,
  "prediction_type": "epsilon",
  "set_alpha_to_one": false,
  "skip_prk_steps": true,
  "steps_offset": 1,
  "timestep_spacing": "linspace",
  "trained_betas": null,
  "use_karras_sigmas": false
}




  0%| | 0/50 [00:00<?, ?it/s]

You can learn more about using different schedulers here.

DIY a sampling loop

Now that we have looked at these components one by one, we can put them together to reproduce the functionality of the entire pipeline:

guidance_scale = 8 #@param
num_inference_steps=30 #@param
prompt = "Beautiful picture of a wave breaking" #@param
negative_prompt = "zoomed in, blurry, oversaturated, warped" #@param

# Encode the prompt
text_embeddings = pipe._encode_prompt(prompt, device, 1, True, negative_prompt)

# Create our random starting point
latents = torch.randn((1, 4, 64, 64), device=device, generator=generator)
latents *= pipe.scheduler.init_noise_sigma

# Prepare the scheduler
pipe.scheduler.set_timesteps(num_inference_steps, device=device)

# Loop through the sampling timesteps
for i, t in enumerate(pipe.scheduler.timesteps):

  # expand the latents if we are doing classifier free guidance
  latent_model_input = torch.cat([latents] * 2)

  # Apply any scaling required by the scheduler
  latent_model_input = pipe.scheduler.scale_model_input(latent_model_input, t)

  # predict the noise residual with the unet
  with torch.no_grad():
    noise_pred = pipe.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

  # perform guidance
  noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
  noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

  # compute the previous noisy sample x_t -> x_t-1
  latents = pipe.scheduler.step(noise_pred, t, latents).prev_sample

# Decode the resulting latents into an image
with torch.no_grad():
  image = pipe.decode_latents(latents.detach())

# View
pipe.numpy_to_pil(image)[0]
F:\software\Anaconda\envs\test\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py:430: FutureWarning: The decode_latents method is deprecated and will be removed in 1.0.0. Please use VaeImageProcessor.postprocess(...) instead
  deprecate("decode_latents", "1.0.0", deprecation_message, standard_warn=False)

Some other pipelines

Img2Img

Until now, the images we have generated have been generated entirely from random latent variables, and have also used a complete diffusion model sampling cycle. But we don’t have to start from scratch. The Img2Img pipeline first encodes an existing image into a series of latent variables, and then randomly adds noise to these latent variables, using these as the starting point. The amount of noise added and the number of steps required to denoise determine the “strength” of the img2img process. Adding just a little bit of noise (low intensity) will only make a small change, while adding the maximum amount of noise and running through the complete denoising process will produce a result that is almost nothing like the original image, even though it may still be structurally correct. More or less similar.

This pipeline does not require any special models, as long as the model ID is the same as our text-to-image model, there are no new files to download.

# Loading an Img2Img pipeline
model_id = "stabilityai/stable-diffusion-2-1-base"
img2img_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(model_id).to(device)
unet\diffusion_pytorch_model.safetensors not found



Loading pipeline components...: 0%| | 0/6 [00:00<?, ?it/s]

In the “Configuration” section of this section, we loaded an image called init_image for use in the demonstration here, but you can replace it with your own image. Here is the code using this pipeline:

# Apply Img2Img
result_image = img2img_pipe(
    prompt="An oil painting of a man on a bench",
    image = init_image, # The starting image
    strength = 0.6, # 0 for no change, 1.0 for max strength
).images[0]

# View the result
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
axs[0].imshow(init_image);axs[0].set_title('Input Image')
axs[1].imshow(result_image);axs[1].set_title('Result');
-------------------------------------------------- ----------------------------

RuntimeError Traceback (most recent call last)


RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 4.00 GiB total capacity; 9.90 GiB already allocated; 0 bytes free; 10.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Exercise: Experiment with this pipeline, try it on your own images, or try different intensities and text prompts. You can use many of the same input parameters as the text-to-image pipeline. So try different image sizes and generation steps as much as possible.

In-Painting

What if we want to keep part of a graph unchanged and create new things in other parts? This technique is called inpainting. While we could have done this with the same model from the previous demonstration (using the StableDiffusionInpaintPipelineLegacy pipeline), here we can use a custom, fine-tuned version of the Stable Diffusion model to get better results. The Stable Diffusion model here receives a mask as additional conditional input. This mask image needs to be the same size as the input image. The white area represents the part to be replaced, and the black area represents the part to be retained. The following code shows how we load this pipeline and apply it to the sample image and mask loaded earlier:

# Load the inpainting pipeline (requires a suitable inpainting model)
pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe = pipe.to(device)
Downloading (…)ain/model_index.json: 0%| | 0.00/548 [00:00<?, ?B/s]


F:\software\Anaconda\envs\test\lib\site-packages\huggingface_hub\file_download.py:137: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\11637\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details , see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps /get-started/enable-your-device-for-development
  warnings.warn(message)
unet\diffusion_pytorch_model.safetensors not found



Fetching 16 files: 0%| | 0/16 [00:00<?, ?it/s]



Downloading pytorch_model.bin: 0%| | 0.00/1.22G [00:00<?, ?B/s]



Downloading (…)_checker/config.json: 0%| | 0.00/4.78k [00:00<?, ?B/s]



Downloading pytorch_model.bin: 0%| | 0.00/492M [00:00<?, ?B/s]



Downloading (…)tokenizer/merges.txt: 0%| | 0.00/525k [00:00<?, ?B/s]



Downloading (…)cial_tokens_map.json: 0%| | 0.00/472 [00:00<?, ?B/s]



Downloading (…)scheduler_config.json: 0%| | 0.00/313 [00:00<?, ?B/s]



Downloading (…)okenizer_config.json: 0%| | 0.00/806 [00:00<?, ?B/s]



Downloading (…)tokenizer/vocab.json: 0%| | 0.00/1.06M [00:00<?, ?B/s]



Downloading (…)processor_config.json: 0%| | 0.00/342 [00:00<?, ?B/s]



Downloading (…)49d/unet/config.json: 0%| | 0.00/748 [00:00<?, ?B/s]



Downloading (…)_encoder/config.json: 0%| | 0.00/617 [00:00<?, ?B/s]



Downloading (…)b49d/vae/config.json: 0%| | 0.00/552 [00:00<?, ?B/s]



Downloading (…)on_pytorch_model.bin: 0%| | 0.00/335M [00:00<?, ?B/s]

ProxyError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /runwayml/stable-diffusion-inpainting/resolve/51388a731f57604945fddd703ecb5c50e8e7b49d/config.json (Caused by ProxyError('Cannot connect to proxy.', timeout('_ssl.c:1114: The handshake operation timed out')))"), '(Request ID: fb3b77aa-4c46-4dd6-9245-5a2f602160e4)')
# Inpaint with a prompt for what we want the result to look like
prompt = "A small robot, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]

# View the result
fig, axs = plt.subplots(1, 3, figsize=(16, 5))
axs[0].imshow(init_image);axs[0].set_title('Input Image')
axs[1].imshow(mask_image);axs[1].set_title('Mask')
axs[2].imshow(image);axs[2].set_title('Result');

This model can be quite powerful when combined with other models that can automatically generate masks. For example, this example space uses a model called CLIPSeg, which can automatically remove an object with a mask based on a text description.

Digression: Managing your model cache

Exploring different pipelines and models can take up space on your hard drive. You can use this command to see which models you have downloaded to your hard drive:

!ls ~/.cache/huggingface/diffusers/ # List the contents of the cache directory
models--CompVis--stable-diffusion-v1-4
models--ddpm-bedroom-256
models--google--ddpm-bedroom-256
models--google--ddpm-celebahq-256
models--runwayml--stable-diffusion-inpainting
models--stabilityai--stable-diffusion-2-1-base

Check out the caching documentation to learn how to view and manage your cache efficiently.

Depth2Image

Input image, depth image and generated examples (image source: StabilityAI)

Img2Img is already great, but sometimes we want to generate a new image using the components of the original image but with completely different colors or textures. It would be difficult to adjust the “strength” of Img2Img to preserve the overall structure of the image without preserving the original color.

So another fine-tuned model is needed here! This model requires input of additional depth information as a generation condition. The related pipeline uses a depth prediction model to predict a depth map, which is then fed into a fine-tuned UNet to generate images. What we hope here is that the generated image can retain the depth information and overall structure of the original image, while filling in new content in the relevant parts.

# Load the Depth2Img pipeline (requires a suitable model)
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained("stabilityai/stable-diffusion-2-depth")
pipe = pipe.to(device)
# Inpaint with a prompt for what we want the result to look like
prompt = "An oil painting of a man on a bench"
image = pipe(prompt=prompt, image=init_image).images[0]

# View the result
fig, axs = plt.subplots(1, 2, figsize=(16, 5))
axs[0].imshow(init_image);axs[0].set_title('Input Image')
axs[1].imshow(image);axs[1].set_title('Result');
 0%| | 0/40 [00:00<?, ?it/s]