[Diffusion Model] HuggingFace Diffusers in practice

HuggingFace Diffusers in action

1. Environment preparation
2.DreamBooth
- 2.1 Introduction to Stable Diffusion
- 2.2 DreamBooth
3. Diffusers core API
4. Practical combat: Generate beautiful butterfly images
- 4.1 Download dataset
- 4.2 Scheduler
- 4.3 Define diffusion model
- 4.4 Create a diffusion model training loop
- 4.5 Image generation
- - Method 1. Create a pipeline
  - Method 2. Write a sampling loop
5. Upload the model to Hugging Face Hub
6. Use the Accelerate library to expand the size of the training model
References

Diffusers is the go-to library for generating state-of-the-art diffusion models for images, audio, and even molecular 3D structures. Whether looking for a simple inference solution or training your own diffusion model, Diffusers is a modular toolbox that supports both. The library is designed with a focus on performance, simplicity, and customizability of abstractions.

Diffusers provides three core components:

State-of-the-art diffusion pipelines can be run in inference with just a few lines of code.
Interchangeable schedulers for different diffusion speeds and output qualities.
Can be used as a pre-trained model to build blocks and combine with schedulers to create your own end-to-end diffusion system.

In this article, learn how to use a powerful custom diffusion model pipeline (pipeline), and learn how to independently make a new version. Learn to use the Accelerate library to call multiple GPUs to speed up the model training process, and upload the final model to Hugging Face Hub.

1. Environment preparation

Install the Diffusers library:

!pip install -qq -U diffusers datasets transformers accelerate ftfy pyarrow==12.0

ftfy is a Python package for repairing and cleaning Unicode text. Its full name is “fixes text for you,” which means itcan automatically detect and correct common Unicode text problems.
Unicode is a character encoding standard for representing characters in text. However, sometimes the text may contain special characters, encoding errors, garbled characters, or inconsistent character representation, which may cause problems with text display or processing. ftfy provides a series of functions for automatically fixing these problems, making text processing more accurate and consistent. By installing the ftfy package, you can use the functions and tools it provides in Python to handle and fix Unicode problems in text, thereby ensuring the correctness and reliability of text in various applications.
Apache Arrow is a development platform for memory analysis. It consists of a set of technologies that enable big data systems to quickly store, process, and move data. It provides a common data format, represents data in memory as tables, and supports features such as serialization and distributed reading.

Additional: -U means –upgrade, which means to upgrade to the latest version if it is already installed; -q provides less output. This option is additive. In other words, you can use it up to 3 times (corresponding to warning, error and critical logging levels).
– -q means only show messages with warning, error, critical log levels.
– -qq means only show messages with error, critical log level.
– -qqq means only show messages with critical log level.

Then visit https://huggingface.co/settings/tokens and create an access token with write permission:
Create token
Run the following code to log in to Huggingface using the created access token:
Token login Huggingface
After successful login, it prompts: Token is valid (permission: write). Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token. Login successful

Install Git LFS to upload model checkpoints

%?pture
!sudo apt -qq install git-lfs
!git config --global credential.helper store

Replenish:

**Git LFS (Large File Storage)** is a Git extension developed by Atlassian, GitHub and other open source contributors. It reduces the impact of large files in the warehouse by lazily downloading relevant versions of large files. Specifically In other words, large files are downloaded during the checkout process, not during the clone or fetch process (This means that when the remote warehouse content is fetched locally in the background, the large file content will not be downloaded. Instead, the content of the large file will actually be downloaded when you checkout to the workspace).
%?pture: %?pture in Jupyter notebook is a magic instruction that allows you to capture the output of a single unit of code in order to discard it or store it in a variable for later use . For example, you can use %?pture to prevent output from a cell from being displayed in the notebook, or to capture a cell’s output and assign it to a variable.

Import dependent libraries:

import numpy as np
import torch
import torch.nn.functional as F
import torchvision
from matplotlib import pyplot as plt
from PIL import Image

Supplement: PIL (Python Imaging Library) is the most commonly used image processing library in Pythonn. PIL supports image storage, display and processing. It can handle almost all image formats and can complete image scaling, cropping, overlaying and adding lines to images. Operations such as images and text. According to different functions, the PIL library includes a total of 21 image-related classes. These classes can be regarded as sub-libraries or modules in the PIL library. Image is the most commonly used class.

Define two tool functions for picture display:

def show_images(x):
  """Given a batch of images, create a grid and convert it to PIL"""
  x = x * 0.5 + 0.5
  grid = torchvision.utils.make_grid(x)
  grid_im = grid.detach().cpu().permute(1,2,0).clip(0,1)*255
  grid_im = Image.fromarray(np.array(grid_im).astype(np.uint8))
  return grid_im
def make_grid(images, size=64):
  """Given a list of PIL images, stack them into a line"""
  output_im = Image.new("RGB", (size * len(images), size))
  for i, im in enumerate(images):
    output_im.paste(im.resize((size, size)), (i * size, 0))
  return output_im

Check the GPU status:
Make sure you are using CUDA

2. DreamBooth

2.1 Introduction to Stable Diffusion

Stable Diffusion, a text-to-image latent diffusion model released in 2022, was created by researchers at CompVis, Stability AI, and LAION.

Stable Diffusion technology, as an improved version of Diffusion, solves the speed bottleneck of Diffusion by introducing latent vector space. In addition to being specifically used for text-generated graph tasks, it can also be used for graph-generated graphs, specific character characterizations, and even Super score or coloring tasks.

The following figure is a basic Vincent diagram process. The Stable Diffusion structure in the middle is regarded as a black box. The black box input is a text string “paradise (paradise), cosmic (broad), beach (beach)”. Use this This technology outputs the generated picture on the far right that meets the input requirements. The picture produces blue sky, white clouds and an endless vast beach.

The core idea of Stable Diffusion is that since each picture satisfies a certain regular distribution, the distribution information contained in the text is used as a guide to gradually denoise a pure noise picture and generate a picture that matches the text information.

2.2 DreamBooth

DreamBooth is a personalized contextual graph model: given several pictures of an object as input, by fine-tuning a pre-trained contextual graph model (such as Imagen), a unique identifier is bound to the object, so that Novel pictures containing the object can be generated in different scenarios through prompts containing this identifier.

That is, DreamBooth allows us to fine-tune the Stable Diffusion model and introduce additional information about specific faces, objects, or styles throughout the process.

First load this pipeline:
Loading pipeline
After the pipeline is loaded, use the following code to generate a sample image:

# Generate sample image
prompt = "an abstract oil painting of sks mr potato head by picasso"
image = pipe(prompt, num_inference_steps=50, guidance_scale=6.5).images[0]
image

Sample image

3. Diffusers core API

Diffusers core API is mainly divided into three parts:

Pipeline: A variety of functions designed from a high level, implemented in a way that is easy to deploy, and can quickly use pre-trained mainstream diffusion models to generate samples.
Model: The network structure needed when training a new diffusion model.
Scheduler: Uses a variety of different techniques to generate images from noise during inference, and can also generate “noisy” images required during training.

Example: Generate a butterfly image using a pipeline:
Using DiffuserAPI Example
The result is as follows:
Butterfly pictures
So far, the process of training the diffusion model is as follows:

Load images from the training set.
Add different levels of noise
Enter data with different levels of noise added into the model
Evaluate how well the model denoises these inputs
Use the obtained performance information to update the model weights and repeat the above steps.

4. Practical combat: generating beautiful butterfly images

4.1 Download data set

Load a dataset of 1000 butterfly images from Hugging Face Hub.

import torchvision
from datasets import load_dataset
from torchvision import transforms

dataset = load_dataset("huggan/smithsonian_butterflies_subset", split="train")
# You can also load images from local folders
#dataset = load_dataset("imagefolder", data_dir="path/to/folder")

# Will be trained on a 32×32 pixel square image
image_size=32
batch_size=64

#Define data enhancement process
preprocess = transforms.Compose([
    transforms.Resize((image_size, image_size)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize([0.5],[0.5])
])

def transform(examples):
  images = [preprocess(image.convert("RGB")) for image in examples["image"]]
  return {<!-- -->"images": images}

dataset.set_transform(transform)
train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

Then take a portion of the image data out of it and visualize it:
Data sampling visualization

4.2 Scheduler

As mentioned above, in training a diffusion model, you need to take these input images and add noise, and then feed the noisy images into the model. During the inference phase, this noise is gradually eliminated using the model’s predictions. In the diffusion model, these two steps are handled by the scheduler.

The noise scheduler determines how much noise to add at different iteration cycles.

from diffusers import DDPMScheduler
noise_scheduler = DDPMScheduler(num_train_timesteps=1000)

Based on the paper “Denoising Diffusion Probabilistic Models”

Control the hyperparameter beta of the noise scheduler by setting the three parameters beta_start, beta_end and beta_schedule.

beta_start is the beta value that controls the start of the inference phase
beta_end is the final value of control beta
beta_schedule can generate a beta value for each step of model inference through a function mapping.

(1)Only a small amount of noise is added

noise_scheduler = DDPMScheduler(num_train_timesteps=1000, beta_start=0.001, beta_end=0.004)

(2) cosine scheduling method, which may be more suitable for smaller images

noise_scheduler = DDPMScheduler(num_train_timesteps=1000, beta_schedule='squaredcos_cap_v2')

No matter which scheduler you choose, you can use noise_scheduler.add_noise to add varying degrees of noise to the image:
Add noise

4.3 Define diffusion model

Simply put, the workflow of the UNet model is as follows:

The image input to the UNet model will pass through several standard network modules in the ResNet layer, and after passing through each standard network module, the size of the image will be reduced by half.
The same number of upsampling layers can restore the size of the image to its original size.
The residual connection module connects upsampling layers and downsampling layers with the same feature map resolution.

A key feature of the UNet model is that its output image has the same dimensions as the input image, which is what is required in a diffusion model. Diffusers provides a UNet2DModel class for creating the required structures in PyTorch.

from diffusers import UNet2DModel

#Create model
model = UNet2DModel(
    sample_size=image_size,
    in_channels=3,
    out_channels=3,
    layers_per_block=2, #The number of ResNet layers in each UNet block
    block_out_channels=(64, 128, 128, 256),
    down_block_types=(
        "DownBlock2D",
        "DownBlock2D",
        "AttnDownBlock2D", # ResNet downsampling module with self-att in spatial dimension
        "AttnDownBlock2D",
    ),
    up_block_types=(
        "AttnUpBlock2D",
        "AttnUpBlock2D", # ResNet upsampling module with self-att in spatial dimension
        "UpBlock2D",
        "UpBlock2D",
    ),
)
model.to(device);

When processing higher resolution images, try to use more upsampling modules or downsampling modules, and only keep the attention module deep in the network with the lowest resolution (this is a very special network structure that can help The neural network locates the most important parts of the feature map), thereby reducing the memory load.

Enter a batch of data and a random number of iteration cycles to see if the output size is the same as the input size:
Input and output size

4.4 Create a diffusion model training loop

Input data into the model batch by batch, and use the optimizer to update the parameters of the model step by step. The training process for each batch of data is as follows:

Randomly sample several iteration cycles
Perform corresponding noise processing on the data
Enter “noisy” data into the model
Use MSE as the loss function to compare the target results with the model’s prediction results.
Update model parameters by calling the functions loss.backward() and optimizer.step()

# Set noise scheduler
noise_scheduler = DDPMScheduler(num_train_timesteps=1000, beta_schedule="squaredcos_cap_v2")

# training loop
optimizer = torch.optim.AdamW(model.parameters(), lr=4e-4)
losses = []
for epoch in range(30):
  for step, batch in enumerate(train_dataloader):
    clean_images = batch["images"].to(device)
    # Add sampling noise to the image
    noise = torch.randn(clean_images.shape).to(clean_images.device)
    bs = clean_images.shape[0]
    # Randomly sample a time step for each image
    timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, (bs, ),
                              device=clean_images.device).long()
    # Add noise to the clear picture based on the noise amplitude at each time step
    noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
    
    # Get the prediction results of the model
    noise_pred = model(noisy_images, timesteps, return_dict=False)[0]

    # Calculate loss
    loss = F.mse_loss(noise_pred, noise)
    loss.backward(loss)
    losses.append(loss.item())

    #Iterate model parameters
    optimizer.step()
    optimizer.zero_grad()
  
  if (epoch + 1)%5==0:
    loss_last_epoch = sum(losses[-len(train_dataloader):])/len(train_dataloader)
    print(f"Epoch: {<!-- -->epoch + 1}, loss: {<!-- -->loss_last_epoch}")

The output of the above code is as follows:

Epoch: 5, loss: 0.15123038180172443
Epoch: 10, loss: 0.10966242663562298
Epoch: 15, loss: 0.09312395006418228
Epoch: 20, loss: 0.08648758847266436
Epoch: 25, loss: 0.08505138801410794
Epoch: 30, loss: 0.06794926035217941

Draw the loss curve:

fig, axs = plt.subplots(1, 2, figsize=(12, 4))
axs[0].plot(losses)
axs[1].plot(np.log(losses))
plt.show()

Loss Curve
Another way to use models in the pipeline:

model = butterfly_pipeline.unet

4.5 Image generation

Method 1. Create a pipeline

code show as below:
Create a pipeline
Save the pipeline to a local folder:

image_pipe.save_pretrained("my_diffusion_pipeline")

Check the contents of the local folder:
Contents in the folder
The scheduler and unet subfolders contain all the components needed to generate the image. For example, in the unet subfolder, you can see the model parameter file diffusion_pytorch_model.safetensors and the configuration file config.json describing the model structure. These two files contain everything needed to rebuild the pipeline. You can manually upload these two files to Hugging Face Hub to share the pipeline with others, or you can do this through API inspection of the code.

Method 2. Write a sampling loop

Starting from a completely random noisy image, run the scheduler from maximum noise to minimum noise and remove a small amount of noise based on the predictions of the model at each step:
Sampling loop
The noise_scheduler.step method performs the mathematical operations required to update the “samples”.

5. Upload the model to Hugging Face Hub

Hugging Face Hub determines the name of the model repository based on the specified model ID. The code is as follows:
Get model ID
Next, create a model warehouse on Hugging Face Hub and upload it, the code is as follows:
Upload to Hugging Face Hub
Then you can also create a beautiful model card through the following code:

Here, we log in to Hugging Face to view the model we just uploaded
Personal model card
Now use the from_pretrained method of DDPPMipeline to download the model from anywhere by using the following code:
Use uploaded model

6. Use the Accelerate library to expand the scale of the training model

Accelerate is a library that allows you to run the same PyTorch code in any distributed configuration by adding just four lines of code. In short, training and inference at scale become simple, efficient, and adaptable.
Download train_unconditional
Give the new model a name:
Name the model
Execute the following code to start this script through the Accelerate library. The Accelerate library can automatically help complete training deployment functions such as multi-GPU parallel training:

Obviously, an error is reported here. By analyzing the error message, we can see: This example requires a source install from HuggingFace diffusers (see https://huggingface.co/docs/diffusers/installation#install-from-source), but the version found is 0.21.4.

Therefore, we need to install the diffusers source code through git clone:

!git clone https://github.com/huggingface/diffusers.git
!pip install diffusers/
!pip install -r diffusers/examples/unconditional_image_generation/requirements.txt

Then execute the accelerate script:

!accelerate config
!accelerate launch train_unconditional.py \
  --dataset_name="huggan/smithsonian_butterflies_subset" \
  --resolution=64 \
  --output_dir={<!-- -->model_name} \
  --train_batch_size=32 \
  --num_epochs=50 \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-4 \
  --lr_warmup_steps=500 \
  --mixed_precision=no

Results of the:

Epoch 1: 100% 32/32 [00:33<00:00, 1.06s/it, loss=0.402, lr=1.28e-5, step=64]
Epoch 2: 100% 32/32 [00:31<00:00, 1.03it/s, loss=0.152, lr=1.92e-5, step=96]
Epoch 3: 100% 32/32 [00:27<00:00, 1.16it/s, loss=0.0663, lr=2.56e-5, step=128]
Epoch 4: 94% 30/32 [00:28<00:02, 1.22s/it, loss=0.0987, lr=3.16e-5, step=158]
...

Then, as before, push the model to Hugging Face Hub and create a model card:

# Push the model to Hugging Face Hub
create_repo(hub_model_id)

api = HfApi()
api.upload_folder(
    folder_path=f"{<!-- -->model_name}/scheduler",
    path_in_repo="",
    repo_id=hub_model_id
)
api.upload_folder(
    folder_path=f"{<!-- -->model_name}/unet/",
    path_in_repo="",
    repo_id=hub_model_id
)
api.upload_file(
    path_or_fileobj=f"{<!-- -->model_name}/model_index.json",
    path_in_repo="model_index.json",
    repo_id=hub_model_id
)
content=f"""
---
license:mit
tags:
- pytorch
- diffusers
-unconditional-image_generation
-diffusion-models-class
---
# This is an unconditional image generation diffusion model (test), used to generate beautiful butterfly images
```python
from diffusers import DDPMPipeline

pipeline=DDPMPipeline.from_pretrained('{hub_model_id}')
image = pipeline().images[0]
image

Push to Hugging Face

card=ModelCard(content)
card.push_to_hub(hub_model_id)

Use this model:

pipeline = DDPMPipeline.from_pretrained(hub_model_id).to(device)
images = pipeline(batch_size=8).images
make_grid(images)

Before understanding various models, it is necessary to first understand safetensors. Anyone who has played with them should know them, which are the suffixes of many models. However, the suffixes of various models are diverse, but the shadow of safetensors can always be seen, which is a bit confusing. In fact, it is mainly because safetensors support various AI models. Before the emergence of safetensors, various AI models had their own unique suffixes. This results in each model being able to use both safetensors and its own original suffixes, so it can be a bit confusing when getting started.
In fact, safetensors is an open source model format developed by huggingface. It has several advantages:

Secure enough to prevent DOS attacks
Loads quickly
Support lazy loading
Strong versatility

So now most open source models provide safetensors format.

Reference materials

Diffuser Github
PyArrow – Apache Arrow Python bindings
Detailed explanation of Git large file storage (Git LFS)
What does “%?pture” in Jupyter notebook mean?
PIL library introduction
Understand the operating principle of Stable Diffusion in ten minutes
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Use AI to draw yourself into the animation and get 1.5 million + views in 3 days
Accelerate