文生图Stable Diffusion XL 1.0 model Full Fine-tuning guide (U-Net full parameter fine-tuning)

Article directory

  • Preface
  • Important tutorial links
  • Take poster generation fine-tuning as an example
    • Overall process
    • data collection
      • POSTER-TEXT
      • AutoPoster
      • CGL-Dataset
      • PKU PosterLayout
      • PosterT80K
      • Movie & amp; TV Series & amp; Anime Posters
    • Data cleaning and annotation
    • Model training
    • Model evaluation
    • Generate image samples
      • Pet bag product poster
      • Skin care essence product poster
    • Some Tips
      • Mata: EMU (Expressive Media Universe)
      • ideogram
      • DALL-E3
      • About model optimization
      • Examples of Commonly Used Negative Prompts:

Foreword

Stable Diffusion is a large generative model in the field of computer vision that can perform image generation tasks such as text-generated images (txt2img) and image-generated images (img2img). The open source release of Stable Diffusion, and the subsequent series of work based on Stable Diffusion, have enabled the field of artificial intelligence painting to present unprecedented high-quality creations and creativity.

In July this year, Stability AI officially launched Stable Diffusion XL (SDXL) 1.0, which is currently the best open source model in the field of image generation. The Vincentian graph model has completed another important iteration in its evolution. SDXL 1.0 can generate high-quality images of almost any artistic style and is the best open source model to achieve realistic effects. This model is well-tuned in terms of color vibrancy and accuracy, with better contrast, lighting, and shadows than the previous generation, all in native 1024×1024 resolution. In addition, SDXL 1.0 has great improvements for difficult-to-generate concepts, such as hands, text, and spatial arrangements.

At present, the training tutorials on text2img models are mostly focused on LoRA, DreamBooth, Text Inversion and other models, and most of the training methods also rely on visual UI interface tools, such as SD WebUI, AI painting one-click startup software, etc. It can be said that there are almost no detailed tutorials on Full Fine-tuning, so here is a record of the materials I referenced in the process of fine-tuning the SDXL Base model, as well as an explanation of some training parameters.

Important tutorial links

  • Main reference: Interpretation of training process https://zhuanlan.zhihu.com/p/643420260 – Zhihu
  • Detailed explanation of SDXL principle: https://zhuanlan.zhihu.com/p/650717774
  • Code base: https://github.com/qaneel/kohya-trainer
  • SDXL training instructions: https://github.com/kohya-ss/sd-scripts/blob/main/README.md#sdxl-training
  • Example of fine-tuning Stable Diffusion model: https://keras.io/examples/generative/finetune_stable_diffusion/
  • huggingface officially fine-tune the SDXL code base based on Diffusers:
    https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md

Take poster generation fine-tuning as an example

Overall process

Data acquisition

Use public data sets from scientific research institutions, companies and the Kaggle platform, as follows:

POSTER-TEXT

The dataset POSTER-TEXT is a text image generation task for e-commerce poster images. It contains 114,009 records and is provided by Alibaba Group. Includes the original poster image and the image after erasing the text in the poster image.
Paper: TextPainter: Multimodal Text Image Generation with Visual-harmony and Text-comprehension for Poster Design ACM MM 2023.
Source: https://tianchi.aliyun.com/dataset/160034

AutoPoster

The dataset AutoPoster-Dataset is about the automated generation of e-commerce poster images. It contains 76,000 records and is provided by Alibaba Group.

There is a problem of duplicate annotations in some images. The paper mentions that there are 69,249 images in the training set and 7,711 images in the test set. But in fact, after removing duplicate data, there are 68,866 unique advertising poster images in the training set and 7,671 unique images in the test set.

Paper: AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation ACM MM 2023
Source: https://tianchi.aliyun.com/dataset/159829

CGL-Dataset

Paper: Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs IJCAI 2022
Github: https://github.com/minzhouGithub/CGL-GAN
Source: https://tianchi.aliyun.com/dataset/142692

PKU PosterLayout

As the first public dataset to contain complex layouts, it provides additional difficulties in modeling intra-layout relationships and represents an extended task requiring complex layouts. Contains 9,974 training images and 905 testing images.

  • Domain diversity
    Data were collected from multiple sources, including an e-commerce poster dataset and multiple photo gallery websites. Images are diverse in terms of domain, quality, and resolution, which results in changes in data distribution and makes the dataset more general.
  • Content diversity
    Nine categories are defined, covering most products, including food/beverages, cosmetics/accessories, electronics/office supplies, toys/instruments, apparel, sports/transportation, groceries, appliances/decor, and fresh produce.

Paper: A New Dataset and Benchmark for Content-aware Visual-Textual Presentation Layout CVPR 2023
Github: https://github.com/PKU-ICST-MIPL/PosterLayout-CVPR2023
Source: http://59.108.48.34/tiki/PosterLayout/

PosterT80K

E-commerce poster picture, but the data is not public. The units are University of Science and Technology of China and Alibaba.
Paper: TextPainter: Multimodal Text Image Generation with
Visual-harmony and Text-comprehension for Poster Design ACM MM 2023
Source: None

Movie & amp; TV Series & amp; Anime Posters

For public data on Kaggle, you need to write a download script from the image URL address in the provided csv or json file. \

Source:

  • https://www.kaggle.com/datasets/bourdier/all-tv-series-details-dataset
    file prefix: https://www.themoviedb.org/t/p/w600_and_h900_bestv2/xx.jpg
  • https://www.kaggle.com/datasets/crawlfeeds/movies-and-tv-shows-dataset
  • https://www.kaggle.com/datasets/phiitm/movie-posters
  • https://www.kaggle.com/datasets/ostamand/tmdb-box-office-prediction-posters
  • https://www.kaggle.com/datasets/dbdmobile/myanimelist-dataset
  • https://www.kaggle.com/zakarihachemi/datasets
  • https://www.kaggle.com/datasets/rezaunderfit/48k-imdb-movies-data

Take the first data source as an example:

import csv
import os
import requests
import warnings
warnings.filterwarnings('ignore')

csv_file = r"C:\Users\xxx\Downloads\tvs.csv"
url_prefix = 'https://www.themoviedb.org/t/p/w600_and_h900_bestv2'
save_root_path = r"D:\dataset\download_data\tv_series"


def parse_csv(path):
    cnt = 0
    s = requests.Session()
    s.verify = False # Turn off ssl verification globally
    with open(path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            raw_img_url = row['poster_path'] # url item
            img_url = url_prefix + raw_img_url
            if raw_img_url == '':
                continue
            try:
                img_file = s.get(img_url, verify=False)
            except Exception as e:
                print(repr(e))
                print("Error status response code: {}".format(img_file.status_code))

            if img_file.status_code == 200:
                img_name = raw_img_url.split('/')[-1]
                # img_name = row['url'].split('/')[-2] + '.jpg'
                save_path = os.path.join(save_root_path, img_name)
                with open(save_path, 'wb') as img:
                    img.write(img_file.content)

                cnt + = 1
                print(cnt, 'saved!')

    print("Done!")


if __name__ == '__main__':
    if not os.path.exists(save_root_path):
        os.makedirs(save_root_path)
    parse_csv(csv_file)

Data cleaning and annotation

Data screening criteria: Remove files below 512 and above 1024, remove files with file size/resolution <0.0005, and remove files with dpi less than 96. The 0.0005 here is a subjective parameter standard determined based on the file size (kb) and resolution of the image generated by SD, and is used to ensure image quality. The values of the eight SD-generated pictures under this indicator are as follows:

SD generated file size/image resolution: 0.00129, 0.0012, 0.0011, 0.00136, 0.0014, 0.0015, 0.0013, 0.00149

Image annotation: Use BLIP and Waifu models to automatically annotate. There are detailed instructions in the Zhihu link given above, so I won’t go into details here.

Model training

  1. Model preparation:
    The image generated by SDXL 1.0 vea will have color stripe artifacts due to the digital watermark problem. This problem can be solved by using the 0.9 vae file. It is recommended to use the integrated Base model (link). HuggingFace official also has relevant instructions as follows:

SDXL’s VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely –pretrained_vae_model_name_or_path that lets you specify the location of a better VAE (such as this one).

  1. Description of some training parameters
    • mixed_precision: Whether to use mixed precision, determined according to the unet model, [“no”, “fp16”, “bf16”]
    • save_percision: Save model precision, [float, fp16, bf16], “float” means torch.float32
    • vae: The specified vae model used for training and inference
    • keep_tokens: Tags will be randomly shuffled during training. If set to n, the order of the first n tags will not be shuffled.
    • shuffle_caption: bool, shuffle labels, which can enhance model generalization
    • _load_target_model() of sdxl_train_util.py determines whether to read the model from a single safetensor. There is code to modify the StableDiffusionXLPipeline reading model.
    • sample_every_n_steps: the number of training iterations between each inference
    • noise_offset: Avoid generating images with an average brightness value of 0.5, which greatly improves the generation of logos, 3D cropped images, and natural light and dark scene images.
    • optimizer_args: Additional setting parameters of the optimizer, such as weight_decay and betas, etc., are defined in train_util.py.
    • clip_threshold: AdaFactor optimizer parameter, add this parameter in optimizer_args, the default value is 1.0. Reference: https://huggingface.co/docs/transformers/main_classes/optimizer_schedules
    • Gradient checkpointing: Use extra computing time in exchange for GPU memory, allowing you to train larger models with limited GPU memory.
    • lr_scheduler: Set learning rate change rules, such as linear, cosine, constant_with_warmup
    • lr_scheduler_args: Set the specific parameters under this rule, please refer to the pytorch documentation

Model evaluation

At present, the evaluation process in the AIGC field is still relatively subjective as a whole, but here we still use the aesthetics score (Aesthetics) and CLIP score indicators to measure the quality of the generated pictures and the matching of text and pictures respectively. The evaluation code is based on GhostReview developed by the author of GhostMix. The author only took part of it and made some optimizations. Please combine it with the original author’s code understanding. The specific code is as follows:

import numpy as np
import torch
import pytorch_lightning as pl
import torch.nn as nn
import clip
import os
import torch.nn.functional as F
import pandas as pd
from PIL import Image
import scipy

class MLP(pl.LightningModule):
    def __init__(self, input_size, xcol='emb', ycol='avg_rating'):
        super().__init__()
        self.input_size = input_size
        self.xcol = xcol
        self.ycol = ycol
        self.layers = nn.Sequential(
            nn.Linear(self.input_size, 1024),
            #nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(1024, 128),
            #nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            #nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(64, 16),
            #nn.ReLU(),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        return self.layers(x)

    def training_step(self, batch, batch_idx):
        x = batch[self.xcol]
        y = batch[self.ycol].reshape(-1, 1)
        x_hat = self.layers(x)
        loss = F.mse_loss(x_hat, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x = batch[self.xcol]
        y = batch[self.ycol].reshape(-1, 1)
        x_hat = self.layers(x)
        loss = F.mse_loss(x_hat, y)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer


def normalized(a, axis=-1, order=2):
    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))
    l2[l2==0]=1
    return a / np.expand_dims(l2, axis)


def PredictionLAION(image, laion_model, clip_model, clip_process, device='cpu'):
    image = clip_process(image).unsqueeze(0).to(device)
    with torch.no_grad():
        image_features = clip_model.encode_image(image)
    im_emb_arr = normalized(image_features.cpu().detach().numpy())
    prediction = laion_model(torch.from_numpy(im_emb_arr).to(device).type(torch.FloatTensor))
    return float(prediction)


#ClipScore for 1 image
# ClipScore of 1 image
def get_clip_score(image, text, clip_model, preprocess, device='cpu'):
    # Preprocess the image and tokenize the text
    image_input = preprocess(image).unsqueeze(0)
    text_input = clip.tokenize([text], truncate=True)

    # Move the inputs to GPU if available
    image_input = image_input.to(device)
    text_input = text_input.to(device)

    # Generate embeddings for the image and text
    with torch.no_grad():
        image_features = clip_model.encode_image(image_input)
        text_features = clip_model.encode_text(text_input)

    #Normalize the features
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)

    # Calculate the cosine similarity to get the CLIP score
    clip_score = torch.matmul(image_features, text_features.T).item()

    return clip_score


if __name__ == '__main__':
    #Read image path
    ImgRoot = './Image/ImageRating'
    DataFramePath = './dataresult/MyImageRating' # all prompts results of each model
    ModelSummaryFile = './ImageRatingSummary/MyModelSummary_Total.csv'

    PromptsFolder = os.listdir(ImgRoot)
    if not os.path.exists(DataFramePath):
        os.makedirs(DataFramePath)

    # Read the prompts corresponding to the image
    PromptDataFrame = pd.read_csv('./PromptsForReviews/mytest.csv')
    PromptsList = list(PromptDataFrame['Prompts'])

    #Load the evaluation model
    device = "cuda" if torch.cuda.is_available() else "cpu"
    MLP_Model = MLP(768) # CLIP embedding dim is 768 for CLIP ViT L 14
    # load LAION aesthetics model
    state_dict = torch.load("./models/sac + logos + ava1-l14-linearMSE.pth", map_location=torch.device(device))
    MLP_Model.load_state_dict(state_dict)
    MLP_Model.to(device)
    MLP_Model.eval()
    # Load the pre-trained CLIP model and the image
    CLIP_Model, CLIP_Preprocess = clip.load('ViT-L/14', device=device, download_root='./models/clip') # RN50x64
    CLIP_Model.to(device)
    CLIP_Model.eval()

    # Skip prompts that have already been done
    try:
        DataSummaryDone = pd.read_csv(ModelSummaryFile)
        PromptsNotDone = [i for i in PromptsFolder if i not in list(DataSummaryDone['Model'])]
    except:
        DataSummaryDone = pd.DataFrame()
        PromptsNotDone = [i for i in PromptsFolder]
    if not PromptsNotDone:
        importsys
        sys.exit("There are no models to analyze.")

    for i, name in enumerate(PromptsNotDone):
        FolderPath = os.path.join(ImgRoot, str(name))
        ImageInFolder = os.listdir(FolderPath)
        DataCollect = pd.DataFrame()
        for j, img in enumerate(ImageInFolder):
            prompt_index = int(img.split('-')[1])
            txt = PromptsList[prompt_index]
            ImagePath = os.path.join(FolderPath, img)
            Img = Image.open(ImagePath)
            #Clipscore
            ImgClipScore = get_clip_score(Img, txt, CLIP_Model, CLIP_Preprocess, device)

            #aesthetics scorer
            # ImageScore = predict(Img)

            #LAION aesthetics scorer
            ImageLAIONScore = PredictionLAION(Img, MLP_Model, CLIP_Model, CLIP_Preprocess, device)

            #temp = list(ImageScore)
            temp = list()
            temp.append(float(ImgClipScore))
            temp.append(ImageLAIONScore)
            temp = pd.DataFrame(temp)
            DataCollect = pd.concat([DataCollect, temp], axis=1)
            print("Model:{}/{}, image:{}/{}".format(i + 1, len(PromptsNotDone), j + 1, len(ImageInFolder)))
        DataCollect = DataCollect.T
        DataCollect['ImageIndex'] = [i + 1 for i in range(len(ImageInFolder))]

        DataCollect.columns = ['ClipScore', 'LAIONScore', 'ImageIndex']

        # Save original data
        DataCollect.to_csv(os.path.join(DataFramePath, str(name) + '.csv'), index=False)
        print("One Results File Saved!")
    print('Image rating complete!')


    #do some calculations
    ModelSummary = pd.DataFrame()
    for i in PromptsNotDone:
        DataCollect = pd.read_csv(os.path.join('dataresult/MyImageRating', str(i) + '.csv'))
        temp = pd.DataFrame(DataCollect['LAIONScore'].describe()).T
        # Calculate the skewness of the data
        temp['skew'] = scipy.stats.skew(DataCollect['LAIONScore'], axis=0, bias=True, nan_policy="propagate")
        # Calculate the kurtosis of the data
        temp['kurtosis'] = scipy.stats.kurtosis(DataCollect['LAIONScore'], axis=0, fisher=True, bias=True,
                                                nan_policy="propagate")
        temp.columns = [i + '_LAIONScore' for i in list(temp.columns)]
        # temp['RatingScore_mean']=np.mean(DataCollect['Rating'])
        # temp['RatingScore_std']=np.std(DataCollect['Rating'])
        temp['Clipscore_mean'] = np.mean(DataCollect['ClipScore'])
        temp['Clipscore_std'] = np.std(DataCollect['ClipScore'])
        # temp['Artifact_mean']=np.mean(DataCollect['Artifact'])
        # temp['Artifact_std']=np.std(DataCollect['Artifact'])
        temp['Model'] = str(i)
        ModelSummary = pd.concat([ModelSummary, temp], axis=0)

    # save results
    new_order = ['Model', 'count_LAIONScore', 'mean_LAIONScore', 'std_LAIONScore',
                 'min_LAIONScore', '25%_LAIONScore', '50%_LAIONScore', '75%_LAIONScore',
                 'max_LAIONScore', 'skew_LAIONScore', 'kurtosis_LAIONScore',
                 'Clipscore_mean', 'Clipscore_std']
    # Use the reindex() method to reorder the columns
    ModelSummary = ModelSummary.reindex(columns=new_order)

    DataSummaryDone = pd.concat([DataSummaryDone, ModelSummary], axis=0)
    DataSummaryDone.to_csv('./ImageRatingSummary/MyModelSummary_Total.csv')

    pd.set_option('display.max_rows', None) # None means no limit
    pd.set_option('display.max_columns', None) # None means no limit
    pd.set_option('display.width', 1000) # Set the width to 1000 characters
    print(DataSummaryDone)

The figure below shows the comparison results between the SDXL-Poster trained in this article and the mainstream Vincentian graph model. Note that the results including the beginning of the Anything model are calculated by the author himself using 180 images generated by calling the relevant model, so the standard deviation All are too large; the above is the calculated result of the GhostReview author calling these models to generate 960 images. Due to inconsistent sample sizes, readers are advised to refer with caution.

Generate image samples

Compare the SDXL-Poster trained in this article with SDXL-Base and CyberRealistic.

Pet bag product poster

A feline peering out from a striped transparent travel bag with a bicycle in the background. Outdoor setting, sunset ambiance. Product advertisement of pet bag, No humans, focus on cat and bag, vibrant colors, recreational theme
SDXL-Poster


(a) SDXL-Poster


(b)SDXL-Base


(c) CyberRealistic

Skin care essence product poster

Four amber glass bottles with droppers placed side by side, arranged on a white background, skincare product promotion, no individuals present, still life setup


(a) SDXL-Poster


(b)SDXL-Base


(c) CyberRealistic

Some Tips

Mata: EMU (Expressive Media Universe)

Simple text and a picture in 5 seconds. Paper: https://arxiv.org/abs/2309.15807
Zhihu detailed interpretation: https://zhuanlan.zhihu.com/p/659476603

The training method of EMU is introduced: quality-tuning, a kind of supervised fine-tuning. It has three keys:

  • Fine-tuning datasets can be surprisingly small, on the order of a few thousand images;
  • The quality of the data set is very high, which makes data curation difficult to fully automate and requires manual annotation;
  • Even if the fine-tuning dataset is small, quality adjustments can significantly improve the aesthetics of the generated images without sacrificing generality, which is measured based on fidelity to the input prompts.
  • The basic pre-trained large model generation process is not guided to generate images that are consistent with the statistical distribution of the fine-tuned dataset, while quality-tuning can effectively constrain the output images to be consistent with the distribution of the fine-tuned subset.
  • Images with resolutions lower than 1024×1024 use pixel diffusion upsampler to increase resolution

ideogram

Generate a generative model that contains text in images: ideogram, released on August 23, 2023, free, official website https://ideogram.ai/

DALL-E3

“Text rendering is still unreliable and they believe the model has difficulty mapping word tokens to letters in images”

  • Enhance the prompt following ability of the model and train an image captioner to generate more accurate and detailed image captions
  • The synthesized long caption can improve model performance, and the mixing ratio with ground-truth caption is 95% for the best effect. Long captions are obtained by upsampling human descriptions using GPT-4.

About model optimization

  • It would be better to mix the trained base model with a similar LoRA model
  • The function of the base model is to be compatible with multiple styles, and style refinement is what LoRA does.
  • SDXL has problems when generating text and hands: https://zhuanlan.zhihu.com/p/649308666
  • The number of fine-tuning iterations cannot exceed 5k, otherwise it will lead to obvious overfitting and reduce the versatility of the visual concept (source: EMU model tips)

Examples of Commonly Used Negative Prompts:

  1. Basic Negative Prompts: worst quality, normal quality, low quality, low res, blurry, text, watermark, logo, banner, extra digits, cropped, jpeg artifacts, signature, username, error, sketch, duplicate , ugly, monochrome, horror, geometry, mutation, disgusting.
  2. For Animated Characters: bad anatomy, bad hands, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus , extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, realistic photo, extra eyes , huge eyes, 2girl, amplifier, disconnected limbs.
  3. For Realistic Characters: bad anatomy, bad hands, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus , extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, huge eyes , 2girl, amplifier, disconnected limbs, cartoon, cg, 3d, unreal, animate.
  4. For Non-Adult Content: nsfw, nude, censored.