Human preference score: better aligning text-to-image models with human preference

This article proposes a human preference data set, and has a HPS evaluation score for the output image, but the core is to allow HPS to be input into SD as a dimension, so that the generated images can better align with human preferences.

1.abstract

A human preference data set on generated images was collected in the SD discord, and a human preference score, HPS, was trained using this data set.

2.introduction

There are some problems with image generation, such as the unnatural combination of body and facial expressions in the generated tasks.

It is still uncertain whether there is a correlation between the existing evaluation indicators IS and FID and human preferences. They are single-modal evaluation indicators that do not consider user intentions. Clips are also used for evaluation to evaluate the consistency between generated images and text prompts. sexual. By fine-tuning the clip model and defining the human preference score hps, a human preference classifier is trained. hps can know the images preferred by the model generator, and supplements sd through lora to generate images.

2.related works

DiffusionDB, discord channel from sd, including text prompts and parameters for each image, SAC: There is an image data set generated by glide, including user ratings, generated image database: https://lexica.art

human feedback learning, the potential of human feedback goes far beyond exact alignment when it takes aesthetic preferences into account.

3.human preference dataset

Use discord-chatExporter to get the chat history and store it in json format, as shown in the image below, the user starts the conversation by sending a text prompt to the bot, the bot generates several images in response, and then the user selects a preferred image to put it along with the text Sent to the bot, which returns several optimized images, this interaction follows a predefined grammar, allowing us to extract human choices and relevant images using simple pattern matching.

We obtained a total of 98807 images generated from 25205 prompts. Each prompt corresponds to several images. The user selects one as the first image, and the other images are non-preferred negative examples. The number of images corresponding to each prompt varied, with 23,722 prompts having four images, 953 prompts having three images, and 530 prompts having two images. The number of images per prompt depends on what the user specified when generating the request. The dataset exhibits a high degree of diversity, encompassing a wide range of main topics, containing 2659 different user choices, with each user contributing up to 267 choices. , examples are as follows:

dataset
----preference_images/
-------- {instance_id}_{image_id}.jpg
---- preference_train.json
---- preference_test.json

preference_{train/test}.json：
[
    {
        'human_preference': int,
        'prompt': str,
        'id': int,
        'file_path': list[str],
        'user_hash': str,
        'contain_name': boolean,
    },
    ...
]

4. existing metrics

4.1 metrics by inception net

IS and FID are two commonly used indicators to evaluate the quality of generated images, and they both analyze images through inception net trained on imagenet.

Inception Score, uses inceptionv3 to classify the generated images, uses the predicted distribution of the generated images to estimate their authenticity, uses kl divergence to measure the diversity of the generated images, calculates the IS of the preferred and non-preferred pictures, and there is no obvious difference.

FID, calculates the distance between the vector of the real image and the generated image. The generated image is generated based on the prompts provided by the user. 10,000 text prompts are randomly sampled from the dataset. For each prompt, the closest match to the prompt is found in the laion dataset. Close images as their true images, 10000 images were randomly sampled from the preferred and non-preferred splits of the collected dataset, and the FID was calculated with the true images.

There is no significant difference between IS and FID in terms of preferred and non-preferred images, indicating that it is not a reliable indicator of human preference.

discuss:

1. Generated images often contain shape artifacts, however classification-based convolutional neural networks are biased towards object texture rather than shape, making them likely to ignore shape artifacts in generated images.

2. The classification model is trained on real data such as imagenet, and does not generate many styles in the data set.

3. Metrics are limited by single modality.

4.2 metrics by clip

Due to the large and diverse training data, clip is better at encoding images from different domains than models trained on ImageNet. In addition, by encoding text prompts, user intent can be captured, and therefore the alignment between the prompt text and the generated image is evaluated. choose. Aesthetics scores proposed by laion are clip-based image quality assessment tools. SD training data are mostly used for data screening. Clip scores are obtained through the cosine similarity between the prompt embedding and the image embedding calculated by clip. The ViT-L/14 and RN50X64 models are evaluated. The largest open source clip based on transformer and cnn architecture, the scores of both clips are good. . Aesthetic scores are based on the pre-trained ViT-L/14 clip image encoder. An mlp is added on top of the clip image encoder to make it output aesthetic scores. The mlp is trained on multiple aesthetic data sets, including real images and generated images. Images, such as AVA and SAC, are used to predict scores on a scale of 1-10. Unlike clips, aesthetic classifiers are not conditioned on prompt text.

Both clip scores and aesthetic scores are better than random selection.

5. human preference score

First, a human preference classifier is trained to predict human choices based on prompts, and then the human preference score hps is obtained based on the trained classifier.

human preference classifier. Clip’s ViT-L/14 is fine-tuned on the collected data set to better match human preference. Each sample in the training set contains a hint and n∈{2,3,4} images, In which only one of these images is preferred by the user, the model works by maximizing the similarity between the text embedding calculated by the clip text encoder and the embedding of the preferred image computed by the clip visual encoder, and minimizing the similarity of the non-preferred image. to compute, fine-tuned by human selection on the generated images.

The human preference score(hps) is obtained through the human preference classifier:

where enc are text and image encoders respectively.

6.better aligning stable diffusion with human preferences

HPS can better guide the alignment of diffusion-based generative models and human preferences. This article believes that the misalignment between the generated images and human preferences is a lack of awareness rather than a problem of model capacity. In fact, this is also an interesting point. Generally speaking, for large models, the ability to generate large models from images is very good. , maybe we lack certain mining capabilities, including multi-modal inputs such as prompts or images, which are not natural enough to give full play to the full capabilities of the model.

How to solve the alignment problem? proposed to adapt the generative model by explicitly distinguishing between preferred and non-preferred images, constructing another dataset including cues and their newly generated images and classifying them as preferred or non-preferred based on a previously trained human preference classifier, For non-preferred images, we modify it by adding a special prefix before the corresponding prompt, and train it through lora, which increases the model’s ability to learn the concept of non-preferred images. This is actually what dreambooth does, except that the model is given an additional first choice. and non-preferred image concepts, while the dataset can be constructed by HPS classifier.

Construct a data set: Construct training data from diffusiondb’s large_first_lm segmentation and sd’s pre-training data set laion-5b. Diffusiondb is a large-scale generated image data set that contains images related to its text prompts. For images in diffusiondb ,We first calculated the hps for each image-cue pair,constructing the data.

7.Experiments

7.1hps

In the process of training the aesthetic classifier, 20205 samples of the HPS data set were used, including 20205 tips and 79167 pictures. ViT-L/14 of clip was used, and the last 10 layers of the clip image encoder and the last layer of the text encoder were used. 6 layers are fine-tuned, trained for 1 epoch, and padding is 224×224.

The correlation between hps and clip score is aligned in the figure below. hps and clip score are positively correlated. hps pays more attention to the aesthetic quality of the image.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. Python entry skill treeHomepageOverview 376559 people are learning the system