dalle3: Improving image generation with better captions

Vincentian Diagram – DALL-E 3 – Paper Interpretation – First Edition – CSDN blog article has been read 236 times. This article is mainly an interpretation of the official first version of the DALL·E 3 technical report (paper). A one-sentence saving version, in terms of data, 95% model (CoCa) is used to synthesize detailed description captions + 5% original human captions during training, and GPT-4v is used to expand human captions during testing; for models, T5xxl + vae encoder + diffusion latent is used + Own decoder to achieve the best results. https://blog.csdn.net /u012863603/article/details/134028230 DALL-E 3 technical report reading notes – Zhihu When everyone is speculating on the DALL-E 3 technical architecture, I did not expect that OpenAI recently released a DALL-E 3 technical report. This has to be Give OpenAI a thumbs up. However, this technical report of DALL-E 3 focuses on how to improve the model’s production by synthesizing the text description (caption) of the image…icon-default.png?t=N7T8https: //zhuanlan.zhihu.com/p/662745543

dalle3, in terms of data, 95% model (CoCa) is used to synthesize detailed description captions + 5% original human captions during training, and GPT-4v is used to expand human captions during testing; in terms of model, T5xxl + vae encoder + diffusion latent + lcm decoder is used to obtain For the best effect, lcm decoder has been open source.

Dalle3’s technical report mainly focuses on two issues: 1. How to improve the model’s generation ability by synthesizing the text description caption of the image. The main improvement is the prompt following ability, that is, the consistency between the generated image and the input text prompt; 2. About the comparison between dalle3 and other Vincentian graph algorithms.

1. Synthetic image caption

The text understanding ability of the Vincentian graph algorithm. For slightly complex texts, the generated images tend to ignore part of the text description, and even fail to generate the image described by the text. The main reason is that the caption of the training data set is not accurate enough. 1. The conventional text of the image The description is too simple (coco). Most of them only describe the subject in the image and ignore other information in the image, such as the background, the location and number of objects, the text in the image, etc. 2. The currently trained image-text pairs (laion) are all It is crawled from the web, and the text description is alt-text (image replacement text). Many of these descriptions are less relevant content, such as advertisements.

OpenAI’s solution is to train an image caption to synthesize the caption of the image. Blip has done this before. In 2022, Laion also launched the laion-coco data set based on blip to generate captions, but it is a short generated caption, and GPT-4V is not used to generate caption. The model chosen here is Google’s CoCa. Compared with clip, CoCa adds an additional multimodel text encoder to generate captions. The training loss includes the contrast loss of clip and the cross-entropy loss (autoregressive loss) generated by the caption.

In order to improve the quality of model generation, the pre-trained image captioner was fine-tuned, including two parts. 1. The fine-tuning data is a short caption that only describes the main body of the image. 2. The fine-tuning data is a long caption that describes the image content in detail. Correspondingly The two models generate ssc short captions and dsc long phrases respectively. The ssc and dsc are evident in the picture below.

The experimental part mainly discusses two points, 1. The impact of synthetic captions on the Vincent diagram model; 2. The optimal mixing ratio of synthetic captions and original captions during training. Mixing is mainly to prevent the model from overfitting to certain paradigms of synthetic captions. For example, synthetic captions often start with a and an. During training, mixing original captions into synthetic captions is equivalent to a kind of model regularization.

1. The impact of synthetic captions on the Vincent diagram algorithm.

Three models were trained, 1. Only original captions; 2.5% original captions + 95% synthetic short captions; 3.5% original captions and 95% synthetic long captions. The Vincent graph algorithm is a latent diffusion model. Its VAE is 8x downsampling like sd. The text encoder uses T5-XXL. T5-XXL can encode longer text and has stronger coding capabilities. The training image size is 256×256, bs= 2048, training for 500,000 steps, which is equivalent to sampling 1B samples. The unet structure is estimated to be similar to sdxl, including 3 stages. sdxl includes 3 sdxl, and is downsampled 2 times. The first stage is pure convolution, and the following 2 stages Each contains attention.

CLIP score is used for evaluation. CLIP score is the value calculated by cosine similarity of text embedding corresponding to image embedding and text prompt, based on CLIP ViT-B/32.

In the left picture, the text uses the original caption when calculating the clip score. The blue line is the image generated by the long text and the score calculated by the original caption, which is still better than the red line. The text in the right picture samples the long caption. It is obvious that long caption>short caption>original caption. .

Use 95% long caption>90% long caption>80% long caption.

Therefore, the synthetic long caption is very helpful to the final result of the model. However, if 95% synthetic long caption is used, it will also be overfitted to the long caption. If the conventional short caption is used, the effect will be much worse. In order to solve this problem, openai uses GPT4 to upsample user captions, which are short captions input by users. After entering GPT4, they become long captions. After optimization, the effect is indeed better.

2.dalle3

For the implementation of dalle3, 1.dalle3 uses 95% synthetic long caption and 5% original caption mixed training, which is one of the keys. 2.dalle3 should be a larger version of T5xxl + vae encoder + diffusion latent + lcm decoder model. The currently generated image resolution is 1024×1024 or above, which should be similar to sdxl. It adopts a progressive training strategy, 256->512->1024, and uses a multi-scale training strategy to enable the model to output images of various aspect ratios, lcm The decoder is currently open source. It mainly improves the details of images, especially text and faces, in order to solve the image distortion caused by VAE. Mainly to replace vae decoder, which can be used together with sd’s vae encoder.

import torch
from diffusers import StableDiffusionPipeline
from consistencydecoder import ConsistencyDecoder, save_image, load_image

# encode with stable diffusion vae
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, device="cuda:0"
)
pipe.vae.cuda()
decoder_consistency = ConsistencyDecoder(device="cuda:0") # Model size: 2.49 GB

image = load_image("assets/gt1.png", size=(256, 256), center_crop=True)
latent = pipe.vae.encode(image.half().cuda()).latent_dist.mean

# decode with gan
sample_gan = pipe.vae.decode(latent).sample.detach()
save_image(sample_gan, "gan.png")

# decode with vae
sample_consistency = decoder_consistency(latent)
save_image(sample_consistency, "con.png")

3. Evaluation

Evaluation is still worth learning from. From the previous sd_eval, how to evaluate image generation has always been a point worth studying. The evaluation scores of dalle3 are divided into automatic evaluation and manual evaluation.

There are 3 indicators for automatic evaluation, 1. CLIP score, the evaluation data set is 4096 captions selected from COCO2014, 2. Using GPT4V, the evaluation data set is DrawBench proposed in imagen, including a total of 200 different types of prompts. , send the generated image and corresponding text to GPT4V, and let the model determine whether the image and text are consistent. If they are consistent, they are correct. 3. Use T2I-CompBench, including 6000 combination types of text prompts, and choose color binding/shape binding/texture binding evaluation, and pass the BLIP-VQA model score. All three aspects are difficult, and sdxl cannot handle it.

These three indicators may not reflect many image quality issues. Intuitively, the images generated by xl are better than those of dalle2.

The second part is manual evaluation: 1. Given two pictures, determine which one is more consistent with the text, 2. Style, without giving text, determine which of the two pictures you like more, 3. Coherence, choose from the two pictures to include More images of real objects.

4. Question

1. Spatial positional relationship; 2. Text generation ability; 3. In the synthesized caption, the important details in the image will be imagined, and the wrong type of plant may be drawn.