NeurIPS 2023 | Up! FreeMask: Improving segmentation model performance with densely annotated synthetic images

Click the Card below and follow the “CVer” public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter->[Image Segmentation and Transformer] Communication Group

Author: LeolhYang (Source: Zhihu, authorized) | Editor: CVer

https://zhuanlan.zhihu.com/p/663587260

Reply in the background of CVer WeChat public account: FreeMask, you can download the pdf and code of this paper and start learning!

Here we share our NeurIPS 2023 work “FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models”. In this work, we generate a large number of synthetic images from semantic segmentation masks, and use these synthetic training images and their corresponding masks to improve the performance of the semantic segmentation model trained on the full amount of real data, e.g., on ADE20K, it can Improved Mask2Former-Swin-T from 48.7 to 52.0 (+3.3 mIoU).

1303f30fb9d8a9039a81eb0aa26 1ad6e.png

Code: github.com/LiheYoung/FreeMask

Paper: https://arxiv.org/abs/2310.15160

In the above repo we also provide the processed ADE20K-Synthetic data set (containing 20 times the training images of ADE20K) and the COCO-Synthetic data set (containing 6 times the training images of COCO-Stuff-164K), as well as the combination Checkpoints for better Mask2Former, SegFormer, and Segmenter models after training on synthetic data.

TL;DR

  • Different from some previous work that uses synthetic data to improve few-shot performance (using only a small amount of real data), we hope to use synthetic data to directly improve fully-supervised performance (using a full amount of real data), which is more challenging.

  • We use the semantic image synthesis model to generate synthetic images of the diversity from the semantic mask. However, directly adding these synthetic images to training will not actually improve the real-image baseline, but will damage the performance.

  • Therefore, we designed a noise filtering strategy and an image re-sampling strategy to learn synthetic data more effectively, which ultimately worked on various models of ADE20K (20,210 real images) and COCO-Stuff (164K real images). Get promoted. In addition, we found that after combining our strategy, only using synthetic data can achieve results comparable to real data.

Take-home Messages

  • On the basis of a full amount of real data, it is not easy to effectively utilize synthetic data. It requires a good enough generation model and the design of an appropriate learning strategy for synthetic data.

  • In the initial stage, we tried multiple GAN-based models that generate images from masks (e.g., OASIS [1]). Although their FID indicators were not bad, their performance in migrating to real data sets was very poor (the migration performance here , meaning training on a synthetic dataset but testing on a real validation set, the mIoU on ADE20K is only ~30%).

  • Mask-to-image synthesis model based on Stable Diffusion is a better choice, such as FreestyleNet[2].

  • When the generation quality is relatively high and the screening strategy is reasonable, joint train synthetic data and real data will be better than using synthetic data to pre-train first and then using real data to fine-tune.

Scan the QR code to join CVer Knowledge Planet, you can quickly learn the paper ideas from the latest top conferences and journalsand CV from entry to Proficient materials, and cutting-edge projects and applications! Publish the paper and recommend it!

80f1fbfe9331109b6af1cdc6d28b9ca5.jpeg

Introduction

ced7467c9e068902dd9c112a8451ee4c .png

The synthetic image generated by FreestyleNet based on semantic mask is very diverse and realistic.

Models such as Stable Diffusion (SD) have achieved very good text-to-image generation results. In the past year, work in the field of semantic image synthesis has also begun to combine SD pre-training to generate the corresponding image from the semantic mask. Among them, we found that the generation effect of FreestyleNet[2] is very good, as shown in the figure above. Therefore, we hope to use these synthetic images and their condition on semantic masks to form new synthetic training sample pairs and add them to the original real training set to further improve the performance of the model.

Simple failed attempts

We first examined the transfer performance of these synthetic to real images, i.e. trained with synthetic images but tested on a validation set of real images. We used SegFormer-B4 to train on real images and achieved a test mIoU of 48.5. However, after training with synthetic data that was 20 times larger than the real training set, we only obtained 43.3 mIoU. In addition, we also tried to mix real data and synthetic data (real data will be upsampled as much as synthetic data because it is of higher quality), but we only achieved 48.2 mIoU, which is still behind the results of training with only real images. .

Therefore, we hope to learn from these synthetic data more efficiently.

Motivation

Since the results of the above synthetic data were not good, we looked at the synthetic data set more carefully and found that there were many synthetic error areas, such as the red box area shown in the figure below. These synthetically incorrect regions can seriously harm model performance when added to the training set.

48a85069607c53e02ddd7d442 9d417c1.png

The synthesis result in the red box is wrong
  • In addition, different semantic masks correspond to different scenes, and the learning difficulty of different scenes is actually different, so the number of synthetic training images they require is also different. As shown in the figure below, generally speaking, the difficulty of the scene corresponding to the semantic mask gradually increases from left to right. If the same number of synthetic images are generated for each mask to learn, then the images corresponding to these simple masks may be It will dominate the learning of the model, and the learning efficiency of the model will be very low.

a0f8f5d2d3f91fd9c1c58e9e42 7dacfb.png

The difficulty of the scenes corresponding to different semantic masks is different. Generally speaking, the difficulty gradually increases from left to right.

Method

875be5a95b920543398256a248 7d46bc.png

With the above two motivations, the specific method is very simple.

Filtering Noisy Synthetic Regions

For the first motivation, we designed a noise filtering strategy to ignore areas with synthetic errors. Specifically, we use a model trained on real images to calculate the pixel-wise loss between each synthetic image and its corresponding semantic mask. Intuitively, the synthetic error area (pixels) will appear relatively large. loss. In addition, the size of loss is also related to the difficulty of different categories themselves.

9dbb836fb2ccf9c174f6febbd73118aa.png

Hardness-aware Re-sampling

For the second motivation, we designed a hardware-aware re-sampling strategy to make our data synthesis and training more biased towards more difficult scenarios (semantic masks), as shown in the figure below.

ac8d7fbcab87c0589ac1a99025d92 26a.png

Produce more synthetic images for harder semantic masks and fewer synthetic images for simple masks

f9427b70300b67969e1303eaa66b989c.png

Learning Paradigms

We explore two paradigms for learning from synthetic images, namely:

  • Pre-training: pre-training with synthetic images, and then further fine-tuning with real images

  • Joint training: Mix real images and synthetic images (real images will be upsampled to the same number as synthetic images) and trained together

Simply put, we found that joint training performs better when the generation quality is relatively high and the screening strategy is reasonable.

Scan the QR code to join CVer Knowledge Planet, you can quickly learn the paper ideas from the latest top conferences and journalsand CV from entry to Proficient materials, and cutting-edge projects and applications! Publish the paper and recommend it!

163f9d6f78b8e8c36f68b541026d2d0a.jpeg

Experiment

Compare the performance of synthetic images and real images transferred to the real test set

1fc2422ebf98b4e07a39dc8c287 ab96b.png

Train with real or synthetic images and test on real validation set

It can be seen that on a variety of models, using synthetic images to migrate to the real verification set can achieve results comparable to the real training set.

Use synthetic images to further improve the performance of fully supervised segmentation models

  • Joint training on ADE20K

253e8d412bf829f400050be5f 6100194.png

When synthetic data is added, the fully supervised performance of real images is significantly improved, especially for Mask2Former-Swin-T, we increased mIoU from 48.7 to 52.0 (+ 3.3); for SegFormer-B4, it increased from 48.5 to 50.6 (+2.1).

  • Joint training on COCO-Stuff-164K

44981484b6397affa222d67c187 17a1e.png

COCO-Stuff-164K is more difficult to improve due to the large amount of original real data, but we still achieved an improvement of +1.9 mIoU on Mask2Former-Swi-T.

  • Pre-training with synthetic images on ADE20K

697a797ed55790c4eb81a76857a 19e7a.png

Ablation Studies

  • The necessity of our noise filtering and hardware-aware re-sampling

164775260be5612825f6f51322 59d61c.png

Without filtering and re-sampling, the synthetic images generated by FreestyleNet can only achieve transfer performance of 43.3 and 48.0 on the real sets of ADE20K and COCO, which is far worse than the transfer performance of real training images (ADE20K: 48.5 and COCO). : 50.5), and after applying our strategy, the transfer performance of purely synthetic images can be improved to 48.3 (ADE20K) and 49.3 (COCO), which are very close to the performance of real training images.

5833212aa455daa487102d3f05 8eda2b.png

Under joint training, our two strategies are also very effective. Without these two strategies, mixing synthetic images and real images can only achieve an mIoU of 48.2 (real images: 48.5). After adding our strategy, we can achieve The baseline of 48.5 for real images has been improved to 50.6.

  • Number of composite images

3f37cb4313973f9cb1ab84200 737e294.png Nmax controls the maximum number of synthetic images generated by a single mask. Without filtering and re-sampling, increasing the number of synthetic images will bring worse migration performance; after filtering and re-sampling, ,
Increasing Nmax from 6 to 20 can bring stable migration performance improvement.

For more ablation studies please refer to our article.

Conclusion

In this work, we generate synthetic images from semantic masks to form a large number of synthetic training data pairs, and significantly improve the performance of various semantic segmentation models in fully supervised settings on ADE20K and COCO-Stuff-164K.

Reply in the background of CVer WeChat public account: FreeMask, you can download the pdf and code of this paper and start learning!

Reference

  1. ^Sushko, Vadim, et al. “You only need adversarial supervision for semantic image synthesis.” ICLR 2021. https://arxiv.org/abs/2012.04781

  2. ^abXue, Han, et al. “Freestyle Layout-to-Image Synthesis.” CVPR 2023. https://arxiv.org/abs/2303.14412

Click to enter-> [Image Segmentation and Transformer] Communication Group

ICCV/CVPR 2023 paper and code download


Backend reply: CVPR2023, you can download the CVPR 2023 papers and code open source paper collection

Backend reply: ICCV2023, you can download the collection of ICCV 2023 papers and code open source papers
Image segmentation and Transformer exchange group established
Scan the QR code below, or add WeChat: CVer444, to add CVer Assistant WeChat, and then apply to join the CVer-Image Segmentation or Transformer WeChat communication group. In addition, other vertical directions have been covered: target detection, image segmentation, target tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, automatic Driving, reinforcement learning, lane detection, model pruning & compression, denoising, fog removal, rain removal, style transfer, remote sensing images, behavior recognition, video understanding, image fusion, image retrieval, paper submission & communication , PyTorch, TensorFlow and Transformer, NeRF, etc.
Be sure to note: Research direction + location + school/company + nickname (such as image segmentation or Transformer + Shanghai + hand in + Kaka). Note according to the format to get passed and invited to the group faster


▲Scan the QR code or add WeChat ID: CVer444 to join the communication group
CVer Computer Vision (Knowledge Planet) is here! If you want to know about the latest, fastest and best CV/DL/AI paper express delivery, high-quality practical projects, AI industry cutting-edge, and learning tutorials from entry to mastery, please scan the QR code below and join CVer Computer Vision (Knowledge Planet). Nearly ten thousand people have been gathered!

▲Scan the QR code to join Planet Learning

▲Click on the card above to follow the CVer official account

It is not easy to organize, please like and watch6f1c0a6d2d4d16a76fcbce70652c1ae2.gif