New visual mission! ReVersion: Relation customization in image generation

New task: Relation Inversion

This year, diffusion model and related personalization work are becoming more and more popular, such as DreamBooth, Textual Inversion, Custom Diffusion, etc. This type of method can extract the concept of a specific object from the picture and add it to In the pre-trained text-to-image diffusion model, people can customize the generation of objects they are interested in, such as specific anime characters, sculptures at home, water cups, etc.

Existing customization methods mainly focus on capturing the appearance of objects. However, in addition to the appearance of objects, there is another important pillar of the visual world, which is the inextricable relationship between objects. Currently, no work has explored how to extract a specific relationship from images and apply this relationship to the generation task. To this end, we propose a new task: Relation Inversion.


As shown in the figure above, given several reference pictures, there is a coexisting relation in these reference pictures, such as “Object A is installed in object B”. The goal of Relation Inversion is to find a relation prompt to describe this interactive relationship, and It is used to generate new scenes so that the objects in them interact according to this relationship, such as putting Spider-Man in a basket.


ReVersion Framework

As a first attempt to address the Relation Inversion problem, we propose the ReVersion framework:


Compared with the existing Appearance Invesion task, the difficulty of the Relation Inversion task is how to tell the model that what we need to extract is the relatively abstract concept of relation, rather than aspects with significant visual features such as the appearance of the object.

We proposed a relation-focal importance sampling strategy to encourage more attention to high-level relations; at the same time, we designed relation-steering contrastive learning to guide more attention to relations rather than the appearance of objects. See the paper for more details.

ReVersion Benchmark

We collect and provide ReVersion Benchmark:

It contains a rich variety of relationships, each relationship has multiple exemplar images and manually annotated text descriptions. We also provide a large number of inference templates for common relationships. You can use these inference templates to test whether the learned relationship prompts are accurate, and they can also be used to combine them to generate some interesting interactive scenarios.

Result display

  • Rich and diverse relationships

We can invert a rich variety of relations and apply them to new objects

  • Various backgrounds and styles

The relationship we get can also connect objects in different styles and background scenes in a specific way.


  • The same Relation, rich and diverse object combinations


