[MVDiffusion] Perfect reproduction of the scene, a generative model that can be designed in multiple views

Article directory

  • MVDiffusion
    • 1. Autoregressive generation of panorama
      • 1.1 Error accumulation
      • 1.2 Large angle change
    • 2. Model structure
      • 2.1 Multi-view latent diffusion model (mutil-view LDM)
        • 2.1.1 Text-conditioned generation model
        • 2.1.2 Image & text-conditioned generation model
        • 2.1.3 Additional convolutional layers
      • 2.2 Correspondence-aware Attention(CAA)
    • 3. Train stage
      • 3.1 Panorama generation task
      • 3.2 Multi-view image generation task
    • 4. Application scenarios
    • 5. Code part


Paper link: https://arxiv.org/pdf/2307.01097.pdf

Project address: https://huggingface.co/spaces/tangshitao/MVDiffusion

Code repository: https://github.com/Tangshitao/MVDiffusion

The goal of MVDiffusion is to generate multi-view images with highly consistent content and unified global semantics. The core ideas of its method are simultaneous denoising and global awareness based on the correspondence between images.

1. Autoregressive generation of panorama

The autoregressive generation process can be compared to a way of constructing an image step by step. The generation of the nth image depends on the n-1th image, which is achieved through image deformation and repair technology. This autoregressive approach leads to accumulation of errors and cannot handle loop closure.

1.1 Error accumulation

Build a panorama:

  1. Generate view A through text description
  2. Transform and repair based on the content of view A to generate view B
  3. Refer to the previous picture and continue process 2

For example, in view A there is a table; when generating B, part of the table might be moved to another location to simulate the effect of seeing it from a different angle. This process continues, generating views C, D, E, and possibly other views, each time depending on the previous view. If the deformation of the table in view A is inaccurate, thenthis error will accumulate when generating views B, C, D, etc., causing the final multi-view image to potentially look unrealistic or incoherent. . This is the so-called “error accumulation” problem.

1.2 Large angle change

Autoregressive methods may encounter difficulties if the angles between views vary widely, such as from one side of a room to the other.

Because it relies on the previous view, correctly generating a view from one angle to another requires handling complex angle changes and changes in background content, which can lead to inaccurate generated results.

2. Model structure

The panoramic image is composed of 8 overlapping perspective images. Between each pair of adjacent images, the correspondence between pixels is determined through a 3×3 homography matrix. In a panoramic image, different perspective images need to be matched at the pixel level through this matrix so that they can be correctly spliced together to form a continuous panorama.

In the text, the horizontal field of view is 90 degrees, which means there is 45 degrees of overlap between each two consecutive images. This setup is typically used for panoramic image stitching, where each image captures a wide horizontal field of view, and to ensure continuity between the images, they have some overlap so that they align correctly when stitched.

  1. The latent variables of multi-view images are denoised simultaneously (i.e. in parallel) through a shared UNet.
  2. A new Correspondence Aware Attention (CAA) block is inserted into UNet to learn cross-view geometric consistency and achieve global awareness.

2.1 Multi-view latent diffusion model (mutil-view LDM)

MVDiffusion designed two model variants. Generative models for text conditions condition on text cues; generative models for image and text conditions condition on text cues and one or two additional source images.

Panorama generation

Depending on the type of user input, only one of the two models will be used.

  • If only text is provided, the text conditional model willgenerate eight multiple views in parallel.
  • If additional source views are provided, the text and image conditional model can take image and text conditions into account to generate seven multiple views.

Multi-view depth-to-image generation
The text conditional generation model first subsamples the sparse depth atlas into “keyframes” and generates texture images. Then, the Image and Text Conditional Model is like the Interpolation Model, which uses two consecutive keyframe images as additional conditions to generate the image in between. These generated images should be consistent with the depth map and the two conditional images, while also being aligned with the text cues.

2.1.1 Text-conditioned generation model

For the panoramic image generation task, the image resolution is 512×512, while for the multi-view depth-to-image task, the image resolution is 192×256.

Panorama generation

Latent representations of multiple images are initialized using independent Gaussian noise. In each denoising step, these noisy latent representations are fed into a multi-branch UNet to denoise all multi-view latent representations simultaneously.
Generating a panoramic image requires compositing multiple views simultaneously tocover the entire panorama. This may require a higher image resolution of 512 × 512 to ensure that the resulting image is sharp enough in detail and does not appear pixelated.

Usage of CAA block:
SD UNet includes multiple downsampling and upsampling blocks, each accompanied by a CAA block to enforce multi-view consistency. The role of the CAA block is to ensure that the generated multi-view images are consistent in appearance and geometry.

CAA block initialization:
The final linear layer of the CAA block is initialized to zero, as recommended by ControlNet, to ensure that modifications do not destroy the original functionality of the SD model.

Multi-view depth-to-image generation

Generate multiple views from depth information, often used to generate multi-view images for text. Relative to panoramic images, this task may have lower image resolution requirements because the generated images typically do not need to cover the entire panorama.

2.1.2 Image & amp;text-conditioned generation model

Panorama generation

In this task, the image & text conditional generative model aims to generate a complete 360-degree panoramic view (seven target images) based on a single perspective image (one conditional image) and text cues for each perspective.

When generating panoramic images, the model needs to use a conditional image as a reference in order to retain some content consistency when generating images from new perspectives.
To do this, connect a mask consisting of 1 to the channel of the conditional image, so that the pixel value of the mask becomes 1, while the original pixel value of the conditional image remains unchanged(change Alpha channel). This operation maps the conditional image to some pixels of the generated image to ensure that the same content as the conditional image is preserved in the newly generated image.

In the UNet branch of the target image, we concatenate a black image consisting of zero-valued pixels with a mask consisting of zeros as input, so the repair model is required to generate a brand new new one based on the text condition and the correspondence with the condition image image.

Multi-view depth-to-image generation

The goal in this task is to densify between keyframe images generated by a text conditional generative model with an additional condition being a pair of keyframe images. Since SD’s image inpainting model does not support depth conditions, we have to adopt a different approach here.

Inspired by VideoLDM, we reuse a depth-conditioned UNet with CAA blocks from a text-conditioned generative model that has been trained to generate sparse keyframe images given a depth map and camera pose. Additional convolutional layers are inserted to inject information from the two conditional images.

2.1.3 Additional convolutional layers

Conditional image branch (two conditional images)
First, concatenate the condition image itself with a mask consisting of 1’s (4 channels in total). Then, zero convolution operations are used to downsample the concatenated image to match the feature map size of the UNet block. The downsampled conditional image is added to the input of the UNet block. The purpose of this process is to train the model so that when the mask is 1, the branch can regenerate the condition image; and when the mask is 0, the branch can generate the target image in between. This method enables the model to perform different generation tasks based on different values of the mask through the operation of convolutional layers.

Target image branch
For the target image branch, a black image consisting of zero-valued pixels is concatenated with a mask consisting of zeros, and then the same zero-convolution operation is used to downsample the image to a size similar to the feature map of the UNet block. match. The downsampled conditional image is added to the input of the UNet block. The purpose of this process is similar to the conditional image branch. Different generation tasks are performed according to the value of the mask. When the mask is 1, the conditional image is generated, and when it is 0, the target image is generated.

TheZero convolution operation is often used as a padding operation in downsampling or upsampling. Its role is to reduce the resolution of the input feature map to match the feature map size expected by the UNet block.

2.2 Correspondence-aware Attention(CAA)

(CAA) mechanism is the key to MVDiffusion enforcing correspondence constraints between multi-view feature maps.

I won’t explain too much about the relevant principles here, just read the original text. Here is a simple and easy-to-understand example.

Assume that a panoramic image is being generated, which includes 8 perspective images. For CAA, we focus on one of the pixel locations (



S) In the source feature map, now we want to calculate the attention between this pixel and the target feature map (one of the 8 perspective images).

  1. First, we select a target feature map (




    Fl) and then select a pixel position in the target image (





  2. A local neighborhood is formed near this location, assuming





    k=3 consider the target pixel




    3×3 neighborhood centered at tl

  3. For the pixel position in the source feature map (



    s) calculates a message (



    m), this message uses the information in the source feature map and the target feature map




    tl and its surrounding pixels to interact with

  4. Message m helps us determine how the process of generating the target image is affected by the corresponding location in the source image.

3. Train stage





?θi? represents the estimated noise of the i-th image.





Zti? is the noise potential value of the i-th image. Since the training dataset is much smaller than the pre-training dataset for the SD model, the original SD parameters in training are frozen as much as possible to preserve the original generalization ability.

3.1 Panorama generation task

Only the inserted CAA blocks were trained.

3.2 Multi-view image generation task

(This part of the task has a resolution of 192 × 256 compared to the original SD’s 512 × 512.) The SD model must be fine-tuned.

In the first stage, we fine-tune the SD UNet model using all ScanNet data to adjust the resolution. This phase was single-view training without CAA blocks (Eq. 1).
In the second stage, we integrate the CAA block and the image conditioning block into UNet and only train these added parameters.

4. Application scenarios

Panorama generation and multi-angle depth image generation

During use, it is recommended to use ChatGPT for enhancement https://huggingface.co/spaces/tangshitao/MVDiffusion.


This bedroom is a harmonious fusion of classic and contemporary elements. It boasts a spacious reclaimed wood bed frame with a plush upholstered headboard, creating a cozy yet stylish centerpiece. On one side of the bed, there's a sleek modern nightstand with a minimalist lamp, while on the other side, a vintage-inspired dresser adds a touch of timeless charm. The color palette is soothing, with soft neutral tones that promote relaxation. Sunlight filters in through sheer curtains, casting a gentle glow on the room. This bedroom offers a perfect balance between comfort and elegance, making it an inviting space for rest and relaxation.

5. Code part

Since the pre-trained weights from the official dropbox website are indeed difficult to download, I will put the weights into the network disk link and give it later.

Detailed explanation of the CAA mechanism code of the highlight work will be updated later…