Diffusion model principle + DDPM case code analysis

Diffusion model principle + code analysis

1. Mathematical basis
- 1.1 General conditional probability form
- 1.2 Markov chain conditional probability form
- 1.3 Prior probability and posterior probability
- 1.4 Reparameterization techniques
- 1.5 KL divergence formula
2. The overall logic of the diffusion model (taking DDPM as an example)
- 2.1 Diffusion diffusion process (Forward noise addition process)
- 2.2 Reverse process (reverse denoising process)
3. Training process and sampling process
- 3.1 Training process
- 3.2 Sampling process
- 3.3 Some details of model training
- - 3.3.1 Network selection
  - 3.3.2 Selection of some hyperparameters
4. Implementation of DDPM case code
- 4.1 Dataset preparation
- 4.2 Forward propagation process
- 4.3 Reverse process (model training process)

1. Mathematics foundation

Diffusion models are not the same as general machine learning neural networks! The general neural network aims to construct a network model to fit the input data and the desired output results. The general neural network can be regarded as a black box. This black box can get what we want after training the input data. the result of. The diffusion model contains a lot of knowledge related to statistics and mathematics. I would like to regard it as the product of the perfect combination of mathematics and AI! Since the diffusion model is much more difficult in mathematics than the ordinary deep learning model, it is necessary to review (preview) the relevant mathematical knowledge when learning the diffusion model.

1.1 General conditional probability form

p(cooling/raining)=0.9 : It means that under the condition of “raining”, the probability of “cooling” is 0.9

p(x∣y)= p(y)p(x,y), p(x,y)=p(x∣y)p(y)=p(y∣x)p(x)

?P(x,y,z)=P(z∣y,x)P(y,x)=P(z∣y,x)P(y∣x)P(x)

P(y,z∣x)= P(x)P(x,y,z) =P(y∣x)P(z∣x,y)
?

1.2 Markov chain conditional probability form

Markov chain refers to the probability of the current state is only related to the previous moment, for example, A->B->C satisfies the Markov relationship?
, then there are:

?P(x,y,z)=P(z∣y,x)P(y,x)=P(z∣y)P(y∣x)P(x)

P(y,z∣x)=P(y∣x)P(z∣y)

1.3 Prior probability and posterior probability

Before introducing the prior probability, let’s review the total probability formula.
?
It can be seen that the total probability formula is the idea of “inferring the effect from the cause”. When the cause of a certain event is known, the probability of the occurrence of the event caused by a certain cause is inferred.

Prior probability: refers to the basis of past experience and analysis. A probability that is available before experimentation or sampling. It often appears as the “cause” in the problem of “seeking effect from cause”.

Before introducing the posterior probability, let’s review the Bayesian formula.

We can find that the Bayesian formula is a kind of thinking of “seeking the cause by the effect”. After knowing the result of a certain series of things, we can infer the probability that this kind of thing is a certain cause based on this kind of thing.

Posterior probability: Refers to the fact that something has happened, and you want to calculate the probability that the cause of this thing is caused by a certain factor. Refers to the probability of re-correction after getting the “result” information, which is the “cause” in the problem of “seeking the cause of the result”.

Let’s give an example to better understand the prior and posterior probability.

Suppose we now have two boxes, red and blue. There are 2 apples and 6 oranges in the red box, and 1 orange and 3 apples in the blue box, as shown in the figure below:

Green represents apples and orange represents oranges.

Suppose we randomly pick a fruit from a certain box every time we experiment,

The random variable B(box) indicates which box was picked, and P(B=blue) = 0.6 (the probability of the blue box being selected), and P(B=red) = 0.4 (the probability of the red box being selected).

The random variable F (fruit) indicates which fruit is picked, and the values of F are “a (apple)” and “o (orange)”.

Now suppose we already know that the fruit picked in an experiment is orange, what is the probability that the orange is picked from the red box? According to Bayesian formula:

The probability of P(F=o) is calculated according to the total probability formula,

P(F=o)=P(B=blue)* P(F=o|B=blue) + P(B=red) P(F=o|B=red)=0.6 1/4 + 0.4*3/4=9/20

At the same time, by the addition rule of probability we can get:

In the above calculation process, we call P(B=red) or P(B) the prior probability (prior probability), because we can get P before getting F is “a” or “o”. (B).
In the same way, P(B=red|F=o) and P(B=blue|F=o) are called posterior probabilities, because we can only obtain the specific value of F after a complete experiment. get this probability.

1.4 Reparameterization techniques

If you want to sample from the Gaussian distribution N (μ,σ), you can first sample z from the standard normal distribution N (0,I) and then get σ ? z + μ. The advantage of this is that the randomness is transferred to the constant z, and σ is part of the affine transformation network.

1.5 KL divergence formula

For two single-variable Gaussian distributions p ? and q ?, their KL divergence is:

2. The overall logic of the diffusion model (taking DDPM as an example)

As shown in FIG. The DDPM model is mainly divided into two processes: the forward noise addition process (from right to left) and the reverse denoising process (from left to right). The noise addition process means to gradually add Gaussian noise to the real pictures of the data set, and the denoising process refers to the gradual denoising of the pictures with noise added, so as to restore the real pictures. The noise addition process satisfies certain mathematical laws, while the denoising process uses a neural network to learn. In this way, the neural network can generate real pictures from a bunch of random noise pictures.

2.1 Diffusion diffusion process (Forward noise addition process)

Here, the forward noise addition process is a Markov chain process. We can see that the original picture becomes a completely chaotic picture by adding noise continuously. This completely chaotic picture can be regarded as a randomly generated noise picture. .

Diffusion in thermodynamics refers to the diffusion of fine particles from a high-density area to a low-density area. In the field of statistics, diffusion refers to the process of converting a complex distribution into a simple distribution. The reason why the diffusion model works is because of its one key property: stationarity. If a probability distribution changes over time, it will tend to a certain stationary distribution (such as Gaussian distribution) under the action of the Markov chain. As long as the termination time is long enough, the probability distribution will approach this stationary distribution.

The transition probability of each step of the Markov chain is essentially adding noise. This is the origin of “diffusion” in the diffusion model: noise gradually enters the diffusion system during the evolution of the Markov chain. As time goes by, the added noise (the added solute) becomes less and less, and the noise in the system (all the solutes before this moment) gradually diffuses in the diffusion system until it is uniform.

The Diffusion model defines a probability distribution transformation model T (note: this is not T in “t ∈ { 1 , 2 , 3… T }”), which can transform the original The complex distribution q_complex composed of data x₀ is transformed into a simple prior distribution p_prior with known parameters:

Specifically, the Diffusion model proposes that a Markov Chain (Markov Chain) can be used to construct T, that is, to define a series of conditional probability distributions q(x _{t< /sub>∣ x_t-1 ) t ∈ { 1 , 2 , 3… T } , convert x ₀ to x ₁ in turn , x ₂ , x ₃ … x _T
, hopefully when T is large enough:

For brevity and effectiveness, the p_prior here selects a Gaussian distribution, so the entire forward diffusion process can be regarded as continuously adding a small amount of Gaussian noise to the sample within T steps.
Specifically, at each step of the Markov chain, we add Gaussian noise with variance β_t to x_t-1, generating a new latent variable
x_t with distribution q(x _t∣ x_t-1 ). This diffusion process can be expressed as follows:}

Since we are in the multidimensional case, I is the identity matrix, indicating that each dimension has the same standard deviation β_t. Note that q(x _t∣ x_t-1 ) is a normal distribution with mean μ _t and variance ∑ < sub>t, where ∑ is the variance of a diagonal matrix (here β_t).

Thus, we can approximate the input from x ₀ to x _T in an operational way. Mathematically, this posterior probability is defined as follows:

where x ₁ :T means that we repeatedly apply q(x _t∣ x_t-1 ) from time 1 to T.

This multiplication method is too cumbersome. Using the reparameterization technique, you can get:

β keeps increasing, and it is 0.0001~0.002 in the paper, so α gets smaller and smaller after that. Then: the further the forward time goes, the greater the weight of the noise influence, z is the noise that obeys the Gaussian distribution, when t approaches positive infinity, x _t is equivalent to the isotropic Gaussian distribution.
In this way, we can directly get x _t at any time.

2.2 Reverse process (reverse denoising process)

The reverse process of the Diffusion Model is the process of continuously removing the noise in the image as opposed to the forward noise addition process. Unfortunately, q(x _t∣ x_t-1 ) knows but q(x _t-1∣ x_t ) is unknown. However, related studies have shown that the reversal of continuous diffusion process has the same distribution form as the forward process. That is, when the diffusion rate β_t is small enough and the number of diffusions is large enough, the discrete diffusion process is close to the continuous diffusion process q(x _t∣ x_t-1 ) has the same distribution form as q(x _t-1∣ x_t ), which is also a Gaussian distribution.
Nevertheless, we still cannot get q(x _t-1∣ x_t ) directly, so we need to learn a network model p(x _{t- 1}∣ x_t) fits q(x _t-1∣ x_t ):

The variance is not learned in DDPM, and the variance is set to β_t.
Thus, the Gaussian posterior probability in the inverse process is defined as:

Using Bayesian formula can get:

Use the formula:

Put together the above results obtained by Bayesian formula into the form of Gaussian distribution probability density:

Therefore, we can get the Gaussian probability density of q(x _t-1∣ x_t, x ₀) expressed as:

Replace x ₀ with x _t to get:

At this point, our goal in the reverse process becomes to shorten the distance between the following two Gaussian distributions, which can be achieved by calculating the KL divergence of the two distributions, where q(x _t-1∣ x_t , the mean and variance of x ₀) are known:

This is the loss function with which we train the network.

3. Training process and sampling process

Let’s reorganize the entire process of the diffusion model.

Forward propagation process (q process): starting from x0, adding noise to x_t, x_t is just a picture with noise, Gradually add more noise, until x_T the picture has completely become a noise picture.
Reverse process (p process): In a completely chaotic noise picture, keep removing the noise just added to make it less chaotic, and gradually get closer to the real picture, and you can get the initial picture.

The forward process is a complete Markov chain adding noise process, which is actually completed through fixed calculations. How to predict noise in the reverse process has become our key requirement. Humans cannot figure it out, so we need the help of the Internet.

3.1 Training process

We have no way to get q(x _t-1∣ x_t ) in the reverse denoising process, so we define a model p(x _t-1∣ x_t ) to approximate it, and during training we can use the posterior q(x _t-1∣ x_t , x ₀) to optimize p (that is, the process of calculating loss and continuous training).
So, how to optimize this p? That is, how to train the model to predict the reliable mean and variance to be calculated according to the distribution?
We can maximize the log likelihood of the model’s predicted distribution, optimize the cross-entropy of the model’s true distribution and predicted distribution, and optimize P under x₀ ~ q(x ₀) _θ(x₀) cross entropy:

Optimizing the negative log-likelihood using a variational lower bound, since the KL divergence is non-negative:

In the above formula, q (x₀) is the real data distribution, and P_θ (x₀) is the model.

To minimize this loss, it can be transformed into minimizing its upper bound L_VLB:

Since the forward q has no learnable parameters, and x_T is pure Gaussian noise, L_T can be ignored as a constant. So we only need to study L₀ and L_t (t and t-1 actually mean the same thing).
L_t can be seen as two Gaussian distributions q(x _t-1∣ x_t , x ₀ sub>) and p(x _t-1∣ x_t ) can be solved according to the KL divergence of the multivariate Gaussian distribution:

Take the formula obtained earlier:

Substitute:

We can see that the core of the diffusion model training is to learn the mean square error MSE of the real noise z_t and the predicted noise z_θ, DDPM (Ho et al 2020) uses A simplified loss without weight terms makes training more stable:

where C is a constant.
For L₀:

because:

In fact, L₀ is the negative log likelihood expectation of a multivariate Gaussian distribution, that is, its entropy:

The entropy of the multivariate Gaussian distribution is only related to its covariance, that is, L₀ is only related to σ₁^2 I, and L₀ is a constant.

In summary, the training process of the diffusion model (DDPM) can be seen as the process of minimizing the distance between the predicted noise and the real sampled ?.
The pseudocode of the training process in the DDPM paper is as follows:

Can be understood as:

This process is repeated until the network converges.

3.2 Sampling process

The description of the sampling process in the DDPM paper:

Because we have obtained a network p(x _t-1 for fitting q(x _t-1∣ x_t) through training >∣ x_t), so we can get x₀ step by step from x_T. The specific steps can be:

3.3 Some details of model training

3.3.1 Network selection

The input and output of the network of the diffusion model are of the same specification, so in theory, as long as the input specification of the network is the same as the output specification. For example, you can choose Unet as the fitted network:

3.3.2 Selection of some hyperparameters

In the process of forward propagation, we don’t know when the noise should be added, and how to set the variance of adding noise each time is also very important. All of these need to be continuously tried and tuned to get.
T is set to 1000 in DDPM, and β_t is set to increase linearly from β₁ = 0.0001 to β_T=0.02. Of course, other diffusion models also have different strategies, as long as the network can be adjusted to the best is the best method. Different network policies for different tasks may also be different.

4. DDPM case code implementation

In order to better grasp the working process of the diffusion model, I wrote and debugged a simple diffusion model case step by step referring to the code on the Internet-
DDPM S_curve

4.1 Dataset preparation

It should be noted here that the entire data set here is the points in the picture visualized above. There are 10,000 data in total. Each data is a point that constitutes S in the above picture. There are 10,000 points in total. These points This is an “s”-shaped distribution.
Code to build the dataset:

import numpy as np
from sklearn.datasets import make_s_curve
import torch

s_curve,_ = make_s_curve(10**4, noise=0.1)
s_curve = s_curve[:, [0, 2]]/10.0 # What we get is a three-dimensional point, we only need two-dimensional
device = 'cuda' if torch.cuda.is_available() else 'cpu'
dataset = torch.Tensor(s_curve).float().to(device)

Visualize the dataset:

data = s_curve.T
fig,ax = plt.subplots()
ax.scatter(*data,color='blue',edgecolor='white');
ax.axis('off')
plt. show()

4.2 Forward propagation process

First determine two hyperparameters β (betas) and T (num_steps), we set T to 100, β first takes 100 numbers from (-6,6), and then uses sigmoid to get 100 non-linearly increased numbers.

num_steps = 100
betas = torch.linspace(-6, 6, num_steps).to(device)
betas = torch.sigmoid(betas)*(0.5e-2 - 1e-5) + 1e-5

Calculate the expressions that need to be used in the forward propagation formula in advance:

alphas = 1-betas

alphas_prod = torch.cumprod(alphas, dim=0)

α_t-1

alphas_prod_p = torch.cat([torch.tensor([1]).float().to(device),alphas_prod[:-1]],0)

one_minus_alphas_bar_log = torch.log(1 - alphas_prod)

one_minus_alphas_bar_sqrt = torch.sqrt(1 - alphas_prod)

According to the formula

Write a function that can obtain the state graph X_t at any time t:

def q_x(x_0,t):
    noise = torch.randn_like(x_0).to(device) # Randomly obtained a noise with the same specification as x_0
    alphas_t = alphas_bar_sqrt[t]
    alphas_1_m_t = one_minus_alphas_bar_sqrt[t]
    return (alphas_t * x_0 + alphas_1_m_t * noise)#Add noise based on x[0]

Visualize the dataset after adding noise every 5 steps:

num_shows = 20
fig,axs = plt.subplots(2,10,figsize=(28,3))
plt.rc('text', color='black')
for i in range(num_shows):
    j = i//10
    k = i
    q_i = q_x(dataset, torch.tensor([i*num_steps//num_shows]).to(device))#Generate sampling data at time t
    q_i = q_i.to('cpu')
    axs[j,k].scatter(q_i[:,0],q_i[:,1],color='red',edgecolor='white')
    axs[j,k].set_axis_off()
    axs[j,k].set_title('$q(\mathbf{x}_{' + str(i*num_steps//num_shows) + '})$')

Define the loss function:

def diffusion_loss_fn(model, x_0, alphas_bar_sqrt, one_minus_alphas_bar_sqrt, n_steps):
    """Sampling and calculating loss at any time t"""
    batch_size = x_0. shape[0]

    # Generate random time t for a batchsize sample
    t = torch.randint(0, n_steps, size=(batch_size // 2,)).to(device)
    t = torch.cat([t, n_steps - 1 - t], dim=0)
    t = t. unsqueeze(-1)

    # coefficient of x0
    a = alphas_bar_sqrt[t]

    # Coefficient of eps
    aml = one_minus_alphas_bar_sqrt[t]

    # generate random noise eps
    e = torch.randn_like(x_0).to(device)

    # Construct the input of the model
    x = x_0 * a + e * aml

    # Send it into the model to get the predicted value of random noise at time t
    output = model(x, t. squeeze(-1))

    # Compute error along with real noise, average
    return (e - output).square().mean()

The loss function calculates the loss between the noise predicted by the network and the real noise. x = x_0 * a + e * aml is the formula:

4.3 Reverse process (model training process)

Here you need to define a function to restore from X_T to X₀:

def p_sample_loop(model, shape, n_steps, betas, one_minus_alphas_bar_sqrt):
    """Restore x[T-1], x[T-2]|...x[0] from x[T]"""
    cur_x = torch.randn(shape).to(device)
    x_seq = [cur_x]
    for i in reversed(range(n_steps)):
        cur_x = p_sample(model,cur_x,i,betas,one_minus_alphas_bar_sqrt)
        x_seq.append(cur_x)
    return x_seq

def p_sample(model,x,t,betas,one_minus_alphas_bar_sqrt):
    """Sampling the reconstructed value at time t from x[T]"""
    t = torch.tensor([t]).to(device)
    coeff = betas[t] / one_minus_alphas_bar_sqrt[t]
    eps_theta = model(x,t)
    mean = (1/(1-betas[t]).sqrt())*(x-(coeff*eps_theta))
    z = torch.randn_like(x).to(device)
    sigma_t = betas[t]. sqrt()
    sample = mean + sigma_t * z
    return (sample)

Then it is to define a network model for fitting q, which is defined here as a network connected by linear layers:
Here pushback applies a formula:

# Define the fitted network
class MLPDiffusion(nn.Module):
    def __init__(self, n_steps, num_units=128):
        super(MLPDiffusion, self).__init__()

        self.linears = nn.ModuleList(
            [
                nn.Linear(2, num_units),
                nn.ReLU(),
                nn.Linear(num_units, num_units),
                nn.ReLU(),
                nn.Linear(num_units, num_units),
                nn.ReLU(),
                nn.Linear(num_units, 2),
            ]
        )
        self.step_embeddings = nn.ModuleList(
            [
                nn. Embedding(n_steps, num_units),
                nn. Embedding(n_steps, num_units),
                nn. Embedding(n_steps, num_units),
            ]
        )

    def forward(self, x, t):
        # x = x_0
        for idx, embedding_layer in enumerate(self. step_embeddings):
            t_embedding = embedding_layer(t)
            x = self. linears[2 * idx](x)
            x + = t_embedding
            x = self. linears[2 * idx + 1](x)

        x = self. linears[-1](x)

        return x

The last is the regular network training process. Our batch_size is set to 128, and the training is 4000 rounds. Because the network is very simple, my computer finished the training in less than 20 minutes. The process is visualized every 100 rounds.

seed = 1234
    
print('Training model...')
batch_size = 128
dataloader = torch.utils.data.DataLoader(dataset,batch_size=batch_size,shuffle=True)
num_epoch = 4000
plt.rc('text', color='blue')

model = MLPDiffusion(num_steps)#The output dimension is 2, the input is x and step
model = model. cuda()
optimizer = torch.optim.Adam(model.parameters(),lr=1e-3)

for t in range(num_epoch):
    for idx, batch_x in enumerate(dataloader):
        loss = diffusion_loss_fn(model, batch_x, alphas_bar_sqrt, one_minus_alphas_bar_sqrt, num_steps)
        optimizer. zero_grad()
        loss. backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(),1.)
        optimizer. step()
        
    if(t 0==0):
        print(loss)
        x_seq = p_sample_loop(model, dataset.shape, num_steps, betas, one_minus_alphas_bar_sqrt)
        
        x_seq = [item.to('cpu') for item in x_seq]
        fig,axs = plt.subplots(1,10,figsize=(28,3))
        for i in range(1,11):
            cur_x = x_seq[i*10].detach()
            axs[i-1].scatter(cur_x[:,0],cur_x[:,1],color='red',edgecolor='white');
            axs[i-1].set_axis_off();
            axs[i-1].set_title('$q(\mathbf{x}_{' + str(i*10) + '})$')

Shown below is part of the visual output during training:
epoch=0

epoch=200

epoch=600

epoch=1500

epoch=3000

epoch = 4000

References
[1]: https://zhuanlan.zhihu.com/p/415487792
[2]: https://zhuanlan.zhihu.com/p/499206074
[3]: https://blog.csdn.net/weixin_42363544/article/details/127495570
[4]: https://blog.csdn.net/weixin_43850253/article/details/128275723
[5]:Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33: 6840-6851.