banner
Nagi-ovo

Nagi-ovo

Breezing homepage: [nagi.fun](nagi.fun)
github

Intuition and Mathematics of Diffusion

This article mainly follows the teaching logic of the video, organizing and explaining the content. If there are any errors, please feel free to correct them in the comments!

Intuitive Part#

Theoretical Support#

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

This paper lays the theoretical foundation for diffusion models. The authors propose a generative model based on nonequilibrium thermodynamics, achieving data generation by gradually adding and removing noise. This provides important theoretical support for subsequent diffusion model research.

We apply a large amount of noise to images and then use a neural network to denoise. If this neural network learns well, it can start from completely random noise and ultimately obtain images from our training data.

image

Forward Diffusion Process:#

Iteratively apply noise to the image; when the steps are sufficient, the image will completely turn into noise, using a normal distribution as the noise source:

Screenshot 2024-12-09 at 21.03.59

Reverse Diffusion Process#

From pure noise to image, involving a neural network that learns to denoise step by step.

Why is it gradual denoising? The authors mention in the paper that the result of "denoising in one step" is very poor.

So what does this network look like? What does it need to predict?

Algorithm Improvement#

Denoising Diffusion Probabilistic Models

This paper introduces Denoising Diffusion Probabilistic Models (DDPM), significantly improving the generation quality and efficiency of diffusion models. By introducing simple denoising networks and optimized training strategies, DDPM has become an important milestone in the field of diffusion models.

The authors discuss three targets that the neural network can predict:

  1. Predict the mean of the noise at each timestep (predict the mean of the noise at each timestep)

    • That is, predict the mean of the conditional distribution p(xt1xt)p(x_{t-1}|x_t)
    • The variance is fixed and not learnable
  2. Predict the original image (predict x0x_0 directly)

    • Directly predict the original, unpolluted image
    • Experiments show that this method performs poorly
  3. Predict the added noise (predict the added noise)

    • Predict the noise εε added during the forward process
    • This is mathematically equivalent to the first method (predicting the mean of the noise), just parameterized differently; they can be converted into each other through simple transformations.

The paper ultimately chooses predicting noise (the third method) as the main approach, as this method is more stable in training and performs better.

The amount of noise added at each step is not fixed, controlled by a Linear Schedule to prevent instability during training.

It looks something like this:
Screenshot 2024-12-09 at 21.34.56

You can see that the last few timesteps are close to complete noise, with very little information; moreover, overall, the information is destroyed too quickly. Therefore, OpenAI used a Cosine Schedule to solve these two problems:

Screenshot 2024-12-09 at 21.39.35

Model Architecture#

U-Net#

Along with this paper, a model architecture called U-Net was published:

This model has a Bottleneck (a layer with fewer parameters) in the middle, using Downsample-Block and Resnet-Block to project the input image to a lower resolution, and Upsample-Block to project it back to the original size during output.

image

At certain resolutions, the authors added Attention-Blocks and made Skip-Connections between layers in the same resolution space. The model is designed for each timestep, achieved through sinusoidal position encoding embeddings in Transformers, which are projected into each Residual-Block. The model can also combine schedules to remove different amounts of noise at different timesteps to enhance generation effects, which will be discussed in detail later.

Bottleneck and Autoencoder#

The concept of Bottleneck (bottleneck layer) was initially proposed and widely used in the unsupervised learning method "Autoencoder." As the lowest-dimensional hidden layer in the autoencoder architecture, it is located between the encoder and decoder, forming the narrowest part of the network, forcing the network to learn a compressed representation of the data, minimizing reconstruction error and acting as a regularization term: Lreconstruction=XDecoder(Encoder(X))2\mathcal{L}_{\text{reconstruction}} = \|X - \text{Decoder}(\text{Encoder}(X))\|^2

image

Architecture Improvement#

OpenAI significantly improved overall performance by enhancing the architecture in their second paper Diffusion Models Beat GANs on Image Synthesis:

  1. Increase network depth (more layers) and reduce width (number of channels per layer)
  2. Increase the number of Attention-Blocks
  3. Expand the number of heads in each Attention Block
  4. Introduce BigGAN-style Residual Blocks for upsampling and downsampling
  5. Introduce Adaptive Group Normalization (AdaGN) to dynamically adjust normalization parameters based on conditional information (such as timestep)
  6. Use Separate Classifier Guidance to help the model generate specific types of images

Mathematical Part#

Symbol Table#

  • XtX_t represents the image at timestep tt, i.e., X0X_0 is the original image. It can be simplified as tt being smaller means less noise:

image

  • The final image of noise is an isotropic (uniform in all directions) complete noise, denoted as XTX_T. In the initial studies, T=1000T=1000, and subsequent work has reduced this significantly:

image

  • Forward Process: q(xtxt1)q(x_t|x_{t-1}), inputting the xt1x_{t-1} image outputs a noisier image XtX_t:

image

  • Backward Process: p(xt1xt)p(x_{t-1}|x_t), inputting the xtx_t image outputs a denoised image xt1x_{t-1} using a neural network:

image

Forward Process#

image

Where,

  • 1βtxt1\sqrt{1 - \beta_t} x_{t-1} is the mean of the distribution; βt\beta_t is the noise schedule parameter, ranging from 0 to 1, combined with 1βt\sqrt{1 - \beta_t} to scale the noise, decreasing as the timestep increases, representing the retained portion of the original signal.

image

  • βtI\beta_t I is the covariance matrix of the distribution, where II is the identity matrix, indicating that the covariance matrix is diagonal and independent in each dimension, increasing the amount of noise added as the timestep increases.

Now we just need to iterate this step continuously to get the result after 1000 steps, but actually, these can be completed in one step.

Reparameterization Trick#

The reparameterization trick is very important in diffusion models and other generative models (such as Variational Autoencoders, VAE). Its core idea is to transform the sampling process of a random variable into a deterministic function plus a standardized random variable. This transformation allows the model to be optimized through gradient descent because it eliminates the randomness in the sampling process from affecting the gradient calculation.

Here is a simple example to explain its significance.

You can implement a dice roll in two ways:

  • The first way has randomness inside the function:
# 1. Directly rolling the dice (random sampling)
def roll_dice():
    return random.randint(1, 6)

result = roll_dice()
  • The second way separates randomness outside the function, making the function itself deterministic:
# 2. Separating randomness
random_number = random.random()  # Generate a random number between 0 and 1

def transformed_dice(random_number):
    # Map the random number from 0-1 to 1-6
    return math.floor(random_number * 6) + 1

result = transformed_dice(random_number)

In probability theory, we learned that if XX is a random variable, and X𝒩(0,1)X ∼ 𝒩(0,1), then: aX+b𝒩(b,a2)aX + b ∼ 𝒩(b, a²)

Thus, for a normal distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2), samples can be generated as follows:

x=μ+σϵx = \mu + \sigma \cdot \epsilon

where ϵN(0,1)\epsilon \sim \mathcal{N}(0, 1) is a standard normal distribution.

Similarly, in a normal distribution,

  • Without using reparameterization:
# Directly sampling from the target normal distribution
x = np.random.normal(mu, sigma)
  • Using reparameterization:
# First sample from the standard normal distribution
epsilon = np.random.normal(0, 1)
# Then obtain the target distribution through a deterministic transformation
x = mu + sigma * epsilon

In the context of gradient calculations involved in model training,

Without using reparameterization:

def sample_direct(mu, sigma):
    return np.random.normal(mu, sigma)

# In this case, it's difficult to compute the gradient with respect to mu and sigma
# because the random sampling operation blocks gradient propagation

Using reparameterization:

def sample_reparameterized(mu, sigma):
    epsilon = np.random.normal(0, 1)  # Gradient does not need to propagate through here
    return mu + sigma * epsilon        # Gradients for mu and sigma can be easily computed

Taking VAE (Variational Autoencoder) as an example:

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.encoder = Encoder()  # Outputs mu and sigma
        self.decoder = Decoder()

    def reparameterize(self, mu, sigma):
        # Reparameterization trick
        epsilon = torch.randn_like(mu)  # Sample from standard normal distribution
        z = mu + sigma * epsilon        # Deterministic transformation
        return z

    def forward(self, x):
        # Encoder outputs mu and sigma
        mu, sigma = self.encoder(x)
        
        # Use reparameterization to sample
        z = self.reparameterize(mu, sigma)
        
        # Decoder reconstructs the input
        reconstruction = self.decoder(z)
        return reconstruction

Reparameterization from a Foodie's Perspective#

Imagine you are making milk tea:

Without using reparameterization:

  • Directly making a cup of milk tea with a specific sweetness
  • If it doesn't taste good, you don't know if you added too much or too little sugar

Using reparameterization:

  1. First, prepare a cup of standard sugar water (ϵ\epsilon)
  2. Then adjust the amount of sugar (μ\mu) and dilution level (σ\sigma) to achieve the desired sweetness
  3. If it doesn't taste good, you can clearly identify whether to adjust the sugar water amount or the dilution level (parameters can be optimized)

image

In summary, through reparameterization:

  • Gradients can propagate through deterministic transformations
  • Parameters can be optimized through gradient descent
  • Randomness is isolated, not affecting gradient calculations

Forward Mathematical Derivation#

Transition from xt1x_{t-1} to xtx_t:

  • Given xt1x_{t-1}, we want to generate xtx_t.
  • Using the reparameterization trick on q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I),
Σ=βtI,σ2=βtσ=βt\begin{align*} \because \Sigma &= \beta_t I, \sigma^2 = \beta_t \\ \therefore \sigma &= \sqrt{\beta_t} \end{align*}

We can express xtx_t as a deterministic transformation of xt1x_{t-1} plus a noise term:

xt=1βtxt1+βtϵ\begin{align*} x_t = \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon \end{align*}
  • Here, 1βtxt1\sqrt{1 - \beta_t} x_{t-1} is the mean part, and βtϵ\sqrt{\beta_t} \epsilon is the noise part. Since ϵ\epsilon is a sample from the standard normal distribution and is independent of model parameters, during backpropagation, the gradient only needs to consider the parameters corresponding to 1βt\sqrt{1 - \beta_t} and βt\sqrt{\beta_t}. This allows the model to be effectively optimized through gradient descent.

We simplify notation using αt\alpha_t and record its cumulative product:

image

We obtain:

q(xtxt1)=αtxt1+1αtϵq(x_t \mid x_{t-1}) = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon

Calculating the two-step transition: from xt2x_{t-2} to xtx_t

xt1=αt1xt2+1αt1ϵt1xt=αt(αt1xt2+1αt1ϵt1)+1αtϵtxt=αtαt1xt2+αt(1αt1)ϵt1+1αtϵt\begin{align*} x_{t-1} &= \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1} \\ x_t &= \sqrt{\alpha_t} \left( \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1} \right) + \sqrt{1 - \alpha_t} \epsilon_t \\ x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})} \epsilon_{t-1} + \sqrt{1 - \alpha_t} \epsilon_t \end{align*}

Since ϵt1\epsilon_{t-1} and ϵt\epsilon_t are independent standard normal distributions, we can combine the noise parts into a new noise term ϵN(0,I)\epsilon \sim \mathcal{N}(0, I):

xt=αtαt1xt2+1αtαt1ϵx_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon

Similarly:

xt=αtαt1xt2+1αtαt1ϵxt=αtαt1αt2xt3+1αtαt1αt2ϵxt=αtαt1α1α0x0+1αtαt1α1α0ϵByinduction,wecanderive:xt=s=k+1tαsxk+1s=k+1tαsϵαˉt=s=1tαsWhenk=0,xt=αˉtx0+1αˉtϵ(ϵN(0,I))\begin{align*} x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} x_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2}} \epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} \epsilon \\ By& induction, we can derive: \\ x_t &= \sqrt{\prod_{s=k+1}^t \alpha_s} x_k + \sqrt{1 - \prod_{s=k+1}^t \alpha_s} \epsilon \\ \because \bar{\alpha}_t &= \prod_{s=1}^t \alpha_s \\ \therefore When& k=0, x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad (\epsilon \sim \mathcal{N}(0, I)) \end{align*}

The complete derivation process is as follows:

q(xtxt1)=N(xt;1βtxt1,βtI)=1βtxt1+βtϵ=αtxt1+1αtϵq(xtxt2)=αtαt1xt2+1αtαt1ϵq(xtxt3)=αtαt1αt2xt3+1αtαt1αt2ϵq(xtx0)=αtαt1α1α0x0+1αtαt1α1α0ϵ=αˉtx0+1αˉtϵ(ϵN(0,I))=N(xt;αˉtx0,(1αˉt)I)\begin{align} q(x_t \mid x_{t-1}) &= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \\ &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon \\ &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon \\ q(x_t \mid x_{t-2}) &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon \\ q(x_t \mid x_{t-3}) &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} x_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2}} \epsilon \\ q(x_t \mid x_0) &= \sqrt{\alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} \epsilon \\ &= \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad (\epsilon \sim \mathcal{N}(0, I))\\ &= \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \end{align}

Reverse Mathematical Derivation#

Since the variance is fixed and does not need to be learned (see 1.3), we only need the neural network to predict the mean:

image

Our ultimate goal is to predict the noise in two timesteps. Now let's analyze from the loss function:

log(pθ(x0))-log(p_\theta(x_0))

However, the probability of x0x_0 in this negative log-likelihood depends on all other previous timesteps. We can learn a model that approximates these conditional probabilities as a solution, which requires using the Variational Lower Bound to obtain a more computable formula.

Variational Lower Bound#

image

Assuming we have an uncomputable function f(x)f(x), in our scenario, it is the negative log-likelihood, we can find a computable function g(x)g(x) that always satisfies g(x)f(x)g(x) \leq f(x). Then optimizing g(x)g(x) can also increase f(x)f(x):
Screenshot 2024-12-12 at 20.06.47

We ensure this by subtracting the KL divergence, which is a measure of the similarity between two distributions and is always non-negative:

DKL(pq)=xp(x)logp(x)q(x)dxD_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} \, dx

Subtracting a term that is always non-negative ensures that the result is always less than the original function. Here, we use "+" because we want to minimize the loss, so adding it guarantees that the original negative log-likelihood remains large:

log(pθ(x0))log(pθ(x0))+DKL(q(x1:Tx0)pθ(x1:Tx0))-\log(p_\theta(x_0)) \leq -\log(p_\theta(x_0)) + D_{KL}(q(x_{1:T} \mid x_0) \| p_\theta(x_{1:T} \mid x_0))

In this form, since the negative log-likelihood is still present, the lower bound remains uncomputable. We need to obtain a better expression. First, rewrite the KL divergence as the log ratio of two terms:

log(pθ(x0))log(pθ(x0))+DKL(q(x1:Tx0)pθ(x1:Tx0))=log(pθ(x0))+log(q(x1:Tx0)pθ(x1:Tx0))\begin{align*} -\log(p_\theta(x_0)) &\leq -\log(p_\theta(x_0)) + D_{KL}(q(x_{1:T} \mid x_0) \| p_\theta(x_{1:T} \mid x_0)) \\ &=-\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) \\ \end{align*}

Next, apply Bayes' theorem to the denominator:

pθ(x1:Tx0)=pθ(x0x1:T)pθ(x1:T)pθ(x0)p_\theta(x_{1:T} \mid x_0)= \frac{p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})}{p_\theta(x_0)}

Note

Bayes' theorem: p(AB)=p(BA)p(A)p(B)p(A \mid B) = \frac{p(B \mid A) p(A)}{p(B)}

The numerator pθ(x0x1:T)pθ(x1:T)p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T}) is actually the joint probability pθ(x0,x1:T)p_\theta(x_0, x_{1:T}) because:

pθ(x0,x1:T)=pθ(x0x1:T)pθ(x1:T)p_\theta(x_0, x_{1:T}) = p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})

Typically, pθ(x0:T)p_\theta(x_{0:T}) represents the joint probability of x0x_0 and all intermediate steps x1:Tx_{1:T}, i.e.:

pθ(x0:T)=pθ(x0,x1:T)p_\theta(x_{0:T}) = p_\theta(x_0, x_{1:T})

Note

pθ(x0:T)p_\theta(x_{0:T}) represents the joint probability distribution of all states x0,x1,,xTx_0, x_1, \ldots, x_T from timestep 0 to TT.

pθ(x0:T)=p(xT)t=1Tpθ(xt1xt) p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)

Substituting gives:

log(q(x1:Tx0)pθ(x1:Tx0))=log(q(x1:Tx0)pθ(x0:T)pθ(x0))Transformingthedenominatorpθ(x0:T)pθ(x0)intomultiplicationform:1pθ(x0:T)pθ(x0)=pθ(x0)pθ(x0:T)=log(q(x1:Tx0)pθ(x0:T))+log(pθ(x0))\begin{align*} \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) = \log \left( \frac{q(x_{1:T} \mid x_0)}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}} \right) \\ Transforming the denominator \frac{p_\theta(x_{0:T})}{p_\theta(x_0)} into multiplication form: \frac{1}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}} = \frac{p_\theta(x_0)}{p_\theta(x_{0:T})} \\ = \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) &+ \log(p_\theta(x_0)) \end{align*}

Thus, following the flow in the diagram below, we arrive at the final form:

Screenshot 2024-12-12 at 22.58.23

Suddenly, the annoying two terms cancel out:

log(pθ(x0))log(pθ(x0))+log(q(x1:Tx0)pθ(x0:T))+log(pθ(x0))=log(q(x1:Tx0)pθ(x0:T))\begin{align*} -\log(p_\theta(x_0)) &\leq -\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) + \log(p_\theta(x_0)) \\ &= \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) \end{align*}

This gives us a lower bound that can be minimized, and the content in the equation is all known:

  • The numerator is the joint probability distribution of the forward process: q(x1:Tx0)=t=1Tq(xtxt1)q(x_{1:T} \mid x_0)=\prod_{t=1}^T q(x_t \mid x_{t-1});
  • The denominator is the joint probability distribution of the reverse process: pθ(x0:T)=p(xT)t=1Tpθ(xt1xt)p_\theta (x_{0:T})=p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)

To have an analytical solution, a few additional reorganization steps are needed:

log(q(x1:Tx0)pθ(x0:T))=log(t=1Tq(xtxt1)p(xT)t=1Tpθ(xt1xt))=log(1p(xT)t=1Tq(xtxt1)t=1Tpθ(xt1xt))=log(1p(xT))+log(t=1Tq(xtxt1)t=1Tpθ(xt1xt))=log(p(xT))+log(t=1Tq(xtxt1)t=1Tpθ(xt1xt))=log(p(xT))+t=1Tlog(q(xtxt1)pθ(xt1xt))=log(p(xT))+t=2Tlog(q(xtxt1)pθ(xt1xt))+log(q(x1x0)pθ(x0x1))\begin{align} \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})} \right) &= \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &= \log \left( \frac{1}{p(x_T)} \cdot \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right)\\ &= \log \left( \frac{1}{p(x_T)} \right) + \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &= -\log(p(x_T)) + \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &=-\log(p(x_T)) + \sum_{t=1}^T \log \left( \frac{q(x_t \mid x_{t-1})}{p_\theta(x_{t-1} \mid x_t)} \right) \\ &=- \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_t \mid x_{t-1})}{p_\theta(x_{t-1} \mid x_t)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}

Rewriting the numerator of the sum term using Bayes' theorem: q(xtxt1)=q(xt1xt)q(xt)q(xt1)q(x_t \mid x_{t-1})=\frac{q(x_{t-1}\mid x_t)q(x_t)}{q(x_{t-1})}

But this returns to the previous point; these terms require estimating all samples, leading to high variance. For example, given the xtx_t shown in the diagram below, it is difficult to determine what the previous state looked like:

image

The improvement idea is to condition directly on the original data x0x_0:

q(xt1xt,x0)q(xtx0)q(xt1x0)\Longrightarrow \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)}

This way, providing the noise-free image reduces the candidates for xt1x_{t-1}, thus reducing variance:

image

Substituting back into the original expression:

=log(p(xT))+t=2Tlog(q(xt1xt,x0)q(xtx0)pθ(xt1xt)q(xt1x0))+log(q(x1x0)pθ(x0x1))=log(p(xT))+t=2Tlog(q(xt1xt,x0)pθ(xt1xt))+t=2Tlog(q(xtx0)q(xt1x0))+log(q(x1x0)pθ(x0x1))\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{p_\theta(x_{t-1} \mid x_t) q(x_{t-1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \\ &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \sum_{t=2}^T \log \left( \frac{q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}

Expanding the second sum term, we find that most terms simplify:

image

=log(p(xT))+t=2Tlog(q(xt1xt,x0)pθ(xt1xt))+log(q(xTx0)q(x1x0))+log(q(x1x0)pθ(x0x1))\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \log \left( \frac{q(x_T \mid x_0)}{q(x_{1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}

Applying log rules to the last two terms can simplify some terms:

log(q(xTx0)q(x1x0))+log(q(x1x0)pθ(x0x1))=[logq(xTx0)logq(x1x0)]+[logq(x1x0)logpθ(x0x1)]=logq(xTx0)logpθ(x0x1)\begin{align*} \log \left( \frac{q(x_T \mid x_0)}{q(x_{1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right)&=\left[ \log q(x_T \mid x_0) - \log q(x_{1} \mid x_0) \right] + \left[ \log q(x_1 \mid x_0) - \log p_\theta(x_0 \mid x_1) \right] \\ &=\log q(x_T \mid x_0) - \log p_\theta(x_0 \mid x_1) \end{align*}

Then moving the simplified first term to the front and merging into one log gives the final analytical form:

=log(p(xT))+t=2Tlog(q(xt1xt,x0)pθ(xt1xt))+logq(xTx0)logpθ(x0x1)=log(q(xTx0)p(xT))+t=2Tlog(q(xt1xt,x0)pθ(xt1xt))log(pθ(x0x1))=DKL(q(xTx0)p(xT))+t=2TDKL(q(xt1xt,x0)pθ(xt1xt))log(pθ(x0x1))=t=2TDKL(q(xt1xt,x0)pθ(xt1xt))log(pθ(x0x1))\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \log q(x_T \mid x_0)- \log p_\theta(x_0 \mid x_1) \\ &= \log(\frac{q(x_T\mid x_0)}{p(x_T)}) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) - \log(p_\theta(x_0 \mid x_1)) \\ &= D_{KL}(q(x_T | x_0) \| p(x_T)) + \sum_{t=2}^T D_{KL}(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)) - \log(p_\theta(x_0 | x_1)) \\ &= \sum_{t=2}^T D_{KL}(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)) - \log(p_\theta(x_0 | x_1)) \end{align}

The first term in this form can be ignored because qq has no learnable parameters; it is just the forward process of adding noise, which will converge to a normal distribution, while p(xT)p(x_T) is just noise sampled from a Gaussian distribution. Therefore, it can be determined that this KL divergence term will be very small.

The remaining two terms' derivation results are as follows (process omitted, see Lilian's Blog)
Screenshot 2024-12-13 at 14.43.24

β\beta is fixed, so we focus on the form of μ\mu:

μ~t(xt,x0)=αˉt(1αˉt1)1αˉtxt+αˉt1βt1αˉtx0\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0

The closed form generated by the forward process xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon can be rewritten in terms of x0x_0:

x0=1αˉt(xt1αˉtϵ)x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right)

Substituting the above expression for x0x_0 into the predicted mean formula μ~t\tilde{\mu}_t:

μ~t=αˉt(1αˉt1)1αˉtxt+αˉt1βt1αˉt1αˉt(xt1αˉtϵ)\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right)

Now μ\mu no longer depends on x0x_0. Continuing to simplify, we first expand the second term:

αˉt1βt1αˉt1αˉt(xt1αˉtϵ)=αˉt1βtαˉt(1αˉt)xtαˉt1βt1αˉtαˉt(1αˉt)ϵ\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \epsilon

Combining the xtx_t terms:

μ~t=(αˉt(1αˉt1)1αˉt+αˉt1βtαˉt(1αˉt))xtαˉt1βt1αˉtαˉt(1αˉt)ϵ\tilde{\mu}_t = \left( \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \right) x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \epsilon

Further combining and simplifying the coefficients of xtx_t, we ultimately obtain:

μ~t=1αˉt(xtβt1αˉtϵ)\tilde{\mu}_t = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right)

This indicates that we are essentially just subtracting the randomly scaled noise generated by xtx_t, which is what the neural network needs to predict.

The loss function LtL_t defined after substitution is a mean squared error:

Lt=12σt21αˉt(xtβt1αˉtϵ)μθ(xt,t)2=12σt21αˉt(xtβt1αˉtϵ)1αˉt(xtβt1αˉtϵθ(xt,t))2=12σt2βtαˉt(1αˉt)(ϵϵθ(xt,t))2=βt22σt2αˉt(1αˉt)ϵϵθ(xt,t)2\begin{align*} L_t &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right) - \mu_\theta(x_t, t) \right\|^2 \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right) - \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \right\|^2 \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{\beta_t}{\sqrt{\bar{\alpha}_t (1-\bar{\alpha}_t)}} (\epsilon - \epsilon_\theta(x_t, t)) \right\|^2 \\ &= \frac{\beta_t^2}{2\sigma_t^2 \bar{\alpha}_t (1-\bar{\alpha}_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \end{align*}

The final form is the mean squared error between the actual noise at timestep tt and the noise predicted by the neural network. Researchers found that ignoring the scaling terms in front yields better sampling quality and is easier to implement.

βt22σt2αt(1α^t)ϵϵθ(xt,t)2ϵϵθ(xt,t)2\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \hat{\alpha}_t)} \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2 \longrightarrow \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2

Returning to the original formula

N(xt1;1αt(xtβt1αˉtϵθ(xt,t)),βt)\mathcal{N}\left(x_{t-1}; \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right), \beta_t\right)

The authors decided not to add extra random noise in the final sampling step to make the generation process more stable:

Screenshot 2024-12-13 at 15.51.11

The final form is:

Lsimple=Et,x0,ϵ[ϵϵθ(αˉtx0+1αˉtϵ,t)2]    Et,x0,ϵ[ϵϵθ(xt,t)2]\begin{align} L_{\text{simple}} &= \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta \left( \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, t \right) \right\|^2 \right] \\ &\implies \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta \left( \mathbf{x}_t, t \right) \right\|^2 \right] \end{align}
  • Et,x0,ϵ\mathbb{E}_{t, \mathbf{x}_0,\boldsymbol{\epsilon}} represents the expectation over timestep tt, original data x0\mathbf{x}_0, and noise ϵ\boldsymbol{\epsilon}
  • ϵ\boldsymbol{\epsilon} is the actual added random noise
  • ϵθ\boldsymbol{\epsilon}_\theta is the noise predicted by the neural network
  • αˉtx0+1αˉtϵ\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon} is the closed-form solution of the forward process, representing the noisy data at timestep tt, thus simplifying to:
    • xt\mathbf{x}_t directly represents the noisy data at timestep tt, i.e., αˉtx0+1αˉtϵ\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}
    • The entire loss function essentially measures the mean squared error between the predicted noise and the actual noise.

Where the timestep tt is typically sampled from a uniform distribution (i.e., tUniform(1,T)t∼Uniform(1,T), where T is the total number of timesteps). This choice ensures that during training, each timestep has an equal probability of being selected, allowing the model to effectively learn the denoising process across all timesteps.

Training#

image

First, we sample some images from the dataset, then sample tt and noise from a normal distribution, and optimize the objective through gradient descent.

Sampling#

First, sample xtx_t from a normal distribution, then use the previously shown formula to sample xt1x_{t-1} through reparameterization.

image

Note that at t=1t=1, no noise is added. According to the formula

x0=1αt(x1βt1αˉtϵθ(x1,1))x_0 = \frac{1}{\sqrt{\alpha_t}} \left( x_1 - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_1, 1) \right)

At t=1t = 1, this formula is used to recover from x1x_1 to x0x_0, which is the last step of the denoising process. At this point, we want to reconstruct the original image as accurately as possible. Not adding noise in the last step (i.e., without the βtϵ\sqrt{\beta_t} \epsilon term) avoids introducing unnecessary randomness when generating the final image, thus preserving the clarity and detail of the image.

Code Implementation#

Recommended Zhihu Sunrise's simplified MLP implementation
If there is time later, consider doing a code implementation of Stable Diffusion, leaving a pit...

References#

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.