Menu
Avatar
The menu of my blog
Quick Stats
Quests
30 Quests
Messages
2 Messages
Playback
5 Playback
Items
6 Items
Skills
2 Skills
Trace
1 Trace
Message

The Sword Art Online Utilities Project

Welcome, traveler. This is a personal blog built in the style of the legendary SAO game interface. Navigate through the menu to explore the journal, skills, and item logs.

© 2020-2026 Nagi-ovo | RSS | Breezing
← Back to Quest Log
The Intuition and Mathematics of Diffusion
The Intuition and Mathematics of Diffusion

Deeply understand the intuitive principles and mathematical derivations of diffusion models, from the forward process to the reverse process, mastering the core ideas and implementation details of DDPM.

Dec 13, 2024 40 min read
Deep LearningDiffusion

Human-Crafted

Written directly by the author with no AI-generated sections.

The Intuition and Mathematics of Diffusion

https://www.youtube.com/watch?v=HoKDTa5jHvg

This article primarily follows the teaching logic of this video, organized and elaborated with explanations. If there are any errors, feel free to correct them in the comments!

Intuition Part

Theoretical Support

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

This paper laid the theoretical foundation for diffusion models. The authors proposed a generative model based on non-equilibrium thermodynamics that achieves data generation by step-wise adding and removing noise. This provided crucial theoretical support for subsequent research in diffusion models.

We apply a large amount of noise to images and then use a neural network to denoise them. If this neural network learns well, it can start from completely random noise and eventually obtain an image from our training data.

Diffusion Overview

Forward Diffusion Process:

Iteratively applying noise to an image. Given enough steps, the image turns completely into noise. Normal distribution is used as the noise source:

Forward Diffusion Process

Reverse Diffusion Process

From pure noise to an image, involving a neural network that learns to denoise step by step.

Why denoising gradually? The authors mentioned in the paper that the result of “one-step direct denoising” is poor.

So what does this network look like? And what is it supposed to predict?

Algorithm Improvement

Denoising Diffusion Probabilistic Models

This paper proposed Denoising Diffusion Probabilistic Models (DDPM), significantly improving the generation quality and efficiency of diffusion models. By introducing a simple denoising network and optimized training strategies, DDPM became an important milestone in the field.

The authors discussed three possible targets for the neural network to predict:

  1. Predict the mean of the noise at each timestep (predict the mean of the noise at each timestep)

    • Predicting the mean of the conditional distribution p(xt−1∣xt)p(x_{t-1}|x_t)p(xt−1​∣xt​)
    • Variance is fixed and non-learnable
  2. Predict x0x_0x0​ directly (predict x0x_0x0​ directly)

    • Directly predicting the original, uncorrupted image
    • Experiments showed this method performs poorly
  3. Predict the added noise (predict the added noise)

    • Predicting the noise εεε added during the forward process
    • Mathematically equivalent to the first method (predicting the noise mean), just a different parameterization; they can be transformed into each other through simple operations

The paper ultimately chose predicting noise (the third way) as the primary method because it is more stable to train and yields better results.

The amount of noise added at each step is not constant; it’s controlled by a Linear Schedule to prevent training instability.

It looks something like this: Linear Schedule Diagram

As seen, the last few timesteps are close to complete noise with very little information. Furthermore, looking at the whole process, information is destroyed too quickly. OpenAI solved these two problems using a Cosine Schedule:

Cosine Schedule Diagram

Model Architecture

U-Net

Published alongside this paper was a model architecture called U-Net:

This model features a Bottleneck in the middle (layers with fewer parameters). It uses Downsample-Blocks and Resnet-Blocks to project the input image to a lower resolution, and Upsample-Blocks to project it back to its original size.

U-Net Architecture

At certain resolutions, the authors added Attention-Blocks and used Skip-Connections between layers in the same resolution space. The model is designed to target each timestep, implemented through sinusoidal positional encoding embeddings from Transformers, which are projected into each Residual-Block. The model can also combine schedules to remove different amounts of noise at different timesteps to improve generation.

Bottleneck and Autoencoder

The concept of a Bottleneck was originally proposed and widely used in the unsupervised learning method “Autoencoder.” As the lowest-dimensional hidden layer in an Autoencoder architecture, it sits between the encoder and decoder, forming the narrowest part of the network. This forces the network to learn a compressed representation of data, minimizing reconstruction error and acting as regularization: Lreconstruction=∥X−Decoder(Encoder(X))∥2\mathcal{L}_{\text{reconstruction}} = \|X - \text{Decoder}(\text{Encoder}(X))\|^2Lreconstruction​=∥X−Decoder(Encoder(X))∥2

Autoencoder Bottleneck

Architectural Improvements

OpenAI significantly improved overall results in their second paper Diffusion Models Beat GANs on Image Synthesis through architectural improvements:

  1. Increased network depth (more layers), decreased width (channels per layer)
  2. Increased the number of Attention-Blocks
  3. Expanded the number of heads in each Attention Block
  4. Introduced BigGAN-style Residual Blocks for upsampling and downsampling
  5. Introduced Adaptive Group Normalization (AdaGN), dynamically adjusting normalization parameters based on conditional information (like timesteps)
  6. Used Separate Classifier Guidance to help the model generate specific classes of images

Mathematics Part

Symbol Table

  • XtX_tXt​ represents the image at timestep ttt, where X0X_0X0​ is the original image. Note that smaller ttt means less noise:

Original image X0

  • The final noisy image is an isotropic (same in all directions) total noise, denoted as XTX_TXT​. In initial research T=1000T=1000T=1000, but later work reduced this significantly:

Total noise XT

  • Forward Process: q(xt∣xt−1)q(x_t|x_{t-1})q(xt​∣xt−1​), taking xt−1x_{t-1}xt−1​ and outputting a noisier image XtX_tXt​:

Forward process diagram

  • Backward Process: p(xt−1∣xt)p(x_{t-1}|x_t)p(xt−1​∣xt​), taking xtx_txt​ and using a neural network to output a denoised image xt−1x_{t-1}xt−1​:

Backward process diagram

Forward Process

Forward process formula

Where:

  • 1−βtxt−1\sqrt{1 - \beta_t} x_{t-1}1−βt​​xt−1​ is the mean of the distribution; βt\beta_tβt​ is the noise schedule parameter, ranging from 0 to 1. Combined with 1−βt\sqrt{1 - \beta_t}1−βt​​, it scales the noise. It decreases as the timestep increases, representing the retained part of the original signal.

Noise Schedule parameters

  • βtI\beta_t Iβt​I is the covariance matrix of the distribution. III is the identity matrix, indicating the covariance matrix is diagonal with independent dimensions. As the timestep increases, the amount of added noise grows.

Now we just need to iteratively execute this step to get the result after 1000 steps, but it can actually be done in one go.

Reparameterization Trick

The reparameterization trick is very important in diffusion models and other generative models (like Variational Autoencoders, VAE). Its core idea is transforming the sampling process of a random variable into a deterministic function plus a standardized random variable. This transformation allows the model to be optimized through gradient descent because it eliminates the impact of randomness in the sampling process on gradient calculation.

Here’s a simple example to explain its significance:

There are two ways to implement rolling a die.

  • The first has randomness inside the function:
# 1. Direct die roll (random sampling)
def roll_dice():
    return random.randint(1, 6)
 
result = roll_dice()
  • The second separates randomness outside, keeping the function deterministic:
# 2. Separating randomness
random_number = random.random()  # Generate random number between 0 and 1
 
def transformed_dice(random_number):
    # Map 0-1 random number to 1-6
    return math.floor(random_number * 6) + 1
 
result = transformed_dice(random_number)

In probability theory, we learn: if XXX is a random variable and X∼N(0,1)X \sim \mathcal{N}(0,1)X∼N(0,1), then aX+b∼N(b,a2)aX + b \sim \mathcal{N}(b, a^2)aX+b∼N(b,a2).

Therefore, for a normal distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2)N(μ,σ2), samples can be generated as:

x=μ+σ⋅ϵx = \mu + \sigma \cdot \epsilonx=μ+σ⋅ϵ

where ϵ∼N(0,1)\epsilon \sim \mathcal{N}(0, 1)ϵ∼N(0,1) is the standard normal distribution.

Similarly, for normal distributions:

  • Without Reparameterization:
# Sample directly from the target normal distribution
x = np.random.normal(mu, sigma)
  • With Reparameterization:
# Sample from standard normal distribution first
epsilon = np.random.normal(0, 1)
# Then get the target distribution via deterministic transformation
x = mu + sigma * epsilon

When it comes to gradient calculation in model training:

Without Reparameterization:

def sample_direct(mu, sigma):
    return np.random.normal(mu, sigma)
 
# In this case, it's hard to calculate gradients w.r.t. mu and sigma
# because random sampling blocks gradient propagation

With Reparameterization:

def sample_reparameterized(mu, sigma):
    epsilon = np.random.normal(0, 1)  # Gradient doesn't flow through here
    return mu + sigma * epsilon        # Easy to calculate gradients for mu and sigma

Taking VAE as an example:

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.encoder = Encoder()  # Outputs mu and sigma
        self.decoder = Decoder()
 
    def reparameterize(self, mu, sigma):
        # Reparameterization trick
        epsilon = torch.randn_like(mu)  # Sample from standard normal distribution
        z = mu + sigma * epsilon        # Deterministic transformation
        return z
 
    def forward(self, x):
        # Encoder outputs mu and sigma
        mu, sigma = self.encoder(x)
        
        # Use reparameterization to sample
        z = self.reparameterize(mu, sigma)
        
        # Decoder reconstructs input
        reconstruction = self.decoder(z)
        return reconstruction

Reparameterization from a Foodie’s Perspective

Imagine making milk tea:

Without Reparameterization:

  • You make a cup of milk tea with a specific sweetness directly.
  • If it’s not good, you don’t know if you added too much sugar or too little water.

With Reparameterization:

  1. Prepare a standard concentration of sugar water first (ϵ\epsilonϵ).
  2. Then adjust the amount of sugar water (μ\muμ) and dilution degree (σ\sigmaσ) to reach the target sweetness.
  3. If it’s not good, you clearly know whether the sugar water amount or dilution needs adjusting (parameters can be optimized).

Reparameterization from a foodie perspective

In summary, via reparameterization:

  • Gradients can propagate through deterministic transformations.
  • Parameters can be optimized via gradient descent.
  • Randomness is isolated and doesn’t affect gradient calculation.

Forward Mathematical Derivation

Transition from xt−1x_{t-1}xt−1​ to xtx_txt​:

  • Given xt−1x_{t-1}xt−1​, we want to generate xtx_txt​.
  • Applying the reparameterization trick to q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)q(xt​∣xt−1​)=N(xt​;1−βt​​xt−1​,βt​I),
∵Σ=βtI,σ2=βt∴σ=βt\begin{align*} \because \Sigma &= \beta_t I, \sigma^2 = \beta_t \\ \therefore \sigma &= \sqrt{\beta_t} \end{align*}∵Σ∴σ​=βt​I,σ2=βt​=βt​​​

We can express xtx_txt​ as a deterministic transformation of xt−1x_{t-1}xt−1​ plus a noise term:

xt=1−βtxt−1+βtϵ\begin{align*} x_t = \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon \end{align*}xt​=1−βt​​xt−1​+βt​​ϵ​
  • Here, 1−βtxt−1\sqrt{1 - \beta_t} x_{t-1}1−βt​​xt−1​ is the mean part, and βtϵ\sqrt{\beta_t} \epsilonβt​​ϵ is the noise part. Since ϵ\epsilonϵ is a sample from a standard normal distribution and independent of model parameters, gradients only need to consider the parameters corresponding to 1−βt\sqrt{1 - \beta_t}1−βt​​ and βt\sqrt{\beta_t}βt​​ during backpropagation. This allows the model to be effectively optimized via gradient descent.

We use αt\alpha_tαt​ to simplify notation and record the cumulative product:

Alpha notation

Resulting in:

q(xt∣xt−1)=αtxt−1+1−αtϵq(x_t \mid x_{t-1}) = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilonq(xt​∣xt−1​)=αt​​xt−1​+1−αt​​ϵ

Calculating a two-step transition: from xt−2x_{t-2}xt−2​ to xtx_txt​:

xt−1=αt−1xt−2+1−αt−1ϵt−1xt=αt(αt−1xt−2+1−αt−1ϵt−1)+1−αtϵtxt=αtαt−1xt−2+αt(1−αt−1)ϵt−1+1−αtϵt\begin{align*} x_{t-1} &= \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1} \\ x_t &= \sqrt{\alpha_t} \left( \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1} \right) + \sqrt{1 - \alpha_t} \epsilon_t \\ x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})} \epsilon_{t-1} + \sqrt{1 - \alpha_t} \epsilon_t \end{align*}xt−1​xt​xt​​=αt−1​​xt−2​+1−αt−1​​ϵt−1​=αt​​(αt−1​​xt−2​+1−αt−1​​ϵt−1​)+1−αt​​ϵt​=αt​αt−1​​xt−2​+αt​(1−αt−1​)​ϵt−1​+1−αt​​ϵt​​

Since ϵt−1\epsilon_{t-1}ϵt−1​ and ϵt\epsilon_tϵt​ are independent standard normal distributions, we can merge the noise parts into a single new noise term ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I)ϵ∼N(0,I):

xt=αtαt−1xt−2+1−αtαt−1ϵx_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilonxt​=αt​αt−1​​xt−2​+1−αt​αt−1​​ϵ

Similarly:

xt=αtαt−1xt−2+1−αtαt−1ϵxt=αtαt−1αt−2xt−3+1−αtαt−1αt−2ϵxt=αtαt−1⋯α1α0x0+1−αtαt−1⋯α1α0ϵByinduction,wecanderive:xt=∏s=k+1tαsxk+1−∏s=k+1tαsϵ∵αˉt=∏s=1tαs∴whenk=0,xt=αˉtx0+1−αˉtϵ(ϵ∼N(0,I))\begin{align*} x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} x_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2}} \epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} \epsilon \\ By& induction, we can derive: \\ x_t &= \sqrt{\prod_{s=k+1}^t \alpha_s} x_k + \sqrt{1 - \prod_{s=k+1}^t \alpha_s} \epsilon \\ \because \bar{\alpha}_t &= \prod_{s=1}^t \alpha_s \\ \therefore when& k=0, x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad (\epsilon \sim \mathcal{N}(0, I)) \end{align*}xt​xt​xt​Byxt​∵αˉt​∴when​=αt​αt−1​​xt−2​+1−αt​αt−1​​ϵ=αt​αt−1​αt−2​​xt−3​+1−αt​αt−1​αt−2​​ϵ=αt​αt−1​⋯α1​α0​​x0​+1−αt​αt−1​⋯α1​α0​​ϵinduction,wecanderive:=s=k+1∏t​αs​​xk​+1−s=k+1∏t​αs​​ϵ=s=1∏t​αs​k=0,xt​=αˉt​​x0​+1−αˉt​​ϵ(ϵ∼N(0,I))​

The full derivation flow is as follows:

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)=1−βtxt−1+βtϵ=αtxt−1+1−αtϵq(xt∣xt−2)=αtαt−1xt−2+1−αtαt−1ϵq(xt∣xt−3)=αtαt−1αt−2xt−3+1−αtαt−1αt−2ϵq(xt∣x0)=αtαt−1⋯α1α0x0+1−αtαt−1⋯α1α0ϵ=αˉtx0+1−αˉtϵ(ϵ∼N(0,I))=N(xt;αˉtx0,(1−αˉt)I)\begin{align} q(x_t \mid x_{t-1}) &= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \\ &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon \\ &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon \\ q(x_t \mid x_{t-2}) &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon \\ q(x_t \mid x_{t-3}) &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} x_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2}} \epsilon \\ q(x_t \mid x_0) &= \sqrt{\alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} \epsilon \\ &= \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad (\epsilon \sim \mathcal{N}(0, I))\\ &= \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \end{align}q(xt​∣xt−1​)q(xt​∣xt−2​)q(xt​∣xt−3​)q(xt​∣x0​)​=N(xt​;1−βt​​xt−1​,βt​I)=1−βt​​xt−1​+βt​​ϵ=αt​​xt−1​+1−αt​​ϵ=αt​αt−1​​xt−2​+1−αt​αt−1​​ϵ=αt​αt−1​αt−2​​xt−3​+1−αt​αt−1​αt−2​​ϵ=αt​αt−1​⋯α1​α0​​x0​+1−αt​αt−1​⋯α1​α0​​ϵ=αˉt​​x0​+1−αˉt​​ϵ(ϵ∼N(0,I))=N(xt​;αˉt​​x0​,(1−αˉt​)I)​​

Reverse Mathematical Derivation

Since variance is fixed and doesn’t need learning (see section 1.3), we only need the neural network to predict the mean:

Predicting noise mean

Our ultimate goal is to predict the noise between two timesteps. Let’s start analyzing from the loss function:

−log⁡(pθ(x0))-\log(p_\theta(x_0))−log(pθ​(x0​))

However, in this negative log-likelihood, the probability of x0x_0x0​ depends on all other preceding timesteps. We can learn a model that approximates these conditional probabilities as a solution. Here we need the Variational Lower Bound to get a more computable formula.

Variational Lower Bound

Variational Lower Bound formula

Suppose we have an uncomputable function f(x)f(x)f(x)—in our case, the negative log-likelihood. We can find a computable function g(x)g(x)g(x) that always satisfies g(x)≤f(x)g(x) \leq f(x)g(x)≤f(x): Optimizing g(x)g(x)g(x) will also increase f(x)f(x)f(x): Variational Lower Bound concept diagram

We ensure this by subtracting the KL Divergence, a metric that measures the similarity between two distributions, which is always non-negative:

DKL(p∥q)=∫xp(x)log⁡p(x)q(x) dxD_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} \, dxDKL​(p∥q)=∫x​p(x)logq(x)p(x)​dx

Subtracting an always non-negative term ensures the result is always less than the original function. We use ”+” here because we want to minimize the loss, so adding it ensures it’s always greater than or equal to the original negative log-likelihood:

−log⁡(pθ(x0))≤−log⁡(pθ(x0))+DKL(q(x1:T∣x0)∥pθ(x1:T∣x0))-\log(p_\theta(x_0)) \leq -\log(p_\theta(x_0)) + D_{KL}(q(x_{1:T} \mid x_0) \| p_\theta(x_{1:T} \mid x_0))−log(pθ​(x0​))≤−log(pθ​(x0​))+DKL​(q(x1:T​∣x0​)∥pθ​(x1:T​∣x0​))

In this form, since the negative log-likelihood is still present, the lower bound remains uncomputable. We need a better expression. First, rewrite the KL divergence as a log ratio of two terms:

−log⁡(pθ(x0))≤−log⁡(pθ(x0))+log⁡(q(x1:T∣x0)pθ(x1:T∣x0))=−log⁡(pθ(x0))+log⁡(q(x1:T∣x0)pθ(x1:T∣x0))\begin{align*} -\log(p_\theta(x_0)) &\leq -\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) \\ &=-\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) \\ \end{align*}−log(pθ​(x0​))​≤−log(pθ​(x0​))+log(pθ​(x1:T​∣x0​)q(x1:T​∣x0​)​)=−log(pθ​(x0​))+log(pθ​(x1:T​∣x0​)q(x1:T​∣x0​)​)​

Next, apply Bayes’ rule to the denominator:

pθ(x1:T∣x0)=pθ(x0∣x1:T)pθ(x1:T)pθ(x0)p_\theta(x_{1:T} \mid x_0)= \frac{p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})}{p_\theta(x_0)} pθ​(x1:T​∣x0​)=pθ​(x0​)pθ​(x0​∣x1:T​)pθ​(x1:T​)​

[!NOTE] Bayes’ Rule: p(A∣B)=p(B∣A)p(A)p(B)p(A \mid B) = \frac{p(B \mid A) p(A)}{p(B)}p(A∣B)=p(B)p(B∣A)p(A)​

The numerator part pθ(x0∣x1:T)pθ(x1:T)p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})pθ​(x0​∣x1:T​)pθ​(x1:T​) is actually the joint probability pθ(x0,x1:T)p_\theta(x_0, x_{1:T})pθ​(x0​,x1:T​), because:

pθ(x0,x1:T)=pθ(x0∣x1:T)pθ(x1:T)p_\theta(x_0, x_{1:T}) = p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})pθ​(x0​,x1:T​)=pθ​(x0​∣x1:T​)pθ​(x1:T​)

Usually, pθ(x0:T)p_\theta(x_{0:T})pθ​(x0:T​) represents the joint probability of x0x_0x0​ and all intermediate steps x1:Tx_{1:T}x1:T​, i.e.:

pθ(x0:T)=pθ(x0,x1:T)p_\theta(x_{0:T}) = p_\theta(x_0, x_{1:T})pθ​(x0:T​)=pθ​(x0​,x1:T​)

[!NOTE] pθ(x0:T)p_\theta(x_{0:T})pθ​(x0:T​) represents the joint probability distribution of all states x0,x1,…,xTx_0, x_1, \ldots, x_Tx0​,x1​,…,xT​ from timestep 0 to TTT.

pθ(x0:T)=p(xT)∏t=1Tpθ(xt−1∣xt) p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)pθ​(x0:T​)=p(xT​)t=1∏T​pθ​(xt−1​∣xt​)

Substituting gives:

log⁡(q(x1:T∣x0)pθ(x1:T∣x0))=log⁡(q(x1:T∣x0)pθ(x0:T)pθ(x0))Transformingthedenominatorintomultiplication:1pθ(x0:T)pθ(x0)=pθ(x0)pθ(x0:T)=log⁡(q(x1:T∣x0)pθ(x0:T))+log⁡(pθ(x0))\begin{align*} \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) = \log \left( \frac{q(x_{1:T} \mid x_0)}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}} \right) \\ Transforming& the denominator into multiplication: \frac{1}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}} = \frac{p_\theta(x_0)}{p_\theta(x_{0:T})} \\ = \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) &+ \log(p_\theta(x_0)) \end{align*}log(pθ​(x1:T​∣x0​)q(x1:T​∣x0​)​)=log​pθ​(x0​)pθ​(x0:T​)​q(x1:T​∣x0​)​​Transforming=log(pθ​(x0:T​)q(x1:T​∣x0​)​)​thedenominatorintomultiplication:pθ​(x0​)pθ​(x0:T​)​1​=pθ​(x0:T​)pθ​(x0​)​+log(pθ​(x0​))​

Following the process below leads to the final form:

Bayes' Rule application

Now the two problematic terms cancel out:

−log⁡(pθ(x0))≤−log⁡(pθ(x0))+log⁡(q(x1:T∣x0)pθ(x0:T))+log⁡(pθ(x0))=log⁡(q(x1:T∣x0)pθ(x0:T))\begin{align*} -\log(p_\theta(x_0)) &\leq -\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) + \log(p_\theta(x_0)) \\ &= \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) \end{align*}−log(pθ​(x0​))​≤−log(pθ​(x0​))+log(pθ​(x0:T​)q(x1:T​∣x0​)​)+log(pθ​(x0​))=log(pθ​(x0:T​)q(x1:T​∣x0​)​)​

This gives us a lower bound that can be minimized, and all parts are known:

  • The numerator is the joint probability of the forward process: q(x1:T∣x0)=∏t=1Tq(xt∣xt−1)q(x_{1:T} \mid x_0)=\prod_{t=1}^T q(x_t \mid x_{t-1})q(x1:T​∣x0​)=∏t=1T​q(xt​∣xt−1​);
  • The denominator is the joint probability of the reverse process: pθ(x0:T)=p(xT)∏t=1Tpθ(xt−1∣xt)p_\theta (x_{0:T})=p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)pθ​(x0:T​)=p(xT​)∏t=1T​pθ​(xt−1​∣xt​)

To make it analytically solvable, a few more reorganization steps are needed:

log⁡(q(x1:T∣x0)pθ(x0:T))=log⁡(∏t=1Tq(xt∣xt−1)p(xT)∏t=1Tpθ(xt−1∣xt))=log⁡(1p(xT)⋅∏t=1Tq(xt∣xt−1)∏t=1Tpθ(xt−1∣xt))=log⁡(1p(xT))+log⁡(∏t=1Tq(xt∣xt−1)∏t=1Tpθ(xt−1∣xt))=−log⁡(p(xT))+log⁡(∏t=1Tq(xt∣xt−1)∏t=1Tpθ(xt−1∣xt))=−log⁡(p(xT))+∑t=1Tlog⁡(q(xt∣xt−1)pθ(xt−1∣xt))=−log⁡(p(xT))+∑t=2Tlog⁡(q(xt∣xt−1)pθ(xt−1∣xt))+log⁡(q(x1∣x0)pθ(x0∣x1))\begin{align} \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})} \right) &= \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &= \log \left( \frac{1}{p(x_T)} \cdot \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right)\\ &= \log \left( \frac{1}{p(x_T)} \right) + \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &= -\log(p(x_T)) + \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &=-\log(p(x_T)) + \sum_{t=1}^T \log \left( \frac{q(x_t \mid x_{t-1})}{p_\theta(x_{t-1} \mid x_t)} \right) \\ &=- \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_t \mid x_{t-1})}{p_\theta(x_{t-1} \mid x_t)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}log(pθ​(x0:T​)q(x1:T​∣x0​)​)​=log(p(xT​)∏t=1T​pθ​(xt−1​∣xt​)∏t=1T​q(xt​∣xt−1​)​)=log(p(xT​)1​⋅∏t=1T​pθ​(xt−1​∣xt​)∏t=1T​q(xt​∣xt−1​)​)=log(p(xT​)1​)+log(∏t=1T​pθ​(xt−1​∣xt​)∏t=1T​q(xt​∣xt−1​)​)=−log(p(xT​))+log(∏t=1T​pθ​(xt−1​∣xt​)∏t=1T​q(xt​∣xt−1​)​)=−log(p(xT​))+t=1∑T​log(pθ​(xt−1​∣xt​)q(xt​∣xt−1​)​)=−log(p(xT​))+t=2∑T​log(pθ​(xt−1​∣xt​)q(xt​∣xt−1​)​)+log(pθ​(x0​∣x1​)q(x1​∣x0​)​)​​

Rewrite the numerator of the sum term using Bayes’ rule: q(xt∣xt−1)=q(xt−1∣xt)q(xt)q(xt−1)q(x_t \mid x_{t-1})=\frac{q(x_{t-1}\mid x_t)q(x_t)}{q(x_{t-1})}q(xt​∣xt−1​)=q(xt−1​)q(xt−1​∣xt​)q(xt​)​

But this goes back to before, where these terms require estimating all samples, leading to high variance. As shown in the image below, given xtx_txt​, it’s hard to determine what the previous state looked like:

High variance problem

The improvement strategy is to condition directly on the original data x0x_0x0​:

⟹q(xt−1∣xt,x0)q(xt∣x0)q(xt−1∣x0)\Longrightarrow \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)}⟹q(xt−1​∣x0​)q(xt−1​∣xt​,x0​)q(xt​∣x0​)​

By providing the noiseless image simultaneously, there are fewer candidate xt−1x_{t-1}xt−1​ states, reducing variance:

Conditioning on X0

Substituting back:

=−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)q(xt∣x0)pθ(xt−1∣xt)q(xt−1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))=−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))+∑t=2Tlog⁡(q(xt∣x0)q(xt−1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{p_\theta(x_{t-1} \mid x_t) q(x_{t-1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \\ &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \sum_{t=2}^T \log \left( \frac{q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}​=−log(p(xT​))+t=2∑T​log(pθ​(xt−1​∣xt​)q(xt−1​∣x0​)q(xt−1​∣xt​,x0​)q(xt​∣x0​)​)+log(pθ​(x0​∣x1​)q(x1​∣x0​)​)=−log(p(xT​))+t=2∑T​log(pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)​)+t=2∑T​log(q(xt−1​∣x0​)q(xt​∣x0​)​)+log(pθ​(x0​∣x1​)q(x1​∣x0​)​)​​

Expanding the second sum, most terms cancel out:

Cancellation process after sum expansion

=−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))+log⁡(q(xT∣x0)q(x1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \log \left( \frac{q(x_T \mid x_0)}{q(x_{1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}​=−log(p(xT​))+t=2∑T​log(pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)​)+log(q(x1​∣x0​)q(xT​∣x0​)​)+log(pθ​(x0​∣x1​)q(x1​∣x0​)​)​​

Applying log rules to the last two terms simplifies them:

log⁡(q(xT∣x0)q(x1∣x0))+log⁡(q(x1∣x0)pθ(x0∣x1))=[log⁡q(xT∣x0)−log⁡q(x1∣x0)]+[log⁡q(x1∣x0)−log⁡pθ(x0∣x1)]=log⁡q(xT∣x0)−log⁡pθ(x0∣x1)\begin{align*} \log \left( \frac{q(x_T \mid x_0)}{q(x_{1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right)&=\left[ \log q(x_T \mid x_0) - \log q(x_{1} \mid x_0) \right] + \left[ \log q(x_1 \mid x_0) - \log p_\theta(x_0 \mid x_1) \right] \\ &=\log q(x_T \mid x_0) - \log p_\theta(x_0 \mid x_1) \end{align*}log(q(x1​∣x0​)q(xT​∣x0​)​)+log(pθ​(x0​∣x1​)q(x1​∣x0​)​)​=[logq(xT​∣x0​)−logq(x1​∣x0​)]+[logq(x1​∣x0​)−logpθ​(x0​∣x1​)]=logq(xT​∣x0​)−logpθ​(x0​∣x1​)​

Moving the first simplified term to the front and merging into a logarithm yields the final analytical form:

=−log⁡(p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))+log⁡q(xT∣x0)−log⁡pθ(x0∣x1)=log⁡(q(xT∣x0)p(xT))+∑t=2Tlog⁡(q(xt−1∣xt,x0)pθ(xt−1∣xt))−log⁡pθ(x0∣x1)=DKL(q(xT∣x0)∥p(xT))+∑t=2TDKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))−log⁡(pθ(x0∣x1))=∑t=2TDKL(q(xt−1∣xt,x0)∥pθ(xt−1∣xt))−log⁡(pθ(x0∣x1))\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \log q(x_T \mid x_0)- \log p_\theta(x_0 \mid x_1) \\ &= \log(\frac{q(x_T\mid x_0)}{p(x_T)}) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) - \log p_\theta(x_0 \mid x_1) \\ &= D_{KL}(q(x_T | x_0) \| p(x_T)) + \sum_{t=2}^T D_{KL}(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)) - \log(p_\theta(x_0 | x_1)) \\ &= \sum_{t=2}^T D_{KL}(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)) - \log(p_\theta(x_0 | x_1)) \end{align}​=−log(p(xT​))+t=2∑T​log(pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)​)+logq(xT​∣x0​)−logpθ​(x0​∣x1​)=log(p(xT​)q(xT​∣x0​)​)+t=2∑T​log(pθ​(xt−1​∣xt​)q(xt−1​∣xt​,x0​)​)−logpθ​(x0​∣x1​)=DKL​(q(xT​∣x0​)∥p(xT​))+t=2∑T​DKL​(q(xt−1​∣xt​,x0​)∥pθ​(xt−1​∣xt​))−log(pθ​(x0​∣x1​))=t=2∑T​DKL​(q(xt−1​∣xt​,x0​)∥pθ​(xt−1​∣xt​))−log(pθ​(x0​∣x1​))​​

The first term can be ignored because qqq has no learnable parameters—it’s just the noise-adding forward process that converges to a normal distribution. p(xT)p(x_T)p(xT​) is just noise sampled from a Gaussian distribution, so the KL divergence will be very small.

The derivation of the remaining two terms is as follows (process omitted, see Lilian’s blog for details): Remaining term derivation

Since β\betaβ is fixed, we focus on the form of μ\muμ:

μ~t(xt,x0)=αˉt(1−αˉt−1)1−αˉtxt+αˉt−1βt1−αˉtx0\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0μ~​t​(xt​,x0​)=1−αˉt​αˉt​​(1−αˉt−1​)​xt​+1−αˉt​αˉt−1​​βt​​x0​

The closed form generated by the forward process xt=αˉtx0+1−αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilonxt​=αˉt​​x0​+1−αˉt​​ϵ can be rewritten in terms of x0x_0x0​:

x0=1αˉt(xt−1−αˉtϵ)x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right)x0​=αˉt​​1​(xt​−1−αˉt​​ϵ)

Substituting this x0x_0x0​ expression into the predicted mean formula μ~t\tilde{\mu}_tμ~​t​:

μ~t=αˉt(1−αˉt−1)1−αˉtxt+αˉt−1βt1−αˉt⋅1αˉt(xt−1−αˉtϵ)\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right)μ~​t​=1−αˉt​αˉt​​(1−αˉt−1​)​xt​+1−αˉt​αˉt−1​​βt​​⋅αˉt​​1​(xt​−1−αˉt​​ϵ)

Now μ\muμ no longer depends on x0x_0x0​. Continuing to simplify, first expand the second term:

αˉt−1βt1−αˉt⋅1αˉt(xt−1−αˉtϵ)=αˉt−1βtαˉt(1−αˉt)xt−αˉt−1βt1−αˉtαˉt(1−αˉt)ϵ\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \epsilon1−αˉt​αˉt−1​​βt​​⋅αˉt​​1​(xt​−1−αˉt​​ϵ)=αˉt​​(1−αˉt​)αˉt−1​​βt​​xt​−αˉt​​(1−αˉt​)αˉt−1​​βt​1−αˉt​​​ϵ

Merging the xtx_txt​ terms:

μ~t=(αˉt(1−αˉt−1)1−αˉt+αˉt−1βtαˉt(1−αˉt))xt−αˉt−1βt1−αˉtαˉt(1−αˉt)ϵ\tilde{\mu}_t = \left( \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \right) x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \epsilonμ~​t​=(1−αˉt​αˉt​​(1−αˉt−1​)​+αˉt​​(1−αˉt​)αˉt−1​​βt​​)xt​−αˉt​​(1−αˉt​)αˉt−1​​βt​1−αˉt​​​ϵ

Further merging and simplifying the coefficient for xtx_txt​ leads to:

μ~t=1αˉt(xt−βt1−αˉtϵ)\tilde{\mu}_t = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right)μ~​t​=αˉt​​1​(xt​−1−αˉt​​βt​​ϵ)

This means we essentially just subtract the randomly scaled noise generated by xtx_txt​, which is what the neural network needs to predict.

Substituting into the loss function LtL_tLt​, defined as Mean Squared Error:

Lt=12σt2∥1αˉt(xt−βt1−αˉtϵ)−μθ(xt,t)∥2=12σt2∥1αˉt(xt−βt1−αˉtϵ)−1αˉt(xt−βt1−αˉtϵθ(xt,t))∥2=12σt2∥βtαˉt(1−αˉt)(ϵ−ϵθ(xt,t))∥2=βt22σt2αˉt(1−αˉt)∥ϵ−ϵθ(xt,t)∥2\begin{align*} L_t &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right) - \mu_\theta(x_t, t) \right\|^2 \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right) - \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \right\|^2 \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{\beta_t}{\sqrt{\bar{\alpha}_t (1-\bar{\alpha}_t)}} (\epsilon - \epsilon_\theta(x_t, t)) \right\|^2 \\ &= \frac{\beta_t^2}{2\sigma_t^2 \bar{\alpha}_t (1-\bar{\alpha}_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \end{align*}Lt​​=2σt2​1​​αˉt​​1​(xt​−1−αˉt​​βt​​ϵ)−μθ​(xt​,t)​2=2σt2​1​​αˉt​​1​(xt​−1−αˉt​​βt​​ϵ)−αˉt​​1​(xt​−1−αˉt​​βt​​ϵθ​(xt​,t))​2=2σt2​1​​αˉt​(1−αˉt​)​βt​​(ϵ−ϵθ​(xt​,t))​2=2σt2​αˉt​(1−αˉt​)βt2​​∥ϵ−ϵθ​(xt​,t)∥2​

The final form is the mean squared error between the actual noise at timestep ttt and the noise predicted by the neural network. Researchers found that ignoring the preceding scaling term results in better sampling quality and is easier to implement.

βt22σt2αt(1−α^t)∥ϵ−ϵθ(xt,t)∥2⟶∥ϵ−ϵθ(xt,t)∥2\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \hat{\alpha}_t)} \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2 \longrightarrow \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^22σt2​αt​(1−α^t​)βt2​​∥ϵ−ϵθ​(xt​,t)∥2⟶∥ϵ−ϵθ​(xt​,t)∥2

Going back to the original formula:

N(xt−1;1αt(xt−βt1−αˉtϵθ(xt,t)),βt)\mathcal{N}\left(x_{t-1}; \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right), \beta_t\right)N(xt−1​;αt​​1​(xt​−1−αˉt​​βt​​ϵθ​(xt​,t)),βt​)

The authors decided not to add extra random noise in the final sampling step to stabilize the generation process:

Final step sampling

The final form is:

Lsimple=Et,x0,ϵ[∥ϵ−ϵθ(αˉtx0+1−αˉtϵ,t)∥2]  ⟹  Et,x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]\begin{align} L_{\text{simple}} &= \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta \left( \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, t \right) \right\|^2 \right] \\ &\implies \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta \left( \mathbf{x}_t, t \right) \right\|^2 \right] \end{align}Lsimple​​=Et,x0​,ϵ​[​ϵ−ϵθ​(αˉt​​x0​+1−αˉt​​ϵ,t)​2]⟹Et,x0​,ϵ​[∥ϵ−ϵθ​(xt​,t)∥2]​​
  • Et,x0,ϵ\mathbb{E}_{t, \mathbf{x}_0,\boldsymbol{\epsilon}}Et,x0​,ϵ​ denotes the expectation over timestep ttt, original data x0\mathbf{x}_0x0​, and noise ϵ\boldsymbol{\epsilon}ϵ.
  • ϵ\boldsymbol{\epsilon}ϵ is the actual random noise added.
  • ϵθ\boldsymbol{\epsilon}_\thetaϵθ​ is the noise predicted by the neural network.
  • αˉtx0+1−αˉtϵ\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}αˉt​​x0​+1−αˉt​​ϵ is the closed-form solution of the forward process, representing noisy data at timestep ttt, thus simplifying to:
    • xt\mathbf{x}_txt​ directly represents noisy data at timestep ttt.
    • The entire loss function essentially measures the mean squared error between predicted and actual noise.

Timestep ttt is usually sampled from a uniform distribution (t∼Uniform(1,T)t \sim Uniform(1,T)t∼Uniform(1,T)). This choice ensures that during training, every timestep has an equal probability of being selected, allowing the model to effectively learn the denoising process across all timesteps.

Training

Training process

First, we sample images from the dataset, then sample ttt and noise from a normal distribution, and finally optimize the objective via gradient descent.

Sampling

First, sample xtx_txt​ from a normal distribution, then sample xt−1x_{t-1}xt−1​ via reparameterization using the formula shown earlier.

Sampling process

Note that no noise is added when t=1t=1t=1. According to the formula:

x0=1αt(x1−βt1−αˉtϵθ(x1,1))x_0 = \frac{1}{\sqrt{\alpha_t}} \left( x_1 - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_1, 1) \right) x0​=αt​​1​(x1​−1−αˉt​​βt​​ϵθ​(x1​,1))

At t=1t=1t=1, the formula is used to recover x0x_0x0​ from x1x_1x1​, the final step of the denoising process. At this point, we want to reconstruct the original image as accurately as possible. Not adding noise at the last step (i.e., no βtϵ\sqrt{\beta_t} \epsilonβt​​ϵ term) avoids introducing unnecessary randomness into the final generated image, maintaining clarity and detail.

Code Implementation

Recommend Sunrise’s simplified MLP implementation on Zhihu. Later, when I have time, I’ll consider doing a manual implementation of Stable Diffusion. Digging a hole for myself…

References

  • Diffusion Models | Paper Explanation | Math Explained
  • Lilian Weng: From Autoencoder to Beta-VAE
  • Lilian Weng: What are Diffusion Models?
Article Info Human-Crafted
Title The Intuition and Mathematics of Diffusion
Author Nagi-ovo
URL
Last Updated No edits yet
Citation

For commercial reuse, contact the site owner for authorization. For non-commercial use, please credit the source and link to this article.

You may copy, distribute, and adapt this work as long as derivatives share the same license. Licensed under CC BY-NC-SA 4.0.

Session 00:00:00