Intuition and Mathematics of Diffusion

This article mainly follows the teaching logic of the video, organizing and explaining the content. If there are any errors, please feel free to correct them in the comments!

Intuitive Part#

Theoretical Support#

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

This paper lays the theoretical foundation for diffusion models. The authors propose a generative model based on nonequilibrium thermodynamics, achieving data generation by gradually adding and removing noise. This provides important theoretical support for subsequent diffusion model research.

We apply a large amount of noise to images and then use a neural network to denoise. If this neural network learns well, it can start from completely random noise and ultimately obtain images from our training data.

Forward Diffusion Process:#

Iteratively apply noise to the image; when the steps are sufficient, the image will completely turn into noise, using a normal distribution as the noise source:

Screenshot 2024-12-09 at 21.03.59

Reverse Diffusion Process#

From pure noise to image, involving a neural network that learns to denoise step by step.

Why is it gradual denoising? The authors mention in the paper that the result of "denoising in one step" is very poor.

So what does this network look like? What does it need to predict?

Algorithm Improvement#

Denoising Diffusion Probabilistic Models

This paper introduces Denoising Diffusion Probabilistic Models (DDPM), significantly improving the generation quality and efficiency of diffusion models. By introducing simple denoising networks and optimized training strategies, DDPM has become an important milestone in the field of diffusion models.

The authors discuss three targets that the neural network can predict:

Predict the mean of the noise at each timestep (predict the mean of the noise at each timestep)
- That is, predict the mean of the conditional distribution $p(x_{t-1}|x_t)$
- The variance is fixed and not learnable
Predict the original image (predict $x_0$ directly)
- Directly predict the original, unpolluted image
- Experiments show that this method performs poorly
Predict the added noise (predict the added noise)
- Predict the noise $ε$ added during the forward process
- This is mathematically equivalent to the first method (predicting the mean of the noise), just parameterized differently; they can be converted into each other through simple transformations.

The paper ultimately chooses predicting noise (the third method) as the main approach, as this method is more stable in training and performs better.

The amount of noise added at each step is not fixed, controlled by a Linear Schedule to prevent instability during training.

It looks something like this:
Screenshot 2024-12-09 at 21.34.56

You can see that the last few timesteps are close to complete noise, with very little information; moreover, overall, the information is destroyed too quickly. Therefore, OpenAI used a Cosine Schedule to solve these two problems:

Screenshot 2024-12-09 at 21.39.35

Model Architecture#

U-Net#

Along with this paper, a model architecture called U-Net was published:

This model has a Bottleneck (a layer with fewer parameters) in the middle, using Downsample-Block and Resnet-Block to project the input image to a lower resolution, and Upsample-Block to project it back to the original size during output.

At certain resolutions, the authors added Attention-Blocks and made Skip-Connections between layers in the same resolution space. The model is designed for each timestep, achieved through sinusoidal position encoding embeddings in Transformers, which are projected into each Residual-Block. The model can also combine schedules to remove different amounts of noise at different timesteps to enhance generation effects, which will be discussed in detail later.

Bottleneck and Autoencoder#

The concept of Bottleneck (bottleneck layer) was initially proposed and widely used in the unsupervised learning method "Autoencoder." As the lowest-dimensional hidden layer in the autoencoder architecture, it is located between the encoder and decoder, forming the narrowest part of the network, forcing the network to learn a compressed representation of the data, minimizing reconstruction error and acting as a regularization term: $\mathcal{L}_{\text{reconstruction}} = \|X - \text{Decoder}(\text{Encoder}(X))\|^2$

Architecture Improvement#

OpenAI significantly improved overall performance by enhancing the architecture in their second paper Diffusion Models Beat GANs on Image Synthesis:

Increase network depth (more layers) and reduce width (number of channels per layer)
Increase the number of Attention-Blocks
Expand the number of heads in each Attention Block
Introduce BigGAN-style Residual Blocks for upsampling and downsampling
Introduce Adaptive Group Normalization (AdaGN) to dynamically adjust normalization parameters based on conditional information (such as timestep)
Use Separate Classifier Guidance to help the model generate specific types of images

Mathematical Part#

Symbol Table#

$X_t$ represents the image at timestep $t$ , i.e., $X_0$ is the original image. It can be simplified as $t$ being smaller means less noise:

The final image of noise is an isotropic (uniform in all directions) complete noise, denoted as $X_T$ . In the initial studies, $T=1000$ , and subsequent work has reduced this significantly:

Forward Process: $q(x_t|x_{t-1})$ , inputting the $x_{t-1}$ image outputs a noisier image $X_t$ :

Backward Process: $p(x_{t-1}|x_t)$ , inputting the $x_t$ image outputs a denoised image $x_{t-1}$ using a neural network:

Forward Process#

Where,

$\sqrt{1 - \beta_t} x_{t-1}$ is the mean of the distribution; $\beta_t$ is the noise schedule parameter, ranging from 0 to 1, combined with $\sqrt{1 - \beta_t}$ to scale the noise, decreasing as the timestep increases, representing the retained portion of the original signal.

$\beta_t I$ is the covariance matrix of the distribution, where $I$ is the identity matrix, indicating that the covariance matrix is diagonal and independent in each dimension, increasing the amount of noise added as the timestep increases.

Now we just need to iterate this step continuously to get the result after 1000 steps, but actually, these can be completed in one step.

Reparameterization Trick#

The reparameterization trick is very important in diffusion models and other generative models (such as Variational Autoencoders, VAE). Its core idea is to transform the sampling process of a random variable into a deterministic function plus a standardized random variable. This transformation allows the model to be optimized through gradient descent because it eliminates the randomness in the sampling process from affecting the gradient calculation.

Here is a simple example to explain its significance.

You can implement a dice roll in two ways:

The first way has randomness inside the function:

# 1. Directly rolling the dice (random sampling)
def roll_dice():
    return random.randint(1, 6)

result = roll_dice()

The second way separates randomness outside the function, making the function itself deterministic:

# 2. Separating randomness
random_number = random.random()  # Generate a random number between 0 and 1

def transformed_dice(random_number):
    # Map the random number from 0-1 to 1-6
    return math.floor(random_number * 6) + 1

result = transformed_dice(random_number)

In probability theory, we learned that if $X$ is a random variable, and $X ∼ 𝒩(0,1)$ , then: $aX + b ∼ 𝒩(b, a²)$

Thus, for a normal distribution $\mathcal{N}(\mu, \sigma^2)$ , samples can be generated as follows:

x = \mu + \sigma \cdot \epsilon

where $\epsilon \sim \mathcal{N}(0, 1)$ is a standard normal distribution.

Similarly, in a normal distribution,

Without using reparameterization:

# Directly sampling from the target normal distribution
x = np.random.normal(mu, sigma)

Using reparameterization:

# First sample from the standard normal distribution
epsilon = np.random.normal(0, 1)
# Then obtain the target distribution through a deterministic transformation
x = mu + sigma * epsilon

In the context of gradient calculations involved in model training,

Without using reparameterization:

def sample_direct(mu, sigma):
    return np.random.normal(mu, sigma)

# In this case, it's difficult to compute the gradient with respect to mu and sigma
# because the random sampling operation blocks gradient propagation

Using reparameterization:

def sample_reparameterized(mu, sigma):
    epsilon = np.random.normal(0, 1)  # Gradient does not need to propagate through here
    return mu + sigma * epsilon        # Gradients for mu and sigma can be easily computed

Taking VAE (Variational Autoencoder) as an example:

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.encoder = Encoder()  # Outputs mu and sigma
        self.decoder = Decoder()

    def reparameterize(self, mu, sigma):
        # Reparameterization trick
        epsilon = torch.randn_like(mu)  # Sample from standard normal distribution
        z = mu + sigma * epsilon        # Deterministic transformation
        return z

    def forward(self, x):
        # Encoder outputs mu and sigma
        mu, sigma = self.encoder(x)
        
        # Use reparameterization to sample
        z = self.reparameterize(mu, sigma)
        
        # Decoder reconstructs the input
        reconstruction = self.decoder(z)
        return reconstruction

Reparameterization from a Foodie's Perspective#

Imagine you are making milk tea:

Without using reparameterization:

Directly making a cup of milk tea with a specific sweetness
If it doesn't taste good, you don't know if you added too much or too little sugar

Using reparameterization:

First, prepare a cup of standard sugar water ( $\epsilon$ )
Then adjust the amount of sugar ( $\mu$ ) and dilution level ( $\sigma$ ) to achieve the desired sweetness
If it doesn't taste good, you can clearly identify whether to adjust the sugar water amount or the dilution level (parameters can be optimized)

In summary, through reparameterization:

Gradients can propagate through deterministic transformations
Parameters can be optimized through gradient descent
Randomness is isolated, not affecting gradient calculations

Forward Mathematical Derivation#

Transition from $x_{t-1}$ to $x_t$ :

Given $x_{t-1}$ , we want to generate $x_t$ .
Using the reparameterization trick on $q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$ ,

\begin{align*} \because \Sigma &= \beta_t I, \sigma^2 = \beta_t \\ \therefore \sigma &= \sqrt{\beta_t} \end{align*}

We can express $x_t$ as a deterministic transformation of $x_{t-1}$ plus a noise term:

\begin{align*} x_t = \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon \end{align*}

Here, $\sqrt{1 - \beta_t} x_{t-1}$ is the mean part, and $\sqrt{\beta_t} \epsilon$ is the noise part. Since $\epsilon$ is a sample from the standard normal distribution and is independent of model parameters, during backpropagation, the gradient only needs to consider the parameters corresponding to $\sqrt{1 - \beta_t}$ and $\sqrt{\beta_t}$ . This allows the model to be effectively optimized through gradient descent.

We simplify notation using $\alpha_t$ and record its cumulative product:

We obtain:

q(x_t \mid x_{t-1}) = \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon

Calculating the two-step transition: from $x_{t-2}$ to $x_t$

\begin{align*} x_{t-1} &= \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1} \\ x_t &= \sqrt{\alpha_t} \left( \sqrt{\alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_{t-1}} \epsilon_{t-1} \right) + \sqrt{1 - \alpha_t} \epsilon_t \\ x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t (1 - \alpha_{t-1})} \epsilon_{t-1} + \sqrt{1 - \alpha_t} \epsilon_t \end{align*}

Since $\epsilon_{t-1}$ and $\epsilon_t$ are independent standard normal distributions, we can combine the noise parts into a new noise term $\epsilon \sim \mathcal{N}(0, I)$ :

x_t = \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon

Similarly:

\begin{align*} x_t &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} x_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2}} \epsilon \\ x_t &= \sqrt{\alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} \epsilon \\ By& induction, we can derive: \\ x_t &= \sqrt{\prod_{s=k+1}^t \alpha_s} x_k + \sqrt{1 - \prod_{s=k+1}^t \alpha_s} \epsilon \\ \because \bar{\alpha}_t &= \prod_{s=1}^t \alpha_s \\ \therefore When& k=0, x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad (\epsilon \sim \mathcal{N}(0, I)) \end{align*}

The complete derivation process is as follows:

\begin{align} q(x_t \mid x_{t-1}) &= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \\ &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon \\ &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} \epsilon \\ q(x_t \mid x_{t-2}) &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \epsilon \\ q(x_t \mid x_{t-3}) &= \sqrt{\alpha_t \alpha_{t-1} \alpha_{t-2}} x_{t-3} + \sqrt{1 - \alpha_t \alpha_{t-1} \alpha_{t-2}} \epsilon \\ q(x_t \mid x_0) &= \sqrt{\alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} x_0 + \sqrt{1 - \alpha_t \alpha_{t-1} \cdots \alpha_1 \alpha_0} \epsilon \\ &= \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \quad (\epsilon \sim \mathcal{N}(0, I))\\ &= \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I) \end{align}

Reverse Mathematical Derivation#

Since the variance is fixed and does not need to be learned (see 1.3), we only need the neural network to predict the mean:

Our ultimate goal is to predict the noise in two timesteps. Now let's analyze from the loss function:

-log(p_\theta(x_0))

However, the probability of $x_0$ in this negative log-likelihood depends on all other previous timesteps. We can learn a model that approximates these conditional probabilities as a solution, which requires using the Variational Lower Bound to obtain a more computable formula.

Variational Lower Bound#

Assuming we have an uncomputable function $f(x)$ , in our scenario, it is the negative log-likelihood, we can find a computable function $g(x)$ that always satisfies $g(x) \leq f(x)$ . Then optimizing $g(x)$ can also increase $f(x)$ :
Screenshot 2024-12-12 at 20.06.47

We ensure this by subtracting the KL divergence, which is a measure of the similarity between two distributions and is always non-negative:

D_{KL}(p \| q) = \int_x p(x) \log \frac{p(x)}{q(x)} \, dx

Subtracting a term that is always non-negative ensures that the result is always less than the original function. Here, we use "+" because we want to minimize the loss, so adding it guarantees that the original negative log-likelihood remains large:

-\log(p_\theta(x_0)) \leq -\log(p_\theta(x_0)) + D_{KL}(q(x_{1:T} \mid x_0) \| p_\theta(x_{1:T} \mid x_0))

In this form, since the negative log-likelihood is still present, the lower bound remains uncomputable. We need to obtain a better expression. First, rewrite the KL divergence as the log ratio of two terms:

\begin{align*} -\log(p_\theta(x_0)) &\leq -\log(p_\theta(x_0)) + D_{KL}(q(x_{1:T} \mid x_0) \| p_\theta(x_{1:T} \mid x_0)) \\ &=-\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) \\ \end{align*}

Next, apply Bayes' theorem to the denominator:

p_\theta(x_{1:T} \mid x_0)= \frac{p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})}{p_\theta(x_0)}

Note

Bayes' theorem: $p(A \mid B) = \frac{p(B \mid A) p(A)}{p(B)}$

The numerator $p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})$ is actually the joint probability $p_\theta(x_0, x_{1:T})$ because:

p_\theta(x_0, x_{1:T}) = p_\theta(x_0 \mid x_{1:T}) p_\theta(x_{1:T})

Typically, $p_\theta(x_{0:T})$ represents the joint probability of $x_0$ and all intermediate steps $x_{1:T}$ , i.e.:

p_\theta(x_{0:T}) = p_\theta(x_0, x_{1:T})

Note

$p_\theta(x_{0:T})$ represents the joint probability distribution of all states $x_0, x_1, \ldots, x_T$ from timestep 0 to $T$ .

p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)

Substituting gives:

\begin{align*} \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{1:T} \mid x_0)} \right) = \log \left( \frac{q(x_{1:T} \mid x_0)}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}} \right) \\ Transforming the denominator \frac{p_\theta(x_{0:T})}{p_\theta(x_0)} into multiplication form: \frac{1}{\frac{p_\theta(x_{0:T})}{p_\theta(x_0)}} = \frac{p_\theta(x_0)}{p_\theta(x_{0:T})} \\ = \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) &+ \log(p_\theta(x_0)) \end{align*}

Thus, following the flow in the diagram below, we arrive at the final form:

Screenshot 2024-12-12 at 22.58.23

Suddenly, the annoying two terms cancel out:

\begin{align*} -\log(p_\theta(x_0)) &\leq -\log(p_\theta(x_0)) + \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) + \log(p_\theta(x_0)) \\ &= \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta (x_{0:T})} \right) \end{align*}

This gives us a lower bound that can be minimized, and the content in the equation is all known:

The numerator is the joint probability distribution of the forward process: $q(x_{1:T} \mid x_0)=\prod_{t=1}^T q(x_t \mid x_{t-1})$ ;
The denominator is the joint probability distribution of the reverse process: $p_\theta (x_{0:T})=p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)$

To have an analytical solution, a few additional reorganization steps are needed:

\begin{align} \log \left( \frac{q(x_{1:T} \mid x_0)}{p_\theta(x_{0:T})} \right) &= \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{p(x_T) \prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &= \log \left( \frac{1}{p(x_T)} \cdot \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right)\\ &= \log \left( \frac{1}{p(x_T)} \right) + \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &= -\log(p(x_T)) + \log \left( \frac{\prod_{t=1}^T q(x_t \mid x_{t-1})}{\prod_{t=1}^T p_\theta(x_{t-1} \mid x_t)} \right) \\ &=-\log(p(x_T)) + \sum_{t=1}^T \log \left( \frac{q(x_t \mid x_{t-1})}{p_\theta(x_{t-1} \mid x_t)} \right) \\ &=- \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_t \mid x_{t-1})}{p_\theta(x_{t-1} \mid x_t)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}

Rewriting the numerator of the sum term using Bayes' theorem: $q(x_t \mid x_{t-1})=\frac{q(x_{t-1}\mid x_t)q(x_t)}{q(x_{t-1})}$

But this returns to the previous point; these terms require estimating all samples, leading to high variance. For example, given the $x_t$ shown in the diagram below, it is difficult to determine what the previous state looked like:

The improvement idea is to condition directly on the original data $x_0$ :

\Longrightarrow \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)}

This way, providing the noise-free image reduces the candidates for $x_{t-1}$ , thus reducing variance:

Substituting back into the original expression:

\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0) q(x_t \mid x_0)}{p_\theta(x_{t-1} \mid x_t) q(x_{t-1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \\ &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \sum_{t=2}^T \log \left( \frac{q(x_t \mid x_0)}{q(x_{t-1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}

Expanding the second sum term, we find that most terms simplify:

\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \log \left( \frac{q(x_T \mid x_0)}{q(x_{1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right) \end{align}

Applying log rules to the last two terms can simplify some terms:

\begin{align*} \log \left( \frac{q(x_T \mid x_0)}{q(x_{1} \mid x_0)} \right) + \log \left( \frac{q(x_1 \mid x_0)}{p_\theta(x_0 \mid x_1)} \right)&=\left[ \log q(x_T \mid x_0) - \log q(x_{1} \mid x_0) \right] + \left[ \log q(x_1 \mid x_0) - \log p_\theta(x_0 \mid x_1) \right] \\ &=\log q(x_T \mid x_0) - \log p_\theta(x_0 \mid x_1) \end{align*}

Then moving the simplified first term to the front and merging into one log gives the final analytical form:

\begin{align} &= - \log(p(x_T)) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) + \log q(x_T \mid x_0)- \log p_\theta(x_0 \mid x_1) \\ &= \log(\frac{q(x_T\mid x_0)}{p(x_T)}) + \sum_{t=2}^T \log \left( \frac{q(x_{t-1} \mid x_t, x_0)}{p_\theta(x_{t-1} \mid x_t)} \right) - \log(p_\theta(x_0 \mid x_1)) \\ &= D_{KL}(q(x_T | x_0) \| p(x_T)) + \sum_{t=2}^T D_{KL}(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)) - \log(p_\theta(x_0 | x_1)) \\ &= \sum_{t=2}^T D_{KL}(q(x_{t-1} | x_t, x_0) \| p_\theta(x_{t-1} | x_t)) - \log(p_\theta(x_0 | x_1)) \end{align}

The first term in this form can be ignored because $q$ has no learnable parameters; it is just the forward process of adding noise, which will converge to a normal distribution, while $p(x_T)$ is just noise sampled from a Gaussian distribution. Therefore, it can be determined that this KL divergence term will be very small.

The remaining two terms' derivation results are as follows (process omitted, see Lilian's Blog)
Screenshot 2024-12-13 at 14.43.24

$\beta$ is fixed, so we focus on the form of $\mu$ :

\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0

The closed form generated by the forward process $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$ can be rewritten in terms of $x_0$ :

x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right)

Substituting the above expression for $x_0$ into the predicted mean formula $\tilde{\mu}_t$ :

\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right)

Now $\mu$ no longer depends on $x_0$ . Continuing to simplify, we first expand the second term:

\frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon \right) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \epsilon

Combining the $x_t$ terms:

\tilde{\mu}_t = \left( \frac{\sqrt{\bar{\alpha}_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} + \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \right) x_t - \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t} (1 - \bar{\alpha}_t)} \epsilon

Further combining and simplifying the coefficients of $x_t$ , we ultimately obtain:

\tilde{\mu}_t = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \right)

This indicates that we are essentially just subtracting the randomly scaled noise generated by $x_t$ , which is what the neural network needs to predict.

The loss function $L_t$ defined after substitution is a mean squared error:

\begin{align*} L_t &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right) - \mu_\theta(x_t, t) \right\|^2 \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right) - \frac{1}{\sqrt{\bar{\alpha}_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right) \right\|^2 \\ &= \frac{1}{2\sigma_t^2} \left\| \frac{\beta_t}{\sqrt{\bar{\alpha}_t (1-\bar{\alpha}_t)}} (\epsilon - \epsilon_\theta(x_t, t)) \right\|^2 \\ &= \frac{\beta_t^2}{2\sigma_t^2 \bar{\alpha}_t (1-\bar{\alpha}_t)} \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \end{align*}

The final form is the mean squared error between the actual noise at timestep $t$ and the noise predicted by the neural network. Researchers found that ignoring the scaling terms in front yields better sampling quality and is easier to implement.

\frac{\beta_t^2}{2 \sigma_t^2 \alpha_t (1 - \hat{\alpha}_t)} \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2 \longrightarrow \left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2

Returning to the original formula

\mathcal{N}\left(x_{t-1}; \frac{1}{\sqrt{\alpha_t}} \left(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)\right), \beta_t\right)

The authors decided not to add extra random noise in the final sampling step to make the generation process more stable:

Screenshot 2024-12-13 at 15.51.11

The final form is:

\begin{align} L_{\text{simple}} &= \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta \left( \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}, t \right) \right\|^2 \right] \\ &\implies \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\| \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta \left( \mathbf{x}_t, t \right) \right\|^2 \right] \end{align}

$\mathbb{E}_{t, \mathbf{x}_0,\boldsymbol{\epsilon}}$ represents the expectation over timestep $t$ , original data $\mathbf{x}_0$ , and noise $\boldsymbol{\epsilon}$
$\boldsymbol{\epsilon}$ is the actual added random noise
$\boldsymbol{\epsilon}_\theta$ is the noise predicted by the neural network
$\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$ $\overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$ is the closed-form solution of the forward process, representing the noisy data at timestep $t$ $t$ , thus simplifying to:
- $\mathbf{x}_t$ directly represents the noisy data at timestep $t$ , i.e., $\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}$
- The entire loss function essentially measures the mean squared error between the predicted noise and the actual noise.

Where the timestep $t$ is typically sampled from a uniform distribution (i.e., $t∼Uniform(1,T)$ , where T is the total number of timesteps). This choice ensures that during training, each timestep has an equal probability of being selected, allowing the model to effectively learn the denoising process across all timesteps.

Training#

First, we sample some images from the dataset, then sample $t$ and noise from a normal distribution, and optimize the objective through gradient descent.

Sampling#

First, sample $x_t$ from a normal distribution, then use the previously shown formula to sample $x_{t-1}$ through reparameterization.

Note that at $t=1$ , no noise is added. According to the formula

x_0 = \frac{1}{\sqrt{\alpha_t}} \left( x_1 - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_1, 1) \right)

At $t = 1$ , this formula is used to recover from $x_1$ to $x_0$ , which is the last step of the denoising process. At this point, we want to reconstruct the original image as accurately as possible. Not adding noise in the last step (i.e., without the $\sqrt{\beta_t} \epsilon$ term) avoids introducing unnecessary randomness when generating the final image, thus preserving the clarity and detail of the image.

Code Implementation#

Recommended Zhihu Sunrise's simplified MLP implementation
If there is time later, consider doing a code implementation of Stable Diffusion, leaving a pit...

Nagi-ovo