This post is translated from Chinese into English through AI.View Original
AI-generated summary
The content discusses the theoretical foundations and advancements of diffusion models in deep learning, particularly focusing on the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." It explains the forward and reverse diffusion processes, where noise is iteratively added to images and then removed using a neural network. The paper "Denoising Diffusion Probabilistic Models" (DDPM) is highlighted for improving the quality and efficiency of these models by introducing a denoising network and optimized training strategies.
The summary includes the three primary objectives of the neural network: predicting the noise mean at each timestep, directly predicting the original image, and predicting the added noise. The third method, predicting noise, is preferred for its stability and effectiveness. The architecture of the U-Net model is described, which employs a bottleneck structure and attention mechanisms to enhance performance.
Additionally, improvements made by OpenAI in their second paper are noted, including increasing network depth, adding attention blocks, and introducing adaptive normalization techniques. The mathematical framework is outlined, detailing the forward and backward processes, noise schedules, and the reparameterization trick, which allows for effective optimization of the model.
This article mainly follows the teaching logic of the video, organizing and explaining the content. If there are any errors, please feel free to correct them in the comments!
This paper lays the theoretical foundation for diffusion models. The authors propose a generative model based on nonequilibrium thermodynamics, achieving data generation by gradually adding and removing noise. This provides important theoretical support for subsequent diffusion model research.
We apply a large amount of noise to images and then use a neural network to denoise. If this neural network learns well, it can start from completely random noise and ultimately obtain images from our training data.
Iteratively apply noise to the image; when the steps are sufficient, the image will completely turn into noise, using a normal distribution as the noise source:
This paper introduces Denoising Diffusion Probabilistic Models (DDPM), significantly improving the generation quality and efficiency of diffusion models. By introducing simple denoising networks and optimized training strategies, DDPM has become an important milestone in the field of diffusion models.
The authors discuss three targets that the neural network can predict:
Predict the mean of the noise at each timestep (predict the mean of the noise at each timestep)
That is, predict the mean of the conditional distribution p(xt−1∣xt)
The variance is fixed and not learnable
Predict the original image (predict x0 directly)
Directly predict the original, unpolluted image
Experiments show that this method performs poorly
Predict the added noise (predict the added noise)
Predict the noise ε added during the forward process
This is mathematically equivalent to the first method (predicting the mean of the noise), just parameterized differently; they can be converted into each other through simple transformations.
The paper ultimately chooses predicting noise (the third method) as the main approach, as this method is more stable in training and performs better.
The amount of noise added at each step is not fixed, controlled by a Linear Schedule to prevent instability during training.
It looks something like this:
You can see that the last few timesteps are close to complete noise, with very little information; moreover, overall, the information is destroyed too quickly. Therefore, OpenAI used a Cosine Schedule to solve these two problems:
Along with this paper, a model architecture called U-Net was published:
This model has a Bottleneck (a layer with fewer parameters) in the middle, using Downsample-Block and Resnet-Block to project the input image to a lower resolution, and Upsample-Block to project it back to the original size during output.
At certain resolutions, the authors added Attention-Blocks and made Skip-Connections between layers in the same resolution space. The model is designed for each timestep, achieved through sinusoidal position encoding embeddings in Transformers, which are projected into each Residual-Block. The model can also combine schedules to remove different amounts of noise at different timesteps to enhance generation effects, which will be discussed in detail later.
The concept of Bottleneck (bottleneck layer) was initially proposed and widely used in the unsupervised learning method "Autoencoder." As the lowest-dimensional hidden layer in the autoencoder architecture, it is located between the encoder and decoder, forming the narrowest part of the network, forcing the network to learn a compressed representation of the data, minimizing reconstruction error and acting as a regularization term: Lreconstruction=∥X−Decoder(Encoder(X))∥2
Xt represents the image at timestep t, i.e., X0 is the original image. It can be simplified as t being smaller means less noise:
The final image of noise is an isotropic (uniform in all directions) complete noise, denoted as XT. In the initial studies, T=1000, and subsequent work has reduced this significantly:
Forward Process: q(xt∣xt−1), inputting the xt−1 image outputs a noisier image Xt:
Backward Process: p(xt−1∣xt), inputting the xt image outputs a denoised image xt−1 using a neural network:
1−βtxt−1 is the mean of the distribution; βt is the noise schedule parameter, ranging from 0 to 1, combined with 1−βt to scale the noise, decreasing as the timestep increases, representing the retained portion of the original signal.
βtI is the covariance matrix of the distribution, where I is the identity matrix, indicating that the covariance matrix is diagonal and independent in each dimension, increasing the amount of noise added as the timestep increases.
Now we just need to iterate this step continuously to get the result after 1000 steps, but actually, these can be completed in one step.
The reparameterization trick is very important in diffusion models and other generative models (such as Variational Autoencoders, VAE). Its core idea is to transform the sampling process of a random variable into a deterministic function plus a standardized random variable. This transformation allows the model to be optimized through gradient descent because it eliminates the randomness in the sampling process from affecting the gradient calculation.
Here is a simple example to explain its significance.
You can implement a dice roll in two ways:
The first way has randomness inside the function:
# 1. Directly rolling the dice (random sampling)def roll_dice(): return random.randint(1, 6)result = roll_dice()
The second way separates randomness outside the function, making the function itself deterministic:
# 2. Separating randomnessrandom_number = random.random() # Generate a random number between 0 and 1def transformed_dice(random_number): # Map the random number from 0-1 to 1-6 return math.floor(random_number * 6) + 1result = transformed_dice(random_number)
In probability theory, we learned that if X is a random variable, and X∼N(0,1), then: aX+b∼N(b,a2)
Thus, for a normal distribution N(μ,σ2), samples can be generated as follows:
x=μ+σ⋅ϵ
where ϵ∼N(0,1) is a standard normal distribution.
Similarly, in a normal distribution,
Without using reparameterization:
# Directly sampling from the target normal distributionx = np.random.normal(mu, sigma)
Using reparameterization:
# First sample from the standard normal distributionepsilon = np.random.normal(0, 1)# Then obtain the target distribution through a deterministic transformationx = mu + sigma * epsilon
In the context of gradient calculations involved in model training,
Without using reparameterization:
def sample_direct(mu, sigma): return np.random.normal(mu, sigma)# In this case, it's difficult to compute the gradient with respect to mu and sigma# because the random sampling operation blocks gradient propagation
Using reparameterization:
def sample_reparameterized(mu, sigma): epsilon = np.random.normal(0, 1) # Gradient does not need to propagate through here return mu + sigma * epsilon # Gradients for mu and sigma can be easily computed
Taking VAE (Variational Autoencoder) as an example:
class VAE(nn.Module): def __init__(self): super(VAE, self).__init__() self.encoder = Encoder() # Outputs mu and sigma self.decoder = Decoder() def reparameterize(self, mu, sigma): # Reparameterization trick epsilon = torch.randn_like(mu) # Sample from standard normal distribution z = mu + sigma * epsilon # Deterministic transformation return z def forward(self, x): # Encoder outputs mu and sigma mu, sigma = self.encoder(x) # Use reparameterization to sample z = self.reparameterize(mu, sigma) # Decoder reconstructs the input reconstruction = self.decoder(z) return reconstruction
Using the reparameterization trick on q(xt∣xt−1)=N(xt;1−βtxt−1,βtI),
∵Σ∴σ=βtI,σ2=βt=βt
We can express xt as a deterministic transformation of xt−1 plus a noise term:
xt=1−βtxt−1+βtϵ
Here, 1−βtxt−1 is the mean part, and βtϵ is the noise part. Since ϵ is a sample from the standard normal distribution and is independent of model parameters, during backpropagation, the gradient only needs to consider the parameters corresponding to 1−βt and βt. This allows the model to be effectively optimized through gradient descent.
We simplify notation using αt and record its cumulative product:
We obtain:
q(xt∣xt−1)=αtxt−1+1−αtϵ
Calculating the two-step transition: from xt−2 to xt
Since the variance is fixed and does not need to be learned (see 1.3), we only need the neural network to predict the mean:
Our ultimate goal is to predict the noise in two timesteps. Now let's analyze from the loss function:
−log(pθ(x0))
However, the probability of x0 in this negative log-likelihood depends on all other previous timesteps. We can learn a model that approximates these conditional probabilities as a solution, which requires using the Variational Lower Bound to obtain a more computable formula.
Assuming we have an uncomputable function f(x), in our scenario, it is the negative log-likelihood, we can find a computable function g(x) that always satisfies g(x)≤f(x). Then optimizing g(x) can also increase f(x):
We ensure this by subtracting the KL divergence, which is a measure of the similarity between two distributions and is always non-negative:
DKL(p∥q)=∫xp(x)logq(x)p(x)dx
Subtracting a term that is always non-negative ensures that the result is always less than the original function. Here, we use "+" because we want to minimize the loss, so adding it guarantees that the original negative log-likelihood remains large:
In this form, since the negative log-likelihood is still present, the lower bound remains uncomputable. We need to obtain a better expression. First, rewrite the KL divergence as the log ratio of two terms:
Rewriting the numerator of the sum term using Bayes' theorem: q(xt∣xt−1)=q(xt−1)q(xt−1∣xt)q(xt)
But this returns to the previous point; these terms require estimating all samples, leading to high variance. For example, given the xt shown in the diagram below, it is difficult to determine what the previous state looked like:
The improvement idea is to condition directly on the original data x0:
⟹q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)
This way, providing the noise-free image reduces the candidates for xt−1, thus reducing variance:
The first term in this form can be ignored because q has no learnable parameters; it is just the forward process of adding noise, which will converge to a normal distribution, while p(xT) is just noise sampled from a Gaussian distribution. Therefore, it can be determined that this KL divergence term will be very small.
The remaining two terms' derivation results are as follows (process omitted, see Lilian's Blog)
The final form is the mean squared error between the actual noise at timestep t and the noise predicted by the neural network. Researchers found that ignoring the scaling terms in front yields better sampling quality and is easier to implement.
Et,x0,ϵ represents the expectation over timestep t, original data x0, and noise ϵ
ϵ is the actual added random noise
ϵθ is the noise predicted by the neural network
αˉtx0+1−αˉtϵ is the closed-form solution of the forward process, representing the noisy data at timestep t, thus simplifying to:
xt directly represents the noisy data at timestep t, i.e., αˉtx0+1−αˉtϵ
The entire loss function essentially measures the mean squared error between the predicted noise and the actual noise.
Where the timestep t is typically sampled from a uniform distribution (i.e., t∼Uniform(1,T), where T is the total number of timesteps). This choice ensures that during training, each timestep has an equal probability of being selected, allowing the model to effectively learn the denoising process across all timesteps.
First, we sample some images from the dataset, then sample t and noise from a normal distribution, and optimize the objective through gradient descent.
First, sample xt from a normal distribution, then use the previously shown formula to sample xt−1 through reparameterization.
Note that at t=1, no noise is added. According to the formula
x0=αt1(x1−1−αˉtβtϵθ(x1,1))
At t=1, this formula is used to recover from x1 to x0, which is the last step of the denoising process. At this point, we want to reconstruct the original image as accurately as possible. Not adding noise in the last step (i.e., without the βtϵ term) avoids introducing unnecessary randomness when generating the final image, thus preserving the clarity and detail of the image.