This post is translated from Chinese into English through AI.View Original
AI-generated summary
The content discusses the theoretical foundations and advancements in diffusion models for image generation, primarily based on the paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." The forward diffusion process involves iteratively adding noise to images until they become completely noisy, while the reverse diffusion process uses a neural network to progressively denoise the images.
Key points include:
1. **Denoising Diffusion Probabilistic Models (DDPM)**: This paper significantly enhances the quality and efficiency of diffusion models by introducing a simple denoising network and optimized training strategies. It identifies three main prediction targets for the neural network: predicting the noise mean, predicting the original image, and predicting the added noise. The latter is chosen as the primary method due to its stability and effectiveness.
2. **Model Architecture**: The U-Net architecture is introduced, featuring a bottleneck layer and attention blocks, which help improve the model's performance by allowing it to process images at different resolutions and incorporate positional encoding for time steps.
3. **Mathematical Framework**: The forward and backward processes are defined mathematically, detailing how noise is added and removed from images. The reparameterization trick is highlighted as a crucial technique that allows for the optimization of the model by separating randomness from the deterministic function.
Overall, the document emphasizes the evolution of diffusion models, their theoretical underpinnings, and the architectural improvements that have led to better image synthesis outcomes.
This article mainly organizes and elaborates on the teaching logic of the video, combined with the explanation content. If there are any errors, please feel free to correct them in the comments!
This paper lays the theoretical foundation for diffusion models. The authors propose a generative model based on nonequilibrium thermodynamics, which achieves data generation by gradually adding and removing noise. This provides important theoretical support for subsequent diffusion model research.
We apply a large amount of noise to the image and then use a neural network to denoise. If this neural network learns well, it can start from completely random noise and ultimately obtain images from our training data.
Iteratively applying noise to the image, when the steps are sufficient, the image will completely turn into noise, using a normal distribution as the noise source:
This paper proposes Denoising Diffusion Probabilistic Models (DDPM), significantly improving the generation quality and efficiency of diffusion models. By introducing a simple denoising network and optimized training strategies, DDPM has become an important milestone in the field of diffusion models.
The authors discuss three targets that the neural network can predict:
Predict the mean of the noise at each timestep
That is, predicting the mean of the conditional distribution p(xt−1∣xt)
The variance is fixed and not learnable.
Predict x0 directly
Directly predicting the original, unpolluted image.
Experiments show that this method performs poorly.
Predict the added noise
Predicting the noise ε added during the forward process.
This is mathematically equivalent to the first method (predicting the noise mean), just parameterized differently; they can be transformed into each other through simple transformations.
The paper ultimately chooses predicting noise (the third method) as the main approach, as this method is more stable and performs better.
The amount of noise added at each step is not fixed, controlled by a Linear Schedule to prevent instability during training.
It looks something like this:
We can see that the last few timesteps are close to complete noise, with very little information. Additionally, overall, the information is destroyed too quickly, so OpenAI used a Cosine Schedule to address these two issues:
Along with this paper, a model architecture called U-Net was published:
This model has a Bottleneck (a layer with fewer parameters) in the middle, using Downsample-Block and Resnet-Block to project the input image to a lower resolution, and using Upsample-Block to project it back to the original size during output.
At certain resolutions, the authors added Attention-Blocks and made Skip-Connections between layers in the same resolution space. The model is designed for each timestep, which is achieved through sinusoidal positional encoding embeddings in the Transformer, which are projected into each Residual-Block. The model can also combine the Schedule to remove different amounts of noise at different timesteps to enhance generation effects, which will be discussed in detail later.
The concept of Bottleneck was initially proposed and widely used in the unsupervised learning method "Autoencoder." As the lowest-dimensional hidden layer in the autoencoder architecture, it is located between the encoder and decoder, forming the narrowest part of the network, forcing the network to learn a compressed representation of the data, minimizing reconstruction error and acting as a regularization effect: Lreconstruction=∥X−Decoder(Encoder(X))∥2
Xt represents the image at timestep t, i.e., X0 is the original image. It can be simplified as t being smaller means less noise:
The final image of noise is an isotropic (uniform in all directions) complete noise, denoted as XT. In the initial studies, T=1000, and subsequent work has reduced this significantly:
Forward Process: q(xt∣xt−1), inputting the xt−1 image outputs a noisier image Xt:
Backward Process: p(xt−1∣xt), inputting the xt image uses a neural network to output a denoised image xt−1:
1−βtxt−1 is the mean of the distribution; βt is the noise schedule parameter, ranging from 0 to 1, combined with 1−βt to scale the noise, decreasing as the timestep increases, representing the retained portion of the original signal.
βtI is the covariance matrix of the distribution, I is the identity matrix, indicating that the covariance matrix is diagonal and independent across dimensions, increasing the amount of noise added as the timestep progresses.
Now we just need to iterate this step continuously to get the result after 1000 steps, but actually, these can be completed in one step.
The reparameterization trick is very important in diffusion models and other generative models (like Variational Autoencoders, VAE). Its core idea is to transform the sampling process of a random variable into a deterministic function plus a standardized random variable. This transformation allows the model to be optimized through gradient descent because it eliminates the randomness in the sampling process from affecting gradient calculations.
Here is a simple example to explain its significance.
You can implement a dice roll in two ways:
The first way has randomness inside the function:
# 1. Directly rolling the dice (random sampling)def roll_dice(): return random.randint(1, 6)result = roll_dice()
The second way separates randomness outside the function, making the function itself deterministic:
# 2. Separating randomnessrandom_number = random.random() # Generate a random number between 0 and 1def transformed_dice(random_number): # Map the random number from 0-1 to 1-6 return math.floor(random_number * 6) + 1result = transformed_dice(random_number)
In probability theory, we learned that if X is a random variable and X∼N(0,1), then: aX+b∼N(b,a2)
Thus, for a normal distribution N(μ,σ2), samples can be generated as follows:
x=μ+σ⋅ϵ
where ϵ∼N(0,1) is a standard normal distribution.
Similarly, in the normal distribution,
Without using reparameterization:
# Directly sampling from the target normal distributionx = np.random.normal(mu, sigma)
Using reparameterization:
# First sampling from the standard normal distributionepsilon = np.random.normal(0, 1)# Then obtaining the target distribution through a deterministic transformationx = mu + sigma * epsilon
In terms of gradient calculations involved in model training,
Without using reparameterization:
def sample_direct(mu, sigma): return np.random.normal(mu, sigma)# In this case, it's difficult to calculate the gradient with respect to mu and sigma# because the random sampling operation blocks gradient propagation
Using reparameterization:
def sample_reparameterized(mu, sigma): epsilon = np.random.normal(0, 1) # Gradient does not need to propagate through here return mu + sigma * epsilon # Gradients for mu and sigma can be easily calculated
Taking VAE (Variational Autoencoder) as an example:
class VAE(nn.Module): def __init__(self): super(VAE, self).__init__() self.encoder = Encoder() # Outputs mu and sigma self.decoder = Decoder() def reparameterize(self, mu, sigma): # Reparameterization trick epsilon = torch.randn_like(mu) # Sample from standard normal distribution z = mu + sigma * epsilon # Deterministic transformation return z def forward(self, x): # Encoder outputs mu and sigma mu, sigma = self.encoder(x) # Use reparameterization to sample z = self.reparameterize(mu, sigma) # Decoder reconstructs the input reconstruction = self.decoder(z) return reconstruction
Using the reparameterization trick on q(xt∣xt−1)=N(xt;1−βtxt−1,βtI),
∵Σ∴σ=βtI,σ2=βt=βt
We can express xt as a deterministic transformation of xt−1 plus a noise term:
xt=1−βtxt−1+βtϵ
Here, 1−βtxt−1 is the mean part, and βtϵ is the noise part. Since ϵ is a sample from the standard normal distribution and independent of model parameters, during backpropagation, the gradient only needs to consider the corresponding parameters of 1−βt and βt. This allows the model to be effectively optimized through gradient descent.
We simplify the notation with αt and record its cumulative product:
We obtain:
q(xt∣xt−1)=αtxt−1+1−αtϵ
Calculating the two-step transition: from xt−2 to xt
Since the variance is fixed and does not need to be learned (see 1.3), we only need the neural network to predict the mean:
Our ultimate goal is to predict the noise in two timesteps. Now let's start analyzing from the loss function:
−log(pθ(x0))
However, the probability of x0 in this negative log likelihood depends on all other previous timesteps. We can learn a model that approximates these conditional probabilities as a solution, which requires using the Variational Lower Bound to obtain a more computable formula.
Assuming we have an uncomputable function f(x), in our scenario, it is the negative log likelihood, we can find a computable function g(x) that always satisfies g(x)≤f(x). Then optimizing g(x) can also increase f(x):
We ensure this by subtracting the KL divergence, which is a measure of the similarity between two distributions and is always non-negative:
DKL(p∥q)=∫xp(x)logq(x)p(x)dx
Subtracting a term that is always non-negative ensures that the result is always less than the original function. Here, we use "+" because we want to minimize the loss, so adding it guarantees that the original negative log likelihood remains large:
In this form, since the negative log likelihood is still present, the lower bound remains uncomputable. We need to obtain a better expression. First, rewrite the KL divergence as the log ratio of two terms:
Rewriting the numerator of the sum term using Bayes' theorem: q(xt∣xt−1)=q(xt−1)q(xt−1∣xt)q(xt)
But this returns to the previous state, where these terms require estimating all samples, leading to high variance. For example, given the xt shown below, it is difficult to determine what the previous state looked like:
The improvement idea is to directly condition on the original data x0:
⟹q(xt−1∣x0)q(xt−1∣xt,x0)q(xt∣x0)
This way, with the noise-free image provided, the candidates for xt−1 are reduced, and the variance will decrease:
The first term in this form can be ignored because q has no learnable parameters; it is just the forward process of adding noise, which will converge to a normal distribution, while p(xT) is just noise sampled from a Gaussian distribution. Therefore, it can be determined that this KL divergence term will be very small.
The remaining two terms' derivation results are as follows (process omitted, see Lilian's Blog)
The final form is the mean squared error between the actual noise at timestep t and the noise predicted by the neural network. Researchers have found that neglecting the scaling term in front yields better sampling quality and is easier to implement.
Et,x0,ϵ represents the expectation over timestep t, original data x0, and noise ϵ.
ϵ is the actual added random noise.
ϵθ is the noise predicted by the neural network.
αˉtx0+1−αˉtϵ is the closed-form solution of the forward process, representing the noisy data at timestep t, so it can be simplified to:
xt directly represents the noisy data at timestep t, i.e., αˉtx0+1−αˉtϵ.
The entire loss function essentially measures the mean squared error between the predicted noise and the actual noise.
In this case, timestep t is typically sampled from a uniform distribution (i.e., t∼Uniform(1,T), where T is the total number of timesteps). This choice ensures that during training, each timestep has an equal probability of being selected, allowing the model to effectively learn the denoising process across all timesteps.
First, we sample some images from the dataset, then sample t and noise from a normal distribution, and optimize the objective through gradient descent.
First, sample xt from a normal distribution, and then sample xt−1 using the previously shown formula through reparameterization.
Note that at t=1, no noise is added. According to the formula
x0=αt1(x1−1−αˉtβtϵθ(x1,1))
At t=1, the formula is used to recover x0 from x1, which is the last step of the denoising process. At this point, we want to reconstruct the original image as accurately as possible. Not adding noise in the final step (i.e., without the βtϵ term) can avoid introducing unnecessary randomness when generating the final image, thus maintaining the clarity and detail of the image.