in diffusion models, why do we train to remove all the noise? | Learn AI Together | Page 1

past scroll Jun 10, 2023, 3:44 PM

#

In a denoising perspective, my understanding is that in diffusion training we add a uniformly distributed amount of noise and train to remove it. But my intuition is that the network might specialize better if we put the target to be also noisy, but a bit less, such that every example would need to remove approximately the same amount of noise.
There is probably a reason why people don't do it this way, but I couldn't figure it out from the literature. I wonder if someone has some intuition.

Thanks!

wide quiver Jun 10, 2023, 9:03 PM

#

@past scroll my intuition would be that learning how to predict/remove noise from t=0.2 down to t=0.1 probably looks quite different from doing so at t=0.9 down to t=0.8. (where t=0 is no noise, t=1 is full noise) So you want the network to be able to behave differently depending on t, not just always trying to denoise by a fixed amount at any given place along the diffusion journey. Does that make sense?

For example, when I'm down at t=0.1 and below, I'm likely to learn something different about edges and how to refine them, vs. in the early stages of denoising when trying to refine edges wouldn't make sense

formal panther Jun 10, 2023, 9:04 PM

#

It was my understanding that diffusion models are trained that way. It’s a next step prediction problem, i.e. Given input with noise at step T+1, predict the input with less noise at step T, with T_0 as no noise added

wide quiver Jun 10, 2023, 9:09 PM

#

@formal panther I also thought that, but my read of the position embeddings section here https://huggingface.co/blog/annotated-diffusion#position-embeddings suggests that the loss function is actually predicting the noise from t=0 to a randomly selected amount of noise that's added to the image, but with position embeddings allowing the network to know how far through the forward diffusion process it is when it's trying to separate signal from noise

The Annotated Diffusion Model

#

(I'm a little fuzzy on diffusion models - is that an acceptable pun? - so would def be happy to be corrected on this!)

copper bramble Jun 11, 2023, 1:22 AM

#

your understanding is probably not accurate. i've spent last few days learning LDM, it works by predicting noise (not noisy image or original image) and remove it using DDIM

#

and the noise is not uniformly distributed, but normally distributed

#

even for DDPM you don't remove the same amount of noise in each iteration, beta_t is often not a constant

past scroll Jun 13, 2023, 12:51 PM

#

The noise is gaussian, but the variance (amount of noise) is uniformly distributed

past scroll Jun 13, 2023, 12:55 PM

#

wide quiver <@838047909677236274> my intuition would be that learning how to predict/remove...

You can still give 't' to the network. I agree that it is not exactly the same at different values of t, but so is removing all the noise that people do now. I think removing the same amount can teach more efficiently.

wide quiver Jun 13, 2023, 6:10 PM

#

past scroll You can still give 't' to the network. I agree that it is not exactly the same a...

I haven't fully grokked this, but looking at the math in the DDPM paper (https://arxiv.org/pdf/2006.11239.pdf) section 3.2 Reverse process, I think in each learning step it is actually calculating the probability distribution of the image at t-1 given the distribution at t. That is, it's always trying to remove one step of noise, even though it's trained stochastically over all values of t.

The highlighted text is defining the model to be learned as a gaussian function of the image at t-1 that equals the conditional distribution of the image at t-1 given the image at t. So, what gaussian distribution, with input parameters (image at t, t) denoises from image at t to image at t-1. (x_t means image at step t)

I don't fully understand why the higlighted RHS starts with "N(x_(t-1); ..." instead of "N(x_t; ...", any help on that would be great

copper bramble Jun 13, 2023, 10:27 PM

#

i believe "N(x_{t-1};.." is just another way of saying "x_{t-1} ~N(...)"

past scroll Jun 14, 2023, 4:13 AM

#

wide quiver I haven't fully grokked this, but looking at the math in the DDPM paper (https:/...

I don't fully understand it, but it seems to me that this is the sampling step, not the learning step. In my understanding the learning objective in this paper is to predict the entire noise.

wide quiver Jun 14, 2023, 4:10 PM

#

copper bramble `i believe "N(x_{t-1};.." is just another way of saying "x_{t-1} ~N(...)"`

thanks, I think what I find confusing is how that relates to the LHS. it currently reads like p_theta(x_(t-1) | x_t) = x_(t-1) ~ N(...

but clearly it's not true that p_theta(x_(t-1) | x_t) is equal to x_(t-1), right? nor can it be equal to a constraint on the distribution of x_(t-1) ?

#

it could be equal to a function of x_(t-1), but it would be more intuitive if it was a function of x_t, no?

copper bramble Jun 14, 2023, 6:06 PM

#

I was not very clear on my description, x_{t-1} is a variable, whereas N(..) is a distribution. It is meant to be interpreted as LHS = N(..), which is a distribution followed by x_{t-1}

#

https://stats.stackexchange.com/questions/301382/what-is-the-semicolon-notation-in-joint-probability

Cross Validated

what is the semicolon notation in joint probability?

I see this kind of notation often

$$ p_{\theta} (x|z, y) = f(x; z, y, \theta)
$$

I understand the conditional prob noation on the left. What is the significance of the ; on the joint prob on the...

#

https://datascience.eu/mathematics-statistics/probability-concepts-explained-maximum-likelihood-estimation/

DATA SCIENCE

Data Science Team

Probability concepts explained: Maximum likelihood estimation — Mat...

Introduction In this post I’ll explain what the utmost likelihood method for parameter estimation is and undergo an easy example to demonstrate the tactic. A number of the content requires knowledge of fundamental probability concepts like the definition of probability and independence of events. I’ve written a blog post with these prerequisites...

wide quiver Jun 14, 2023, 7:18 PM

#

thanks for helping me with this! can you clarify what you mean by "which is a distribution followed by x_(t-1)"?, with that phrasing, it seems like it's just a coincidence, or that you're defining two things at once, e.g. LHS = N(...), and btw, x_(t-1) also follows that distribution. I don't quite get the significance that x_(t-1) has to the definition of the LHS. does my confusion make sense?

this related post https://stats.stackexchange.com/questions/592778/what-does-the-notation-that-is-commonly-used-in-diffusion-models-mean-mathcal?noredirect=1&lq=1 suggests it isn't that x_(t+1) follows the N distribution, but rather that the LHS is being defined as equal to N(...) computed at x_(t-1). But practically speaking I don't really know how that affects e.g. how you would compute that value. If it agrees with this link, it would be that N(mu(...), sigma(...)) is a probability density function that produces a different distribution for every possible image, and we are evaluating it at x_(t-1).

sorry this is so long, I really appreciate your help!

Cross Validated

What does the notation, that is commonly used in diffusion models m...

Commonly used in diffusion models (https://arxiv.org/pdf/2006.11239.pdf), what does the notation, $\mathcal{N}(x_{t};\mu_{\theta},\Sigma_{\theta})$ mean? I get the parameterized by $\mu$ and $\sigma$

copper bramble Jun 14, 2023, 7:48 PM

#

"which is a distribution followed by x_{t-1}" the "which" here refers to N(...). I think it would even be fine to omit x_{t-1}; inside N(x_{t-1};...) without losing its original meaning

#

tbh, i dont speak english that well and i fail to see the difference between x_{t-1} follows the N(...) and N(...) computed at x_{t-1}.

wide quiver Jun 15, 2023, 10:31 AM

#

oh interesting. so for me if we took a standard normal distribution with mean 0 and standard deviation 1, "x follows the N(0,1) distribution" means that x is a random variable drawn from N(0,1). whereas computing N(0,1) at x would answer "what is the probability of drawing x from N(0,1)?", so if x=0.5, computing N(0,1) at x would be 0.35 (like this https://www.wolframalpha.com/input?i=gaussian+distribution+with+mean%3D0%2C+standard+deviation%3D1.+at+x%3D0.5)

so to me it's the difference between calculating the probability density at a certain point, vs. stating that a random variable is drawn from a certain probability density function

#in diffusion models, why do we train to remove all the noise?