#in diffusion models, why do we train to remove all the noise?

22 messages · Page 1 of 1 (latest)

past scroll
#

In a denoising perspective, my understanding is that in diffusion training we add a uniformly distributed amount of noise and train to remove it. But my intuition is that the network might specialize better if we put the target to be also noisy, but a bit less, such that every example would need to remove approximately the same amount of noise.
There is probably a reason why people don't do it this way, but I couldn't figure it out from the literature. I wonder if someone has some intuition.

Thanks!

wide quiver
#

@past scroll my intuition would be that learning how to predict/remove noise from t=0.2 down to t=0.1 probably looks quite different from doing so at t=0.9 down to t=0.8. (where t=0 is no noise, t=1 is full noise) So you want the network to be able to behave differently depending on t, not just always trying to denoise by a fixed amount at any given place along the diffusion journey. Does that make sense?

For example, when I'm down at t=0.1 and below, I'm likely to learn something different about edges and how to refine them, vs. in the early stages of denoising when trying to refine edges wouldn't make sense

formal panther
#

It was my understanding that diffusion models are trained that way. It’s a next step prediction problem, i.e. Given input with noise at step T+1, predict the input with less noise at step T, with T_0 as no noise added

wide quiver
#

(I'm a little fuzzy on diffusion models - is that an acceptable pun? - so would def be happy to be corrected on this!)

copper bramble
#

your understanding is probably not accurate. i've spent last few days learning LDM, it works by predicting noise (not noisy image or original image) and remove it using DDIM

#

and the noise is not uniformly distributed, but normally distributed

#

even for DDPM you don't remove the same amount of noise in each iteration, beta_t is often not a constant

past scroll
#

The noise is gaussian, but the variance (amount of noise) is uniformly distributed

past scroll
wide quiver
# past scroll You can still give 't' to the network. I agree that it is not exactly the same a...

I haven't fully grokked this, but looking at the math in the DDPM paper (https://arxiv.org/pdf/2006.11239.pdf) section 3.2 Reverse process, I think in each learning step it is actually calculating the probability distribution of the image at t-1 given the distribution at t. That is, it's always trying to remove one step of noise, even though it's trained stochastically over all values of t.

The highlighted text is defining the model to be learned as a gaussian function of the image at t-1 that equals the conditional distribution of the image at t-1 given the image at t. So, what gaussian distribution, with input parameters (image at t, t) denoises from image at t to image at t-1. (x_t means image at step t)

I don't fully understand why the higlighted RHS starts with "N(x_(t-1); ..." instead of "N(x_t; ...", any help on that would be great

copper bramble
#

i believe "N(x_{t-1};.." is just another way of saying "x_{t-1} ~N(...)"

past scroll
wide quiver
#

it could be equal to a function of x_(t-1), but it would be more intuitive if it was a function of x_t, no?

copper bramble
#

I was not very clear on my description, x_{t-1} is a variable, whereas N(..) is a distribution. It is meant to be interpreted as LHS = N(..), which is a distribution followed by x_{t-1}

#

Introduction In this post I’ll explain what the utmost likelihood method for parameter estimation is and undergo an easy example to demonstrate the tactic. A number of the content requires knowledge of fundamental probability concepts like the definition of probability and independence of events. I’ve written a blog post with these prerequisites...

wide quiver
#

thanks for helping me with this! can you clarify what you mean by "which is a distribution followed by x_(t-1)"?, with that phrasing, it seems like it's just a coincidence, or that you're defining two things at once, e.g. LHS = N(...), and btw, x_(t-1) also follows that distribution. I don't quite get the significance that x_(t-1) has to the definition of the LHS. does my confusion make sense?

this related post https://stats.stackexchange.com/questions/592778/what-does-the-notation-that-is-commonly-used-in-diffusion-models-mean-mathcal?noredirect=1&lq=1 suggests it isn't that x_(t+1) follows the N distribution, but rather that the LHS is being defined as equal to N(...) computed at x_(t-1). But practically speaking I don't really know how that affects e.g. how you would compute that value. If it agrees with this link, it would be that N(mu(...), sigma(...)) is a probability density function that produces a different distribution for every possible image, and we are evaluating it at x_(t-1).

sorry this is so long, I really appreciate your help!

copper bramble
#

"which is a distribution followed by x_{t-1}" the "which" here refers to N(...). I think it would even be fine to omit x_{t-1}; inside N(x_{t-1};...) without losing its original meaning

#

tbh, i dont speak english that well and i fail to see the difference between x_{t-1} follows the N(...) and N(...) computed at x_{t-1}.

wide quiver
#

oh interesting. so for me if we took a standard normal distribution with mean 0 and standard deviation 1, "x follows the N(0,1) distribution" means that x is a random variable drawn from N(0,1). whereas computing N(0,1) at x would answer "what is the probability of drawing x from N(0,1)?", so if x=0.5, computing N(0,1) at x would be 0.35 (like this https://www.wolframalpha.com/input?i=gaussian+distribution+with+mean%3D0%2C+standard+deviation%3D1.+at+x%3D0.5)

so to me it's the difference between calculating the probability density at a certain point, vs. stating that a random variable is drawn from a certain probability density function