In a denoising perspective, my understanding is that in diffusion training we add a uniformly distributed amount of noise and train to remove it. But my intuition is that the network might specialize better if we put the target to be also noisy, but a bit less, such that every example would need to remove approximately the same amount of noise.
There is probably a reason why people don't do it this way, but I couldn't figure it out from the literature. I wonder if someone has some intuition.
Thanks!