#variable-sized inputs,outputs for embedders and autoencoders

1 messages · Page 1 of 1 (latest)

outer stirrup
#

Hi all!
I am trying to port stable-diffusion to music generation. I think I got a good general grip on how it works and how to start, however there is one point I am uncertain about: How do autoencoders, or embedder, accept/outputs variable-sized tensors?
That is, how does stable diffusion output both 512x512 images, but also 712x1024 etc?

So far, I've found two answers ( https://ai.stackexchange.com/questions/2008/how-can-neural-networks-deal-with-varying-input-sizes and https://discuss.pytorch.org/t/how-to-create-convnet-for-variable-size-input-dimension-images/1906/6 ): The first one is simply to have a bunch of conv and unconv layers to fit the right size (or use https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html ), and the second is something with recurrent/recursive networks (basically embed with another embedding as the input?)

Is there a standard convention with this type of things, or a recommended approach?

outer stirrup
#

variable-sized inputs,outputs for embedders and autoencoders

brave scroll
#

As I understand, the SD encoder module takes an image as input and output a latent tensor whose size depends on the input size, which has to be a multiple of some number.
Say for instance you input a 32x32 image, the encoder will give you a 1024x1x1 latent (I'm making the numbers up)
1024 would be the mebedding size, and 1x1 is there because 32x32 is the smallest size that the convolutional encoder can take in.
if the input was size 64x64 instead, then the latent would be 1024x2x2.
The decoder does the same thing in reverse, from a 1024xHxW latent, it outputs a H*32xW*32 image.

#

so for music, you would have a latent with one embedding dimension and one spatial (or rather temporal) dimension

#

the diffusion process happens in the latent space using a Transformer architecture, which doesn't care about the number of spatial and temporal dimensions, it can generalize to any number of "spatio-temporal items" at inference time (after training)
If you want to predict a bigger image/song compared to the input, you simply initialize a bigger latent by adding padding along the desired spatio-temporal dimension(s)

outer stirrup
brave scroll
# outer stirrup Ah I see. Is this variable length achieved via a sort of recursive mechanism, or...

I'm simplifying a bit but a convolution is a sliding window mechanism, as long as you have a bigger input image than the kernel size, you can slide the window multiple times, the number of windows will depend on the size of the input.
In general, if you have a convolutional model trained on images of some size WxH, it will at least work for integer multiples of this size, k1*W x k2*H
There is no need for recurrence to get variably sized outputs, you just feed a bigger latent to the decoder and it will decode a bigger image.

outer stirrup
# brave scroll I'm simplifying a bit but a convolution is a sliding window mechanism, as long a...

you can slide the window multiple times, the number of windows will depend on the size of the input.
Ah! So it sounds like I understood incorrectly and that "sliding multiple times" is actually comprised in a single convolution layers instead of multiple convolution layers stacked one over the other?
Sounds like I had a misunderstanding in how convolution layers worked, I thought they were only doing one pass, will have to look more into it

#

Thanks a lot for your guidance!

outer stirrup
# brave scroll I'm simplifying a bit but a convolution is a sliding window mechanism, as long a...

Ah! So if I understand you correctly, you mean something like


.....
    self.convolution = nn.Conv2D(1, 1, 3)
    # Other convolution layers
.....

    def forward(self, x):
       for i in .....:
           x = self.convolution(x)
       x = # rest of layers
```?
That is, we keep using the same layer multiple times?


> There is no need for recurrence to get variably sized outputs, you just feed a bigger latent to the decoder and it will decode a bigger image.
Ah, you mean with https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html ? It does indeed seem to just multiply the input size!
#

I think I'm starting to form a better model of what convolution exactly is

brave scroll
# outer stirrup Ah! So if I understand you correctly, you mean something like ```python ..... ...

No you don't repeat the application of the same convolution layer, the convolution layer itself accepts variably sized inputs and will return a smaller (or bigger in the case of convtranspose) output.
In convolutional neural networks, the input goes through multiple fixed convolutions layers so usually the size is reduced quite a bit, but you have to feed the model a large enough image to begin with or else you might get to a convolution layer that can't work with an input that is too small.

outer stirrup
brave scroll
# outer stirrup Hmm... What I'm confused about is that i thought that the latent space needed to...

to explain this I have to get a little bit into the details so:
If you have a 2D latent tensor with dimensions (batch, features), a linear layer will only accept a tensor with the exact number of input features for which it was defined for. But it will accept any batch size as input as it process all elements of the batch in parallel independently.
Similarly, if you have a 4D tensor with 2 spatial dimensions like this: (batch, x_dim, y_dim, features), a linear layer will yet again only accept a tensor for which the number of features correspond, but will accept any size in the other dimensions as they will all be processed independently. The issue in this case is that the linear layer won't be able to process the spatial information. Imagine doing the same operation in parallel on each pixel of an image, it would not be very powerful.

Now, convolutions are a bit different because they work with a sliding window, so with your (batch, x_dim, y_dim, features) tensor, the sliding window gather neighboring cells from the x_dim and y_dim dimensions. Each cell is not processed independently since it will depend on the neighboring cells.
But since it is a sliding window mechanism and that the window is fixed sized and so doesn't see the whole image at once, the layer also accepts different sizes along the x and y dimensions. The number of features has to be fixed though.

outer stirrup
#

Ah! Okay, everything makes sense now!