variable-sized inputs,outputs for embedders and autoencoders | Learn AI Together | Page 1

outer stirrup Dec 15, 2022, 11:14 AM

#

Hi all!
I am trying to port stable-diffusion to music generation. I think I got a good general grip on how it works and how to start, however there is one point I am uncertain about: How do autoencoders, or embedder, accept/outputs variable-sized tensors?
That is, how does stable diffusion output both 512x512 images, but also 712x1024 etc?

So far, I've found two answers ( https://ai.stackexchange.com/questions/2008/how-can-neural-networks-deal-with-varying-input-sizes and https://discuss.pytorch.org/t/how-to-create-convnet-for-variable-size-input-dimension-images/1906/6 ): The first one is simply to have a bunch of conv and unconv layers to fit the right size (or use https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html ), and the second is something with recurrent/recursive networks (basically embed with another embedding as the input?)

Is there a standard convention with this type of things, or a recommended approach?

Artificial Intelligence Stack Exchange

How can neural networks deal with varying input sizes?

As far as I can tell, neural networks have a fixed number of neurons in the input layer.

If neural networks are used in a context like NLP, sentences or blocks of text of varying sizes are fed to a

PyTorch Forums

How to create convnet for variable size input dimension images

@kirk86 Can you share your key code that loading different size input images? I read the source code of dataloader, finding that torch.stack() is used. This function expects all elements in the batch sequence the exactly same size. How do you handle that? @smth If I set batch size to greater than 1, how can I use a dataloader of different si...

outer stirrup Dec 15, 2022, 11:32 AM

#

variable-sized inputs,outputs for embedders and autoencoders

brave scroll Dec 15, 2022, 12:52 PM

#

As I understand, the SD encoder module takes an image as input and output a latent tensor whose size depends on the input size, which has to be a multiple of some number.
Say for instance you input a 32x32 image, the encoder will give you a 1024x1x1 latent (I'm making the numbers up)
1024 would be the mebedding size, and 1x1 is there because 32x32 is the smallest size that the convolutional encoder can take in.
if the input was size 64x64 instead, then the latent would be 1024x2x2.
The decoder does the same thing in reverse, from a 1024xHxW latent, it outputs a H*32xW*32 image.

#

so for music, you would have a latent with one embedding dimension and one spatial (or rather temporal) dimension

#

the diffusion process happens in the latent space using a Transformer architecture, which doesn't care about the number of spatial and temporal dimensions, it can generalize to any number of "spatio-temporal items" at inference time (after training)
If you want to predict a bigger image/song compared to the input, you simply initialize a bigger latent by adding padding along the desired spatio-temporal dimension(s)

outer stirrup Dec 15, 2022, 4:07 PM

#

brave scroll As I understand, the SD encoder module takes an image as input and output a late...

Ah I see. Is this variable length achieved via a sort of recursive mechanism, or just more convolutions? I haven't really been able to find the answer to this, though I assume it is the latter?
And yes, if I understand correctly, it is only the autoencoder that handles input and output size, the rest of the machinery only works in the lattent space

brave scroll Dec 15, 2022, 4:18 PM

#

outer stirrup Ah I see. Is this variable length achieved via a sort of recursive mechanism, or...

I'm simplifying a bit but a convolution is a sliding window mechanism, as long as you have a bigger input image than the kernel size, you can slide the window multiple times, the number of windows will depend on the size of the input.
In general, if you have a convolutional model trained on images of some size WxH, it will at least work for integer multiples of this size, k1*W x k2*H
There is no need for recurrence to get variably sized outputs, you just feed a bigger latent to the decoder and it will decode a bigger image.

outer stirrup Dec 15, 2022, 4:28 PM

#

brave scroll I'm simplifying a bit but a convolution is a sliding window mechanism, as long a...

you can slide the window multiple times, the number of windows will depend on the size of the input.
Ah! So it sounds like I understood incorrectly and that "sliding multiple times" is actually comprised in a single convolution layers instead of multiple convolution layers stacked one over the other?
Sounds like I had a misunderstanding in how convolution layers worked, I thought they were only doing one pass, will have to look more into it

#

Thanks a lot for your guidance!

outer stirrup Dec 15, 2022, 11:14 PM

#

brave scroll I'm simplifying a bit but a convolution is a sliding window mechanism, as long a...

Ah! So if I understand you correctly, you mean something like


.....
    self.convolution = nn.Conv2D(1, 1, 3)
    # Other convolution layers
.....

    def forward(self, x):
       for i in .....:
           x = self.convolution(x)
       x = # rest of layers
```?
That is, we keep using the same layer multiple times?


> There is no need for recurrence to get variably sized outputs, you just feed a bigger latent to the decoder and it will decode a bigger image.
Ah, you mean with https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html ? It does indeed seem to just multiply the input size!

#

I think I'm starting to form a better model of what convolution exactly is

brave scroll Dec 16, 2022, 7:48 AM

#

outer stirrup Ah! So if I understand you correctly, you mean something like ```python ..... ...

No you don't repeat the application of the same convolution layer, the convolution layer itself accepts variably sized inputs and will return a smaller (or bigger in the case of convtranspose) output.
In convolutional neural networks, the input goes through multiple fixed convolutions layers so usually the size is reduced quite a bit, but you have to feed the model a large enough image to begin with or else you might get to a convolution layer that can't work with an input that is too small.

outer stirrup Dec 16, 2022, 8:46 AM

#

brave scroll No you don't repeat the application of the same convolution layer, the convoluti...

Hmm... What I'm confused about is that i thought that the latent space needed to be fixed size, for instance if i need dense/linear layers (though it appears SD only uses conv 🤔 ?)

brave scroll Dec 16, 2022, 9:02 AM

#

outer stirrup Hmm... What I'm confused about is that i thought that the latent space needed to...

to explain this I have to get a little bit into the details so:
If you have a 2D latent tensor with dimensions (batch, features), a linear layer will only accept a tensor with the exact number of input features for which it was defined for. But it will accept any batch size as input as it process all elements of the batch in parallel independently.
Similarly, if you have a 4D tensor with 2 spatial dimensions like this: (batch, x_dim, y_dim, features), a linear layer will yet again only accept a tensor for which the number of features correspond, but will accept any size in the other dimensions as they will all be processed independently. The issue in this case is that the linear layer won't be able to process the spatial information. Imagine doing the same operation in parallel on each pixel of an image, it would not be very powerful.

Now, convolutions are a bit different because they work with a sliding window, so with your (batch, x_dim, y_dim, features) tensor, the sliding window gather neighboring cells from the x_dim and y_dim dimensions. Each cell is not processed independently since it will depend on the neighboring cells.
But since it is a sliding window mechanism and that the window is fixed sized and so doesn't see the whole image at once, the layer also accepts different sizes along the x and y dimensions. The number of features has to be fixed though.

outer stirrup Dec 16, 2022, 9:05 AM

#

Ah! Okay, everything makes sense now!

#variable-sized inputs,outputs for embedders and autoencoders