Hi all!
I am trying to port stable-diffusion to music generation. I think I got a good general grip on how it works and how to start, however there is one point I am uncertain about: How do autoencoders, or embedder, accept/outputs variable-sized tensors?
That is, how does stable diffusion output both 512x512 images, but also 712x1024 etc?
So far, I've found two answers ( https://ai.stackexchange.com/questions/2008/how-can-neural-networks-deal-with-varying-input-sizes and https://discuss.pytorch.org/t/how-to-create-convnet-for-variable-size-input-dimension-images/1906/6 ): The first one is simply to have a bunch of conv and unconv layers to fit the right size (or use https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html ), and the second is something with recurrent/recursive networks (basically embed with another embedding as the input?)
Is there a standard convention with this type of things, or a recommended approach?
As far as I can tell, neural networks have a fixed number of neurons in the input layer.
If neural networks are used in a context like NLP, sentences or blocks of text of varying sizes are fed to a
@kirk86 Can you share your key code that loading different size input images? I read the source code of dataloader, finding that torch.stack() is used. This function expects all elements in the batch sequence the exactly same size. How do you handle that? @smth If I set batch size to greater than 1, how can I use a dataloader of different si...