Hello there! I need help regarding convolutional autoencoder, where we can deal with variable length audio tensor input(1D). In the encoder part it should give a fixed-size representation which is then passed to the decoder. The decoder should try to approximate the original input. I have implemented the encoder part which is generating fixed-size embedding but I'm stuck on decoder implementation, I don't know how to go back to the original input_size. I have used adaptiveMaxPool1D in Encoder to downsample to a certain output_size. But what would be its reverse process for the decoder? Lemme know if there are any leads.
#Convolutional AutoEncoder
8 messages · Page 1 of 1 (latest)
I think it's some combination of this and possibly upsampling
https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html
(I just tried to google a bit take this with a few grains of rock salt)
@solar bobcat Yeah I have tried that but its not giving the output of the same shape as original input
can you show what the code you used was
class SpeechEncoder(nn.Module):
def __init__(self,output_size):
super().__init__()
self.output_size = output_size
self.conv1 = nn.Conv1d(in_channels=1, out_channels=64, kernel_size=3, stride=2, padding=1)
self.relu1 = nn.ReLU()
self.conv2 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1)
self.relu2 = nn.ReLU()
self.maxpool1 = nn.AdaptiveMaxPool1d(1024, return_indices=True) #Formula:- [Stride = (input_size//output_size) , Kernel size = input_size - (output_size-1)*stride , Padding = 0] if we use non-adaptive maxpooling
self.fc1 = nn.Linear(1024, output_size)
def forward(self, x):
# Input shape: (batch_size, 1, audio_length)
x = self.conv1(x) # shape: (batch_size, 64, audio_length)
print(f"Shape after conv1: {x.shape}")
x = self.relu1(x)
x = self.conv2(x)
x = self.relu2(x)
input_size = x.size(2)
print(f"after conv2: {x.shape}")
x,indices = self.maxpool1(x)
print(f"Shape after maxpool: {x.shape}")
x = x.mean(1) # shape: (batch_size, audio_length) #this will take mean of all the 128 channels
print(f"Shape after mean: {x.shape}")
# x = x.view(x.size(0), -1) # shape: (batch_size, 128 * audio_length)
# fc1 = nn.Linear(sf, self.output_size) # shape: (batch_size, output_size)
x = self.fc1(x)
return x,indices
def main():
embedding_size = 512
encoder = SpeechEncoder(output_size=embedding_size)
decoder = SpeechDecoder()
src = torch.rand(4, 1, 441001)
@solar bobcat
Although I have worked it out using another way (without using AdaptiveMaxPool); but want to know whether there would be some way with this approach.