Convolutional AutoEncoder | Learn AI Together | Page 1

cyan edge Feb 6, 2024, 7:54 PM

#

Hello there! I need help regarding convolutional autoencoder, where we can deal with variable length audio tensor input(1D). In the encoder part it should give a fixed-size representation which is then passed to the decoder. The decoder should try to approximate the original input. I have implemented the encoder part which is generating fixed-size embedding but I'm stuck on decoder implementation, I don't know how to go back to the original input_size. I have used adaptiveMaxPool1D in Encoder to downsample to a certain output_size. But what would be its reverse process for the decoder? Lemme know if there are any leads.

solar bobcat Feb 6, 2024, 10:30 PM

#

I think it's some combination of this and possibly upsampling
https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose1d.html

#

(I just tried to google a bit take this with a few grains of rock salt)

cyan edge Feb 7, 2024, 1:02 PM

#

@solar bobcat Yeah I have tried that but its not giving the output of the same shape as original input

solar bobcat Feb 7, 2024, 2:17 PM

#

cyan edge <@323196863006375936> Yeah I have tried that but its not giving the output of th...

can you show what the code you used was

cyan edge Feb 10, 2024, 7:42 AM

#

class SpeechEncoder(nn.Module):
    def __init__(self,output_size):
        super().__init__()
        self.output_size = output_size
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=64, kernel_size=3, stride=2, padding=1)
        self.relu1 = nn.ReLU()
        self.conv2 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, stride=2, padding=1)
        self.relu2 = nn.ReLU()
        self.maxpool1 = nn.AdaptiveMaxPool1d(1024, return_indices=True)  #Formula:- [Stride = (input_size//output_size) , Kernel size = input_size - (output_size-1)*stride , Padding = 0] if we use non-adaptive maxpooling
        self.fc1 = nn.Linear(1024, output_size)

    def forward(self, x):
        # Input shape: (batch_size, 1, audio_length)
        x = self.conv1(x)  # shape: (batch_size, 64, audio_length)
        print(f"Shape after conv1: {x.shape}")
        x = self.relu1(x)
        x = self.conv2(x)
        x = self.relu2(x)
        input_size = x.size(2)
        print(f"after conv2: {x.shape}")
        x,indices = self.maxpool1(x)
        print(f"Shape after maxpool: {x.shape}")
        x = x.mean(1) # shape: (batch_size, audio_length) #this will take mean of all the 128 channels
       
        print(f"Shape after mean: {x.shape}")
        # x = x.view(x.size(0), -1) # shape: (batch_size, 128 * audio_length)
        # fc1 = nn.Linear(sf, self.output_size) # shape: (batch_size, output_size)
        x = self.fc1(x)
        return x,indices
        
def main():
    embedding_size = 512
    encoder = SpeechEncoder(output_size=embedding_size)
    decoder = SpeechDecoder()
    src = torch.rand(4, 1, 441001)

#

@solar bobcat

#

Although I have worked it out using another way (without using AdaptiveMaxPool); but want to know whether there would be some way with this approach.

#Convolutional AutoEncoder