#Trying to make my own neural network for language generation, struggling to get coherent text

20 messages · Page 1 of 1 (latest)

tired sun
#

I'm trying to make a network that generates a character, given a sequence. I don't want to use LSTM/Transformer/RNN, I want to try and see how far I can get without these.
So far I am able to make the network overfit, but it is really bad at generalizing.
I'm not sure if I need to make my network deeper, or more wide, or what.
Also it takes a long time to train (takes a good couple hours on an RTX 4090 to get down to even 1.7 training loss, which still produces gibberish like this: "The sovk If the serl wk tnt tn t site toove thich tn toooe te tnsorvlng torhhfr torphith
TThe sont on tn t toarn tn thes trrt tn tn t d tfher Tut tn the soketin the sordnh pg th th th the sord th tnpert th tond tndtn hnere hrto d r og thich tirld tnl w tn th te tptont rtd
he sa eu tf tomttrueteon toaeave tp the sor")

I can only get the training accuracy up to like 55% (meaning 55% of the time the character is generated correctly, but that still ends up looking like jibberish)

Anybody have any tips?

Currently the network has about 1.5 million parameters

        super(Gen, self).__init__()
        print("SIZE: ", read_size * vocab_size)
        embedding_dim = 10
        mid_size = 200
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lin1 = nn.Linear(read_size * embedding_dim, 400)
        self.bn = nn.BatchNorm1d(400)
        self.lin2 = nn.Linear(400, mid_size)
        self.bn2 = nn.BatchNorm1d(mid_size)
        self.linears = nn.ModuleList([nn.Linear(mid_size, mid_size) for i in range(15)])
        self.resid = nn.ModuleList([nn.Linear(mid_size, mid_size) for i in range(15)])
        self.batchnorms = nn.ModuleList([nn.BatchNorm1d(mid_size) for i in range(15)])
        self.layernorm = nn.LayerNorm(mid_size)
        self.lin4 = nn.Linear(mid_size, 100)
        self.lin5 = nn.Linear(100, 50)
        self.lin6 = nn.Linear(50, 20)
        self.bn6=nn.BatchNorm1d(20)
        self.lin7 = nn.Linear(20, 10)
        self.lin8 = nn.Linear(10, output_size * vocab_size)
        
        self.drp = nn.Dropout(0.15)
        self.drpmid = nn.Dropout(0.01)
        self.drpsmall=nn.Dropout(0.00)

    def forward(self, x):
        x = x.long()
        x = self.embedding(x)
        x = x.view(x.shape[0], -1)
        x = nn.functional.leaky_relu(self.bn(self.lin1(x)))
        x = self.drpsmall(nn.functional.leaky_relu(self.bn2(self.lin2(x))))
        for i, l in enumerate(self.linears):
            x = self.drpmid(nn.functional.leaky_relu(self.batchnorms[i](l(x)) + nn.functional.leaky_relu(self.resid[i](x)), negative_slope=0.01))
        x = nn.functional.leaky_relu(self.lin4(x))
        x = nn.functional.leaky_relu(self.lin5(x))
        x=self.drp(x)
        x = nn.functional.selu(self.bn6(self.lin6(x)))
        x = nn.functional.leaky_relu(self.lin7(x))
        x=self.lin8(x)
        return x```
sweet spade
#

This is @strange grove's expertise, where he knows a lot more about non-transformer text generation than anyone else. But it won't be anything bruteforcing like you're trying to do, but to incorporate quite a bit of linguistics knowledge.

strange grove
#

This is a n-gram model. I don't quite understand from your code the size of the n-gram (15?) but beyond 5-grams in general you won't be able to see enough data to fit the model.

tired sun
#

the 15 is just I have 15 other linear layers in the forward() method

glacial verge
#

@tired sun How are you sampling from the networks prediction? Uniformly?

tired sun
#

and then to sample (when generating text after training) I just do an argmax

#

choose the max value

glacial verge
#

Reading through your code now. Its too verbose. You could try packaging some of the layers in a Sequential container.

tired sun
#

yeah that would probably be easier to read for sure

glacial verge
strange grove
terse night
#

use word2vec embeddings for the tokens, that might help. Better embeddings like Elmo, BERT and all can be used but they use a more complicated architecture than a MLP layer and if you want you can train your own word2vec embeddings .

tired sun
#

FINAL: siderate, not to say unreasonable.

Mr. Parke came in, but could only shake my h||andstnd tnpuerunedtor sen eng tnay or t sart ete sav te n tirra ng tor toar the sotony oh a tn tarnt tnsereetorenir tynht te sIin eoand te sav sn tist teaete thet the sney these th te tis th toane tft ohe e trd tIonmtor tes,elf,

HE CORCTH.T hock tnsom the soon tf cerstitk d temunee the siteeh ou the srevi tli toae oith t doare aieme ng ttaeke tnd the tilished tn tnshratetooehtstf

#

Everything after the "||" is generated

#

Its just jibberish 😦

tired sun