#d_ff/d_model + swiglu tests

2034 messages · Page 3 of 3 (latest)

rose vapor
#

So you end up with a very complex nonlinear layer function that is easily optimized because the inputs are effectively constrained by Pre-LayerNorm

soft bobcat
#

my primary takeaway is that the best nonlinearities seem to have significant dependencies on the surrounding hparams, such as LR and problem type

#

my second takeaway is that there is some way to make sin work, which I tried but always failed with

#

as a side note, your architectures appear to have disastrous performance

#

you should check out #implementation-details message

#

it should be possible to hit 94% on CIFAR-10 in about 10 seconds on a free T4 in google colab

rose vapor
#

Lolyeah I used a training recipe / architecture from another paper, and did no hyper parameter tuning. So it could be that my results are specific to those hyperparameters

#

"94% on CIFAR-10 in 3.29 Seconds on a Single GPU"
Is a killer title. I wish I had read this before I set up my experiments

#

But then, it was published 26 days after my paper, so I guess it couldn't be helped lol

soft bobcat
rose vapor
#

Thanks, man. I'll be playing with this over the weekend. Very much appreciated

still grail
still grail
rose vapor
#

Outstanding. Thank you so much!

boreal moss
#

heh, I did 94% cifar10 with 100k parameter model, but training takes forever, so I suppose it's not so cool lol

boreal moss
fallen spear
#

it is kind of impressive that that works that well

#

i saw it

#

you good

boreal moss
#

there are some tricks that aren't really visible

fallen spear
#

thassa bog standard convnet, i guess the only thing that would have surprised me five years ago would have been the layernorm

boreal moss
#

ConvNext-like with little bit of InceptionNet and "modern" ffn

#

it has quite a lot of flops in respect to the parameter count

fallen spear
#

"negligible impact"

dawn vine
boreal moss
fallen spear
fallen spear
#

#off-topic message

#

bump