#d_ff/d_model + swiglu tests
2034 messages · Page 3 of 3 (latest)
my primary takeaway is that the best nonlinearities seem to have significant dependencies on the surrounding hparams, such as LR and problem type
my second takeaway is that there is some way to make sin work, which I tried but always failed with
as a side note, your architectures appear to have disastrous performance
you should check out #implementation-details message
it should be possible to hit 94% on CIFAR-10 in about 10 seconds on a free T4 in google colab
Lolyeah I used a training recipe / architecture from another paper, and did no hyper parameter tuning. So it could be that my results are specific to those hyperparameters
"94% on CIFAR-10 in 3.29 Seconds on a Single GPU"
Is a killer title. I wish I had read this before I set up my experiments
But then, it was published 26 days after my paper, so I guess it couldn't be helped lol
before, there was https://github.com/tysam-code/hlb-CIFAR10/
Thanks, man. I'll be playing with this over the weekend. Very much appreciated
wild seeing Keller's work pop up in a comment thread here. Small world. He seems like a really neat guy from our interactions/discussions together, while I'm biased I found his writeup to be high quality and well-worth the read
lmk if you have any questions about it please! ❤️ :')))) airbench is definitely the top dog on speed, and hlb is written in a slightly different style, so it depends on what you're looking for. def take a look at airbench tho! :3 🙂 ❤️ :'))))
Outstanding. Thank you so much!
heh, I did 94% cifar10 with 100k parameter model, but training takes forever, so I suppose it's not so cool lol
what model though
you asking architecture?
yes
it is kind of impressive that that works that well
i saw it
you good
thassa bog standard convnet, i guess the only thing that would have surprised me five years ago would have been the layernorm
ConvNext-like with little bit of InceptionNet and "modern" ffn
it has quite a lot of flops in respect to the parameter count
What paper is that
gimme paper
@dawn vine @boreal moss https://arxiv.org/abs/2404.05405
Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia p...