d_ff/d_model + swiglu tests | EleutherAI | Page 3

rose vapor Apr 5, 2024, 4:38 PM

#

So you end up with a very complex nonlinear layer function that is easily optimized because the inputs are effectively constrained by Pre-LayerNorm

soft bobcat Apr 5, 2024, 4:42 PM

#

my primary takeaway is that the best nonlinearities seem to have significant dependencies on the surrounding hparams, such as LR and problem type

#

my second takeaway is that there is some way to make sin work, which I tried but always failed with

#

as a side note, your architectures appear to have disastrous performance

#

you should check out #implementation-details message

#

it should be possible to hit 94% on CIFAR-10 in about 10 seconds on a free T4 in google colab

rose vapor Apr 5, 2024, 4:46 PM

#

Lolyeah I used a training recipe / architecture from another paper, and did no hyper parameter tuning. So it could be that my results are specific to those hyperparameters

#

"94% on CIFAR-10 in 3.29 Seconds on a Single GPU"
Is a killer title. I wish I had read this before I set up my experiments

#

But then, it was published 26 days after my paper, so I guess it couldn't be helped lol

soft bobcat Apr 5, 2024, 4:49 PM

#

before, there was https://github.com/tysam-code/hlb-CIFAR10/

GitHub

GitHub - tysam-code/hlb-CIFAR10: Train to 94% on CIFAR-10 in <6.3 s...

Train to 94% on CIFAR-10 in <6.3 seconds on a single A100. Or ~95.79% in ~110 seconds (or less!) - tysam-code/hlb-CIFAR10

rose vapor Apr 5, 2024, 4:51 PM

#

Thanks, man. I'll be playing with this over the weekend. Very much appreciated

still grail Apr 5, 2024, 5:03 PM

#

rose vapor "94% on CIFAR-10 in 3.29 Seconds on a Single GPU" Is a killer title. I wish I ha...

wild seeing Keller's work pop up in a comment thread here. Small world. He seems like a really neat guy from our interactions/discussions together, while I'm biased I found his writeup to be high quality and well-worth the read

still grail Apr 5, 2024, 5:04 PM

#

rose vapor Thanks, man. I'll be playing with this over the weekend. Very much appreciated

lmk if you have any questions about it please! ❤️ :')))) airbench is definitely the top dog on speed, and hlb is written in a slightly different style, so it depends on what you're looking for. def take a look at airbench tho! :3 🙂 ❤️ :'))))

rose vapor Apr 5, 2024, 5:05 PM

#

Outstanding. Thank you so much!

boreal moss Apr 5, 2024, 7:28 PM

#

heh, I did 94% cifar10 with 100k parameter model, but training takes forever, so I suppose it's not so cool lol

fallen spear Apr 5, 2024, 7:58 PM

#

boreal moss heh, I did 94% cifar10 with 100k parameter model, but training takes forever, so...

what model though

boreal moss Apr 5, 2024, 8:04 PM

#

fallen spear what model though

you asking architecture?

fallen spear Apr 5, 2024, 8:20 PM

#

boreal moss you asking architecture?

yes

#

it is kind of impressive that that works that well

#

i saw it

#

you good

boreal moss Apr 5, 2024, 8:32 PM

#

fallen spear yes

📎 message.txt

#

there are some tricks that aren't really visible

fallen spear Apr 5, 2024, 8:33 PM

#

thassa bog standard convnet, i guess the only thing that would have surprised me five years ago would have been the layernorm

boreal moss Apr 5, 2024, 8:35 PM

#

ConvNext-like with little bit of InceptionNet and "modern" ffn

#

it has quite a lot of flops in respect to the parameter count

fallen spear Apr 9, 2024, 6:22 AM

#

"negligible impact"

#

heh

dawn vine Apr 9, 2024, 12:38 PM

#

fallen spear "negligible impact"

What paper is that

boreal moss Apr 9, 2024, 2:14 PM

#

fallen spear "negligible impact"

gimme paper

fallen spear Apr 9, 2024, 3:13 PM

#

@dawn vine @boreal moss https://arxiv.org/abs/2404.05405

arXiv.org

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia p...

boreal moss Apr 9, 2024, 3:20 PM

#

https://tenor.com/view/food-chew-eat-om-nom-nom-nom-nom-gif-4855315

Tenor

fallen spear Jul 29, 2024, 4:41 AM

#

#off-topic message

#

bump

#d_ff/d_model + swiglu tests