#ETHOS

1798 messages · Page 2 of 2 (latest)

spare flame
#

I recommend doing this by making a HF adapter for your model so you can run inference that way and also can run lm eval harness

#

but im sure nanogpt has code for the initial inference sanity check

whole goblet
#
The winner of the 2008 United States presidential election would be George W. Bush, who took office more than two decades ago.

In a paper released Thursday, the former Republican presidential candidate touted his administration’s economic record – including a successful push to promote American manufacturing through jobs-creation programs – as a turnaround, saying that the economy will be stronger this year.

“The economy is a strong force in the world,” said Bush, “as long as we’re not doing terrible things on```

`step 55147/55148 (100.0%): train loss 3.1518, val loss 3.1519`
Small bump at the end, so will probably be interesting to see what happens in ablations. Still a very solid result.
spare flame
whole goblet
#

Baseline running now

tranquil fiber
#

Yeah I said the same thing too haha, I've been burned too many times 😅😭

tranquil fiber
#

It could be that this method is more variance sensitive

spare flame
tranquil fiber
#

Yes def

#

Since batchsize impacts implicit lr by a lot (and other things as well)

spare flame
#

yeah (sorry I keep being a downer about checking things) also just keep in mind that even with same hyperparams and bsz its possible that by reducing the bsz you will have kneecapped nanogpt, and that your method performs better at low bsz than it

tranquil fiber
# whole goblet

Yeah I think it might be that this method is more variance sensitive, if it's wiggly at 120M then scaling may be even moreso. Could also just be the smaller barchsize exacerbating it, something to keep in mind

#

(also ablating hyp effects/diffs vs the baseline/PEER baseline may be hard since different methods will be sensitive to different hyps, and not much compute for e.g. a full grid search (or the like))

tranquil fiber
#

If there is enough compute to do final runs with the ~.5M tokens batchsize the baseline has then that would make the argument a lot more solid

#

(but I know there's limited compute here so you need to budget it)

spare flame
#

yeah to be clear im not discouraging running the nanogpt at low bsz, you need to do that esp since its inexpensive

tranquil fiber
#

Yeah accum steps are impl so it should be a config change I think

spare flame
#

a cheap way to make sure you're not kneecapping nanogpt baseline is run that at both small and large bsz

#

that way you can see if nanogpt baseline is better at higher bsz

#

if so, then you will eventually need to run your test at higher bsz too

#

if not, great

tranquil fiber
#

Good point yeah, hadn't thought of that

#

There is also a low-res chart from the nanogpt example run done in the repo but have no idea how much syncrot it's been under

#

That said this run does seem to be beating the LLM.c python baseline logs pretty handily

#

So that may bode well

#

But those may have been recorded before the LR change, which made it converge ~60% faster or so

#

So if that's the case, then the two runs would be roughly ~even or so, which bodes well (at least)

whole goblet
#

Yea, baseline is pretty fast to run, so I can check both batchsizes. Right now It's approximately even at the lower batchsize

#

But I'm only at the first checkpoint, and the more interesting stuff will be in a few hours

whole goblet
whole goblet
spare flame
tranquil fiber
whole goblet
tranquil fiber
#

Yeah

#

Either way is fine I think

whole goblet
#

tbh though if we are getting into hyperparam stuff, this is arguably comparing a run that has been tuned like crazy at all levels to a first real attempt

tranquil fiber
#

Maybe actually bigger after kernel works is okay since that would give a good probe into how your new method performs at different batchsizes/effective lrs

whole goblet
tranquil fiber
#

Gotcha

#

Yeah I think they merged the lr change into it

whole goblet
#

I'm just straight up materializing GBs of weights in the hypernetwork version that don't need to exist rn

tranquil fiber
#

Nanogpt has some okay tuning but not like an absurd amount iirc

#

But still, more than just a few runs 😅😂

whole goblet
#

Yea, I'm at n=1 😅

tranquil fiber
#

Yeah it's a pretty good start!

#

Also sounds like a good argument for investing in the kernel work

#

Now that the big run looks really promising

#

(hard to tell the order of operations on that sorta thingie)

whole goblet
#

But yea, the GPT-2 small version should be completely apples to apples. Identical config to the hypernet stuff (including LR which arguably should likely differ between architectures)

tranquil fiber
#

Yeah ideally as a speedrunner I'm immediately thinking about how things are set up for the hypernetwork, not just LR but also inits and a few other things

#

Glad it's doing well vanilla

whole goblet
#

Yea, I think even if on a first run for preprint, just showing that this works and is competitive without tons of optimization would be enough

#

But yea, the PEER baseline I think needs to be rerun

tranquil fiber
#

I believe that peer was run at the default batchsize IIRC? (It's a bit more than .5M so not exactly but in that ballpark)

whole goblet
tranquil fiber
whole goblet
tranquil fiber
#

Since it would be cheaper to rerun

tranquil fiber
#

And nanogpt?

whole goblet
#

Oh, yea. I'm cool doing that

tranquil fiber
#

Is it 491.5k as well?

whole goblet
#

NanoGPT is using the small one

tranquil fiber
#

I mean the normal baseline for it

whole goblet
#

But this will be done in a few hours. It won't be an issue to rerun at larger batch

#

Oh, yea

tranquil fiber
#

It is the same

tranquil fiber
#

Good datapoint as smerky said

#

Especially since it's cheap

whole goblet
#

Yea, at a minimum gives me options depending on where memory pressure actually ends up after the kernel work

#

I have a prototype that seems to work, but just need to test a bit before I say for sure

#

And reduces memory from 35GB running to a bit under 8GB

tranquil fiber
#

Nice!

whole goblet
#

Which does also mean that if I have to hop off the GH200's I'm on now, I could move back over to my 3090's

tranquil fiber
#

What's the baseline nanogpt network at that batchsize for memory use (also is this nvidia-smi or something a bit more fine-grained?)

whole goblet
#

nvtop (so smi under the hood iirc), seeing 6.8GB

tranquil fiber
#

Gotcha

#

That is an upper bound IIRC so not necessarily the most accurate

whole goblet
#

yea there's for whatever reason just a GB or so of unfreed memory that shows up randomly and persists until I kick the node

tranquil fiber
#

But decent maybe for sanity checking potential max use

#

Interesting

#

Yeah

#

6.8->8 is certainly pretty reasonable

whole goblet
#

Yea, I could get it to be identical~ if I wrote a backwards kernel but I don't want to torture myself

#

fuck chain rule

#

all my homies hate chain rule

whole goblet
#

Okay, yea I should have run baseline first

#

But still worth while for me to figure out the fused kernel and run ablations, since we're in spitting distance

#

That way I can avoid a full grid search of hyperparams for this model

#

Does also make me question why PEER underperformed so hard, that's not something I expected

spare flame
#

live and learn 🤷‍♂️

#

but also, the most annoying thing is that sometimes these lines can cross much later in training

whole goblet
#

Yea, I need to see where I end up on throughput with the fused kernel

#

And that'll inform where to go from here

#

Same rough area to me suggests that there's still likely something here, just not obviously busted

spare flame
#

definitely let the baseline run the whole way if u can

#

it may be closer than you think

#

also important to still run the large bsz baseline

whole goblet
#

Oh I am, I just shared because it's beating where hypernet ended up 70% of the way through

#

Still would have expected PEER 5b to perform better than dense baseline

#

Weird that it didn't

spare flame
#

your lora-like constructions may be able to tolerate 10x the learning rate

whole goblet
#

Would that break comparisons?

spare flame
#

yes but you dont have to care

#

you can assume nanogpt is optimal hyperparams for gpt2

whole goblet
#

So basically crank up the LR for PEER, but I might need to reduce LR for the hypernet

spare flame
#

if you recall, i dont actually know how your code works 🤣
but if you use a lora-like construction with down then up projections then maybe you can up the LR successfully for that portion

#

the overall point is that your setup may require different hyperparameters to do well

whole goblet
#

I can shoot you the notebook. Fern took a look too

spare flame
#

ok, then she may have a better idea of things that could help

whole goblet
spare flame
#

so you might want to see if it learns faster with some changes

#

you can assume that nanogpt is reasonably optimal for itself

whole goblet
#

Sounds good to me

spare flame
#

also good to check if the peer paper says anything about this for it

whole goblet
#

I also have contact with Owen He, so I might just ask him to double check the setup

#

Was hoping to avoid this by using an open source impl

#

Once this baseline is done I'll get the fused kernel working and then figure out budget for a hyperparam search. Other problem with a hyperparam search is I don't know if this exact shape of network is what will be ideal until I run ablations

#

And that might effect ideal hyperparams

spare flame
#

there is no magic bullet here that I know of

#

try to find people who did similar things

#

and see if they had to compensate somehow specifically

whole goblet
#

Is lucidrains in this server?

spare flame
#

no

whole goblet
#

I might just try to ping him on email and see if he has any insight

spare flame
#

if you are going to contact him, an issue on the repo might be the best and most polite way

whole goblet
#

You wouldn't just email?

#

Mostly ask because the repo hasn't been touched in a year

spare flame
#

he has 10000 repos

whole goblet
#

Yea, that's the other reason I figured email might be best

spare flame
#

🤷‍♂️

whole goblet
#

Oh, actually his website points to signal

#

Specifically for reaching out

#

Going to do that

tranquil fiber
whole goblet
#

Trying a kind of dumb config while I test out the fused kernel. Dropped weight decay to 1e-2, and quadrupled learning rate

#

So far isn't exploding and is like a solid .5 nat ahead of the old run at this token count (5.5 vs 6.0), but still so early it doesn't mean anything other than it doesn't immediately blow up yet.

whole goblet
#

@tranquil fiber Hacked a bit on the fused kernel and used it as an opportunity to run small experiments on hyperparams, and I think you might have been right on weight decay

spare flame
#

its ok to change LR (because you can assume nanogpt is close to optimal on that for gpt2), but not ok to change WD without rerunning baseline

whole goblet
spare flame
#

yeah im just letting u know that WD always specifically hurts short term performance

#

so you're not learning anything about your algorithm by removing/reducing it

whole goblet
#

Past that, when talking about late training, is there a standard definition for that?

spare flame
#

Like trillions

whole goblet
#

Since a 9B token run for models this size is already way past chinchilla suggests

spare flame
#

Or at least hundreds of billion

whole goblet
#

So chinchilla is outdated?

#

Or not applicable for other reasons?

spare flame
#

No one ever trains to chinchilla, it's only optimal in a kind of useless sense

whole goblet
#

I think I’m just coming to the conclusion that I need to do a full hyperparameter sweep which quickly becomes compute prohibitive.

#

Is there a set of tests that would reach “good enough” for preprint so I can garner more compute?

spare flame
#

You can put out a preprint anytime you like. As long as you're honest about things and do reasonable comparisons the worst case scenario is it will be ignored. So it just depends on what your goals are for the preprint.

#

But more generally, my feeling is that no one will be interested enough in any method that performs worse and is slower for the same parameter count for the preprint to specifically help get you compute.

#

You may still be able to get more compute tho! I'm only saying I don't know that a preprint will cause that to happen.

#

Also, the fact that your implementation and test gives bad results for PEER is problematic for trust in your current methodology.

#

No one is going to believe that your method dramatically beats PEER while simultaneously trusting that your result showing PEER is destroyed by GPT2 was done properly.

#

You've got some serious problems right now and this does not warrant a paper at the current time.

#

What it does warrant is further investigation, maybe including trying to replicate the PEER results.

#

I feel like I'm always having to give bad news here, but the reality is you are going to need to identify these kinds of problems without 3rd party input.

#

After all, you already know the facts of the situation as well or better than I do.

#

When things 'seem wrong' or you believe they don't show the right result, you gotta either figure out how to address those issues or decide that they are in fact the right result and your hunches were wrong. Then update your priors, learn from the experience, and try some new thing or approach.

#

Right now, you're faced with two opposing results: PEER was worse than yours, but PEER was also much worse than GPT2

#

Either PEER isn't actually good, your test/implementation isn't good, or it doesn't work well at this scale/hyperparameters.

#

Possibly all three!

#

None of these things imply that your method works well. But it leaves the door open that it might, especially if it works well in situations conducive to PEER.

#

It's also important to recognize that your hunches or ideas about what should work well can be just plain wrong. I think experienced researchers expect that the vast majority of the things they try will NOT work, regardless of their hypotheses with theoretical justifications.

#

This is certainly true for me personally. I give up on things or methods and/or have to try a very different tactic maybe 75-99% of the time. I'm sure @tranquil fiber can give more insight into her batting average, but I'd guess that it's similar.

#

You should be trying to do more tests quicker with less resources so that you can swing the bat more frequently. This is why I advocated for testing against the nanogpt baseline early rather than late. Because time spent per idea matters. You can also spend a lot of time drilling down on one idea, but it's only good to do this when you have a good sense of what may be salvageable and what can't.

#

And most importantly of all: you always need to be testing against a fair yardstick as early as possible. Science is about verification via either experimental evidence or mathematical proof. Without this, you are just hoping that your intuition is well enough developed while flying blind.

#

In summary:

  1. Test early and often against a fair yardstick.
  2. Try to use less compute and/or obtain more compute prior to having proven stuff.
  3. Ask yourself every time what it is you've proven and what is uncertain. And how you can most efficiently add to the evidence in either direction.
  4. Don't put out a paper until you have shown something useful and can prove it. [This can include negative results if you have strong evidence]
  5. Get a lot of at-bats as you work on increasing your batting average.
whole goblet
whole goblet
whole goblet
whole goblet
spare flame
#

but for example now that you have the baseline gpt2 run you can pretty quickly guess at how new runs of peer etc are doing
or you can make your code run faster (its a tradeoff of your coding time tho)

#

or you can spend more money per unit time to get things done faster, and/or you can stop runs early since u can compare to gpt2 base

whole goblet
#

But I'm still not sure exactly why it's "useless"

spare flame
#

you can also try an even smaller scale to get your bearings

spare flame
#

its fine to train to it as a test

whole goblet
#

It's not like PEER was ever released

spare flame
#

yeah it makes sense to train to it here, but I also dont think anyone is going to care that much if you train to it specifically

whole goblet
#

I don't want a pat on the back for training to chinchilla optimal, just to know if it's good enough for that to not be a sole reason the methodology is dismissed

spare flame
#

you gotta keep in mind that your new FFN replacement is going to have different training dynamics than regular ones so its kind of irrelevant to a study (chinchilla) that was done on traditional FFNs optimality per tokens trained

whole goblet
#

Sure, but there's going to be so many differences in general, but for some reason I also have to keep LR/WD/etc. identical despite those likely having different dynamics as well?

#

Just trying to understand the line here

spare flame
#

I did not tell you to keep LR the same, but I did tell you to keep WD the same

#

Specifically, you can and should change the LR for YOUR model

#

if feasible

#

but the original LR was a good starting point when you literally had zero runs done 🙂

whole goblet
#

Yea, I'm just moreso saying that I might also have to take an approach where I use different LR's for different parts of the network

#

And will likely reduce WD just for the hypernetwork

#

And all of these make me question how "fair yardstick" is defined

#

Since they all clearly have different effects on different models

#

And that's without getting into other stuff like beta/clipping/etc.

spare flame
#

there's no good answer

#

this stuff is hard

whole goblet
#

I guess that's what I'm getting at, is there's no good answer, but you're telling me that the answers I'm coming up with are definitely wrong

#

So just trying to figure out where you're actually drawing these lines

#

Like are there papers I can read on this stuff? Because I've been pretty obsessively looking at how other arch papers present this stuff, and most of these topics aren't even touched on

spare flame
#

yeah you're right that it's hard to know that stuff without a lot of detailed info about hyperparams, but I think generally changing LR is fine and that's all you should change

#

you can assume (for now) that nanogpt is somewhat near optimal hparams for itself

#

im just telling you to try to move as rapidly as possible

whole goblet
#

I think from early tests and also following Fern's advice, there's likely some benefit from looking at weight decay on the hypernetworks, both because they receive more dense gradients than any other part of the network, and their ability to produce diverse outputs directly ties to the hypotethetical capacity of the overall network

spare flame
#

yeah she is better equipped to advise you on that than I am

#

but you could also just rerun the baseline with no WD

#

which is safer

#

WD isnt going to help on a short test but it will HURT

#

so removing it on only yours is unfair

whole goblet
#

But couldn't that theoretically harm GPT-2 Small's results?

spare flame
whole goblet
whole goblet
spare flame
#

im saying that WD hurts short tests across the board, so remove it from nanogpt gpt2 first if you really want to remove it from yours

#

and show that it helps (or at least doesnt hurt) gpt2 to remove it there

whole goblet
#

Sure, I just feel like removing WD from one that benefits, and another arch that hurts from it removes the fairness of the yardstick?

spare flame
#

sorry, are you saying that WD will benefit yours?

whole goblet
#

Yes

spare flame
#

oh

whole goblet
#

that's what I've been saying lol

spare flame
#

lol

#

I thought you wanted to reduce it on the hypernetwork

whole goblet
#

I do

#

Higher weight decay hurts the network

#

Basically because more diverse outputs seem to help

spare flame
#

when I say 'wd benefits' I mean "MORE wd benefits"

whole goblet
#

Oh, no, the hypernetwork wants as little weight decay as possible from what I can tell

spare flame
#

right, so take it away from gpt2 first

#

show that it improves gpt2 to do so

#

then run it without wd on yours

whole goblet
#

Oh, so you're saying that less weight decay also helps GPT-2

spare flame
#

these are short tests

#

if you were doing 300B tokens I'd have a different opinion

whole goblet
#

Okay, now I'm understanding

spare flame
#

sorry if I wasnt clear enough about that

#

btw I could be wrong! but generally this is true

whole goblet
#

I just thought you were saying that by reducing WD, I should expect GPT-2 to get worse, so it felt like it would be sullying tuned hyperparameters for GPT-2

spare flame
#

nope, opposite for short tests

#

it should improve gpt2

whole goblet
#

Okay cool, we're on the same page then

spare flame
#

this is all just to chase the idea that reducing WD will make your hypernetwork better

#

I don't necessarily think that's the right thing to chase, but I'm no expert!

#

unless you think its also damaging PEER

#

to me the #1 problem here is that PEER looks awful

#

that's a big problem

whole goblet
#

Yea, I think that there's some reproduction issues here. I'm going to go and compare my original implementation of it to lucidrain's to see if there's any obvious bugs

#

Because I was seeing better performance when I was working on similar with ETHOS

#

PEER might just be bad

spare flame
#

more importantly, check the PEER paper to see in what situations it actually (supposedly) worked well for them

#

if its a very different size regime etc.

#

and try to reproduce that however you can if possible even briefly

whole goblet
#

They were doing IsoFLOP comparisons, didn't care about isoparam

spare flame
#

and/or by talking to the author

whole goblet
#

Yea, he's reviewing my reproduction hopefully in the next couple weeks

spare flame
#

yeah its possible that PEER is horrible at isoparams

whole goblet
#

Which is why I went for 5B, since it's IsoFLOP(ish) with GPT-2 Small

spare flame
#

but also if PEER is horrible at isoparams and this is inspired by it, that might imply problems for yours

#

hard to know

whole goblet
#

Which is fair, but I'm showing that with Iso"Capacity" that I'm outperforming, so it still feels like there's something there

#

Even if the base has some trouble

#

Like I think the most interesting thing here is that this model is effectively constructing rank-k experts neuron by neuron

#

And is like, spitting distance from dense

spare flame
#

no idea if they supplied or can supply enough data for you to do that

#

or its at any sort of feasible scale

whole goblet
#

It is, and I have a kernel that makes it pretty tractable

#

But that was with some other parts that I'm not sure how to match 1-1

#

I'd expect that harnesses within GDM at the time were using GQA

#

And my harness used MLA

#

And nanoGPT uses MHA

#

So there's a lot of potential variance just from attention mechanism used

#

PEER's paper itself is pretty light on hyperparam details. We don't even know the model dim or if it was held constant

#

Or if depth was constant

spare flame
#

author can shed light on this much more easily than they can check your code

whole goblet
#

yea, I'll shoot an email

tranquil fiber
tranquil fiber
spare flame
#

since this is in the publishing help section, I have one more note about why and when to put out a preprint:
consider how many citations you expect to realistically get and base it on that (considering who and why they might cite your work and in what situation and other papers)
don't write or put out a preprint until that number is greater than zero

tranquil fiber
# spare flame This is certainly true for me personally. I give up on things or methods and/or ...

5-10% for methods I'm new(ish) to, ~30-40% for methods where I have a very strong guess that makes sense (a lot of this is just sense from last ~decade or so).

but that doesn't include e.g. bugfixes or the like as much, maybe a bit.

but agreed that maximizing the amount of times you can pull the lever is generally best, especially when starting out, it's how you gain the most intuition and information

#

(if completely new to something, e.g. an entirely new subfield, at the start, maybe a bit worse than 5-10%, e.g. ~3-5% or so)

whole goblet
#

Yea, and I think I'm going to have to take another stab at this kernel. I got one working, but it's not beating easy autograd tricks

#

Can get batch size up considerably now, though with the autograd tricks

#

And did get about a 30% improvement in throughput

spare flame
whole goblet
#

I can just tell what's going wrong, but getting the tiling correct has been a pain

spare flame
#

and run some much shorter tests varying that stuff if u want to and comparing to short run of baseline without wd etc.

#

that way you can get feedback in minutes or hours not days

#

shorter doesnt always extrapolate cleanly but at least its rapid signal acquisition

#

you can only really learn from an experiment, so running lots of those may help you learn stuff about what works in this regime and what doesnt

whole goblet
#

Yea, I can get a complete chinchilla run done on this (assuming it is the first point I can extrapolate from) within a few hours

spare flame
#

yeah but u can also NOT get a complete one done

#

lol

tranquil fiber
#

it's kinda like rebuilding a whole car to see if a new piston fits

spare flame
#

yeah man just jiggle the pistons a lot!!!

tranquil fiber
#

you should do scheduling, hyperparameter tuning generally on longer runs

#

most other things (esp smoke/sanity testing if things are working okay) should happen on shorter ones

tranquil fiber
whole goblet
#

I mean, this thing is training, so I'm not doing a ton of long runs checking if stuff is working. Usually if it's completely not functional I know within the first 10 iterations at most

tranquil fiber
#

yep

#

tho for e.g. the peer baseline i think it should be pretty short to know how the repro does

#

you may need to see how their loss curves compare against baselines to see what you should expect

whole goblet
#

But I'm definitely at the "how do I tune this" phase ahead of trying to get ablations started, but it seems like this would benefit from a hyperparameter sweep before those

tranquil fiber
#

larger models usually are much more step efficient, time-wise they may take longer to converge, but stepwise should almost always be faster than smaller models

whole goblet
whole goblet
#

tbh I think I might just want to dodge the PEER baseline question

tranquil fiber
#

defaulting to "bigger model should be better stepwise almost always" then should be a good litmus test there

tranquil fiber
#

it would be nice to have as a comparison

#

but in the spirit of minimization

#

I will say @whole goblet the unfun thing is that with all likeliness, after fresh-implementing things, usually there are 2-3 major bugs, and a bunch of minor ones, sometimes maybe more depending on size of implementation

#

(and depends on implementer as well, you'll get a vibe for what yours are)

#

(usually these are ones too that are more silent)

#

so, doing very thorough tracethroughs of the code and what's happening oftentimes helps with these, esp on smaller toy examples

whole goblet
#

Yea, I did try to bug bash before really touching too much

tranquil fiber
#

yeah

#

just something to keep in mind

whole goblet
#

I had a major bug early where I was just generating the same neuron k times

tranquil fiber
#

yeah

#

it can be subtle

#

even in proper implementations there are things that can be hard to discover e.g. MoE router collapse

#

(which is also why having a good testbench w/ a 5-10 minute turnaround for basic tests is really really useful for in-the-loop debugging)

#

(but longer also works too, anything over an hour gets to be a bit more iffy w/ exploration stuff)

#

it may be good to also plot tons of statistics in tb (or the like) about what's happening in the routers, e.g. router weight distributions, weight similarities for generated values, etc

#

to see how that evolves over time

#

you generally want things to be gaussian, (or, if it's principled, log-normal -- but you really should be sure as to why here), if things collapse or spread out weirdly then there's a sign that something might be off in your training

#

(max and min also is super helpful for determining this)

#

basically, putting eyes everywhere on what's happening in your code

#

you gotta know intimately what's happening and evolving over time in training, to make sure that everything is healthy

whole goblet
#

Makes sense. I'll start collecting more of that

tranquil fiber
#

yeah, keeping an eye on cheapness and what can you spot early that predicts other things later on as well

#

part of dealing with new methods is establishing cordons for good performance vs bad, and basically diagnosing what's going on in any given area

#

(which can be quite hard and expensive at first as you build up intuition for yourself, sometimes building out that toolbox can be a lot of work! but it is well quite worth it in the end)

tranquil fiber
# whole goblet Makes sense. I'll start collecting more of that

one thing e.g. the variance of your method's performance really does beg for a larger batchsize, that's an easier one (if you look at the wiggle of the variance of the loss + the val loss you can see that the loss variance is leaking to the val loss variance which is generally a big no-no, it seems to be pretty strong too -- so that's one avenue that the network performance can improve along i think)

whole goblet
#

Seeing how much I can eek out with max-autotune rn

tranquil fiber
whole goblet
#

Oh, yea 100%

tranquil fiber
#

Ofc there's something to be said for some ops doing better with larger batchsizes so that's a good thing yeah

whole goblet
#

Just means that I can ideally move this to more parallelizable systems once I get there

#

Without as many of the broadcast/collect portions

tranquil fiber
#

Honestly staying on one GPU for a while is probably your best shot

#

Usually it takes...many many hours of experiments and dev, and many long times of debugging to get a truly new idea functional, very very rarely does it happen straight in the first shot

#

I think you'd need to have a win in one of the iso regimes to move forward towards a paper for it

#

(which is a bit annoying I know, but at least it does give you options to trade along)

#

And in the meantime you need (generally) as much simplicity as possible for it

whole goblet
#

Yea, and to be clear not trying to say I've won in the iso regimes. Just moreso that if the PEER trends are accurate (which I guess is worth questioning) I can get a cheap win by just increasing k arbitrarily which doesn't increase parameters but drastically increases flop cost.

#

But that feels more useless

tranquil fiber
#

Well PEER should be handily beating the nanogpt model iiuc

#

A lot of ML research is detective work debugging

#

Always being skeptical and assuming something is wrong

whole goblet
#

Yea, I'm starting to wonder if there's a major problem in reproduction

tranquil fiber
#

Yeah

whole goblet
#

He said he had looked at lucidrain's impl and said it was good

#

But maybe didn't look that carefully or something?

#

Or I'm doing something wrong

tranquil fiber
#

You can also try smaller # params to see if it's a quick of having tons of PEER params (i.e. how does peer do with isoparams? Only 2-3x params? Gotta rule out something here. But only after a good bug sweep)

tranquil fiber
# whole goblet Or I'm doing something wrong

It's not always bad to assume that one is doing 2-3 things wrong at any given time, and even if the results beat the baseline being skeptical can be an enormous way to improve performance

whole goblet
tranquil fiber
#

Yeah, the general idea there I mean

#

"something in the same vicinity"

whole goblet
tranquil fiber
#

There's so many things it could be 😭

#

Unfortunately

whole goblet
#

Oh 100%. I'm just annoyed because I reimplemented it, but then decided to go with lucidrains so I can cite instead of proving that my baseline is valid

tranquil fiber
#

It's a good idea to look at their code carefully line by line and seeing what's going on

#

‼️

#

^

whole goblet
#

I even have an optimized kernel 😅

#

But it does leverage MLA for that impl

tranquil fiber
whole goblet
#

It was just easier to cite than use my own

tranquil fiber
whole goblet
#

I also haven't touched my own implementation in a few months lol

spare flame
whole goblet
# spare flame Compare results

Yep, just need to do the reimplementation work. Looking back on it, we also integrated heavily with MLA so we could compute in a latent space

#

So it'll take some retooling

whole goblet
#

80% faster, no kernel needed, just einsum bullshit

#

Just 2x slower than dense baseline now

#

For the hypernet version

#

Actually should say more than 80% faster, it's an 80% reduction in wall clock time. 178k tokens per second

#

It's late, need to make sure I didn't just bug the hell out of this, but it's training and seems to be identical

whole goblet
#

Jk I messed up grad accumulation logic

whole goblet
#

Have half a thought, that with the query being the primary input, and then router shenanigans producing conditioning coordinates and scaling factors, that I might be able to just use RoPE for positional encoding of the query. Relies on the hypernetwork being able to generate already scaled experts based on the query itself, but I don't see any reason that can't be the case. Also is massively more efficient at runtime (router right now is 15%~ of forward pass and makes backward pass pretty rough)

Going to give it a shot.

whole goblet
#

In v1 of this, we did do a lot of computation in a latent space and it worked pretty well. It would constrain the size of the hypernetwork if we project down

whole goblet
#

Does seem like the middle layer of the hypernetwork was not helping performance that much. Linear projection with no hidden layers is handling just fine. Guess is because the nonlinearity is captured by the generated weights, but could be entirely wrong there.

I think this learning plus operating in a latent space would be able to get this within striking distance of throughput of a dense model, and then the question is just if performance can match. Latent space computation would also give me a lot more room to play with h x k values, which showed pretty consistent performance returns in PEER (assuming we work off of those results being reliable)

pastel linden
#

@whole goblet How are the results coming along? Do you have anything particularly exciting to report? Or is there a write-up somewhere I can check out?

whole goblet
# pastel linden <@288423151136800768> How are the results coming along? Do you have anything par...

Been ups and downs. Original idea might still have merit but don't have a clean baseline for that, and want to revisit, cc @mental plinth

Ended up chasing down eliminating expert weights altogether and I'm getting near baseline loss performance without tuning, but right now hitting some efficiency issues when using a pytorch implementation, and hitting throughput issues when trying to beat pytorch with triton. Trying a more hybrid approach today where I don't try to recompute everything on backward instead hand off the bulky matmuls to pytorch and let autograd handle everything else

#

It's just kind of a weird spot because I'm generating a lot of weights on the fly, so need to make sure they live as short as possible.

#

Otherwise you end up with such tiny microbatches that you can't get reasonable saturation

pastel linden
#

Gotcha! Mostly curious / checking in about the general status. What compute resources do you have / how bottlenecked by compute are you?

whole goblet
#

I have about 2k left in a Lambda grant, and currently trying to stretch that with a single GH200 until I have good enough throughput to justify broad ablations

#

Right now doing more single threaded stuff because of trying to get perf in a good spot, but will be bottlenecked once that's done

#

That said I have a lot of experience with k8s, so considering doing a self fund through something like sfcompute if I run out

pastel linden
#

We have some 8xA40 machines which could be useful for testing scalability, but assuming you're looking for chonky GPUs to do actual runs that's not something we have sitting around. I'm happy to talk about working to help get you a grant or something, if the results are exciting enough.

whole goblet
#

I'd appreciate it! Right now I'd want to make sure it's worth your time, and not beating a dense baseline and not having any throughput benefits yet is pretty hard to justify 😅

#

That said, I think there's a path where this will be beating dense baselines, but need to get at least within striking distance of appropriate throughput before I feel like I can justify another request

#

(bitter lesson and all that jazz 🙃 )

pastel linden
#

What's the goal, in a couple sentences?

whole goblet
#

See if we can solve the parameter explosion problem (and therefore susceptibility to hitting the memory bandwidth wall) in MoE by trading stored parameters for generated ones.

#

Like honestly the most interesting stuff at this point is that I think I've shown that you can generate a coherent set of experts on the fly when you construct them one neuron at a time, but that's mostly neat and not exactly useful right now.

#

I do also have some future work to see if I can replace what's currently a k^d operation which limits the generated expert's depth with something more efficient like faiss

whole goblet
#
iter 60: loss 8.9143, trailing_100 9.6500, lr 1.80e-05, time 4567.63ms, 35870 tok/s, MFU 1.19%
iter 61: loss 8.8450, trailing_100 9.6370, lr 1.83e-05, time 4566.15ms, 35881 tok/s, MFU 1.19%
iter 62: loss 8.7013, trailing_100 9.6222, lr 1.86e-05, time 4567.74ms, 35869 tok/s, MFU 1.19%```
#

Progress on performance through algo improvements. Now if I can just get reasonable saturation (pretty sure this isn't using tensor cores as much as it should be) we should be in spitting distance of a dense baseline

whole goblet
#

And some tweaks to get saturation up, but does feel like I need to go get some outside advice on how to get this to be fast fast.

iter 512: loss 5.5459, trailing_100 5.7531, lr 1.54e-04, time 4754.85ms, 41349 tok/s, MFU 1.38%```
whole goblet
#

@spare flame Just a heads up, been playing with nsight a decent amount, and I'm finding that the memory bound nature of base PEER is also largely in the router because of how it materializes just a few bytes that are then extremely low intensity for future accesses. Weighing a persistent kernel for that

tranquil fiber
whole goblet
#

Basically nobody is going to care about if I match a dense baseline if it's 10x the wall clock time to train

#

And the PEER baseline I think I'm just going to drop. If a paper relies on a pure reproduction of PEER that's going to take significantly longer

#

And I haven't heard back from lucidrains on if he ever got his implementation to train

spare flame
#

I was on his old discord discussing it with him and was probably the only person trying to get it to train at the time. And I decided it was too slow to bother with.

whole goblet
#

Think it's reasonable to just drop it?

#

idk at this point it's probably better to compare to dense baseline even though it has some derivative parts

spare flame
#

I agree with @tranquil fiber

#

I don't know whether or not peer is worth it bc I don't know if peer is good or not

whole goblet
#

Really the calculus I'm running right now is I can improve performance dramatically with reasonable amounts of work, which stretches my compute budget further

spare flame
#

Yeah there's a correct tradeoff wrt effort there but you'll have to decide where the line is

whole goblet
#

It also helped me stumble on a better factorization of this

#

Yea, I just don't think I'm at that line yet. With MFU as low as it is, and getting throughput where I have, I think if I can hit reasonable MFU then I can show competitive wall clock time with dense baseline, which makes it more apples to apples.

#

Since I'm kind of competing against CuBLAS in pytorch for something as straightforward as a single hidden dense FFN

spare flame
#

The only reason to optimize first imo is if you cant do the experiments otherwise so you can't work on improving the architecture

whole goblet
#

That's basically where I'm at. I think I can squeeze another order of magnitude of throughput out of this

#

Which gets me a lot more experimentation

spare flame
#

Probably

whole goblet
#

I'm already beating baseline throughput by 35%~, so it hasn't been wasted work so far

spare flame
#

But it won't necessarily lead to any useful result so you just gotta weigh the time cost

whole goblet
#

Yea, from my perspective once this grant is over, if I don't have interesting results, I probably won't try to get more compute. So this should get me my best shot

#

And if it fails, then it fails

#

I'll open source the negative result and move on

#

Basically just trying to avoid "And I decided it was too slow to bother with." for this arch

spare flame
#

btw the reason to follow up more on PEER is because it has implications for why yours is underperforming

#

I dont remember if they had a equiparameter study in their paper etc.

#

but if they were able to show PEER outperforming then it could be worth trying to figure out how to get to that regime

#

(they could also have just messed up somehow, who knows, so all of this is a big question mark - you can never trust any results that no one has replicated)

whole goblet
#

tbh v1 of ETHOS is likely a better baseline despite the vocab size mismatch for undersatnding performance. It hit pretty reasonable loss with the latent expert approach when adjusted for vocab size

tranquil fiber
whole goblet
spare flame
#

Wes, last time it took a long time to finally do the nanogpt run instead of doing it first like i had suggested
I recommend that this time you listen to fern and my suggestion

#

in order to save yourself a lot of time and effort

whole goblet
#

But I have a nanogpt baseline now, so I'm not sure how this differs

tranquil fiber
#

Having a strong sense of direction is okay but you're kind of shooting yourself in the foot with some of the research direction, it would be good to listen to the advice for it.

spare flame
#

this differs in the sense that you're going to do things in the opposite order of what will make it go fastest for you

tranquil fiber
#

Yes, agreed

#

We've both been doing this for quite a while!

whole goblet
#

100%, but maybe I'm not understanding the advice then?

tranquil fiber
#

I think that's my vibe

spare flame
#

you can do it in any order, its only a question of how long it takes you to succeed/give up 🤣

tranquil fiber
#

I'm not sure quite how to make it "click" however

spare flame
tranquil fiber
#

Yep

#

Definitely

whole goblet
#

Yea, I see this as less of time constraint and more of budget constraint. Costs me next to nothing to write/test kernels. Takes a lot more to actually run relevant tests since I've been told at different points that I'm undertraining, but now it sounds like I'm overtraining?

spare flame
whole goblet
#

Like I just won't have the ability to run N tests at a slow pace

tranquil fiber
#

You're still in the stage where you likely have a ton of bugs/initial arch issues, you need to understand the dynamics before speeding things up

whole goblet
spare flame
#

if its tight then maybe you should give up

#

the goal is to find something better than what people already use

tranquil fiber
#

(bugs here being e.g. initialization preventing certain things from working as well, etc)

whole goblet
tranquil fiber
#

It takes time (sometimes several months!) of close examination to find them

whole goblet
tranquil fiber
spare flame
#

you dont need a full run or even a partial run barely to try to find that

whole goblet
tranquil fiber
whole goblet
tranquil fiber
whole goblet
whole goblet
#

And I've also been warned this could have entirely different training dynamics?

tranquil fiber
whole goblet
tranquil fiber
whole goblet
#

usually if it early plateaus it's been in the 4-5 loss range

tranquil fiber
whole goblet
tranquil fiber
whole goblet
#

It's a joint optimization problem in each layer

whole goblet
tranquil fiber
#

modded-nanoGPT is ~4.35 @ 125 steps

#

But it is a tough baseline to beat

whole goblet
#

So am I using the wrong harness, then? I was told to just use default nanoGPT

#

Since it would be easier for reviewers to reproduce

spare flame
#

its also bc its easier for you to use

#

and less complicated

tranquil fiber
#

We had a pretty lengthy discussion where I suggested it but you wanted to stick with nanogpt since your previous runs were in it. And that's okay too, nanogpt is a pretty decent baseline that's well accepted so I think that's alright

#

Yeah

#

It's just more compute

#

Unfortunately

whole goblet
spare flame
#

for the same reason that now we dont even know if PEER is real

tranquil fiber
#

Nanogpt is definitely a step up from self coded harnesses since it's verified code

#

(tho if you verify your own harness with a baseline that should be okayish)

whole goblet
#

Yea I mean the earlier harness was literally just DSv3's attention block

#

So I figured that was fine

#

But yea, maybe a port to nanogpt is a good idea at this point

tranquil fiber
#

Wes if you're able to find some statistics over training and link those to network performance, and use that to ratchet down what you think is happening, that may set you up for a quick-loop harness to iterate over

spare flame
#

yeah to be clear, I am not promoting the idea that hyperparameters are going to be the magic bullet here

whole goblet
tranquil fiber
#

Like, running on paper what's happening with each number and their magnitudes throughout training is super super useful

#

Seeing if there are any outliers in activations, etc (and why)

whole goblet
tranquil fiber
#

Also logging, extensive logging

#

That's pretty important

whole goblet
tranquil fiber
#

Nice

whole goblet
tranquil fiber
#

Are you looking at your histograms over time?

spare flame
#

Wes is there some reason this is going to scale better than traditional ffn

tranquil fiber
#

Histogram everything

#

In tensorboard

#

(dense logging can help catch changes as well)

whole goblet
spare flame
whole goblet
#

It also (if my assessment ends up being correct) should be faster in practice than a dense FFN in inference

#

Just slower during training

#

Which is another reason I'm getting this low level

spare flame
#

I see

whole goblet
#

But yea, basically I have it to where you never have to leave the chip (at most SMEM writes) after attention

spare flame
#

ok I dunno then - maybe its worth kernel work if it could be better in practical ways even if its maybe a bit less great than a normal FFN in performance

whole goblet
#

.13 to be exact

spare flame
whole goblet
#

I'm saying intermediates never leave at this point, so it's getting pretty optimized

#

No readbacks

#

Just need to get tiling correct

spare flame
#

how is that better than dense matmul?

whole goblet
#

Scales better than tensor parallelism because no scatter gathers between PEER blocks

#

Just a single reduce before the next layer

#

Basically you get the benefits of expert parallelism without the load balancing issues

spare flame
#

that sounds good, how does that occur?

whole goblet
#

Because each GPU would just recieve a broadcast of the post attention token batch. Everything is pipelined between the router and output at that point, because the router and hypernetwork are tightly coupled.

#

In traditional PEER you'd be selecting the discrete experts from a massive pool. Same thing happens at lower scale for other MoE models

#

Here, because you're constructing the expert, you don't have the same problem

spare flame
#

youre able to slice up the thing that constructs the experts somehow across gpus?

whole goblet
#

More that as long as Heads % GPU_count == 0, they can handle Head / GPU count number of heads, if that makes sense

spare flame
#

sure

whole goblet
#

And then once each GPU has finished its batch, you just have a single reduce for their outputs

spare flame
#

multiple head FFN doesnt typically work great

#

why does this?

whole goblet
#

Because you're determining what kind of computation each token needs on the fly

#

MoE does this, but it's in a very discrete way, which is why you get things like aux loss for load balancing in most architetures

spare flame
#

hmm ok I think I see the outline of the idea generally

#

so basically you're making FFN worse (but more easily parallelizable) by using heads, but making it better again by hypernetwork somehow

whole goblet
#

Give or take

#

Like if I break down PEER (assuming it was real when the paper was produced)

spare flame
#

I'd like you to run another experiment where you just divide the FFN into heads

whole goblet
#

I can do that

spare flame
#

thats a kind of ablation for this concept

#

should be super fast to run

whole goblet
#

Yea, agreed

spare flame
#

because right now we don't know if you're just exactly matching that

#

due to that being a part of your change

whole goblet
# whole goblet Like if I break down PEER (assuming it was real when the paper was produced)

The router was basically doing two things:

  1. Determining what kind of compute was needed for a given token. This was encoded as the expert query
  2. Determining exactly how to do that compute, neuron by neuron. That's the coordinate used for retrieval + scaling each neuron

So here I'm just using that query to generate a neuron, and then using the coordinates to condition generation of each neuron, and the coordinate score to scale them

spare flame
#

also, if you are doing BETTER than head-ffn, that's interesting, since you're doing head-ffn and then more stuff on top

whole goblet
whole goblet
spare flame
#

I probably dont have time (actually I gotta go do some stuff right now) but btw I also don't really know exactly what headedness you're describing in the first place 🤣

whole goblet
#

In this case, just a PEER head

#

PEER defines a head as a router plus the k experts that router selects

#

So you have H routers per FFN

#

And no worries. I'll just give it a shot. I need to do some doc review on if there have been parallel FFNs without routing before

spare flame
#

Is the idea just changing
matmul(act(matmul(x, A)), B)
into
sum( bmm( act( bmm(x,As) ), Bs ) )
?

Like a bunch of smaller ffns summed?

whole goblet
#

Yep exactly

#

Because that would be the exact dense baseline without any complexity from what I'm doing

#

So basically instead of the 768 -> 3072 expansion in GPT-2 small per FFN, you'd have 8 768 -> 384 bottlenecks that are summed after

#

Looks like closest might be GroupBERT?

spare flame
#

but the sum of a bunch of ffns is mathematically equivalent to a single wide ffn

whole goblet
#

Yep

#

That's kind of the point, right?

spare flame
#

so.. that means I don't understand the 'benefit' youre obtaining vs doing that for a FFN across gpus

#

I thought you said yours was more efficient bc of lack of tensor parallelism or something

whole goblet
#

Fewer scatter gathers because you don't need to do that for each layer. Benefit is limited in GPT-2 style FFNs, but the second you add an additional hidden layer, benefit emerges

spare flame
#

ok i dont understand but this is getting beyond what I can spend time on

#

sorry 🙁

whole goblet
#

But the benefit still exists because it's a single scatter gather instead of 2. Woudl be same with the mutliheaded FFN approach

#

you're good

spare flame
#

one last thought.. if the main benefit is this kind of practical speedup, maybe you should write up a clear explanation of what situations that is expected to occur

#

and what subset of the invention is required for that speedup

whole goblet
#

Makes sense

#

I don't know if I've explored the architecture enough to know that its benefit is only a practical speedup on multi-node, but I can definitely try to isolate that aspect

#

Just going to switch to modded-nanogpt if the baseline needs to change again

tranquil fiber
#

Yeah, modded-nanoGPT upside is faster experiments, downside is it will likely be very very hard to beat the baseline

#

Since it's a very highly tuned run

whole goblet
#

Yea, I mean, I'm just swapping out the FFN, so if I get compute budget back, I'm fine with that

whole goblet
#
    """Multi-headed FFN: splits into parallel bottlenecks and sums outputs."""

    def __init__(self, dim: int, num_heads: int = 8):
        super().__init__()
        self.num_heads = num_heads
        total_intermediate = 4 * dim
        assert total_intermediate % num_heads == 0
        self.bottleneck_dim = total_intermediate // num_heads

        # Create parallel FFN heads
        self.heads = nn.ModuleList([
            nn.ModuleDict({
                \'c_fc\': CastedLinear(dim, self.bottleneck_dim),
                \'c_proj\': CastedLinear(self.bottleneck_dim, dim),
            })
            for _ in range(num_heads)
        ])
        
        # Zero init projections
        for head in self.heads:
            head[\'c_proj\'].weight.detach().zero_()
    
    def forward(self, x: Tensor):
        outputs = []
        for head in self.heads:
            h = head[\'c_fc\'](x)
            h = F.relu(h).square()
            h = head[\'c_proj\'](h)
            outputs.append(h)
        return sum(outputs)```
#

Simple implementation, will let y'all know how it does on modded nano

whole goblet
# tranquil fiber modded-nanoGPT is ~4.35 @ 125 steps

Not sure if there's something vastly different from running on a single GPU. Only disabled a world_size == 8 assertion and changed align_to_bos=True to false in the data loader. Everything else is just cloned directly from main.

Base (for reproduction on single GH200):

step:0/1750 val_loss:10.8258 train_time:0ms step_avg:0.01ms
step:125/1750 val_loss:5.5574 train_time:14117ms step_avg:112.93ms
step:250/1750 val_loss:4.9899 train_time:28250ms step_avg:113.00ms
step:375/1750 val_loss:4.7004 train_time:42454ms step_avg:113.21ms
step:500/1750 val_loss:4.5114 train_time:56892ms step_avg:113.78ms
step:625/1750 val_loss:4.3930 train_time:71403ms step_avg:114.25ms
step:750/1750 val_loss:4.3105 train_time:86070ms step_avg:114.76ms
step:875/1750 val_loss:4.2550 train_time:100809ms step_avg:115.21ms
step:1000/1750 val_loss:4.1766 train_time:115700ms step_avg:115.70ms
step:1125/1750 val_loss:4.1036 train_time:130682ms step_avg:116.16ms
step:1250/1750 val_loss:4.0288 train_time:145701ms step_avg:116.56ms
step:1375/1750 val_loss:3.9647 train_time:160736ms step_avg:116.90ms
step:1500/1750 val_loss:3.9090 train_time:175922ms step_avg:117.28ms
step:1625/1750 val_loss:3.8606 train_time:191158ms step_avg:117.64ms
step:1750/1750 val_loss:3.8198 train_time:206425ms step_avg:117.96ms
#

Multiheaded FFN:

step:125/1750 val_loss:5.5590 train_time:39192ms step_avg:313.54ms
step:250/1750 val_loss:5.0048 train_time:78624ms step_avg:314.49ms
step:375/1750 val_loss:4.6950 train_time:118188ms step_avg:315.17ms
step:500/1750 val_loss:4.5055 train_time:158052ms step_avg:316.10ms
step:625/1750 val_loss:4.3863 train_time:197880ms step_avg:316.61ms
step:750/1750 val_loss:4.3022 train_time:237781ms step_avg:317.04ms
step:875/1750 val_loss:4.2447 train_time:277795ms step_avg:317.48ms
step:1000/1750 val_loss:4.1717 train_time:317970ms step_avg:317.97ms
step:1125/1750 val_loss:4.0932 train_time:358107ms step_avg:318.32ms
step:1250/1750 val_loss:4.0169 train_time:398400ms step_avg:318.72ms
step:1375/1750 val_loss:3.9494 train_time:438691ms step_avg:319.05ms
step:1500/1750 val_loss:3.8911 train_time:478944ms step_avg:319.30ms
step:1625/1750 val_loss:3.8419 train_time:519242ms step_avg:319.53ms
step:1750/1750 val_loss:3.8008 train_time:559605ms step_avg:319.77ms```

Testing my pytorch version next once I get it ported over.
tranquil fiber
#

Don't forget to run your baseline!

#

Turning off align_bos does change things

whole goblet
tranquil fiber
#

I'd recommend using kosarsky's January record, that is much easier to convert to 1 GPU

tranquil fiber
#

Should have 3.28 very consistently (+/- a bit)

whole goblet
tranquil fiber
#

It always should converge to ~3.28

tranquil fiber
#

I can't remember the name of the top of my head, you can look in the records folder for it, I believe it's the tanh scaling upgrade iirc

whole goblet
#

I'll scroll through

#

Does kosarsky go by a different name?

#

nvm found it

whole goblet
#

Okay, yea. That's more normal then. Baseline, no changes needed from that version:

step:125/1390 val_loss:4.3667 train_time:119635ms step_avg:1040.31ms
step:250/1390 val_loss:3.9498 train_time:254696ms step_avg:1061.23ms
step:375/1390 val_loss:3.7707 train_time:392056ms step_avg:1074.13ms
step:500/1390 val_loss:3.6554 train_time:531675ms step_avg:1085.05ms
step:625/1390 val_loss:3.5748 train_time:671656ms step_avg:1092.12ms
step:750/1390 val_loss:3.5223 train_time:811987ms step_avg:1097.28ms
step:875/1390 val_loss:3.4717 train_time:954645ms step_avg:1103.64ms
step:1000/1390 val_loss:3.4056 train_time:1100304ms step_avg:1111.42ms
step:1125/1390 val_loss:3.3539 train_time:1246828ms step_avg:1118.23ms
step:1250/1390 val_loss:3.3070 train_time:1393206ms step_avg:1123.55ms
step:1375/1390 val_loss:3.2783 train_time:1538654ms step_avg:1127.22ms
step:1390/1390 val_loss:3.2775 train_time:1556107ms step_avg:1127.61ms```
whole goblet
# whole goblet Multiheaded FFN: ```step:0/1750 val_loss:10.8258 train_time:0ms step_avg:0.01ms ...

Take 2 on older harness that's single GPU friendly for multiheaded FFN:

step:125/1390 val_loss:4.3925 train_time:155044ms step_avg:1348.21ms
step:250/1390 val_loss:3.9615 train_time:328091ms step_avg:1367.05ms
step:375/1390 val_loss:3.7857 train_time:504000ms step_avg:1380.82ms
step:500/1390 val_loss:3.6669 train_time:680342ms step_avg:1388.45ms
step:625/1390 val_loss:3.5861 train_time:857694ms step_avg:1394.62ms
step:750/1390 val_loss:3.5308 train_time:1038387ms step_avg:1403.23ms
step:875/1390 val_loss:3.4804 train_time:1220604ms step_avg:1411.10ms
step:1000/1390 val_loss:3.4140 train_time:1402886ms step_avg:1417.06ms
step:1125/1390 val_loss:3.3613 train_time:1584735ms step_avg:1421.29ms
step:1250/1390 val_loss:3.3141 train_time:1767800ms step_avg:1425.64ms
step:1375/1390 val_loss:3.2852 train_time:1952838ms step_avg:1430.65ms
step:1390/1390 val_loss:3.2844 train_time:1975155ms step_avg:1431.27ms```
Little worse
spare flame
#

Oh after understanding what you meant by multi headed I didn't think you still needed to compare since it's mathematically identical

#

(tho I still don't really get the justification for why it's faster than normal FFNs since it's identical in that way)

whole goblet
#

Well it's done anyways, and yea it's pretty much identical just slower

#

The speed up when on >1 would be that if you have two hidden layers in those, you would be able to reduce a scatter gather from that second layer

#

but with a single hidden layer it's almost strictly worse

#

But when you have so many ops that are pipelined in my arch, it's more pronounced

#

Alright, I have my architecture patched into modded-nanogpt

#

We'll see how perf is

#

Also trying muon on my layers but might swap them back to Adam if there's major issues. Given I'm doing all manual gradient handling at this point so it should be pretty agnostic

whole goblet
#

Alright, at a bit over 60k tokens per second, so double baseline, with the modded nanogpt repo and some kernel tuning. Still a lot slower than the 470k tokens per second that modded nanogpt can get on the same config, but I'm okay not beating a literal speedrunning config

#

Will run ablations from this. Should be able to get a decent chunk of configs out of this

#

util is still pretty low, so someone smarter than me could probably get further faster than I can at this point

whole goblet
# spare flame if its tight then maybe you should give up

I'm just about here, just going to try one last thing and then just open source the negative case. Got it to get within like .05 nats of the modded gpt baseline, but still like 7x the iteration time. I know there's a lot of room for speed based on nsight, but haven't been able to capture any of it.

Last hurrah is going to be trying to toss an expert choice router in front of each of the PEER routers. Would let each head specialize better, would reduce amount of overhead I'm eating from each head, and shouldn't require kernel rewrites the way I have them now, would just look like smaller batch sizes from that perspective.

if/when that fails I think I'll just give this up

#

I have the approach factorized to a point where it is lower FLOPs than a dense baseline, though, just less efficient ops currently.

spare flame
#

sorry 🙁 thats interesting that its lower flops and almost equivalent performance but less efficient

whole goblet
#

It feels like there's something here, and to some degree this shouldn't work at all, let alone nearly as well.

#

But yea, I just might not be the person to make it work if it does have merit

#

But yea, last hurrah is going to see if they specialize better if I go
Head Choice Router (ripped from expert choice, will reduce tokens each head needs to process) -> PEER Router -> Hypernetwork -> process token with generated experts

#

learned how to write reasonably efficient backwards kernels at least

#

forward is just a huge bottleneck rn

spare flame
#

without knowing the answer to that it's hard to know anything much about these kinds of 'dynamically constructing FFNs' methods

#

it's fine if your method requires more parameters if those parameters are used much more sparsely

#

but you need some kind of specific situations in which your method would lead to dramatically reduced compute (like not just a constant multiplier on FLOPs)

#

that was sort of the promise of PEER

#

ultimately, we can't tell you what the proposed benefits of the methods you're pursuing are... you have to be the one to know and communicate that

#

and if you can find a clear benefit of this sort it's probably worth pursuing, and if not then yeah it might not be

whole goblet
whole goblet
tranquil fiber
whole goblet
#

Doing this on limited budget without institutional support is fun when it's fun, but lately it's just been exhausting

#

hobbies are supposed to just mostly be fun lol

#

And it's hard to sift through "this advice is useful when you have a huge training budget" vs "this is useful at all scales"

#

Been good learning, just not really fun rn

tranquil fiber
whole goblet
whole goblet
#

Okay, actually it looks like a good chunk of the gap can be made up by tuning LR just for PEER layers

#

Lowered to 3.35 by making coarse grained updates, can likely eek out a decent amount more by separating LR between attention and components of the FFN

#

(was at 3.40~)

whole goblet
#

it's still peer in my codebase

spare flame
#

o ok, was just not sure if u meant actual PEER

whole goblet
#

Yea, no, the one I've been working on. I'm not going to spend any more compute trying to get a PEER baseline perfect

#

if GDM wants a repro study they can give me the compute for it

whole goblet
#

Honestly starting to think that this might do better as a blog post on "how to use hypernetworks to make a worse transformer"

mental plinth
#

Just getting caught up here. 😊

I’ve been messing with different decompositions on attention and the FFN, and this past week was trying to see if I could find a configuration that would actually achieve a lower perplexity than my dense baseline at the same parameter count.

I found a Google paper that discussed how best to allocate additional parameters, and it recommended adding more layers rather than making them bigger.

I tried applying that—any parameter savings I was getting from the decompositions, I’d into additional layers when possible.

An approach that finally got lower perplexity than the dense baseline was to interweave dense and low-rank layers.

I went from 6 dense layers (“dddddd”) to ten layers with a pattern of: “dssdssdssd” where ‘s’ is for sparse / low rank.

The low rank layers are narrow—they have ~1/3 as many heads and ~1/3 as many neurons as the dense layers.

My theory is that, since the low rank layers can’t access the full model space, it makes sense to make them more narrow / less expressive.

And also that it benefits from periodically having unrestricted read and write access to the full model space.

For Ethos and the variants, there is a lot of low rank-ness going on here with the PK Router and the hyper networks. I wonder if this architecture might benefit from a similar “banding” pattern?

mental plinth
#

I think the same insight may apply to regular PEER as well. The PK router is a low-rank approximation of a standard MoE router. It can only see a subspace defined by the query projection.

I suppose it makes up for that, though, by having multiple heads (multiple low rank routers), similar to attention. So maybe nevermind.

mental plinth
#

What I've always liked about PEER is the hope that it could learn the same kind of precise features that they find using SAEs in interpretability research.

I wonder if it could work to take a pre-trained model, do the SAE analysis, and then initialize a PEER layer from that.

Applied to GPT-2, an SAE setup might look like the below.

First, compute the activations for the FFN input neurons:

a = gelu(x @ W_in)   # with W_in.shape = (768, 3096)

Then the SAE has an encoder matrix E with, e.g., shape (3096, 1M), and a decoder matrix D with shape (1M, 3096).

So the output is something like:

y = gelu(a @ E) @ D @ W_out

The features they find are the rows of D. You can turn those into expert output neurons by multiplying with W_out. So for PEER expert neuron 'j':

w_vj = D[j, :] @ W_out

For the routing, the simplest but expensive approach would be to say that the input to the PEER layer is the activation vector a, which is length 3096.

So then you need to set up the PK router to approximate the operation a @ E to find those top features.

Maybe a better way would be to freeze all of those ideal output weights, and then use distillation to train the PK-router and the neuron inputs.

whole goblet
#

Yea I thought the same thing, but I’m not sure you can hit ground truth on “ideal output weights”

#

You could hit a version of it, but PEER reproductions themselves don’t seem to be great from the versions I’ve tried

#

What I think could be more interesting is token + scaled output pairs on a trained network as a training set

#

So you attempt to learn the full path

whole goblet
#

Wait I found a bug

#

I'll rerun training once it's fixed

whole goblet
#

Would y’all have any thoughts on how to best test OOD data? I have half a hypothesis that because this is defining what compute you need on the fly, it might do better there.

whole goblet
#

Also, am I potentially thinking of this wrong? The FFN is learning a materially more difficult task, so should I be expecting that it's as sample efficient as its dense counterpart?

#

It's learning given this input, what weights should I generate that would be useful and then immediately consuming those weights, which is significantly more difficult than the training task for the dense network

whole goblet
#

Going to swap to gpt2-medium baseline just because I’m wasting a lot of cycles making a non power of 2 model dim work. Everything is a lot easier when model dim is 1024.

whole goblet
#
step:125/1390 val_loss:4.3229 train_time:217484ms step_avg:1891.16ms
step:250/1390 val_loss:3.8955 train_time:460616ms step_avg:1919.23ms
step:375/1390 val_loss:3.7061 train_time:704883ms step_avg:1931.19ms
step:500/1390 val_loss:3.5879 train_time:953667ms step_avg:1946.26ms
step:625/1390 val_loss:3.5013 train_time:1206982ms step_avg:1962.57ms
step:750/1390 val_loss:3.4425 train_time:1459531ms step_avg:1972.34ms
step:875/1390 val_loss:3.3900 train_time:1716268ms step_avg:1984.12ms
step:1000/1390 val_loss:3.3211 train_time:1976056ms step_avg:1996.02ms
step:1125/1390 val_loss:3.2642 train_time:2233886ms step_avg:2003.49ms
step:1250/1390 val_loss:3.2146 train_time:2497299ms step_avg:2013.95ms
step:1375/1390 val_loss:3.1837 train_time:2760458ms step_avg:2022.31ms
step:1390/1390 val_loss:3.1829 train_time:2791743ms step_avg:2023.00ms```
Baseline gpt2-medium
whole goblet
#

@spare flame going back through original ETHOS, I found a bug that might actually be a minor discovery. Any shot you'd have 10 minutes to verify? Should be apparent in the code.

Basically, the hypernetwork and expert weights never actually received gradients. We hit reasonable PPL without training those component at all. Just random init.

spare flame
whole goblet
spare flame
#

Lol!

whole goblet
#

But we were using the forward triton kernel

whole goblet
#

If this is true, I think I can just share latents across layers and never train them? Would just need to fix the hypernetwork to recieve gradients because that's likely suboptimal

spare flame
#

Heh what part do you need me to look at? Shouldn't you just make sure everything has a backward?

#

I don't really want to look at code to debug it

tranquil fiber
whole goblet
whole goblet
#

Just used the same shorter run hyperparams from the gpt-2 small version

whole goblet
whole goblet
tranquil fiber
#

(I probably won't comment too much more on ETHOS but I may come in occasionally to help)

whole goblet
#

E.g. I trained a GPT-2 Medium class model as my dense baseline on X tokens, and then also trained my IsoParam|IsoFLOP|IsoWallClock model on X tokens, is that not a valid comparison?

#

Goal isn't to get on leaderboards of modded-nanogpt, just have a scientifically valid baseline to compare to.

#

Especially when they're using identical hyperparams outside of my exact FFN change?

#

Maybe the part that wasn't clear is that baseline was run exactly from the repo just with the training duration changed?

whole goblet
#

Otherwise not sure what the

"Yeah, I'm not entirely sure what that's getting you unless it's something specific to kernel writing."
part is about

whole goblet
#

Was only offering to run it to 2.92 because that's what the speedrun requires, but this is just a short run baseline in order to keep sweeps reasonable cost.

whole goblet
#

fwiw going to move forward with the shortened GPT-2 Medium baseline for ablations and hyperparam sweeps. I think that simply training for fewer tokens will just have to be a limitation for that portion, and I'm fine with that.

whole goblet
#

Went through and just updated the most recent GPT-2 Medium speedrun to be friendly to single GPU and ran a full run. Final delta was +0.000934 (2.920684 vs 2.919750) Going to call that similar enough, use that as the harness and baseline, and just use early exit to disqualify candidates if they aren't training well.

Reference: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_2_medium/2025-06-15_OptimizationLeaderboard/075_640429f2-e726-4e83-aa27-684626239ffc.txt

My changes and full run: https://gist.github.com/wrmedford/14893b6a4477b6d2ef114a3406d5aa87

#

Now back to the actual research

mental plinth
#

How do you decide when to stop training? The goal is to keep going till you hit their perplexity mark, right? But you need to know ahead of time how many steps that’s going to take in order to schedule the learning rate 🤔

whole goblet
#

Best I can tell it's a bit of trial and error on figuring out exactly which iteration to stop on, but I'm just going to be training to this token count to keep with the methodology

mental plinth
#

What’s the metric you’re evaluating here for Ethos? Comparing validation loss at the end of training?

whole goblet
#

I will be yea. I have a sweep harness and a few other things I'll throw in a branch probably tomorrow

#

It's already on the jupyter instance if you want to check there

mental plinth
#

Cool!

#

Do you have an idea of what success looks like here?
Are we hoping for a lower validation loss at the same token count vs. the baseline?

whole goblet
#

That's the hope, or seeing if we can match it at a lower param/flop count

#

now that I'm moving back closer to the standard ETHOS arch

mental plinth
#

Ah, I see now, that run you just shared was the baseline. If it took 3 hours, how long is ethos going to take? 😬

whole goblet
#

Hopefully not all ablations will take that long and will early exit on bad configs. Once I post the sweep (it’s in ethosv2/grid_search.py on the Jupyter server) it should make more sense. But should be able to just run them back to back for a while and just check results.

#

And I’m looking at potentially just tooling this for 8xH100 and doing it there if wall clock becomes prohibitive. But I think a couple weeks waiting on results is fine considering budget

mental plinth
#

Yeah, makes sense

Have you tried swapping in the ethos layer yet?

#

After I saw that you were using this as the test framework, I’m considering trying it for the attention subspace work, but it’s not immediately obvious to me how tricky it will be to customize or not.

The MLP looks more straightforward for ethos, at least.

whole goblet
#

Yea it trains, just need to fix the bug I found. I think what I found also implies that since the routers are that expressive, each router needs its own hypernetwork. Lines up with what I saw when I was testing out the hypernetwork only approach as well

#

Yea MLP swap here is straightforward, but there’s a lot of tricks happening with attention you’d need to disentangle. Would recommend a previous run before flex attention was introduced

mental plinth
whole goblet
#

Yea, conceptually I think it makes more sense because then you’d never have different heads feeding a shared hypernetwork conflicting gradients

#

And then each can truly specialize

mental plinth
#

I’m itching to go read up on their different hacks to attention. Stuff like the smear gate, and skip connections, and value embeddings 😳

whole goblet
#

Part of me wants to test throwing an expert choice router in front of each PEER router since it should work. Would allow for even better specialization.

#

And still allows for lossless load balancing

#

And consistent expert load

mental plinth
#

Expert choice? Like making an MoE layer where each expert is an ethos layer?

whole goblet
#

An ethos head, but yea

#

Expert choice -> PEER router(s) -> constructed expert -> gather token output from experts that selected it

mental plinth
#

Well, good stuff. It sounds like a really nice, simple test harness here that ought to be easy to interpret.

whole goblet
#

Yea. Going to try the normal stuff first but the expert choice router just feels like it’ll fit there

whole goblet
#

@mental plinth , I had half a thought yesterday. Do you think that PEER's router might fit some (albeit looser) definition of an encoder?

mental plinth
#

Well, it certainly seems close to Attention. It’s like it attends to the expert neurons instead of a sequence of tokens. It applies softmax just like attention, except only to the top-k experts. And since there’s no “causal masking” I suppose you could compare the whole thing to an Encoder Attention block in that regard?

What aspects of an encoder were you thinking about?

whole goblet
#

Mostly that it’s encoding language at that exact point in time since it’s working off a point in time language with causality baked in, in a way that is consumable by a hypernetwork

#

It might be stretching the definition a bit, but does definitely feel like the IO matches the shape of what I do for natural language encoding in my job

whole goblet
#

So I tried to reduce this to just comparing to a standard MLP without the complexities of the surrounding model. Chose a CIFAR10 flattened set (not 1:1 with language, I know) just to try to track relative performance on situations where MLPs can struggle. Gave me better signal (on at last this task) on what configs beat an isoparam dense baseline and which don't. Also gave me some better info on which hyperparams impact performance more than others

#

In isoparam (where this arch is actually technically fewer FLOPs now after some tweaking) I'm able to consistently match or beat the single layer FFN baseline. Deeper FFNs are consistently beating, though, so I'm not sure exactly how to compare since MoE arches tend to use a single layer MLP

#

But it reminded me of what Smerky said earlier, that this approach might just be giving the model more depth, which might just be an inductive bias for the CIFAR10 stuff

#

The most interesting part is that it performs way worse on early epochs, but starts to pull ahead pretty quick. It does plateau earlier than the dense baseline so I might play with lowering its LR

whole goblet
#

Okay, so counterintuitive finding: lower LR on the router, higher LR on the hypernetwork is preferred.

Not as counterintuitive findings: router is massively overparameterized in some ways and underparamterized in others. Starting to get some amount of crystalization on where the design space is for this without having to do full on training runs

whole goblet
#

In testing, multiple heads here actually performs worse than a single router and single hypernetwork. My guess is that if there was an actual positive benefit from multiple heads in the PEER paper (becoming less and less convinced that paper wasn't outright wrong or required very sensitive hyperparams), it likely was due to having discrete experts that needed to be combined into valid experts.

#

And that might look different from a hypernetwork that jointly learns what that valid expert is.

mental plinth
#

Nice! I like the approach. I actually did my first round of experiments using a Vision Transformer (ViT) trained on CIFAR-10. It's nice because you can train that wicked fast.

I like the approach of holding the parameters constant, too. If an architecture is truly "better", I think you'd expect that it's able to score higher given the same parameters, right? I think too many of my early experiments were in the regime of "it performs slightly worse with fewer parameters", and I've started to see that as not a very useful observation.

#

Maybe putting the ethos / peer layer into a ViT with an established configuration would make for a good comparison, over just a straight MLP that you have to architect yourself?

#

Also, do you think there might be any merit to testing PEER with a dense router? The PK router makes it efficient to run for a large number of experts, but also introduces a low rank decomposition, and I'm wondering if removing that and isolating just the single-neuron-expert approach could be informative. I'm not really sure--it's an incomplete thought.

whole goblet
#

Going to set it up to run ETHOS and it side by side with an isoparam constraint on the MLP, which should get pretty well isolated behavior curves at least

whole goblet
#

Ignore the mha misnomer since it transferred over from my toy example, is in fact using ViT. Up to date code, running the included sweep now. Includes FiLM which is an improvement on PEER to get an actual rank-k MLP to generate without needing to blow up the hypernetwork's output layer (in this arch it's now rank-m, with k being the depth of modulation).

Without doing an arch sweep (next phase once I have a decent idea of what hparams I should use), best baseline is getting 84.37, vs 83.27 for ETHOS. So consistently behind, but I think a decent chunk of that gap will be made up by better balancing of parameters once I do an arch sweep. Relatively certain current setup is overparameterized in some places, and under in others.

https://gist.github.com/wrmedford/ef452a86bae0c7dd1201b5e4e265729a

Gist

GitHub Gist: instantly share code, notes, and snippets.

#

Normal hedges of non-pretrained ViT on limited dataset, etc. etc. just seemed like the best way to isolate purely the MLP aspects of this without usng a super contrived method

whole goblet
#

Small update ViT stuff is going reasonable. Just still doing a lot of arch/hparam searches. Narrowing, but not exactly solidified yet

whole goblet
#

Alright, going to be setting this down for a bit. In a pretty isolated MLP vs ETHOS bakeoff, error bars overlap after tuning both, but it's not consistently beating a dense baseline. Maybe/Probably has better performance in transfer learning, but no clue yet. Going to set it down for a while.

#

I'm going to be popping over to just do some more standard kaggle competitions + maybe try my hand at the gpu mode kernel writing stuff especially since gluon is a thing now if anyone wants to join

quartz kestrel
#

i did try a 256² experts version of a PEER implementation and got a gibberish generator. Was the original model able to produce meaningful text?

#

did someone tried a better dataset than C4 and wikitext, like FineWeb-Edu?

whole goblet
spare flame
#

btw @whole goblet didnt know if you've seen this https://arxiv.org/abs/2508.18756v1

#

someone seems to have gotten PEER to be useful in at least some context

whole goblet
#

Nah I've stepped away from research in general outside of work.

#

I'll give it a read though

#

But yea just been spending my time on some trading strats, been doing well on that so far, but not really anything you publish or do outside of your own money

spare flame
#

o ok! just thought u might be interested since you were investigating PEER so heavily

quartz kestrel
#

when i say gibberish, not exactly gibberish, otherwise it wouldn't be PPL 7

#

it's just like repetitive patterns or sequences that doesn't make much sense for humans, but if you analyze the output, it carry meaning

#

to be fair, I wouldn't expect anything different from any model trained on wikitext-103-raw

#

I can't pretrain or finetune this scale on my hardware, would be awesome to see how PEER behave in high quality educational datasets, such as FineWeb-Edu

whole goblet
#

Happy to go over what I learned, but I'm pretty sure that PEER is just a bitter lesson wrapped in some solid high level logic at this point. I think that if there is a path towards breaking past both memory bandwidth walls and increasing effective model size on this architecture, it'll have to be done through a hypernetwork. With that said, I'm not sure that a standard backprop will be the answer to better performance here.

Small experiments I did with different hypernetwork based models (where the hypernetwork creates the expert on the fly) never actually beat their dense equivalents.

whole goblet
#

It's basically where I dropped inquiry though