#ETHOS
1798 messages · Page 2 of 2 (latest)
The winner of the 2008 United States presidential election would be George W. Bush, who took office more than two decades ago.
In a paper released Thursday, the former Republican presidential candidate touted his administration’s economic record – including a successful push to promote American manufacturing through jobs-creation programs – as a turnaround, saying that the economy will be stronger this year.
“The economy is a strong force in the world,” said Bush, “as long as we’re not doing terrible things on```
`step 55147/55148 (100.0%): train loss 3.1518, val loss 3.1519`
Small bump at the end, so will probably be interesting to see what happens in ablations. Still a very solid result.
looks great, I don't think you should read too much into various bumps and wiggles
Baseline running now
Yeah I said the same thing too haha, I've been burned too many times 😅😭
Curious, did the PEER baseline use the same batch size? (Overall, i.e. # tokens per step, i.e. microbatch size changing to match)
It could be that this method is more variance sensitive
plz use same batchsize
yeah (sorry I keep being a downer about checking things) also just keep in mind that even with same hyperparams and bsz its possible that by reducing the bsz you will have kneecapped nanogpt, and that your method performs better at low bsz than it
Yeah I think it might be that this method is more variance sensitive, if it's wiggly at 120M then scaling may be even moreso. Could also just be the smaller barchsize exacerbating it, something to keep in mind
(also ablating hyp effects/diffs vs the baseline/PEER baseline may be hard since different methods will be sensitive to different hyps, and not much compute for e.g. a full grid search (or the like))
Oh yeah def too
If there is enough compute to do final runs with the ~.5M tokens batchsize the baseline has then that would make the argument a lot more solid
(but I know there's limited compute here so you need to budget it)
yeah to be clear im not discouraging running the nanogpt at low bsz, you need to do that esp since its inexpensive
Yeah accum steps are impl so it should be a config change I think
a cheap way to make sure you're not kneecapping nanogpt baseline is run that at both small and large bsz
that way you can see if nanogpt baseline is better at higher bsz
if so, then you will eventually need to run your test at higher bsz too
if not, great
Good point yeah, hadn't thought of that
There is also a low-res chart from the nanogpt example run done in the repo but have no idea how much syncrot it's been under
That said this run does seem to be beating the LLM.c python baseline logs pretty handily
So that may bode well
But those may have been recorded before the LR change, which made it converge ~60% faster or so
So if that's the case, then the two runs would be roughly ~even or so, which bodes well (at least)
Yea, baseline is pretty fast to run, so I can check both batchsizes. Right now It's approximately even at the lower batchsize
But I'm only at the first checkpoint, and the more interesting stuff will be in a few hours
Already done
PEER baseline used a larger batchsize. Because it isn't creating weights on the fly the pytorch impl had less memory pressure
Yeah training dynamics will vary
Oh no, that does impact things a lot
I can rerun with smaller, or just rerun the new version with a fused kernel once I have that functional
tbh though if we are getting into hyperparam stuff, this is arguably comparing a run that has been tuned like crazy at all levels to a first real attempt
Maybe actually bigger after kernel works is okay since that would give a good probe into how your new method performs at different batchsizes/effective lrs
That's my thought
For peer, or nanogpt
nano
I'm just straight up materializing GBs of weights in the hypernetwork version that don't need to exist rn
Nanogpt has some okay tuning but not like an absurd amount iirc
But still, more than just a few runs 😅😂
Yea, I'm at n=1 😅
Yeah it's a pretty good start!
Also sounds like a good argument for investing in the kernel work
Now that the big run looks really promising
(hard to tell the order of operations on that sorta thingie)
But yea, the GPT-2 small version should be completely apples to apples. Identical config to the hypernet stuff (including LR which arguably should likely differ between architectures)
Yeah ideally as a speedrunner I'm immediately thinking about how things are set up for the hypernetwork, not just LR but also inits and a few other things
Glad it's doing well vanilla
Yea, I think even if on a first run for preprint, just showing that this works and is competitive without tons of optimization would be enough
But yea, the PEER baseline I think needs to be rerun
For an extra datapoint, though a .5M batchsize is kinda "standard" at this size so that's more marketable if it's worth it to rerun after doing some kernel work
I believe that peer was run at the default batchsize IIRC? (It's a bit more than .5M so not exactly but in that ballpark)
I probably won't do kernel work for the baseline just so I can cite the implementation and cite PEER's author saying that it's Good Enough™
Kernel work for your method I mean
491,520 tokens per batch, exactly
Since it would be cheaper to rerun
Oh, yea. I'm cool doing that
Is it 491.5k as well?
NanoGPT is using the small one
I mean the normal baseline for it
But this will be done in a few hours. It won't be an issue to rerun at larger batch
Oh, yea
It is the same
Nice awesome
Good datapoint as smerky said
Especially since it's cheap
Yea, at a minimum gives me options depending on where memory pressure actually ends up after the kernel work
I have a prototype that seems to work, but just need to test a bit before I say for sure
And reduces memory from 35GB running to a bit under 8GB
Nice!
Which does also mean that if I have to hop off the GH200's I'm on now, I could move back over to my 3090's
What's the baseline nanogpt network at that batchsize for memory use (also is this nvidia-smi or something a bit more fine-grained?)
nvtop (so smi under the hood iirc), seeing 6.8GB
yea there's for whatever reason just a GB or so of unfreed memory that shows up randomly and persists until I kick the node
But decent maybe for sanity checking potential max use
Interesting
Yeah
6.8->8 is certainly pretty reasonable
Yea, I could get it to be identical~ if I wrote a backwards kernel but I don't want to torture myself
fuck chain rule
all my homies hate chain rule
Okay, yea I should have run baseline first
But still worth while for me to figure out the fused kernel and run ablations, since we're in spitting distance
I do think I might need to spend a minute to really implement what was found here https://arxiv.org/pdf/2502.17405
That way I can avoid a full grid search of hyperparams for this model
Does also make me question why PEER underperformed so hard, that's not something I expected
always
live and learn 🤷♂️
but also, the most annoying thing is that sometimes these lines can cross much later in training
Yea, I need to see where I end up on throughput with the fused kernel
And that'll inform where to go from here
Same rough area to me suggests that there's still likely something here, just not obviously busted
definitely let the baseline run the whole way if u can
it may be closer than you think
also important to still run the large bsz baseline
Oh I am, I just shared because it's beating where hypernet ended up 70% of the way through
Still would have expected PEER 5b to perform better than dense baseline
Weird that it didn't
your lora-like constructions may be able to tolerate 10x the learning rate
Would that break comparisons?
yes but you dont have to care
you can assume nanogpt is optimal hyperparams for gpt2
So basically crank up the LR for PEER, but I might need to reduce LR for the hypernet
if you recall, i dont actually know how your code works 🤣
but if you use a lora-like construction with down then up projections then maybe you can up the LR successfully for that portion
the overall point is that your setup may require different hyperparameters to do well
I can shoot you the notebook. Fern took a look too
ok, then she may have a better idea of things that could help
Yea, this is the part that I'm confused on, is if I should attempt to compare ideal hyperparams to ideal hyperparams or keep them directly constant (or if there's an in-between)
it would have been convenient if the same hparams worked well for both, but now we simply dont know
so you might want to see if it learns faster with some changes
you can assume that nanogpt is reasonably optimal for itself
Sounds good to me
also good to check if the peer paper says anything about this for it
I also have contact with Owen He, so I might just ask him to double check the setup
Was hoping to avoid this by using an open source impl
Once this baseline is done I'll get the fused kernel working and then figure out budget for a hyperparam search. Other problem with a hyperparam search is I don't know if this exact shape of network is what will be ideal until I run ablations
And that might effect ideal hyperparams
there is no magic bullet here that I know of
try to find people who did similar things
and see if they had to compensate somehow specifically
Is lucidrains in this server?
no
I might just try to ping him on email and see if he has any insight
if you are going to contact him, an issue on the repo might be the best and most polite way
he has 10000 repos
Yea, that's the other reason I figured email might be best
🤷♂️
Oh, actually his website points to signal
Specifically for reaching out
Going to do that
oddly too the posted loss charts from the repo don't match this quite, maybe something to do with the different effective lr (or the like)
Trying a kind of dumb config while I test out the fused kernel. Dropped weight decay to 1e-2, and quadrupled learning rate
So far isn't exploding and is like a solid .5 nat ahead of the old run at this token count (5.5 vs 6.0), but still so early it doesn't mean anything other than it doesn't immediately blow up yet.
@tranquil fiber Hacked a bit on the fused kernel and used it as an opportunity to run small experiments on hyperparams, and I think you might have been right on weight decay
u r changing too many things at once
also, WD is something that is going to hurt short term performance but benefit longterm training which you arent doing, so if u wanna change it make sure u change the baseline too and rerun
its ok to change LR (because you can assume nanogpt is close to optimal on that for gpt2), but not ok to change WD without rerunning baseline
These are short experiments when I’m just getting the kernel working. It’s helping short term performance, fwiw.
yeah im just letting u know that WD always specifically hurts short term performance
so you're not learning anything about your algorithm by removing/reducing it
The hypothesis is that the hypernetworks might be over-constrained, and they see more gradients than any other part of the network so could be arguably further along at the same step. So just trying to see if there’s anything to that thought. Obviously not a replacement for a full ablation.
Past that, when talking about late training, is there a standard definition for that?
Like trillions
Since a 9B token run for models this size is already way past chinchilla suggests
Or at least hundreds of billion
No one ever trains to chinchilla, it's only optimal in a kind of useless sense
I think I’m just coming to the conclusion that I need to do a full hyperparameter sweep which quickly becomes compute prohibitive.
Is there a set of tests that would reach “good enough” for preprint so I can garner more compute?
You can put out a preprint anytime you like. As long as you're honest about things and do reasonable comparisons the worst case scenario is it will be ignored. So it just depends on what your goals are for the preprint.
But more generally, my feeling is that no one will be interested enough in any method that performs worse and is slower for the same parameter count for the preprint to specifically help get you compute.
You may still be able to get more compute tho! I'm only saying I don't know that a preprint will cause that to happen.
Also, the fact that your implementation and test gives bad results for PEER is problematic for trust in your current methodology.
No one is going to believe that your method dramatically beats PEER while simultaneously trusting that your result showing PEER is destroyed by GPT2 was done properly.
You've got some serious problems right now and this does not warrant a paper at the current time.
What it does warrant is further investigation, maybe including trying to replicate the PEER results.
I feel like I'm always having to give bad news here, but the reality is you are going to need to identify these kinds of problems without 3rd party input.
After all, you already know the facts of the situation as well or better than I do.
When things 'seem wrong' or you believe they don't show the right result, you gotta either figure out how to address those issues or decide that they are in fact the right result and your hunches were wrong. Then update your priors, learn from the experience, and try some new thing or approach.
Right now, you're faced with two opposing results: PEER was worse than yours, but PEER was also much worse than GPT2
Either PEER isn't actually good, your test/implementation isn't good, or it doesn't work well at this scale/hyperparameters.
Possibly all three!
None of these things imply that your method works well. But it leaves the door open that it might, especially if it works well in situations conducive to PEER.
It's also important to recognize that your hunches or ideas about what should work well can be just plain wrong. I think experienced researchers expect that the vast majority of the things they try will NOT work, regardless of their hypotheses with theoretical justifications.
This is certainly true for me personally. I give up on things or methods and/or have to try a very different tactic maybe 75-99% of the time. I'm sure @tranquil fiber can give more insight into her batting average, but I'd guess that it's similar.
You should be trying to do more tests quicker with less resources so that you can swing the bat more frequently. This is why I advocated for testing against the nanogpt baseline early rather than late. Because time spent per idea matters. You can also spend a lot of time drilling down on one idea, but it's only good to do this when you have a good sense of what may be salvageable and what can't.
And most importantly of all: you always need to be testing against a fair yardstick as early as possible. Science is about verification via either experimental evidence or mathematical proof. Without this, you are just hoping that your intuition is well enough developed while flying blind.
In summary:
- Test early and often against a fair yardstick.
- Try to use less compute and/or obtain more compute prior to having proven stuff.
- Ask yourself every time what it is you've proven and what is uncertain. And how you can most efficiently add to the evidence in either direction.
- Don't put out a paper until you have shown something useful and can prove it. [This can include negative results if you have strong evidence]
- Get a lot of at-bats as you work on increasing your batting average.
I think that there's still likely reasonable learnings here from being able to show that this architecture works at all, since it is a weird architecture. Hypernetworks are themselves difficult to train, so showing this architecture can reign some of that in I think could have signal for some folks
Don't most researchers have faculty or colleagues to look at this?
I'm not sure exactly how you would reduce here, and this feels like it goes directly against a handful of times that you've suggested I need to do more, not less.
And that's most of what I've asked for input on, is if it's a fair yardstick. There's so many confounding variables, and most papers I've looked at don't mention basically anything I've been called on here.
sorry for not being clear enough on that... amount of time spent on each test is not directly related to scale
there are lots of ways to tackle this and there is no uniform answer, it's going to require creativity
but for example now that you have the baseline gpt2 run you can pretty quickly guess at how new runs of peer etc are doing
or you can make your code run faster (its a tradeoff of your coding time tho)
or you can spend more money per unit time to get things done faster, and/or you can stop runs early since u can compare to gpt2 base
I mean, I could train to chinchilla optimal and cut down these training runs by a bit over 3x, if that would still provide enough signal
But I'm still not sure exactly why it's "useless"
you can also try an even smaller scale to get your bearings
chinchilla? oh its just useless for actual inference so no one does it
its fine to train to it as a test
Can you expand? Is it useless for architecture research? My goal isn't to pump out a SOTA model, just a finding on this to explore the space
It's not like PEER was ever released
yeah it makes sense to train to it here, but I also dont think anyone is going to care that much if you train to it specifically
I don't want a pat on the back for training to chinchilla optimal, just to know if it's good enough for that to not be a sole reason the methodology is dismissed
you gotta keep in mind that your new FFN replacement is going to have different training dynamics than regular ones so its kind of irrelevant to a study (chinchilla) that was done on traditional FFNs optimality per tokens trained
Sure, but there's going to be so many differences in general, but for some reason I also have to keep LR/WD/etc. identical despite those likely having different dynamics as well?
Just trying to understand the line here
I did not tell you to keep LR the same, but I did tell you to keep WD the same
Specifically, you can and should change the LR for YOUR model
if feasible
but the original LR was a good starting point when you literally had zero runs done 🙂
Yea, I'm just moreso saying that I might also have to take an approach where I use different LR's for different parts of the network
And will likely reduce WD just for the hypernetwork
And all of these make me question how "fair yardstick" is defined
Since they all clearly have different effects on different models
And that's without getting into other stuff like beta/clipping/etc.
I guess that's what I'm getting at, is there's no good answer, but you're telling me that the answers I'm coming up with are definitely wrong
So just trying to figure out where you're actually drawing these lines
Like are there papers I can read on this stuff? Because I've been pretty obsessively looking at how other arch papers present this stuff, and most of these topics aren't even touched on
yeah you're right that it's hard to know that stuff without a lot of detailed info about hyperparams, but I think generally changing LR is fine and that's all you should change
you can assume (for now) that nanogpt is somewhat near optimal hparams for itself
im just telling you to try to move as rapidly as possible
I think from early tests and also following Fern's advice, there's likely some benefit from looking at weight decay on the hypernetworks, both because they receive more dense gradients than any other part of the network, and their ability to produce diverse outputs directly ties to the hypotethetical capacity of the overall network
yeah she is better equipped to advise you on that than I am
but you could also just rerun the baseline with no WD
which is safer
WD isnt going to help on a short test but it will HURT
so removing it on only yours is unfair
But couldn't that theoretically harm GPT-2 Small's results?
theoretically sure, but in practice it will help its results (i think!)
It's showing positive delta against another baseline within like 100M tokens on this arch
I don't think I understand what you're saying. Aren't these two statements mutually exclusive? #1395195891262029884 message
im saying that WD hurts short tests across the board, so remove it from nanogpt gpt2 first if you really want to remove it from yours
and show that it helps (or at least doesnt hurt) gpt2 to remove it there
Sure, I just feel like removing WD from one that benefits, and another arch that hurts from it removes the fairness of the yardstick?
sorry, are you saying that WD will benefit yours?
Yes
oh
that's what I've been saying lol
I do
Higher weight decay hurts the network
Basically because more diverse outputs seem to help
when I say 'wd benefits' I mean "MORE wd benefits"
Oh, no, the hypernetwork wants as little weight decay as possible from what I can tell
right, so take it away from gpt2 first
show that it improves gpt2 to do so
then run it without wd on yours
Oh, so you're saying that less weight decay also helps GPT-2
as I said, (MORE) WD hurts short tests across the board
these are short tests
if you were doing 300B tokens I'd have a different opinion
Okay, now I'm understanding
sorry if I wasnt clear enough about that
btw I could be wrong! but generally this is true
I just thought you were saying that by reducing WD, I should expect GPT-2 to get worse, so it felt like it would be sullying tuned hyperparameters for GPT-2
Okay cool, we're on the same page then
this is all just to chase the idea that reducing WD will make your hypernetwork better
I don't necessarily think that's the right thing to chase, but I'm no expert!
unless you think its also damaging PEER
to me the #1 problem here is that PEER looks awful
that's a big problem
Yea, I think that there's some reproduction issues here. I'm going to go and compare my original implementation of it to lucidrain's to see if there's any obvious bugs
Because I was seeing better performance when I was working on similar with ETHOS
PEER might just be bad
more importantly, check the PEER paper to see in what situations it actually (supposedly) worked well for them
if its a very different size regime etc.
and try to reproduce that however you can if possible even briefly
They were doing IsoFLOP comparisons, didn't care about isoparam
and/or by talking to the author
Yea, he's reviewing my reproduction hopefully in the next couple weeks
yeah its possible that PEER is horrible at isoparams
Which is why I went for 5B, since it's IsoFLOP(ish) with GPT-2 Small
but also if PEER is horrible at isoparams and this is inspired by it, that might imply problems for yours
hard to know
Which is fair, but I'm showing that with Iso"Capacity" that I'm outperforming, so it still feels like there's something there
Even if the base has some trouble
Like I think the most interesting thing here is that this model is effectively constructing rank-k experts neuron by neuron
And is like, spitting distance from dense
well if you can even reproduce any part of their runs that would at least give you confidence that your implementation isnt wrong
no idea if they supplied or can supply enough data for you to do that
or its at any sort of feasible scale
I did with small variance back when I first started on ETHOS. I can likely drag that back up
It is, and I have a kernel that makes it pretty tractable
But that was with some other parts that I'm not sure how to match 1-1
I'd expect that harnesses within GDM at the time were using GQA
And my harness used MLA
And nanoGPT uses MHA
So there's a lot of potential variance just from attention mechanism used
PEER's paper itself is pretty light on hyperparam details. We don't even know the model dim or if it was held constant
Or if depth was constant
author can shed light on this much more easily than they can check your code
yea, I'll shoot an email
(yes agreed almost generally definitely at this scale)
yes agreed as well
since this is in the publishing help section, I have one more note about why and when to put out a preprint:
consider how many citations you expect to realistically get and base it on that (considering who and why they might cite your work and in what situation and other papers)
don't write or put out a preprint until that number is greater than zero
5-10% for methods I'm new(ish) to, ~30-40% for methods where I have a very strong guess that makes sense (a lot of this is just sense from last ~decade or so).
but that doesn't include e.g. bugfixes or the like as much, maybe a bit.
but agreed that maximizing the amount of times you can pull the lever is generally best, especially when starting out, it's how you gain the most intuition and information
(if completely new to something, e.g. an entirely new subfield, at the start, maybe a bit worse than 5-10%, e.g. ~3-5% or so)
Yea, and I think I'm going to have to take another stab at this kernel. I got one working, but it's not beating easy autograd tricks
Can get batch size up considerably now, though with the autograd tricks
And did get about a 30% improvement in throughput
And this is solid advice
sounds like way better bang for buck than kernel work, also see if various compile tricks can help
I can just tell what's going wrong, but getting the tiling correct has been a pain
and run some much shorter tests varying that stuff if u want to and comparing to short run of baseline without wd etc.
that way you can get feedback in minutes or hours not days
shorter doesnt always extrapolate cleanly but at least its rapid signal acquisition
you can only really learn from an experiment, so running lots of those may help you learn stuff about what works in this regime and what doesnt
Yea, I can get a complete chinchilla run done on this (assuming it is the first point I can extrapolate from) within a few hours
ideally shorter than that for testing stuff
it's kinda like rebuilding a whole car to see if a new piston fits
yeah man just jiggle the pistons a lot!!!
you should do scheduling, hyperparameter tuning generally on longer runs
most other things (esp smoke/sanity testing if things are working okay) should happen on shorter ones
yeah, proxy tests like compressing water in the chamber to see for leaks (you can probably tell i dont know cars), that kind thing. in spirit
I mean, this thing is training, so I'm not doing a ton of long runs checking if stuff is working. Usually if it's completely not functional I know within the first 10 iterations at most
yep
tho for e.g. the peer baseline i think it should be pretty short to know how the repro does
you may need to see how their loss curves compare against baselines to see what you should expect
But I'm definitely at the "how do I tune this" phase ahead of trying to get ablations started, but it seems like this would benefit from a hyperparameter sweep before those
larger models usually are much more step efficient, time-wise they may take longer to converge, but stepwise should almost always be faster than smaller models
unforuntately they only published final PPL values on C4 after a specific flop budget 🙁
hm yeah
tbh I think I might just want to dodge the PEER baseline question
defaulting to "bigger model should be better stepwise almost always" then should be a good litmus test there
yeah maybe, that would simplify things a bit
it would be nice to have as a comparison
but in the spirit of minimization
I will say @whole goblet the unfun thing is that with all likeliness, after fresh-implementing things, usually there are 2-3 major bugs, and a bunch of minor ones, sometimes maybe more depending on size of implementation
(and depends on implementer as well, you'll get a vibe for what yours are)
(usually these are ones too that are more silent)
so, doing very thorough tracethroughs of the code and what's happening oftentimes helps with these, esp on smaller toy examples
Yea, I did try to bug bash before really touching too much
I had a major bug early where I was just generating the same neuron k times
yeah
it can be subtle
even in proper implementations there are things that can be hard to discover e.g. MoE router collapse
(which is also why having a good testbench w/ a 5-10 minute turnaround for basic tests is really really useful for in-the-loop debugging)
(but longer also works too, anything over an hour gets to be a bit more iffy w/ exploration stuff)
it may be good to also plot tons of statistics in tb (or the like) about what's happening in the routers, e.g. router weight distributions, weight similarities for generated values, etc
to see how that evolves over time
you generally want things to be gaussian, (or, if it's principled, log-normal -- but you really should be sure as to why here), if things collapse or spread out weirdly then there's a sign that something might be off in your training
(max and min also is super helpful for determining this)
basically, putting eyes everywhere on what's happening in your code
you gotta know intimately what's happening and evolving over time in training, to make sure that everything is healthy
Makes sense. I'll start collecting more of that
yeah, keeping an eye on cheapness and what can you spot early that predicts other things later on as well
part of dealing with new methods is establishing cordons for good performance vs bad, and basically diagnosing what's going on in any given area
(which can be quite hard and expensive at first as you build up intuition for yourself, sometimes building out that toolbox can be a lot of work! but it is well quite worth it in the end)
one thing e.g. the variance of your method's performance really does beg for a larger batchsize, that's an easier one (if you look at the wiggle of the variance of the loss + the val loss you can see that the loss variance is leaking to the val loss variance which is generally a big no-no, it seems to be pretty strong too -- so that's one avenue that the network performance can improve along i think)
Yea, with the autograd tricks I can do a 64 batch size comfortably now
Seeing how much I can eek out with max-autotune rn
Yes, but microbatch size doesn't really matter as much as effective batch size, you can almost always increase the number of accumulation steps to (generally pretty efficiently) increase your overall effective batchsize
Oh, yea 100%
Ofc there's something to be said for some ops doing better with larger batchsizes so that's a good thing yeah
Just means that I can ideally move this to more parallelizable systems once I get there
Without as many of the broadcast/collect portions
Honestly staying on one GPU for a while is probably your best shot
Usually it takes...many many hours of experiments and dev, and many long times of debugging to get a truly new idea functional, very very rarely does it happen straight in the first shot
I think you'd need to have a win in one of the iso regimes to move forward towards a paper for it
(which is a bit annoying I know, but at least it does give you options to trade along)
And in the meantime you need (generally) as much simplicity as possible for it
Yea, and to be clear not trying to say I've won in the iso regimes. Just moreso that if the PEER trends are accurate (which I guess is worth questioning) I can get a cheap win by just increasing k arbitrarily which doesn't increase parameters but drastically increases flop cost.
But that feels more useless
Well PEER should be handily beating the nanogpt model iiuc
A lot of ML research is detective work debugging
Always being skeptical and assuming something is wrong
Yea, I'm starting to wonder if there's a major problem in reproduction
Yeah
He said he had looked at lucidrain's impl and said it was good
But maybe didn't look that carefully or something?
Or I'm doing something wrong
You can also try smaller # params to see if it's a quick of having tons of PEER params (i.e. how does peer do with isoparams? Only 2-3x params? Gotta rule out something here. But only after a good bug sweep)
It's not always bad to assume that one is doing 2-3 things wrong at any given time, and even if the results beat the baseline being skeptical can be an enormous way to improve performance
PEER with isoparams is pretty hard to do, since it grows quadratically
Yea, I typically do, but this is pretty straightforwardly just replacing the FFN block with PEER. It's basically an import and config
Oh 100%. I'm just annoyed because I reimplemented it, but then decided to go with lucidrains so I can cite instead of proving that my baseline is valid
It's a good idea to look at their code carefully line by line and seeing what's going on
‼️
^
Well if your code and lucid's code agree in performance then that's a pretty good sign (if you didn't cross-reference the two)
Nah I implemented it before I knew his existed
It was just easier to cite than use my own
(this is a hint you should probably run using both to see if they are equivalent!)
yea, it's just in a harness that was intended to replicate PEER directly, and the one I did with lucidrains was glommed onto nanogpt, so it'll require a not insignificant amount of work
I also haven't touched my own implementation in a few months lol
Compare results
Yep, just need to do the reimplementation work. Looking back on it, we also integrated heavily with MLA so we could compute in a latent space
So it'll take some retooling
80% faster, no kernel needed, just einsum bullshit
Just 2x slower than dense baseline now
For the hypernet version
Actually should say more than 80% faster, it's an 80% reduction in wall clock time. 178k tokens per second
It's late, need to make sure I didn't just bug the hell out of this, but it's training and seems to be identical
Jk I messed up grad accumulation logic
Have half a thought, that with the query being the primary input, and then router shenanigans producing conditioning coordinates and scaling factors, that I might be able to just use RoPE for positional encoding of the query. Relies on the hypernetwork being able to generate already scaled experts based on the query itself, but I don't see any reason that can't be the case. Also is massively more efficient at runtime (router right now is 15%~ of forward pass and makes backward pass pretty rough)
Going to give it a shot.
In v1 of this, we did do a lot of computation in a latent space and it worked pretty well. It would constrain the size of the hypernetwork if we project down
Does seem like the middle layer of the hypernetwork was not helping performance that much. Linear projection with no hidden layers is handling just fine. Guess is because the nonlinearity is captured by the generated weights, but could be entirely wrong there.
I think this learning plus operating in a latent space would be able to get this within striking distance of throughput of a dense model, and then the question is just if performance can match. Latent space computation would also give me a lot more room to play with h x k values, which showed pretty consistent performance returns in PEER (assuming we work off of those results being reliable)
@whole goblet How are the results coming along? Do you have anything particularly exciting to report? Or is there a write-up somewhere I can check out?
Been ups and downs. Original idea might still have merit but don't have a clean baseline for that, and want to revisit, cc @mental plinth
Ended up chasing down eliminating expert weights altogether and I'm getting near baseline loss performance without tuning, but right now hitting some efficiency issues when using a pytorch implementation, and hitting throughput issues when trying to beat pytorch with triton. Trying a more hybrid approach today where I don't try to recompute everything on backward instead hand off the bulky matmuls to pytorch and let autograd handle everything else
It's just kind of a weird spot because I'm generating a lot of weights on the fly, so need to make sure they live as short as possible.
Otherwise you end up with such tiny microbatches that you can't get reasonable saturation
Gotcha! Mostly curious / checking in about the general status. What compute resources do you have / how bottlenecked by compute are you?
I have about 2k left in a Lambda grant, and currently trying to stretch that with a single GH200 until I have good enough throughput to justify broad ablations
Right now doing more single threaded stuff because of trying to get perf in a good spot, but will be bottlenecked once that's done
That said I have a lot of experience with k8s, so considering doing a self fund through something like sfcompute if I run out
We have some 8xA40 machines which could be useful for testing scalability, but assuming you're looking for chonky GPUs to do actual runs that's not something we have sitting around. I'm happy to talk about working to help get you a grant or something, if the results are exciting enough.
I'd appreciate it! Right now I'd want to make sure it's worth your time, and not beating a dense baseline and not having any throughput benefits yet is pretty hard to justify 😅
That said, I think there's a path where this will be beating dense baselines, but need to get at least within striking distance of appropriate throughput before I feel like I can justify another request
(bitter lesson and all that jazz 🙃 )
What's the goal, in a couple sentences?
See if we can solve the parameter explosion problem (and therefore susceptibility to hitting the memory bandwidth wall) in MoE by trading stored parameters for generated ones.
Like honestly the most interesting stuff at this point is that I think I've shown that you can generate a coherent set of experts on the fly when you construct them one neuron at a time, but that's mostly neat and not exactly useful right now.
I do also have some future work to see if I can replace what's currently a k^d operation which limits the generated expert's depth with something more efficient like faiss
iter 60: loss 8.9143, trailing_100 9.6500, lr 1.80e-05, time 4567.63ms, 35870 tok/s, MFU 1.19%
iter 61: loss 8.8450, trailing_100 9.6370, lr 1.83e-05, time 4566.15ms, 35881 tok/s, MFU 1.19%
iter 62: loss 8.7013, trailing_100 9.6222, lr 1.86e-05, time 4567.74ms, 35869 tok/s, MFU 1.19%```
Progress on performance through algo improvements. Now if I can just get reasonable saturation (pretty sure this isn't using tensor cores as much as it should be) we should be in spitting distance of a dense baseline
And some tweaks to get saturation up, but does feel like I need to go get some outside advice on how to get this to be fast fast.
iter 512: loss 5.5459, trailing_100 5.7531, lr 1.54e-04, time 4754.85ms, 41349 tok/s, MFU 1.38%```
@spare flame Just a heads up, been playing with nsight a decent amount, and I'm finding that the memory bound nature of base PEER is also largely in the router because of how it materializes just a few bytes that are then extremely low intensity for future accesses. Weighing a persistent kernel for that
That's good, but I'd probably stick to the suggestions we gave re: baselines and comparisons before diving more deeply into technical efficiency stuff. I get that's a lot more tempting and "fun", but really if you're going for a paper you should probably work on some of the foundational comparison performance gaps and baseline stuff first
It's pretty necessary for me to get throughput to a state where I can effectively utilize compute. I already know that it's modestly losing to a dense baseline on performance on first few passes. It's easier for me to spend time on the kernels than it is for me to spend more money on ablations right now
Basically nobody is going to care about if I match a dense baseline if it's 10x the wall clock time to train
And the PEER baseline I think I'm just going to drop. If a paper relies on a pure reproduction of PEER that's going to take significantly longer
And I haven't heard back from lucidrains on if he ever got his implementation to train
I was on his old discord discussing it with him and was probably the only person trying to get it to train at the time. And I decided it was too slow to bother with.
Think it's reasonable to just drop it?
idk at this point it's probably better to compare to dense baseline even though it has some derivative parts
I agree with @tranquil fiber
I don't know whether or not peer is worth it bc I don't know if peer is good or not
Really the calculus I'm running right now is I can improve performance dramatically with reasonable amounts of work, which stretches my compute budget further
Yeah there's a correct tradeoff wrt effort there but you'll have to decide where the line is
It also helped me stumble on a better factorization of this
Yea, I just don't think I'm at that line yet. With MFU as low as it is, and getting throughput where I have, I think if I can hit reasonable MFU then I can show competitive wall clock time with dense baseline, which makes it more apples to apples.
Since I'm kind of competing against CuBLAS in pytorch for something as straightforward as a single hidden dense FFN
The only reason to optimize first imo is if you cant do the experiments otherwise so you can't work on improving the architecture
That's basically where I'm at. I think I can squeeze another order of magnitude of throughput out of this
Which gets me a lot more experimentation
Probably
I'm already beating baseline throughput by 35%~, so it hasn't been wasted work so far
But it won't necessarily lead to any useful result so you just gotta weigh the time cost
Yea, from my perspective once this grant is over, if I don't have interesting results, I probably won't try to get more compute. So this should get me my best shot
And if it fails, then it fails
I'll open source the negative result and move on
Basically just trying to avoid "And I decided it was too slow to bother with." for this arch
btw the reason to follow up more on PEER is because it has implications for why yours is underperforming
I dont remember if they had a equiparameter study in their paper etc.
but if they were able to show PEER outperforming then it could be worth trying to figure out how to get to that regime
(they could also have just messed up somehow, who knows, so all of this is a big question mark - you can never trust any results that no one has replicated)
They only did IsoFLOP :/ I'd imagine isoparameter would be pretty bad
tbh v1 of ETHOS is likely a better baseline despite the vocab size mismatch for undersatnding performance. It hit pretty reasonable loss with the latent expert approach when adjusted for vocab size
Quite the opposite, you need to make it work first, then speed it up. You don't know what's causing the gap, so leaning in to write a specialized kernel will only make ablarions harder.
You can compare first 100 steps and that should be enough roughly, if your variance is low enough (I've given advice for reducing that as well)
First hundred steps seem dominated by hyperparams more than anything else. I'm still at like loss 9 at that point
Wes, last time it took a long time to finally do the nanogpt run instead of doing it first like i had suggested
I recommend that this time you listen to fern and my suggestion
in order to save yourself a lot of time and effort
But I have a nanogpt baseline now, so I'm not sure how this differs
Having a strong sense of direction is okay but you're kind of shooting yourself in the foot with some of the research direction, it would be good to listen to the advice for it.
this differs in the sense that you're going to do things in the opposite order of what will make it go fastest for you
100%, but maybe I'm not understanding the advice then?
I think that's my vibe
you can do it in any order, its only a question of how long it takes you to succeed/give up 🤣
I'm not sure quite how to make it "click" however
Yes definitely
our advice is to try NOT to speed it up a lot first
Yea, I see this as less of time constraint and more of budget constraint. Costs me next to nothing to write/test kernels. Takes a lot more to actually run relevant tests since I've been told at different points that I'm undertraining, but now it sounds like I'm overtraining?
everything is relative to the specific needs at that point, unfortunately, which can be confusing
Like I just won't have the ability to run N tests at a slow pace
You're still in the stage where you likely have a ton of bugs/initial arch issues, you need to understand the dynamics before speeding things up
I really don't think I am. The pytorch baseline is pretty tight at this point
if its tight then maybe you should give up
the goal is to find something better than what people already use
(bugs here being e.g. initialization preventing certain things from working as well, etc)
Can you explain why? I'm behind dense baseline only barely without hyperparameter sweeps or ablations
It takes time (sometimes several months!) of close examination to find them
Sure, but I guess I'm not understanding what the advice is then? Should I be looking somewhere else?
.004-.007 would be barely, .01 is significant, .015 diff is clear difference
can't you just make changes/hyperparams/whatever until it starts being lower loss than nanogpt early on
you dont need a full run or even a partial run barely to try to find that
Yes, I'd take a look at #1395195891262029884 message
I see that diff within the same arch just with different seeds even on dense baseline. Unless you mean .04-.07, etc?
Especially #1395195891262029884 message
Sure, but that usually involves cranking LR to a point where it starts to plateau early. I have runs where that has happened
You may have to t-test when they are close if variance is that high, but I do mean .004-.007, .01, and .015. anything above the variance threshold is usually so far off you don't need a t test for it
This is what I'm trying to get to, so I can identify the problems, but I'm also being told not to focus on speed :/ I just don't really know what you're asking for here
That level of differentiation usually doesn't occur until we're later in training
And I've also been warned this could have entirely different training dynamics?
Yeah, you don't always need speeds for that, just pick the slice of what's happening at the beginning of training and watch that with your tensorboard logs to see if you can pick out some trends
Usually I need to get at least 1000 iterations in to start to gauge performance of a hyperparam change
Yes, definitely. If you're able to long a long-enough run to pick up on any differences, you can use that to try to differentiate what's different in terms of training dynamics for them
usually if it early plateaus it's been in the 4-5 loss range
Then you can use that for the shorter runs
Yea, this just seems to happen pretty late. The router in this arch is closer to an inline encoder for the hypernetwork, so early early it can vary quite a bit just on hyperparams alone
That is a long time, yeah (by contrast that's about 70-80% of a modded-nanoGPT run, but, different beast and all)
It's a joint optimization problem in each layer
Really? Base nanoGPT is only at 4.57 trailing 100 loss at that stage
So am I using the wrong harness, then? I was told to just use default nanoGPT
Since it would be easier for reviewers to reproduce
We had a pretty lengthy discussion where I suggested it but you wanted to stick with nanogpt since your previous runs were in it. And that's okay too, nanogpt is a pretty decent baseline that's well accepted so I think that's alright
Yeah
It's just more compute
Unfortunately
tbh my earlier harness felt easier that was built on dsv3, but I was told that it would have reproducibility problems so I abandoned it
using your own anything is not a good way to go
for the same reason that now we dont even know if PEER is real
Nanogpt is definitely a step up from self coded harnesses since it's verified code
(tho if you verify your own harness with a baseline that should be okayish)
Yea I mean the earlier harness was literally just DSv3's attention block
So I figured that was fine
But yea, maybe a port to nanogpt is a good idea at this point
Wes if you're able to find some statistics over training and link those to network performance, and use that to ratchet down what you think is happening, that may set you up for a quick-loop harness to iterate over
yeah to be clear, I am not promoting the idea that hyperparameters are going to be the magic bullet here
I feel like I have a pretty good idea of what's happening in the network. That's part of why I jumped for triton kernels. Makes me explicitly engage with that
Yeah it's possible, just a risk
Like, running on paper what's happening with each number and their magnitudes throughout training is super super useful
Seeing if there are any outliers in activations, etc (and why)
I don't think they'll be a magic bullet, but I do think that a single config that's not optimized can make up a .1 nat difference
Yea, I've got that now
Nice
How would I check that?
Are you looking at your histograms over time?
Wes is there some reason this is going to scale better than traditional ffn
The logging I suggested from earlier
Histogram everything
In tensorboard
(dense logging can help catch changes as well)
Yep. The PEER block is basically expert parallelism with deterministic behavior that doesn't require scatter gathers the way tensor parallelism in dense networks do
ok so can you just allow it to be a bit worse but scale it
Sure, just requires multi-gpu and some small rewrites
It also (if my assessment ends up being correct) should be faster in practice than a dense FFN in inference
Just slower during training
Which is another reason I'm getting this low level
I see
But yea, basically I have it to where you never have to leave the chip (at most SMEM writes) after attention
ok I dunno then - maybe its worth kernel work if it could be better in practical ways even if its maybe a bit less great than a normal FFN in performance
I think I can likely match it too with tuning, but it sounds like y'all aren't confident in that. This is really just my first config that's .1 nat behind once we're further into training
.13 to be exact
I don't really know what this means, because if you are using a lot of parameters those parameters are going to have to get to SRAM somehow
Sure, I guess I should be specific that they're single load
I'm saying intermediates never leave at this point, so it's getting pretty optimized
No readbacks
Just need to get tiling correct
how is that better than dense matmul?
Scales better than tensor parallelism because no scatter gathers between PEER blocks
Just a single reduce before the next layer
Basically you get the benefits of expert parallelism without the load balancing issues
that sounds good, how does that occur?
Because each GPU would just recieve a broadcast of the post attention token batch. Everything is pipelined between the router and output at that point, because the router and hypernetwork are tightly coupled.
In traditional PEER you'd be selecting the discrete experts from a massive pool. Same thing happens at lower scale for other MoE models
Here, because you're constructing the expert, you don't have the same problem
youre able to slice up the thing that constructs the experts somehow across gpus?
More that as long as Heads % GPU_count == 0, they can handle Head / GPU count number of heads, if that makes sense
sure
And then once each GPU has finished its batch, you just have a single reduce for their outputs
Because you're determining what kind of computation each token needs on the fly
MoE does this, but it's in a very discrete way, which is why you get things like aux loss for load balancing in most architetures
hmm ok I think I see the outline of the idea generally
so basically you're making FFN worse (but more easily parallelizable) by using heads, but making it better again by hypernetwork somehow
Give or take
Like if I break down PEER (assuming it was real when the paper was produced)
I'd like you to run another experiment where you just divide the FFN into heads
I can do that
Yea, agreed
because right now we don't know if you're just exactly matching that
due to that being a part of your change
The router was basically doing two things:
- Determining what kind of compute was needed for a given token. This was encoded as the expert query
- Determining exactly how to do that compute, neuron by neuron. That's the coordinate used for retrieval + scaling each neuron
So here I'm just using that query to generate a neuron, and then using the coordinates to condition generation of each neuron, and the coordinate score to scale them
Yea, 100%
also, if you are doing BETTER than head-ffn, that's interesting, since you're doing head-ffn and then more stuff on top
This, at least if I did the math right, is identical to constructing a rank-k expert instead of k rank-1 experts. And then you get diversity via multiple heads
Would you be willing to review the multi head FFN approach just to make sure it stands up to what you're thinking? Should be able to have a version by tomorrow. Would just be a single code block
I probably dont have time (actually I gotta go do some stuff right now) but btw I also don't really know exactly what headedness you're describing in the first place 🤣
In this case, just a PEER head
PEER defines a head as a router plus the k experts that router selects
So you have H routers per FFN
And no worries. I'll just give it a shot. I need to do some doc review on if there have been parallel FFNs without routing before
Is the idea just changing
matmul(act(matmul(x, A)), B)
into
sum( bmm( act( bmm(x,As) ), Bs ) )
?
Like a bunch of smaller ffns summed?
Yep exactly
Because that would be the exact dense baseline without any complexity from what I'm doing
So basically instead of the 768 -> 3072 expansion in GPT-2 small per FFN, you'd have 8 768 -> 384 bottlenecks that are summed after
Looks like closest might be GroupBERT?
but the sum of a bunch of ffns is mathematically equivalent to a single wide ffn
so.. that means I don't understand the 'benefit' youre obtaining vs doing that for a FFN across gpus
I thought you said yours was more efficient bc of lack of tensor parallelism or something
Fewer scatter gathers because you don't need to do that for each layer. Benefit is limited in GPT-2 style FFNs, but the second you add an additional hidden layer, benefit emerges
But the benefit still exists because it's a single scatter gather instead of 2. Woudl be same with the mutliheaded FFN approach
you're good
one last thought.. if the main benefit is this kind of practical speedup, maybe you should write up a clear explanation of what situations that is expected to occur
and what subset of the invention is required for that speedup
Makes sense
I don't know if I've explored the architecture enough to know that its benefit is only a practical speedup on multi-node, but I can definitely try to isolate that aspect
Just going to switch to modded-nanogpt if the baseline needs to change again
Yeah, modded-nanoGPT upside is faster experiments, downside is it will likely be very very hard to beat the baseline
Since it's a very highly tuned run
Yea, I mean, I'm just swapping out the FFN, so if I get compute budget back, I'm fine with that
"""Multi-headed FFN: splits into parallel bottlenecks and sums outputs."""
def __init__(self, dim: int, num_heads: int = 8):
super().__init__()
self.num_heads = num_heads
total_intermediate = 4 * dim
assert total_intermediate % num_heads == 0
self.bottleneck_dim = total_intermediate // num_heads
# Create parallel FFN heads
self.heads = nn.ModuleList([
nn.ModuleDict({
\'c_fc\': CastedLinear(dim, self.bottleneck_dim),
\'c_proj\': CastedLinear(self.bottleneck_dim, dim),
})
for _ in range(num_heads)
])
# Zero init projections
for head in self.heads:
head[\'c_proj\'].weight.detach().zero_()
def forward(self, x: Tensor):
outputs = []
for head in self.heads:
h = head[\'c_fc\'](x)
h = F.relu(h).square()
h = head[\'c_proj\'](h)
outputs.append(h)
return sum(outputs)```
Simple implementation, will let y'all know how it does on modded nano
Not sure if there's something vastly different from running on a single GPU. Only disabled a world_size == 8 assertion and changed align_to_bos=True to false in the data loader. Everything else is just cloned directly from main.
Base (for reproduction on single GH200):
step:0/1750 val_loss:10.8258 train_time:0ms step_avg:0.01ms
step:125/1750 val_loss:5.5574 train_time:14117ms step_avg:112.93ms
step:250/1750 val_loss:4.9899 train_time:28250ms step_avg:113.00ms
step:375/1750 val_loss:4.7004 train_time:42454ms step_avg:113.21ms
step:500/1750 val_loss:4.5114 train_time:56892ms step_avg:113.78ms
step:625/1750 val_loss:4.3930 train_time:71403ms step_avg:114.25ms
step:750/1750 val_loss:4.3105 train_time:86070ms step_avg:114.76ms
step:875/1750 val_loss:4.2550 train_time:100809ms step_avg:115.21ms
step:1000/1750 val_loss:4.1766 train_time:115700ms step_avg:115.70ms
step:1125/1750 val_loss:4.1036 train_time:130682ms step_avg:116.16ms
step:1250/1750 val_loss:4.0288 train_time:145701ms step_avg:116.56ms
step:1375/1750 val_loss:3.9647 train_time:160736ms step_avg:116.90ms
step:1500/1750 val_loss:3.9090 train_time:175922ms step_avg:117.28ms
step:1625/1750 val_loss:3.8606 train_time:191158ms step_avg:117.64ms
step:1750/1750 val_loss:3.8198 train_time:206425ms step_avg:117.96ms
Multiheaded FFN:
step:125/1750 val_loss:5.5590 train_time:39192ms step_avg:313.54ms
step:250/1750 val_loss:5.0048 train_time:78624ms step_avg:314.49ms
step:375/1750 val_loss:4.6950 train_time:118188ms step_avg:315.17ms
step:500/1750 val_loss:4.5055 train_time:158052ms step_avg:316.10ms
step:625/1750 val_loss:4.3863 train_time:197880ms step_avg:316.61ms
step:750/1750 val_loss:4.3022 train_time:237781ms step_avg:317.04ms
step:875/1750 val_loss:4.2447 train_time:277795ms step_avg:317.48ms
step:1000/1750 val_loss:4.1717 train_time:317970ms step_avg:317.97ms
step:1125/1750 val_loss:4.0932 train_time:358107ms step_avg:318.32ms
step:1250/1750 val_loss:4.0169 train_time:398400ms step_avg:318.72ms
step:1375/1750 val_loss:3.9494 train_time:438691ms step_avg:319.05ms
step:1500/1750 val_loss:3.8911 train_time:478944ms step_avg:319.30ms
step:1625/1750 val_loss:3.8419 train_time:519242ms step_avg:319.53ms
step:1750/1750 val_loss:3.8008 train_time:559605ms step_avg:319.77ms```
Testing my pytorch version next once I get it ported over.
Baseline was the first one
I'd recommend using kosarsky's January record, that is much easier to convert to 1 GPU
It is an incorrect baseline, something is wildly off in the loss
Should have 3.28 very consistently (+/- a bit)
I mean, those are the only changes I made, but yea I'll go try an older version.
It always should converge to ~3.28
Which one is this?
I can't remember the name of the top of my head, you can look in the records folder for it, I believe it's the tanh scaling upgrade iirc
Okay, yea. That's more normal then. Baseline, no changes needed from that version:
step:125/1390 val_loss:4.3667 train_time:119635ms step_avg:1040.31ms
step:250/1390 val_loss:3.9498 train_time:254696ms step_avg:1061.23ms
step:375/1390 val_loss:3.7707 train_time:392056ms step_avg:1074.13ms
step:500/1390 val_loss:3.6554 train_time:531675ms step_avg:1085.05ms
step:625/1390 val_loss:3.5748 train_time:671656ms step_avg:1092.12ms
step:750/1390 val_loss:3.5223 train_time:811987ms step_avg:1097.28ms
step:875/1390 val_loss:3.4717 train_time:954645ms step_avg:1103.64ms
step:1000/1390 val_loss:3.4056 train_time:1100304ms step_avg:1111.42ms
step:1125/1390 val_loss:3.3539 train_time:1246828ms step_avg:1118.23ms
step:1250/1390 val_loss:3.3070 train_time:1393206ms step_avg:1123.55ms
step:1375/1390 val_loss:3.2783 train_time:1538654ms step_avg:1127.22ms
step:1390/1390 val_loss:3.2775 train_time:1556107ms step_avg:1127.61ms```
Take 2 on older harness that's single GPU friendly for multiheaded FFN:
step:125/1390 val_loss:4.3925 train_time:155044ms step_avg:1348.21ms
step:250/1390 val_loss:3.9615 train_time:328091ms step_avg:1367.05ms
step:375/1390 val_loss:3.7857 train_time:504000ms step_avg:1380.82ms
step:500/1390 val_loss:3.6669 train_time:680342ms step_avg:1388.45ms
step:625/1390 val_loss:3.5861 train_time:857694ms step_avg:1394.62ms
step:750/1390 val_loss:3.5308 train_time:1038387ms step_avg:1403.23ms
step:875/1390 val_loss:3.4804 train_time:1220604ms step_avg:1411.10ms
step:1000/1390 val_loss:3.4140 train_time:1402886ms step_avg:1417.06ms
step:1125/1390 val_loss:3.3613 train_time:1584735ms step_avg:1421.29ms
step:1250/1390 val_loss:3.3141 train_time:1767800ms step_avg:1425.64ms
step:1375/1390 val_loss:3.2852 train_time:1952838ms step_avg:1430.65ms
step:1390/1390 val_loss:3.2844 train_time:1975155ms step_avg:1431.27ms```
Little worse
Oh after understanding what you meant by multi headed I didn't think you still needed to compare since it's mathematically identical
(tho I still don't really get the justification for why it's faster than normal FFNs since it's identical in that way)
Well it's done anyways, and yea it's pretty much identical just slower
The speed up when on >1 would be that if you have two hidden layers in those, you would be able to reduce a scatter gather from that second layer
but with a single hidden layer it's almost strictly worse
But when you have so many ops that are pipelined in my arch, it's more pronounced
Alright, I have my architecture patched into modded-nanogpt
We'll see how perf is
Also trying muon on my layers but might swap them back to Adam if there's major issues. Given I'm doing all manual gradient handling at this point so it should be pretty agnostic
Alright, at a bit over 60k tokens per second, so double baseline, with the modded nanogpt repo and some kernel tuning. Still a lot slower than the 470k tokens per second that modded nanogpt can get on the same config, but I'm okay not beating a literal speedrunning config
Will run ablations from this. Should be able to get a decent chunk of configs out of this
util is still pretty low, so someone smarter than me could probably get further faster than I can at this point
I'm just about here, just going to try one last thing and then just open source the negative case. Got it to get within like .05 nats of the modded gpt baseline, but still like 7x the iteration time. I know there's a lot of room for speed based on nsight, but haven't been able to capture any of it.
Last hurrah is going to be trying to toss an expert choice router in front of each of the PEER routers. Would let each head specialize better, would reduce amount of overhead I'm eating from each head, and shouldn't require kernel rewrites the way I have them now, would just look like smaller batch sizes from that perspective.
if/when that fails I think I'll just give this up
I have the approach factorized to a point where it is lower FLOPs than a dense baseline, though, just less efficient ops currently.
sorry 🙁 thats interesting that its lower flops and almost equivalent performance but less efficient
It feels like there's something here, and to some degree this shouldn't work at all, let alone nearly as well.
But yea, I just might not be the person to make it work if it does have merit
But yea, last hurrah is going to see if they specialize better if I go
Head Choice Router (ripped from expert choice, will reduce tokens each head needs to process) -> PEER Router -> Hypernetwork -> process token with generated experts
learned how to write reasonably efficient backwards kernels at least
forward is just a huge bottleneck rn
to me, the biggest open question is still PEER and what situations, if any, PEER is good in
without knowing the answer to that it's hard to know anything much about these kinds of 'dynamically constructing FFNs' methods
it's fine if your method requires more parameters if those parameters are used much more sparsely
but you need some kind of specific situations in which your method would lead to dramatically reduced compute (like not just a constant multiplier on FLOPs)
that was sort of the promise of PEER
ultimately, we can't tell you what the proposed benefits of the methods you're pursuing are... you have to be the one to know and communicate that
and if you can find a clear benefit of this sort it's probably worth pursuing, and if not then yeah it might not be
So far, my instinct is no with at least the currently available reproduction code and my attempts, but I haven't explored the space entirely.
And yea, this architecture is the opposite. Parameters are reused, so it's more parameter efficient but necessarily more FLOP efficient. Given I've landed on a better factorization that's actualy slightly lower than a dense FFN.
I think there's a lot of room for twiddling on the MLP generating the weights, seems like there's a lot you could do there
Yea, just honestly getting burnt out
Doing this on limited budget without institutional support is fun when it's fun, but lately it's just been exhausting
hobbies are supposed to just mostly be fun lol
And it's hard to sift through "this advice is useful when you have a huge training budget" vs "this is useful at all scales"
Been good learning, just not really fun rn
yep, that's a good time to know when to not push it as much
Yea, I think I'd probably like this more as a job tbh, I just don't think I have the resources to really do this just for the fun of it, which sucks a lot of the fun out of it
Okay, actually it looks like a good chunk of the gap can be made up by tuning LR just for PEER layers
Lowered to 3.35 by making coarse grained updates, can likely eek out a decent amount more by separating LR between attention and components of the FFN
(was at 3.40~)
Peer or ethos?
Cool
whatever we're calling the pure hypernetwork version
it's still peer in my codebase
o ok, was just not sure if u meant actual PEER
Yea, no, the one I've been working on. I'm not going to spend any more compute trying to get a PEER baseline perfect
if GDM wants a repro study they can give me the compute for it
Honestly starting to think that this might do better as a blog post on "how to use hypernetworks to make a worse transformer"
Just getting caught up here. 😊
I’ve been messing with different decompositions on attention and the FFN, and this past week was trying to see if I could find a configuration that would actually achieve a lower perplexity than my dense baseline at the same parameter count.
I found a Google paper that discussed how best to allocate additional parameters, and it recommended adding more layers rather than making them bigger.
I tried applying that—any parameter savings I was getting from the decompositions, I’d into additional layers when possible.
An approach that finally got lower perplexity than the dense baseline was to interweave dense and low-rank layers.
I went from 6 dense layers (“dddddd”) to ten layers with a pattern of: “dssdssdssd” where ‘s’ is for sparse / low rank.
The low rank layers are narrow—they have ~1/3 as many heads and ~1/3 as many neurons as the dense layers.
My theory is that, since the low rank layers can’t access the full model space, it makes sense to make them more narrow / less expressive.
And also that it benefits from periodically having unrestricted read and write access to the full model space.
For Ethos and the variants, there is a lot of low rank-ness going on here with the PK Router and the hyper networks. I wonder if this architecture might benefit from a similar “banding” pattern?
I think the same insight may apply to regular PEER as well. The PK router is a low-rank approximation of a standard MoE router. It can only see a subspace defined by the query projection.
I suppose it makes up for that, though, by having multiple heads (multiple low rank routers), similar to attention. So maybe nevermind.
What I've always liked about PEER is the hope that it could learn the same kind of precise features that they find using SAEs in interpretability research.
I wonder if it could work to take a pre-trained model, do the SAE analysis, and then initialize a PEER layer from that.
Applied to GPT-2, an SAE setup might look like the below.
First, compute the activations for the FFN input neurons:
a = gelu(x @ W_in) # with W_in.shape = (768, 3096)
Then the SAE has an encoder matrix E with, e.g., shape (3096, 1M), and a decoder matrix D with shape (1M, 3096).
So the output is something like:
y = gelu(a @ E) @ D @ W_out
The features they find are the rows of D. You can turn those into expert output neurons by multiplying with W_out. So for PEER expert neuron 'j':
w_vj = D[j, :] @ W_out
For the routing, the simplest but expensive approach would be to say that the input to the PEER layer is the activation vector a, which is length 3096.
So then you need to set up the PK router to approximate the operation a @ E to find those top features.
Maybe a better way would be to freeze all of those ideal output weights, and then use distillation to train the PK-router and the neuron inputs.
Yea I thought the same thing, but I’m not sure you can hit ground truth on “ideal output weights”
You could hit a version of it, but PEER reproductions themselves don’t seem to be great from the versions I’ve tried
What I think could be more interesting is token + scaled output pairs on a trained network as a training set
So you attempt to learn the full path
Would y’all have any thoughts on how to best test OOD data? I have half a hypothesis that because this is defining what compute you need on the fly, it might do better there.
Also, am I potentially thinking of this wrong? The FFN is learning a materially more difficult task, so should I be expecting that it's as sample efficient as its dense counterpart?
It's learning given this input, what weights should I generate that would be useful and then immediately consuming those weights, which is significantly more difficult than the training task for the dense network
Going to swap to gpt2-medium baseline just because I’m wasting a lot of cycles making a non power of 2 model dim work. Everything is a lot easier when model dim is 1024.
step:125/1390 val_loss:4.3229 train_time:217484ms step_avg:1891.16ms
step:250/1390 val_loss:3.8955 train_time:460616ms step_avg:1919.23ms
step:375/1390 val_loss:3.7061 train_time:704883ms step_avg:1931.19ms
step:500/1390 val_loss:3.5879 train_time:953667ms step_avg:1946.26ms
step:625/1390 val_loss:3.5013 train_time:1206982ms step_avg:1962.57ms
step:750/1390 val_loss:3.4425 train_time:1459531ms step_avg:1972.34ms
step:875/1390 val_loss:3.3900 train_time:1716268ms step_avg:1984.12ms
step:1000/1390 val_loss:3.3211 train_time:1976056ms step_avg:1996.02ms
step:1125/1390 val_loss:3.2642 train_time:2233886ms step_avg:2003.49ms
step:1250/1390 val_loss:3.2146 train_time:2497299ms step_avg:2013.95ms
step:1375/1390 val_loss:3.1837 train_time:2760458ms step_avg:2022.31ms
step:1390/1390 val_loss:3.1829 train_time:2791743ms step_avg:2023.00ms```
Baseline gpt2-medium
@spare flame going back through original ETHOS, I found a bug that might actually be a minor discovery. Any shot you'd have 10 minutes to verify? Should be apparent in the code.
Basically, the hypernetwork and expert weights never actually received gradients. We hit reasonable PPL without training those component at all. Just random init.
They didn't receive grads?? Not sure how but sounds like a good find! I'm actually travelling until next week but I can take a peek then
There's literally no backward function. I had an autograd in there at some point that got deleted. Can even show original notebook we trained with with original output to show it's just not there
Lol!
But we were using the forward triton kernel
If this is true, I think I can just share latents across layers and never train them? Would just need to fix the hypernetwork to recieve gradients because that's likely suboptimal
Heh what part do you need me to look at? Shouldn't you just make sure everything has a backward?
I don't really want to look at code to debug it
Please check your baseline numbers against the raw reported numbers, this is the third time!
Just that there is not in fact a backwards
I did, was using https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_2_medium/2025-01-18/241dd7a7-3d76-4dce-85a4-7df60387f32a.txt and this technically beats it
Just used the same shorter run hyperparams from the gpt-2 small version
Yea, it doesn't. Just validating absence, but it's no huge deal
If you'd prefer I can hit the 2.92 target for the medium, just wanted to have a straighforward baseline with a 1024 model dim
Yeah, I'm not entirely sure what that's getting you unless it's something specific to kernel writing.
But dynamics change when you change the numbers, baseline typically means "I didn't change anything and ran it as previously", changing hyperparams changes the result.
You can make a new baseline, but it's risky and takes longer because it loses the guarantees of the pre-existing baseline
(I probably won't comment too much more on ETHOS but I may come in occasionally to help)
I'm looking to have a baseline architecture to compare my architecture to, so as long as I'm using the architecture provided in the repo, and then also holding maintaining other constants outside of my own independent variable, is that not scientifically valid?
E.g. I trained a GPT-2 Medium class model as my dense baseline on X tokens, and then also trained my IsoParam|IsoFLOP|IsoWallClock model on X tokens, is that not a valid comparison?
Goal isn't to get on leaderboards of modded-nanogpt, just have a scientifically valid baseline to compare to.
Especially when they're using identical hyperparams outside of my exact FFN change?
Maybe the part that wasn't clear is that baseline was run exactly from the repo just with the training duration changed?
Actually maybe I caught the misunderstanding. This is not with any changes, just with the exact code from the above run I linked, with only num_iterations=1390. Maybe this came across as with changes to the FFN?
Otherwise not sure what the
"Yeah, I'm not entirely sure what that's getting you unless it's something specific to kernel writing."
part is about
Was only offering to run it to 2.92 because that's what the speedrun requires, but this is just a short run baseline in order to keep sweeps reasonable cost.
fwiw going to move forward with the shortened GPT-2 Medium baseline for ablations and hyperparam sweeps. I think that simply training for fewer tokens will just have to be a limitation for that portion, and I'm fine with that.
Went through and just updated the most recent GPT-2 Medium speedrun to be friendly to single GPU and ran a full run. Final delta was +0.000934 (2.920684 vs 2.919750) Going to call that similar enough, use that as the harness and baseline, and just use early exit to disqualify candidates if they aren't training well.
My changes and full run: https://gist.github.com/wrmedford/14893b6a4477b6d2ef114a3406d5aa87
Now back to the actual research
How do you decide when to stop training? The goal is to keep going till you hit their perplexity mark, right? But you need to know ahead of time how many steps that’s going to take in order to schedule the learning rate 🤔
Best I can tell it's a bit of trial and error on figuring out exactly which iteration to stop on, but I'm just going to be training to this token count to keep with the methodology
What’s the metric you’re evaluating here for Ethos? Comparing validation loss at the end of training?
I will be yea. I have a sweep harness and a few other things I'll throw in a branch probably tomorrow
It's already on the jupyter instance if you want to check there
Cool!
Do you have an idea of what success looks like here?
Are we hoping for a lower validation loss at the same token count vs. the baseline?
That's the hope, or seeing if we can match it at a lower param/flop count
now that I'm moving back closer to the standard ETHOS arch
Ah, I see now, that run you just shared was the baseline. If it took 3 hours, how long is ethos going to take? 😬
Hopefully not all ablations will take that long and will early exit on bad configs. Once I post the sweep (it’s in ethosv2/grid_search.py on the Jupyter server) it should make more sense. But should be able to just run them back to back for a while and just check results.
And I’m looking at potentially just tooling this for 8xH100 and doing it there if wall clock becomes prohibitive. But I think a couple weeks waiting on results is fine considering budget
Yeah, makes sense
Have you tried swapping in the ethos layer yet?
After I saw that you were using this as the test framework, I’m considering trying it for the attention subspace work, but it’s not immediately obvious to me how tricky it will be to customize or not.
The MLP looks more straightforward for ethos, at least.
Yea it trains, just need to fix the bug I found. I think what I found also implies that since the routers are that expressive, each router needs its own hypernetwork. Lines up with what I saw when I was testing out the hypernetwork only approach as well
Yea MLP swap here is straightforward, but there’s a lot of tricks happening with attention you’d need to disentangle. Would recommend a previous run before flex attention was introduced
Yeah, I think that makes a lot of sense (having a network per head)
Yea, conceptually I think it makes more sense because then you’d never have different heads feeding a shared hypernetwork conflicting gradients
And then each can truly specialize
I’m itching to go read up on their different hacks to attention. Stuff like the smear gate, and skip connections, and value embeddings 😳
Part of me wants to test throwing an expert choice router in front of each PEER router since it should work. Would allow for even better specialization.
And still allows for lossless load balancing
And consistent expert load
Expert choice? Like making an MoE layer where each expert is an ethos layer?
An ethos head, but yea
Expert choice -> PEER router(s) -> constructed expert -> gather token output from experts that selected it
Well, good stuff. It sounds like a really nice, simple test harness here that ought to be easy to interpret.
Yea. Going to try the normal stuff first but the expert choice router just feels like it’ll fit there
@mental plinth , I had half a thought yesterday. Do you think that PEER's router might fit some (albeit looser) definition of an encoder?
Well, it certainly seems close to Attention. It’s like it attends to the expert neurons instead of a sequence of tokens. It applies softmax just like attention, except only to the top-k experts. And since there’s no “causal masking” I suppose you could compare the whole thing to an Encoder Attention block in that regard?
What aspects of an encoder were you thinking about?
Mostly that it’s encoding language at that exact point in time since it’s working off a point in time language with causality baked in, in a way that is consumable by a hypernetwork
It might be stretching the definition a bit, but does definitely feel like the IO matches the shape of what I do for natural language encoding in my job
So I tried to reduce this to just comparing to a standard MLP without the complexities of the surrounding model. Chose a CIFAR10 flattened set (not 1:1 with language, I know) just to try to track relative performance on situations where MLPs can struggle. Gave me better signal (on at last this task) on what configs beat an isoparam dense baseline and which don't. Also gave me some better info on which hyperparams impact performance more than others
In isoparam (where this arch is actually technically fewer FLOPs now after some tweaking) I'm able to consistently match or beat the single layer FFN baseline. Deeper FFNs are consistently beating, though, so I'm not sure exactly how to compare since MoE arches tend to use a single layer MLP
But it reminded me of what Smerky said earlier, that this approach might just be giving the model more depth, which might just be an inductive bias for the CIFAR10 stuff
The most interesting part is that it performs way worse on early epochs, but starts to pull ahead pretty quick. It does plateau earlier than the dense baseline so I might play with lowering its LR
Okay, so counterintuitive finding: lower LR on the router, higher LR on the hypernetwork is preferred.
Not as counterintuitive findings: router is massively overparameterized in some ways and underparamterized in others. Starting to get some amount of crystalization on where the design space is for this without having to do full on training runs
In testing, multiple heads here actually performs worse than a single router and single hypernetwork. My guess is that if there was an actual positive benefit from multiple heads in the PEER paper (becoming less and less convinced that paper wasn't outright wrong or required very sensitive hyperparams), it likely was due to having discrete experts that needed to be combined into valid experts.
And that might look different from a hypernetwork that jointly learns what that valid expert is.
Nice! I like the approach. I actually did my first round of experiments using a Vision Transformer (ViT) trained on CIFAR-10. It's nice because you can train that wicked fast.
I like the approach of holding the parameters constant, too. If an architecture is truly "better", I think you'd expect that it's able to score higher given the same parameters, right? I think too many of my early experiments were in the regime of "it performs slightly worse with fewer parameters", and I've started to see that as not a very useful observation.
Maybe putting the ethos / peer layer into a ViT with an established configuration would make for a good comparison, over just a straight MLP that you have to architect yourself?
Also, do you think there might be any merit to testing PEER with a dense router? The PK router makes it efficient to run for a large number of experts, but also introduces a low rank decomposition, and I'm wondering if removing that and isolating just the single-neuron-expert approach could be informative. I'm not really sure--it's an incomplete thought.
Yea I think this is the path I'm going to use for now just to test in isolation. Technically not 1:1 for LLMs but should be enough to at least inform behavior
Going to use this repo since it has baselines and previous training runs https://github.com/kentaroy47/vision-transformers-cifar10
Going to set it up to run ETHOS and it side by side with an isoparam constraint on the MLP, which should get pretty well isolated behavior curves at least
Ignore the mha misnomer since it transferred over from my toy example, is in fact using ViT. Up to date code, running the included sweep now. Includes FiLM which is an improvement on PEER to get an actual rank-k MLP to generate without needing to blow up the hypernetwork's output layer (in this arch it's now rank-m, with k being the depth of modulation).
Without doing an arch sweep (next phase once I have a decent idea of what hparams I should use), best baseline is getting 84.37, vs 83.27 for ETHOS. So consistently behind, but I think a decent chunk of that gap will be made up by better balancing of parameters once I do an arch sweep. Relatively certain current setup is overparameterized in some places, and under in others.
https://gist.github.com/wrmedford/ef452a86bae0c7dd1201b5e4e265729a
Normal hedges of non-pretrained ViT on limited dataset, etc. etc. just seemed like the best way to isolate purely the MLP aspects of this without usng a super contrived method
Small update ViT stuff is going reasonable. Just still doing a lot of arch/hparam searches. Narrowing, but not exactly solidified yet
Alright, going to be setting this down for a bit. In a pretty isolated MLP vs ETHOS bakeoff, error bars overlap after tuning both, but it's not consistently beating a dense baseline. Maybe/Probably has better performance in transfer learning, but no clue yet. Going to set it down for a while.
I'm going to be popping over to just do some more standard kaggle competitions + maybe try my hand at the gpu mode kernel writing stuff especially since gluon is a thing now if anyone wants to join
i did try a 256² experts version of a PEER implementation and got a gibberish generator. Was the original model able to produce meaningful text?
did someone tried a better dataset than C4 and wikitext, like FineWeb-Edu?
It wasn't a gibberish generator at any scale I tested with, just not as strong as expected performance. Router code can be bug prone, so I would make sure you implemented PKM correctly
btw @whole goblet didnt know if you've seen this https://arxiv.org/abs/2508.18756v1
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE mode...
someone seems to have gotten PEER to be useful in at least some context
Nah I've stepped away from research in general outside of work.
I'll give it a read though
But yea just been spending my time on some trading strats, been doing well on that so far, but not really anything you publish or do outside of your own money
o ok! just thought u might be interested since you were investigating PEER so heavily
i did test this one https://huggingface.co/ThomasTheMaker/PEER-v1
when i say gibberish, not exactly gibberish, otherwise it wouldn't be PPL 7
it's just like repetitive patterns or sequences that doesn't make much sense for humans, but if you analyze the output, it carry meaning
to be fair, I wouldn't expect anything different from any model trained on wikitext-103-raw
I can't pretrain or finetune this scale on my hardware, would be awesome to see how PEER behave in high quality educational datasets, such as FineWeb-Edu
I found the GH200's from Lambda to be the sweet spot for this kind of research. The architecture of those machines is also good if you want to test out multi-tiered expert retrieval
If PEER/Grajewski pan out, I (sloppily) hypothesized that this kind of architecture would be the ideal one. https://github.com/wrmedford/moe-scaling
Keep the non-expert transformer elements of the model in HBM, expert portions in system RAM/Storage, and scale model to infinity. Has some other issues that I tried to get around by using hypernetworks to compress expert knowledge (https://gist.github.com/wrmedford/ef452a86bae0c7dd1201b5e4e265729a)
Happy to go over what I learned, but I'm pretty sure that PEER is just a bitter lesson wrapped in some solid high level logic at this point. I think that if there is a path towards breaking past both memory bandwidth walls and increasing effective model size on this architecture, it'll have to be done through a hypernetwork. With that said, I'm not sure that a standard backprop will be the answer to better performance here.
Small experiments I did with different hypernetwork based models (where the hypernetwork creates the expert on the fly) never actually beat their dense equivalents.
also @spare flame you might find this interesting, might not. It didn't end up being better, but beam search to build a cartesian product as the input to a network worked reasonably well
It's basically where I dropped inquiry though