ETHOS | EleutherAI | Page 2

spare flame Oct 2, 2025, 11:39 AM

#

I recommend doing this by making a HF adapter for your model so you can run inference that way and also can run lm eval harness

#

but im sure nanogpt has code for the initial inference sanity check

whole goblet Oct 2, 2025, 12:53 PM

#

The winner of the 2008 United States presidential election would be George W. Bush, who took office more than two decades ago.

In a paper released Thursday, the former Republican presidential candidate touted his administration’s economic record – including a successful push to promote American manufacturing through jobs-creation programs – as a turnaround, saying that the economy will be stronger this year.

“The economy is a strong force in the world,” said Bush, “as long as we’re not doing terrible things on```

`step 55147/55148 (100.0%): train loss 3.1518, val loss 3.1519`
Small bump at the end, so will probably be interesting to see what happens in ablations. Still a very solid result.

#

spare flame Oct 2, 2025, 1:14 PM

#

whole goblet ```Generated text: The winner of the 2008 United States presidential election wo...

looks great, I don't think you should read too much into various bumps and wiggles

whole goblet Oct 2, 2025, 1:22 PM

#

Baseline running now

tranquil fiber Oct 2, 2025, 1:39 PM

#

Yeah I said the same thing too haha, I've been burned too many times 😅😭

tranquil fiber Oct 2, 2025, 1:40 PM

#

whole goblet ```Generated text: The winner of the 2008 United States presidential election wo...

Curious, did the PEER baseline use the same batch size? (Overall, i.e. # tokens per step, i.e. microbatch size changing to match)

#

It could be that this method is more variance sensitive

spare flame Oct 2, 2025, 1:40 PM

#

whole goblet Baseline running now

plz use same batchsize

tranquil fiber Oct 2, 2025, 1:41 PM

#

Yes def

#

Since batchsize impacts implicit lr by a lot (and other things as well)

spare flame Oct 2, 2025, 1:42 PM

#

yeah (sorry I keep being a downer about checking things) also just keep in mind that even with same hyperparams and bsz its possible that by reducing the bsz you will have kneecapped nanogpt, and that your method performs better at low bsz than it

tranquil fiber Oct 2, 2025, 1:42 PM

#

whole goblet

Yeah I think it might be that this method is more variance sensitive, if it's wiggly at 120M then scaling may be even moreso. Could also just be the smaller barchsize exacerbating it, something to keep in mind

#

(also ablating hyp effects/diffs vs the baseline/PEER baseline may be hard since different methods will be sensitive to different hyps, and not much compute for e.g. a full grid search (or the like))

tranquil fiber Oct 2, 2025, 1:44 PM

#

spare flame yeah (sorry I keep being a downer about checking things) also just keep in mind ...

Oh yeah def too

#

If there is enough compute to do final runs with the ~.5M tokens batchsize the baseline has then that would make the argument a lot more solid

#

(but I know there's limited compute here so you need to budget it)

spare flame Oct 2, 2025, 1:49 PM

#

yeah to be clear im not discouraging running the nanogpt at low bsz, you need to do that esp since its inexpensive

tranquil fiber Oct 2, 2025, 1:49 PM

#

Yeah accum steps are impl so it should be a config change I think

spare flame Oct 2, 2025, 1:49 PM

#

a cheap way to make sure you're not kneecapping nanogpt baseline is run that at both small and large bsz

#

that way you can see if nanogpt baseline is better at higher bsz

#

if so, then you will eventually need to run your test at higher bsz too

#

if not, great

tranquil fiber Oct 2, 2025, 1:52 PM

#

Good point yeah, hadn't thought of that

#

There is also a low-res chart from the nanogpt example run done in the repo but have no idea how much syncrot it's been under

#

That said this run does seem to be beating the LLM.c python baseline logs pretty handily

#

So that may bode well

#

But those may have been recorded before the LR change, which made it converge ~60% faster or so

#

So if that's the case, then the two runs would be roughly ~even or so, which bodes well (at least)

whole goblet Oct 2, 2025, 2:37 PM

#

Yea, baseline is pretty fast to run, so I can check both batchsizes. Right now It's approximately even at the lower batchsize

#

But I'm only at the first checkpoint, and the more interesting stuff will be in a few hours

whole goblet Oct 2, 2025, 2:37 PM

#

spare flame plz use same batchsize

Already done

whole goblet Oct 2, 2025, 2:38 PM

#

tranquil fiber Curious, did the PEER baseline use the same batch size? (Overall, i.e. # tokens ...

PEER baseline used a larger batchsize. Because it isn't creating weights on the fly the pytorch impl had less memory pressure

spare flame Oct 2, 2025, 2:38 PM

#

whole goblet But I'm only at the first checkpoint, and the more interesting stuff will be in ...

Yeah training dynamics will vary

tranquil fiber Oct 2, 2025, 2:38 PM

#

whole goblet PEER baseline used a larger batchsize. Because it isn't creating weights on the ...

Oh no, that does impact things a lot

whole goblet Oct 2, 2025, 2:39 PM

#

tranquil fiber Oh no, that does impact things a lot

I can rerun with smaller, or just rerun the new version with a fused kernel once I have that functional

tranquil fiber Oct 2, 2025, 2:39 PM

#

Yeah

#

Either way is fine I think

whole goblet Oct 2, 2025, 2:40 PM

#

tbh though if we are getting into hyperparam stuff, this is arguably comparing a run that has been tuned like crazy at all levels to a first real attempt

tranquil fiber Oct 2, 2025, 2:40 PM

#

Maybe actually bigger after kernel works is okay since that would give a good probe into how your new method performs at different batchsizes/effective lrs

whole goblet Oct 2, 2025, 2:40 PM

#

tranquil fiber Maybe actually bigger after kernel works is okay since that would give a good p...

That's my thought

tranquil fiber Oct 2, 2025, 2:40 PM

#

whole goblet tbh though if we are getting into hyperparam stuff, this is arguably comparing a...

For peer, or nanogpt

whole goblet Oct 2, 2025, 2:40 PM

#

tranquil fiber For peer, or nanogpt

nano

tranquil fiber Oct 2, 2025, 2:40 PM

#

Gotcha

#

Yeah I think they merged the lr change into it

whole goblet Oct 2, 2025, 2:40 PM

#

I'm just straight up materializing GBs of weights in the hypernetwork version that don't need to exist rn

tranquil fiber Oct 2, 2025, 2:41 PM

#

Nanogpt has some okay tuning but not like an absurd amount iirc

#

But still, more than just a few runs 😅😂

whole goblet Oct 2, 2025, 2:41 PM

#

Yea, I'm at n=1 😅

tranquil fiber Oct 2, 2025, 2:41 PM

#

Yeah it's a pretty good start!

#

Also sounds like a good argument for investing in the kernel work

#

Now that the big run looks really promising

#

(hard to tell the order of operations on that sorta thingie)

whole goblet Oct 2, 2025, 2:42 PM

#

But yea, the GPT-2 small version should be completely apples to apples. Identical config to the hypernet stuff (including LR which arguably should likely differ between architectures)

tranquil fiber Oct 2, 2025, 2:43 PM

#

Yeah ideally as a speedrunner I'm immediately thinking about how things are set up for the hypernetwork, not just LR but also inits and a few other things

#

Glad it's doing well vanilla

whole goblet Oct 2, 2025, 2:44 PM

#

Yea, I think even if on a first run for preprint, just showing that this works and is competitive without tons of optimization would be enough

#

But yea, the PEER baseline I think needs to be rerun

tranquil fiber Oct 2, 2025, 2:45 PM

#

whole goblet But yea, the PEER baseline I think needs to be rerun

For an extra datapoint, though a .5M batchsize is kinda "standard" at this size so that's more marketable if it's worth it to rerun after doing some kernel work

#

I believe that peer was run at the default batchsize IIRC? (It's a bit more than .5M so not exactly but in that ballpark)

whole goblet Oct 2, 2025, 2:46 PM

#

tranquil fiber For an extra datapoint, though a .5M batchsize is kinda "standard" at this size ...

I probably won't do kernel work for the baseline just so I can cite the implementation and cite PEER's author saying that it's Good Enough™

tranquil fiber Oct 2, 2025, 2:47 PM

#

whole goblet I probably won't do kernel work for the baseline just so I can cite the implemen...

Kernel work for your method I mean

whole goblet Oct 2, 2025, 2:47 PM

#

tranquil fiber I believe that peer was run at the default batchsize IIRC? (It's a bit more than...

491,520 tokens per batch, exactly

tranquil fiber Oct 2, 2025, 2:47 PM

#

Since it would be cheaper to rerun

tranquil fiber Oct 2, 2025, 2:47 PM

#

whole goblet 491,520 tokens per batch, exactly

Gotcha yeah

#

And nanogpt?

whole goblet Oct 2, 2025, 2:47 PM

#

Oh, yea. I'm cool doing that

tranquil fiber Oct 2, 2025, 2:47 PM

#

Is it 491.5k as well?

whole goblet Oct 2, 2025, 2:47 PM

#

NanoGPT is using the small one

tranquil fiber Oct 2, 2025, 2:47 PM

#

I mean the normal baseline for it

whole goblet Oct 2, 2025, 2:47 PM

#

But this will be done in a few hours. It won't be an issue to rerun at larger batch

#

Oh, yea

tranquil fiber Oct 2, 2025, 2:47 PM

#

It is the same

tranquil fiber Oct 2, 2025, 2:48 PM

#

whole goblet But this will be done in a few hours. It won't be an issue to rerun at larger ba...

Nice awesome

#

Good datapoint as smerky said

#

Especially since it's cheap

whole goblet Oct 2, 2025, 2:48 PM

#

Yea, at a minimum gives me options depending on where memory pressure actually ends up after the kernel work

#

I have a prototype that seems to work, but just need to test a bit before I say for sure

#

And reduces memory from 35GB running to a bit under 8GB

tranquil fiber Oct 2, 2025, 2:49 PM

#

Nice!

whole goblet Oct 2, 2025, 2:49 PM

#

Which does also mean that if I have to hop off the GH200's I'm on now, I could move back over to my 3090's

tranquil fiber Oct 2, 2025, 2:49 PM

#

What's the baseline nanogpt network at that batchsize for memory use (also is this nvidia-smi or something a bit more fine-grained?)

whole goblet Oct 2, 2025, 2:50 PM

#

nvtop (so smi under the hood iirc), seeing 6.8GB

tranquil fiber Oct 2, 2025, 2:50 PM

#

Gotcha

#

That is an upper bound IIRC so not necessarily the most accurate

whole goblet Oct 2, 2025, 2:51 PM

#

yea there's for whatever reason just a GB or so of unfreed memory that shows up randomly and persists until I kick the node

tranquil fiber Oct 2, 2025, 2:51 PM

#

But decent maybe for sanity checking potential max use

#

Interesting

#

Yeah

#

6.8->8 is certainly pretty reasonable

whole goblet Oct 2, 2025, 2:51 PM

#

Yea, I could get it to be identical~ if I wrote a backwards kernel but I don't want to torture myself

#

fuck chain rule

#

all my homies hate chain rule

whole goblet Oct 2, 2025, 6:24 PM

#

Okay, yea I should have run baseline first

#

But still worth while for me to figure out the fused kernel and run ablations, since we're in spitting distance

#

I do think I might need to spend a minute to really implement what was found here https://arxiv.org/pdf/2502.17405

#

That way I can avoid a full grid search of hyperparams for this model

#

Does also make me question why PEER underperformed so hard, that's not something I expected

spare flame Oct 2, 2025, 6:45 PM

#

whole goblet Okay, yea I should have run baseline first

always

#

live and learn 🤷‍♂️

#

but also, the most annoying thing is that sometimes these lines can cross much later in training

whole goblet Oct 2, 2025, 6:47 PM

#

Yea, I need to see where I end up on throughput with the fused kernel

#

And that'll inform where to go from here

#

Same rough area to me suggests that there's still likely something here, just not obviously busted

spare flame Oct 2, 2025, 6:49 PM

#

definitely let the baseline run the whole way if u can

#

it may be closer than you think

#

also important to still run the large bsz baseline

whole goblet Oct 2, 2025, 6:54 PM

#

Oh I am, I just shared because it's beating where hypernet ended up 70% of the way through

#

Still would have expected PEER 5b to perform better than dense baseline

#

Weird that it didn't

spare flame Oct 2, 2025, 6:55 PM

#

your lora-like constructions may be able to tolerate 10x the learning rate

whole goblet Oct 2, 2025, 6:56 PM

#

Would that break comparisons?

spare flame Oct 2, 2025, 6:56 PM

#

yes but you dont have to care

#

you can assume nanogpt is optimal hyperparams for gpt2

#

https://thinkingmachines.ai/blog/lora/

Thinking Machines Lab

LoRA Without Regret

How LoRA matches full training performance more broadly than expected.

whole goblet Oct 2, 2025, 6:58 PM

#

So basically crank up the LR for PEER, but I might need to reduce LR for the hypernet

spare flame Oct 2, 2025, 6:58 PM

#

if you recall, i dont actually know how your code works 🤣
but if you use a lora-like construction with down then up projections then maybe you can up the LR successfully for that portion

#

the overall point is that your setup may require different hyperparameters to do well

whole goblet Oct 2, 2025, 6:59 PM

#

I can shoot you the notebook. Fern took a look too

spare flame Oct 2, 2025, 6:59 PM

#

ok, then she may have a better idea of things that could help

whole goblet Oct 2, 2025, 6:59 PM

#

spare flame the overall point is that your setup may require different hyperparameters to do...

Yea, this is the part that I'm confused on, is if I should attempt to compare ideal hyperparams to ideal hyperparams or keep them directly constant (or if there's an in-between)

spare flame Oct 2, 2025, 7:00 PM

#

whole goblet Yea, this is the part that I'm confused on, is if I should attempt to compare id...

it would have been convenient if the same hparams worked well for both, but now we simply dont know

#

so you might want to see if it learns faster with some changes

#

you can assume that nanogpt is reasonably optimal for itself

whole goblet Oct 2, 2025, 7:01 PM

#

Sounds good to me

spare flame Oct 2, 2025, 7:01 PM

#

also good to check if the peer paper says anything about this for it

whole goblet Oct 2, 2025, 7:01 PM

#

I also have contact with Owen He, so I might just ask him to double check the setup

#

Was hoping to avoid this by using an open source impl

#

Once this baseline is done I'll get the fused kernel working and then figure out budget for a hyperparam search. Other problem with a hyperparam search is I don't know if this exact shape of network is what will be ideal until I run ablations

#

And that might effect ideal hyperparams

spare flame Oct 2, 2025, 7:03 PM

#

there is no magic bullet here that I know of

#

try to find people who did similar things

#

and see if they had to compensate somehow specifically

whole goblet Oct 2, 2025, 7:03 PM

#

Is lucidrains in this server?

spare flame Oct 2, 2025, 7:03 PM

#

no

whole goblet Oct 2, 2025, 7:04 PM

#

I might just try to ping him on email and see if he has any insight

spare flame Oct 2, 2025, 7:05 PM

#

if you are going to contact him, an issue on the repo might be the best and most polite way

whole goblet Oct 2, 2025, 7:05 PM

#

You wouldn't just email?

#

Mostly ask because the repo hasn't been touched in a year

spare flame Oct 2, 2025, 7:06 PM

#

he has 10000 repos

whole goblet Oct 2, 2025, 7:06 PM

#

Yea, that's the other reason I figured email might be best

spare flame Oct 2, 2025, 7:06 PM

#

🤷‍♂️

whole goblet Oct 2, 2025, 7:06 PM

#

Oh, actually his website points to signal

#

Specifically for reaching out

#

Going to do that

tranquil fiber Oct 2, 2025, 8:44 PM

#

spare flame but also, the most annoying thing is that sometimes these lines can cross much l...

oddly too the posted loss charts from the repo don't match this quite, maybe something to do with the different effective lr (or the like)

whole goblet Oct 2, 2025, 9:01 PM

#

Trying a kind of dumb config while I test out the fused kernel. Dropped weight decay to 1e-2, and quadrupled learning rate

#

So far isn't exploding and is like a solid .5 nat ahead of the old run at this token count (5.5 vs 6.0), but still so early it doesn't mean anything other than it doesn't immediately blow up yet.

whole goblet Oct 3, 2025, 2:31 AM

#

@tranquil fiber Hacked a bit on the fused kernel and used it as an opportunity to run small experiments on hyperparams, and I think you might have been right on weight decay

spare flame Oct 3, 2025, 1:49 PM

#

whole goblet <@285976230409404416> Hacked a bit on the fused kernel and used it as an opportu...

u r changing too many things at once
also, WD is something that is going to hurt short term performance but benefit longterm training which you arent doing, so if u wanna change it make sure u change the baseline too and rerun

#

its ok to change LR (because you can assume nanogpt is close to optimal on that for gpt2), but not ok to change WD without rerunning baseline

whole goblet Oct 3, 2025, 2:56 PM

#

spare flame u r changing too many things at once also, WD is something that is going to hurt...

These are short experiments when I’m just getting the kernel working. It’s helping short term performance, fwiw.

spare flame Oct 3, 2025, 3:08 PM

#

yeah im just letting u know that WD always specifically hurts short term performance

#

so you're not learning anything about your algorithm by removing/reducing it

whole goblet Oct 3, 2025, 3:19 PM

#

spare flame so you're not learning anything about your algorithm by removing/reducing it

The hypothesis is that the hypernetworks might be over-constrained, and they see more gradients than any other part of the network so could be arguably further along at the same step. So just trying to see if there’s anything to that thought. Obviously not a replacement for a full ablation.

#

Past that, when talking about late training, is there a standard definition for that?

spare flame Oct 3, 2025, 3:20 PM

#

Like trillions

whole goblet Oct 3, 2025, 3:20 PM

#

Since a 9B token run for models this size is already way past chinchilla suggests

spare flame Oct 3, 2025, 3:20 PM

#

Or at least hundreds of billion

whole goblet Oct 3, 2025, 3:21 PM

#

So chinchilla is outdated?

#

Or not applicable for other reasons?

spare flame Oct 3, 2025, 3:41 PM

#

No one ever trains to chinchilla, it's only optimal in a kind of useless sense

whole goblet Oct 3, 2025, 3:48 PM

#

I think I’m just coming to the conclusion that I need to do a full hyperparameter sweep which quickly becomes compute prohibitive.

#

Is there a set of tests that would reach “good enough” for preprint so I can garner more compute?

spare flame Oct 3, 2025, 4:59 PM

#

You can put out a preprint anytime you like. As long as you're honest about things and do reasonable comparisons the worst case scenario is it will be ignored. So it just depends on what your goals are for the preprint.

#

But more generally, my feeling is that no one will be interested enough in any method that performs worse and is slower for the same parameter count for the preprint to specifically help get you compute.

#

You may still be able to get more compute tho! I'm only saying I don't know that a preprint will cause that to happen.

#

Also, the fact that your implementation and test gives bad results for PEER is problematic for trust in your current methodology.

#

No one is going to believe that your method dramatically beats PEER while simultaneously trusting that your result showing PEER is destroyed by GPT2 was done properly.

#

You've got some serious problems right now and this does not warrant a paper at the current time.

#

What it does warrant is further investigation, maybe including trying to replicate the PEER results.

#

I feel like I'm always having to give bad news here, but the reality is you are going to need to identify these kinds of problems without 3rd party input.

#

After all, you already know the facts of the situation as well or better than I do.

#

When things 'seem wrong' or you believe they don't show the right result, you gotta either figure out how to address those issues or decide that they are in fact the right result and your hunches were wrong. Then update your priors, learn from the experience, and try some new thing or approach.

#

Right now, you're faced with two opposing results: PEER was worse than yours, but PEER was also much worse than GPT2

#

Either PEER isn't actually good, your test/implementation isn't good, or it doesn't work well at this scale/hyperparameters.

#

Possibly all three!

#

None of these things imply that your method works well. But it leaves the door open that it might, especially if it works well in situations conducive to PEER.

#

It's also important to recognize that your hunches or ideas about what should work well can be just plain wrong. I think experienced researchers expect that the vast majority of the things they try will NOT work, regardless of their hypotheses with theoretical justifications.

#

This is certainly true for me personally. I give up on things or methods and/or have to try a very different tactic maybe 75-99% of the time. I'm sure @tranquil fiber can give more insight into her batting average, but I'd guess that it's similar.

#

You should be trying to do more tests quicker with less resources so that you can swing the bat more frequently. This is why I advocated for testing against the nanogpt baseline early rather than late. Because time spent per idea matters. You can also spend a lot of time drilling down on one idea, but it's only good to do this when you have a good sense of what may be salvageable and what can't.

#

And most importantly of all: you always need to be testing against a fair yardstick as early as possible. Science is about verification via either experimental evidence or mathematical proof. Without this, you are just hoping that your intuition is well enough developed while flying blind.

#

In summary:

Test early and often against a fair yardstick.
Try to use less compute and/or obtain more compute prior to having proven stuff.
Ask yourself every time what it is you've proven and what is uncertain. And how you can most efficiently add to the evidence in either direction.
Don't put out a paper until you have shown something useful and can prove it. [This can include negative results if you have strong evidence]
Get a lot of at-bats as you work on increasing your batting average.

whole goblet Oct 3, 2025, 8:00 PM

#

spare flame Either PEER isn't actually good, your test/implementation isn't good, or it does...

I think that there's still likely reasonable learnings here from being able to show that this architecture works at all, since it is a weird architecture. Hypernetworks are themselves difficult to train, so showing this architecture can reign some of that in I think could have signal for some folks

whole goblet Oct 3, 2025, 8:01 PM

#

spare flame I feel like I'm always having to give bad news here, but the reality is you are ...

Don't most researchers have faculty or colleagues to look at this?

whole goblet Oct 3, 2025, 8:02 PM

#

spare flame You should be trying to do more tests quicker with less resources so that you ca...

I'm not sure exactly how you would reduce here, and this feels like it goes directly against a handful of times that you've suggested I need to do more, not less.

whole goblet Oct 3, 2025, 8:03 PM

#

spare flame And most importantly of all: you always need to be testing against a fair yardst...

And that's most of what I've asked for input on, is if it's a fair yardstick. There's so many confounding variables, and most papers I've looked at don't mention basically anything I've been called on here.

spare flame Oct 3, 2025, 8:10 PM

#

whole goblet I'm not sure exactly how you would reduce here, and this feels like it goes dire...

sorry for not being clear enough on that... amount of time spent on each test is not directly related to scale
there are lots of ways to tackle this and there is no uniform answer, it's going to require creativity

#

but for example now that you have the baseline gpt2 run you can pretty quickly guess at how new runs of peer etc are doing
or you can make your code run faster (its a tradeoff of your coding time tho)

#

or you can spend more money per unit time to get things done faster, and/or you can stop runs early since u can compare to gpt2 base

whole goblet Oct 3, 2025, 8:11 PM

#

spare flame sorry for not being clear enough on that... amount of time spent on each test is...

I mean, I could train to chinchilla optimal and cut down these training runs by a bit over 3x, if that would still provide enough signal

#

But I'm still not sure exactly why it's "useless"

spare flame Oct 3, 2025, 8:12 PM

#

you can also try an even smaller scale to get your bearings

spare flame Oct 3, 2025, 8:13 PM

#

whole goblet But I'm still not sure exactly why it's "useless"

chinchilla? oh its just useless for actual inference so no one does it

#

its fine to train to it as a test

whole goblet Oct 3, 2025, 8:13 PM

#

spare flame chinchilla? oh its just useless for actual inference so no one does it

Can you expand? Is it useless for architecture research? My goal isn't to pump out a SOTA model, just a finding on this to explore the space

#

It's not like PEER was ever released

spare flame Oct 3, 2025, 8:14 PM

#

yeah it makes sense to train to it here, but I also dont think anyone is going to care that much if you train to it specifically

whole goblet Oct 3, 2025, 8:14 PM

#

I don't want a pat on the back for training to chinchilla optimal, just to know if it's good enough for that to not be a sole reason the methodology is dismissed

spare flame Oct 3, 2025, 8:15 PM

#

you gotta keep in mind that your new FFN replacement is going to have different training dynamics than regular ones so its kind of irrelevant to a study (chinchilla) that was done on traditional FFNs optimality per tokens trained

whole goblet Oct 3, 2025, 8:15 PM

#

Sure, but there's going to be so many differences in general, but for some reason I also have to keep LR/WD/etc. identical despite those likely having different dynamics as well?

#

Just trying to understand the line here

spare flame Oct 3, 2025, 8:15 PM

#

I did not tell you to keep LR the same, but I did tell you to keep WD the same

#

Specifically, you can and should change the LR for YOUR model

#

if feasible

#

but the original LR was a good starting point when you literally had zero runs done 🙂

whole goblet Oct 3, 2025, 8:16 PM

#

Yea, I'm just moreso saying that I might also have to take an approach where I use different LR's for different parts of the network

#

And will likely reduce WD just for the hypernetwork

#

And all of these make me question how "fair yardstick" is defined

#

Since they all clearly have different effects on different models

#

And that's without getting into other stuff like beta/clipping/etc.

spare flame Oct 3, 2025, 8:18 PM

#

there's no good answer

#

this stuff is hard

whole goblet Oct 3, 2025, 8:18 PM

#

I guess that's what I'm getting at, is there's no good answer, but you're telling me that the answers I'm coming up with are definitely wrong

#

So just trying to figure out where you're actually drawing these lines

#

Like are there papers I can read on this stuff? Because I've been pretty obsessively looking at how other arch papers present this stuff, and most of these topics aren't even touched on

spare flame Oct 3, 2025, 8:19 PM

#

yeah you're right that it's hard to know that stuff without a lot of detailed info about hyperparams, but I think generally changing LR is fine and that's all you should change

#

you can assume (for now) that nanogpt is somewhat near optimal hparams for itself

#

im just telling you to try to move as rapidly as possible

whole goblet Oct 3, 2025, 8:20 PM

#

I think from early tests and also following Fern's advice, there's likely some benefit from looking at weight decay on the hypernetworks, both because they receive more dense gradients than any other part of the network, and their ability to produce diverse outputs directly ties to the hypotethetical capacity of the overall network

spare flame Oct 3, 2025, 8:21 PM

#

yeah she is better equipped to advise you on that than I am

#

but you could also just rerun the baseline with no WD

#

which is safer

#

WD isnt going to help on a short test but it will HURT

#

so removing it on only yours is unfair

whole goblet Oct 3, 2025, 8:21 PM

#

But couldn't that theoretically harm GPT-2 Small's results?

spare flame Oct 3, 2025, 8:22 PM

#

whole goblet But couldn't that theoretically harm GPT-2 Small's results?

theoretically sure, but in practice it will help its results (i think!)

whole goblet Oct 3, 2025, 8:22 PM

#

spare flame WD isnt going to help on a short test but it will HURT

It's showing positive delta against another baseline within like 100M tokens on this arch

whole goblet Oct 3, 2025, 8:22 PM

#

spare flame theoretically sure, but in practice it will help its results (i think!)

I don't think I understand what you're saying. Aren't these two statements mutually exclusive? #1395195891262029884 message

spare flame Oct 3, 2025, 8:23 PM

#

im saying that WD hurts short tests across the board, so remove it from nanogpt gpt2 first if you really want to remove it from yours

#

and show that it helps (or at least doesnt hurt) gpt2 to remove it there

whole goblet Oct 3, 2025, 8:23 PM

#

Sure, I just feel like removing WD from one that benefits, and another arch that hurts from it removes the fairness of the yardstick?

spare flame Oct 3, 2025, 8:24 PM

#

sorry, are you saying that WD will benefit yours?

whole goblet Oct 3, 2025, 8:24 PM

#

Yes

spare flame Oct 3, 2025, 8:24 PM

#

oh

whole goblet Oct 3, 2025, 8:24 PM

#

that's what I've been saying lol

spare flame Oct 3, 2025, 8:24 PM

#

lol

#

I thought you wanted to reduce it on the hypernetwork

whole goblet Oct 3, 2025, 8:24 PM

#

I do

#

Higher weight decay hurts the network

#

Basically because more diverse outputs seem to help

spare flame Oct 3, 2025, 8:25 PM

#

when I say 'wd benefits' I mean "MORE wd benefits"

whole goblet Oct 3, 2025, 8:25 PM

#

Oh, no, the hypernetwork wants as little weight decay as possible from what I can tell

spare flame Oct 3, 2025, 8:25 PM

#

right, so take it away from gpt2 first

#

show that it improves gpt2 to do so

#

then run it without wd on yours

whole goblet Oct 3, 2025, 8:26 PM

#

Oh, so you're saying that less weight decay also helps GPT-2

spare flame Oct 3, 2025, 8:26 PM

#

spare flame im saying that WD hurts short tests across the board, so remove it from nanogpt ...

as I said, (MORE) WD hurts short tests across the board

#

these are short tests

#

if you were doing 300B tokens I'd have a different opinion

whole goblet Oct 3, 2025, 8:27 PM

#

Okay, now I'm understanding

spare flame Oct 3, 2025, 8:27 PM

#

sorry if I wasnt clear enough about that

#

btw I could be wrong! but generally this is true

whole goblet Oct 3, 2025, 8:28 PM

#

I just thought you were saying that by reducing WD, I should expect GPT-2 to get worse, so it felt like it would be sullying tuned hyperparameters for GPT-2

spare flame Oct 3, 2025, 8:28 PM

#

nope, opposite for short tests

#

it should improve gpt2

whole goblet Oct 3, 2025, 8:28 PM

#

Okay cool, we're on the same page then

spare flame Oct 3, 2025, 8:28 PM

#

this is all just to chase the idea that reducing WD will make your hypernetwork better

#

I don't necessarily think that's the right thing to chase, but I'm no expert!

#

unless you think its also damaging PEER

#

to me the #1 problem here is that PEER looks awful

#

that's a big problem

whole goblet Oct 3, 2025, 8:29 PM

#

Yea, I think that there's some reproduction issues here. I'm going to go and compare my original implementation of it to lucidrain's to see if there's any obvious bugs

#

Because I was seeing better performance when I was working on similar with ETHOS

#

PEER might just be bad

spare flame Oct 3, 2025, 8:30 PM

#

more importantly, check the PEER paper to see in what situations it actually (supposedly) worked well for them

#

if its a very different size regime etc.

#

and try to reproduce that however you can if possible even briefly

whole goblet Oct 3, 2025, 8:30 PM

#

They were doing IsoFLOP comparisons, didn't care about isoparam

spare flame Oct 3, 2025, 8:30 PM

#

and/or by talking to the author

whole goblet Oct 3, 2025, 8:30 PM

#

Yea, he's reviewing my reproduction hopefully in the next couple weeks

spare flame Oct 3, 2025, 8:31 PM

#

yeah its possible that PEER is horrible at isoparams

whole goblet Oct 3, 2025, 8:31 PM

#

Which is why I went for 5B, since it's IsoFLOP(ish) with GPT-2 Small

spare flame Oct 3, 2025, 8:31 PM

#

but also if PEER is horrible at isoparams and this is inspired by it, that might imply problems for yours

#

hard to know

whole goblet Oct 3, 2025, 8:32 PM

#

Which is fair, but I'm showing that with Iso"Capacity" that I'm outperforming, so it still feels like there's something there

#

Even if the base has some trouble

#

Like I think the most interesting thing here is that this model is effectively constructing rank-k experts neuron by neuron

#

And is like, spitting distance from dense

spare flame Oct 3, 2025, 8:34 PM

#

whole goblet They were doing IsoFLOP comparisons, didn't care about isoparam

well if you can even reproduce any part of their runs that would at least give you confidence that your implementation isnt wrong

#

no idea if they supplied or can supply enough data for you to do that

#

or its at any sort of feasible scale

whole goblet Oct 3, 2025, 8:34 PM

#

spare flame well if you can even reproduce any part of their runs that would at least give y...

I did with small variance back when I first started on ETHOS. I can likely drag that back up

#

It is, and I have a kernel that makes it pretty tractable

#

But that was with some other parts that I'm not sure how to match 1-1

#

I'd expect that harnesses within GDM at the time were using GQA

#

And my harness used MLA

#

And nanoGPT uses MHA

#

So there's a lot of potential variance just from attention mechanism used

#

PEER's paper itself is pretty light on hyperparam details. We don't even know the model dim or if it was held constant

#

Or if depth was constant

spare flame Oct 3, 2025, 8:36 PM

#

author can shed light on this much more easily than they can check your code

whole goblet Oct 3, 2025, 8:36 PM

#

yea, I'll shoot an email

tranquil fiber Oct 3, 2025, 8:40 PM

#

spare flame but you could also just rerun the baseline with no WD

(yes agreed almost generally definitely at this scale)

tranquil fiber Oct 3, 2025, 8:41 PM

#

spare flame btw I could be wrong! but generally this is true

yes agreed as well

spare flame Oct 3, 2025, 8:44 PM

#

since this is in the publishing help section, I have one more note about why and when to put out a preprint:
consider how many citations you expect to realistically get and base it on that (considering who and why they might cite your work and in what situation and other papers)
don't write or put out a preprint until that number is greater than zero

tranquil fiber Oct 3, 2025, 8:44 PM

#

spare flame This is certainly true for me personally. I give up on things or methods and/or ...

5-10% for methods I'm new(ish) to, ~30-40% for methods where I have a very strong guess that makes sense (a lot of this is just sense from last ~decade or so).

but that doesn't include e.g. bugfixes or the like as much, maybe a bit.

but agreed that maximizing the amount of times you can pull the lever is generally best, especially when starting out, it's how you gain the most intuition and information

#

(if completely new to something, e.g. an entirely new subfield, at the start, maybe a bit worse than 5-10%, e.g. ~3-5% or so)

whole goblet Oct 3, 2025, 8:49 PM

#

Yea, and I think I'm going to have to take another stab at this kernel. I got one working, but it's not beating easy autograd tricks

#

Can get batch size up considerably now, though with the autograd tricks

#

And did get about a 30% improvement in throughput

whole goblet Oct 3, 2025, 8:50 PM

#

spare flame since this is in the publishing help section, I have one more note about why and...

And this is solid advice

spare flame Oct 3, 2025, 8:52 PM

#

whole goblet And did get about a 30% improvement in throughput

sounds like way better bang for buck than kernel work, also see if various compile tricks can help

whole goblet Oct 3, 2025, 8:55 PM

#

I can just tell what's going wrong, but getting the tiling correct has been a pain

spare flame Oct 3, 2025, 8:55 PM

#

and run some much shorter tests varying that stuff if u want to and comparing to short run of baseline without wd etc.

#

that way you can get feedback in minutes or hours not days

#

shorter doesnt always extrapolate cleanly but at least its rapid signal acquisition

#

you can only really learn from an experiment, so running lots of those may help you learn stuff about what works in this regime and what doesnt

whole goblet Oct 3, 2025, 8:57 PM

#

Yea, I can get a complete chinchilla run done on this (assuming it is the first point I can extrapolate from) within a few hours

spare flame Oct 3, 2025, 8:57 PM

#

yeah but u can also NOT get a complete one done

#

lol

tranquil fiber Oct 3, 2025, 8:57 PM

#

whole goblet Yea, I can get a complete chinchilla run done on this (assuming it is the first ...

ideally shorter than that for testing stuff

#

it's kinda like rebuilding a whole car to see if a new piston fits

spare flame Oct 3, 2025, 8:58 PM

#

yeah man just jiggle the pistons a lot!!!

tranquil fiber Oct 3, 2025, 8:58 PM

#

you should do scheduling, hyperparameter tuning generally on longer runs

#

most other things (esp smoke/sanity testing if things are working okay) should happen on shorter ones

tranquil fiber Oct 3, 2025, 8:59 PM

#

spare flame yeah man just jiggle the pistons a lot!!!

yeah, proxy tests like compressing water in the chamber to see for leaks (you can probably tell i dont know cars), that kind thing. in spirit

whole goblet Oct 3, 2025, 8:59 PM

#

I mean, this thing is training, so I'm not doing a ton of long runs checking if stuff is working. Usually if it's completely not functional I know within the first 10 iterations at most

tranquil fiber Oct 3, 2025, 8:59 PM

#

yep

#

tho for e.g. the peer baseline i think it should be pretty short to know how the repro does

#

you may need to see how their loss curves compare against baselines to see what you should expect

whole goblet Oct 3, 2025, 9:00 PM

#

But I'm definitely at the "how do I tune this" phase ahead of trying to get ablations started, but it seems like this would benefit from a hyperparameter sweep before those

tranquil fiber Oct 3, 2025, 9:00 PM

#

larger models usually are much more step efficient, time-wise they may take longer to converge, but stepwise should almost always be faster than smaller models

whole goblet Oct 3, 2025, 9:00 PM

#

tranquil fiber tho for e.g. the peer baseline i think it should be pretty short to know how the...

unforuntately they only published final PPL values on C4 after a specific flop budget 🙁

tranquil fiber Oct 3, 2025, 9:00 PM

#

whole goblet unforuntately they only published final PPL values on C4 after a specific flop b...

hm yeah

whole goblet Oct 3, 2025, 9:00 PM

#

tbh I think I might just want to dodge the PEER baseline question

tranquil fiber Oct 3, 2025, 9:00 PM

#

defaulting to "bigger model should be better stepwise almost always" then should be a good litmus test there

tranquil fiber Oct 3, 2025, 9:01 PM

#

whole goblet tbh I think I might just want to dodge the PEER baseline question

yeah maybe, that would simplify things a bit

#

it would be nice to have as a comparison

#

but in the spirit of minimization

#

I will say @whole goblet the unfun thing is that with all likeliness, after fresh-implementing things, usually there are 2-3 major bugs, and a bunch of minor ones, sometimes maybe more depending on size of implementation

#

(and depends on implementer as well, you'll get a vibe for what yours are)

#

(usually these are ones too that are more silent)

#

so, doing very thorough tracethroughs of the code and what's happening oftentimes helps with these, esp on smaller toy examples

whole goblet Oct 3, 2025, 9:02 PM

#

Yea, I did try to bug bash before really touching too much

tranquil fiber Oct 3, 2025, 9:02 PM

#

yeah

#

just something to keep in mind

whole goblet Oct 3, 2025, 9:03 PM

#

I had a major bug early where I was just generating the same neuron k times

tranquil fiber Oct 3, 2025, 9:03 PM

#

yeah

#

it can be subtle

#

even in proper implementations there are things that can be hard to discover e.g. MoE router collapse

#

(which is also why having a good testbench w/ a 5-10 minute turnaround for basic tests is really really useful for in-the-loop debugging)

#

(but longer also works too, anything over an hour gets to be a bit more iffy w/ exploration stuff)

#

it may be good to also plot tons of statistics in tb (or the like) about what's happening in the routers, e.g. router weight distributions, weight similarities for generated values, etc

#

to see how that evolves over time

#

you generally want things to be gaussian, (or, if it's principled, log-normal -- but you really should be sure as to why here), if things collapse or spread out weirdly then there's a sign that something might be off in your training

#

(max and min also is super helpful for determining this)

#

basically, putting eyes everywhere on what's happening in your code

#

you gotta know intimately what's happening and evolving over time in training, to make sure that everything is healthy

whole goblet Oct 3, 2025, 9:08 PM

#

Makes sense. I'll start collecting more of that

tranquil fiber Oct 3, 2025, 9:08 PM

#

yeah, keeping an eye on cheapness and what can you spot early that predicts other things later on as well

#

part of dealing with new methods is establishing cordons for good performance vs bad, and basically diagnosing what's going on in any given area

#

(which can be quite hard and expensive at first as you build up intuition for yourself, sometimes building out that toolbox can be a lot of work! but it is well quite worth it in the end)

tranquil fiber Oct 3, 2025, 9:10 PM

#

whole goblet Makes sense. I'll start collecting more of that

one thing e.g. the variance of your method's performance really does beg for a larger batchsize, that's an easier one (if you look at the wiggle of the variance of the loss + the val loss you can see that the loss variance is leaking to the val loss variance which is generally a big no-no, it seems to be pretty strong too -- so that's one avenue that the network performance can improve along i think)

whole goblet Oct 3, 2025, 9:12 PM

#

tranquil fiber one thing e.g. the variance of your method's performance really does beg for a l...

Yea, with the autograd tricks I can do a 64 batch size comfortably now

#

Seeing how much I can eek out with max-autotune rn

tranquil fiber Oct 3, 2025, 9:14 PM

#

whole goblet Yea, with the autograd tricks I can do a 64 batch size comfortably now

Yes, but microbatch size doesn't really matter as much as effective batch size, you can almost always increase the number of accumulation steps to (generally pretty efficiently) increase your overall effective batchsize

whole goblet Oct 3, 2025, 9:14 PM

#

Oh, yea 100%

tranquil fiber Oct 3, 2025, 9:14 PM

#

Ofc there's something to be said for some ops doing better with larger batchsizes so that's a good thing yeah

whole goblet Oct 3, 2025, 9:15 PM

#

Just means that I can ideally move this to more parallelizable systems once I get there

#

Without as many of the broadcast/collect portions

tranquil fiber Oct 3, 2025, 9:16 PM

#

Honestly staying on one GPU for a while is probably your best shot

#

Usually it takes...many many hours of experiments and dev, and many long times of debugging to get a truly new idea functional, very very rarely does it happen straight in the first shot

#

I think you'd need to have a win in one of the iso regimes to move forward towards a paper for it

#

(which is a bit annoying I know, but at least it does give you options to trade along)

#

And in the meantime you need (generally) as much simplicity as possible for it

whole goblet Oct 3, 2025, 9:20 PM

#

Yea, and to be clear not trying to say I've won in the iso regimes. Just moreso that if the PEER trends are accurate (which I guess is worth questioning) I can get a cheap win by just increasing k arbitrarily which doesn't increase parameters but drastically increases flop cost.

#

But that feels more useless

tranquil fiber Oct 3, 2025, 9:31 PM

#

Well PEER should be handily beating the nanogpt model iiuc

#

A lot of ML research is detective work debugging

#

Always being skeptical and assuming something is wrong

whole goblet Oct 3, 2025, 9:31 PM

#

Yea, I'm starting to wonder if there's a major problem in reproduction

tranquil fiber Oct 3, 2025, 9:31 PM

#

Yeah

whole goblet Oct 3, 2025, 9:31 PM

#

He said he had looked at lucidrain's impl and said it was good

#

But maybe didn't look that carefully or something?

#

Or I'm doing something wrong

tranquil fiber Oct 3, 2025, 9:32 PM

#

You can also try smaller # params to see if it's a quick of having tons of PEER params (i.e. how does peer do with isoparams? Only 2-3x params? Gotta rule out something here. But only after a good bug sweep)

tranquil fiber Oct 3, 2025, 9:33 PM

#

whole goblet Or I'm doing something wrong

It's not always bad to assume that one is doing 2-3 things wrong at any given time, and even if the results beat the baseline being skeptical can be an enormous way to improve performance

whole goblet Oct 3, 2025, 9:34 PM

#

tranquil fiber You can also try smaller # params to see if it's a quick of having tons of PEER ...

PEER with isoparams is pretty hard to do, since it grows quadratically

tranquil fiber Oct 3, 2025, 9:35 PM

#

Yeah, the general idea there I mean

#

"something in the same vicinity"

whole goblet Oct 3, 2025, 9:35 PM

#

tranquil fiber It's not always bad to assume that one is doing 2-3 things wrong at any given ti...

Yea, I typically do, but this is pretty straightforwardly just replacing the FFN block with PEER. It's basically an import and config

tranquil fiber Oct 3, 2025, 9:36 PM

#

There's so many things it could be 😭

#

Unfortunately

whole goblet Oct 3, 2025, 9:36 PM

#

Oh 100%. I'm just annoyed because I reimplemented it, but then decided to go with lucidrains so I can cite instead of proving that my baseline is valid

tranquil fiber Oct 3, 2025, 9:36 PM

#

It's a good idea to look at their code carefully line by line and seeing what's going on

#

‼️

#

^

whole goblet Oct 3, 2025, 9:36 PM

#

I even have an optimized kernel 😅

#

But it does leverage MLA for that impl

tranquil fiber Oct 3, 2025, 9:37 PM

#

whole goblet Oh 100%. I'm just annoyed because I reimplemented it, but then decided to go wit...

Well if your code and lucid's code agree in performance then that's a pretty good sign (if you didn't cross-reference the two)

whole goblet Oct 3, 2025, 9:39 PM

#

tranquil fiber Well if your code and lucid's code agree in performance then that's a pretty goo...

Nah I implemented it before I knew his existed

#

It was just easier to cite than use my own

tranquil fiber Oct 3, 2025, 9:40 PM

#

whole goblet Nah I implemented it before I knew his existed

(this is a hint you should probably run using both to see if they are equivalent!)

whole goblet Oct 3, 2025, 9:41 PM

#

tranquil fiber (this is a hint you should probably run using both to see if they are equivalent...

yea, it's just in a harness that was intended to replicate PEER directly, and the one I did with lucidrains was glommed onto nanogpt, so it'll require a not insignificant amount of work

#

I also haven't touched my own implementation in a few months lol

spare flame Oct 4, 2025, 12:11 AM

#

whole goblet Nah I implemented it before I knew his existed

Compare results

whole goblet Oct 4, 2025, 12:12 AM

#

spare flame Compare results

Yep, just need to do the reimplementation work. Looking back on it, we also integrated heavily with MLA so we could compute in a latent space

#

So it'll take some retooling

whole goblet Oct 4, 2025, 5:07 AM

#

80% faster, no kernel needed, just einsum bullshit

#

Just 2x slower than dense baseline now

#

For the hypernet version

#

Actually should say more than 80% faster, it's an 80% reduction in wall clock time. 178k tokens per second

#

It's late, need to make sure I didn't just bug the hell out of this, but it's training and seems to be identical

whole goblet Oct 4, 2025, 5:35 AM

#

Jk I messed up grad accumulation logic

whole goblet Oct 5, 2025, 3:17 PM

#

Have half a thought, that with the query being the primary input, and then router shenanigans producing conditioning coordinates and scaling factors, that I might be able to just use RoPE for positional encoding of the query. Relies on the hypernetwork being able to generate already scaled experts based on the query itself, but I don't see any reason that can't be the case. Also is massively more efficient at runtime (router right now is 15%~ of forward pass and makes backward pass pretty rough)

Going to give it a shot.

whole goblet Oct 5, 2025, 6:43 PM

#

In v1 of this, we did do a lot of computation in a latent space and it worked pretty well. It would constrain the size of the hypernetwork if we project down

whole goblet Oct 7, 2025, 3:52 PM

#

Does seem like the middle layer of the hypernetwork was not helping performance that much. Linear projection with no hidden layers is handling just fine. Guess is because the nonlinearity is captured by the generated weights, but could be entirely wrong there.

I think this learning plus operating in a latent space would be able to get this within striking distance of throughput of a dense model, and then the question is just if performance can match. Latent space computation would also give me a lot more room to play with h x k values, which showed pretty consistent performance returns in PEER (assuming we work off of those results being reliable)

pastel linden Oct 9, 2025, 3:50 PM

#

@whole goblet How are the results coming along? Do you have anything particularly exciting to report? Or is there a write-up somewhere I can check out?

whole goblet Oct 9, 2025, 3:57 PM

#

pastel linden <@288423151136800768> How are the results coming along? Do you have anything par...

Been ups and downs. Original idea might still have merit but don't have a clean baseline for that, and want to revisit, cc @mental plinth

Ended up chasing down eliminating expert weights altogether and I'm getting near baseline loss performance without tuning, but right now hitting some efficiency issues when using a pytorch implementation, and hitting throughput issues when trying to beat pytorch with triton. Trying a more hybrid approach today where I don't try to recompute everything on backward instead hand off the bulky matmuls to pytorch and let autograd handle everything else

#

It's just kind of a weird spot because I'm generating a lot of weights on the fly, so need to make sure they live as short as possible.

#

Otherwise you end up with such tiny microbatches that you can't get reasonable saturation

pastel linden Oct 9, 2025, 4:00 PM

#

Gotcha! Mostly curious / checking in about the general status. What compute resources do you have / how bottlenecked by compute are you?

whole goblet Oct 9, 2025, 4:01 PM

#

I have about 2k left in a Lambda grant, and currently trying to stretch that with a single GH200 until I have good enough throughput to justify broad ablations

#

Right now doing more single threaded stuff because of trying to get perf in a good spot, but will be bottlenecked once that's done

#

That said I have a lot of experience with k8s, so considering doing a self fund through something like sfcompute if I run out

pastel linden Oct 9, 2025, 4:06 PM

#

We have some 8xA40 machines which could be useful for testing scalability, but assuming you're looking for chonky GPUs to do actual runs that's not something we have sitting around. I'm happy to talk about working to help get you a grant or something, if the results are exciting enough.

whole goblet Oct 9, 2025, 4:07 PM

#

I'd appreciate it! Right now I'd want to make sure it's worth your time, and not beating a dense baseline and not having any throughput benefits yet is pretty hard to justify 😅

#

That said, I think there's a path where this will be beating dense baselines, but need to get at least within striking distance of appropriate throughput before I feel like I can justify another request

#

(bitter lesson and all that jazz 🙃 )

pastel linden Oct 9, 2025, 4:13 PM

#

What's the goal, in a couple sentences?

whole goblet Oct 9, 2025, 4:17 PM

#

See if we can solve the parameter explosion problem (and therefore susceptibility to hitting the memory bandwidth wall) in MoE by trading stored parameters for generated ones.

#

Like honestly the most interesting stuff at this point is that I think I've shown that you can generate a coherent set of experts on the fly when you construct them one neuron at a time, but that's mostly neat and not exactly useful right now.

#

I do also have some future work to see if I can replace what's currently a k^d operation which limits the generated expert's depth with something more efficient like faiss

whole goblet Oct 10, 2025, 3:09 PM

#

iter 60: loss 8.9143, trailing_100 9.6500, lr 1.80e-05, time 4567.63ms, 35870 tok/s, MFU 1.19%
iter 61: loss 8.8450, trailing_100 9.6370, lr 1.83e-05, time 4566.15ms, 35881 tok/s, MFU 1.19%
iter 62: loss 8.7013, trailing_100 9.6222, lr 1.86e-05, time 4567.74ms, 35869 tok/s, MFU 1.19%```

#

Progress on performance through algo improvements. Now if I can just get reasonable saturation (pretty sure this isn't using tensor cores as much as it should be) we should be in spitting distance of a dense baseline

whole goblet Oct 10, 2025, 5:55 PM

#

And some tweaks to get saturation up, but does feel like I need to go get some outside advice on how to get this to be fast fast.

iter 512: loss 5.5459, trailing_100 5.7531, lr 1.54e-04, time 4754.85ms, 41349 tok/s, MFU 1.38%```

whole goblet Oct 11, 2025, 3:57 PM

#

@spare flame Just a heads up, been playing with nsight a decent amount, and I'm finding that the memory bound nature of base PEER is also largely in the router because of how it materializes just a few bytes that are then extremely low intensity for future accesses. Weighing a persistent kernel for that

tranquil fiber Oct 11, 2025, 4:50 PM

#

whole goblet <@1007072846960410685> Just a heads up, been playing with nsight a decent amount...

That's good, but I'd probably stick to the suggestions we gave re: baselines and comparisons before diving more deeply into technical efficiency stuff. I get that's a lot more tempting and "fun", but really if you're going for a paper you should probably work on some of the foundational comparison performance gaps and baseline stuff first

whole goblet Oct 11, 2025, 5:04 PM

#

tranquil fiber That's good, but I'd probably stick to the suggestions we gave re: baselines and...

It's pretty necessary for me to get throughput to a state where I can effectively utilize compute. I already know that it's modestly losing to a dense baseline on performance on first few passes. It's easier for me to spend time on the kernels than it is for me to spend more money on ablations right now

#

Basically nobody is going to care about if I match a dense baseline if it's 10x the wall clock time to train

#

And the PEER baseline I think I'm just going to drop. If a paper relies on a pure reproduction of PEER that's going to take significantly longer

#

And I haven't heard back from lucidrains on if he ever got his implementation to train

spare flame Oct 11, 2025, 5:34 PM

#

I was on his old discord discussing it with him and was probably the only person trying to get it to train at the time. And I decided it was too slow to bother with.

whole goblet Oct 11, 2025, 5:34 PM

#

Think it's reasonable to just drop it?

#

idk at this point it's probably better to compare to dense baseline even though it has some derivative parts

spare flame Oct 11, 2025, 5:34 PM

#

I agree with @tranquil fiber

#

I don't know whether or not peer is worth it bc I don't know if peer is good or not

whole goblet Oct 11, 2025, 5:35 PM

#

Really the calculus I'm running right now is I can improve performance dramatically with reasonable amounts of work, which stretches my compute budget further

spare flame Oct 11, 2025, 5:35 PM

#

Yeah there's a correct tradeoff wrt effort there but you'll have to decide where the line is

whole goblet Oct 11, 2025, 5:35 PM

#

It also helped me stumble on a better factorization of this

#

Yea, I just don't think I'm at that line yet. With MFU as low as it is, and getting throughput where I have, I think if I can hit reasonable MFU then I can show competitive wall clock time with dense baseline, which makes it more apples to apples.

#

Since I'm kind of competing against CuBLAS in pytorch for something as straightforward as a single hidden dense FFN

spare flame Oct 11, 2025, 5:37 PM

#

The only reason to optimize first imo is if you cant do the experiments otherwise so you can't work on improving the architecture

whole goblet Oct 11, 2025, 5:37 PM

#

That's basically where I'm at. I think I can squeeze another order of magnitude of throughput out of this

#

Which gets me a lot more experimentation

spare flame Oct 11, 2025, 5:38 PM

#

Probably

whole goblet Oct 11, 2025, 5:38 PM

#

I'm already beating baseline throughput by 35%~, so it hasn't been wasted work so far

spare flame Oct 11, 2025, 5:38 PM

#

But it won't necessarily lead to any useful result so you just gotta weigh the time cost

whole goblet Oct 11, 2025, 5:39 PM

#

Yea, from my perspective once this grant is over, if I don't have interesting results, I probably won't try to get more compute. So this should get me my best shot

#

And if it fails, then it fails

#

I'll open source the negative result and move on

#

Basically just trying to avoid "And I decided it was too slow to bother with." for this arch

spare flame Oct 11, 2025, 6:01 PM

#

btw the reason to follow up more on PEER is because it has implications for why yours is underperforming

#

I dont remember if they had a equiparameter study in their paper etc.

#

but if they were able to show PEER outperforming then it could be worth trying to figure out how to get to that regime

#

(they could also have just messed up somehow, who knows, so all of this is a big question mark - you can never trust any results that no one has replicated)

whole goblet Oct 11, 2025, 6:04 PM

#

spare flame I dont remember if they had a equiparameter study in their paper etc.

They only did IsoFLOP :/ I'd imagine isoparameter would be pretty bad

#

tbh v1 of ETHOS is likely a better baseline despite the vocab size mismatch for undersatnding performance. It hit pretty reasonable loss with the latent expert approach when adjusted for vocab size

tranquil fiber Oct 11, 2025, 6:11 PM

#

whole goblet Basically nobody is going to care about if I match a dense baseline if it's 10x ...

Quite the opposite, you need to make it work first, then speed it up. You don't know what's causing the gap, so leaning in to write a specialized kernel will only make ablarions harder.

You can compare first 100 steps and that should be enough roughly, if your variance is low enough (I've given advice for reducing that as well)

whole goblet Oct 11, 2025, 6:12 PM

#

tranquil fiber Quite the opposite, you need to make it work first, then speed it up. You don't ...

First hundred steps seem dominated by hyperparams more than anything else. I'm still at like loss 9 at that point

spare flame Oct 11, 2025, 6:12 PM

#

Wes, last time it took a long time to finally do the nanogpt run instead of doing it first like i had suggested
I recommend that this time you listen to fern and my suggestion

#

in order to save yourself a lot of time and effort

whole goblet Oct 11, 2025, 6:12 PM

#

But I have a nanogpt baseline now, so I'm not sure how this differs

tranquil fiber Oct 11, 2025, 6:13 PM

#

Having a strong sense of direction is okay but you're kind of shooting yourself in the foot with some of the research direction, it would be good to listen to the advice for it.

spare flame Oct 11, 2025, 6:13 PM

#

this differs in the sense that you're going to do things in the opposite order of what will make it go fastest for you

tranquil fiber Oct 11, 2025, 6:13 PM

#

Yes, agreed

#

We've both been doing this for quite a while!

whole goblet Oct 11, 2025, 6:13 PM

#

100%, but maybe I'm not understanding the advice then?

tranquil fiber Oct 11, 2025, 6:14 PM

#

I think that's my vibe

spare flame Oct 11, 2025, 6:14 PM

#

you can do it in any order, its only a question of how long it takes you to succeed/give up 🤣

tranquil fiber Oct 11, 2025, 6:14 PM

#

I'm not sure quite how to make it "click" however

tranquil fiber Oct 11, 2025, 6:14 PM

#

spare flame you can do it in any order, its only a question of how long it takes you to succ...

Yes definitely

spare flame Oct 11, 2025, 6:14 PM

#

whole goblet 100%, but maybe I'm not understanding the advice then?

our advice is to try NOT to speed it up a lot first

tranquil fiber Oct 11, 2025, 6:14 PM

#

Yep

#

Definitely

whole goblet Oct 11, 2025, 6:14 PM

#

Yea, I see this as less of time constraint and more of budget constraint. Costs me next to nothing to write/test kernels. Takes a lot more to actually run relevant tests since I've been told at different points that I'm undertraining, but now it sounds like I'm overtraining?

spare flame Oct 11, 2025, 6:15 PM

#

whole goblet Yea, I see this as less of time constraint and more of budget constraint. Costs ...

everything is relative to the specific needs at that point, unfortunately, which can be confusing

whole goblet Oct 11, 2025, 6:15 PM

#

Like I just won't have the ability to run N tests at a slow pace

tranquil fiber Oct 11, 2025, 6:15 PM

#

You're still in the stage where you likely have a ton of bugs/initial arch issues, you need to understand the dynamics before speeding things up

whole goblet Oct 11, 2025, 6:15 PM

#

tranquil fiber You're still in the stage where you likely have a ton of bugs/initial arch issue...

I really don't think I am. The pytorch baseline is pretty tight at this point

spare flame Oct 11, 2025, 6:15 PM

#

if its tight then maybe you should give up

#

the goal is to find something better than what people already use

tranquil fiber Oct 11, 2025, 6:16 PM

#

(bugs here being e.g. initialization preventing certain things from working as well, etc)

whole goblet Oct 11, 2025, 6:16 PM

#

spare flame if its tight then maybe you should give up

Can you explain why? I'm behind dense baseline only barely without hyperparameter sweeps or ablations

tranquil fiber Oct 11, 2025, 6:16 PM

#

It takes time (sometimes several months!) of close examination to find them

whole goblet Oct 11, 2025, 6:16 PM

#

tranquil fiber It takes time (sometimes several months!) of close examination to find them

Sure, but I guess I'm not understanding what the advice is then? Should I be looking somewhere else?

tranquil fiber Oct 11, 2025, 6:17 PM

#

whole goblet Can you explain why? I'm behind dense baseline only barely without hyperparamete...

.004-.007 would be barely, .01 is significant, .015 diff is clear difference

spare flame Oct 11, 2025, 6:18 PM

#

whole goblet Can you explain why? I'm behind dense baseline only barely without hyperparamete...

can't you just make changes/hyperparams/whatever until it starts being lower loss than nanogpt early on

#

you dont need a full run or even a partial run barely to try to find that

tranquil fiber Oct 11, 2025, 6:18 PM

#

whole goblet Sure, but I guess I'm not understanding what the advice is then? Should I be loo...

Yes, I'd take a look at #1395195891262029884 message

whole goblet Oct 11, 2025, 6:18 PM

#

tranquil fiber .004-.007 would be barely, .01 is significant, .015 diff is clear difference

I see that diff within the same arch just with different seeds even on dense baseline. Unless you mean .04-.07, etc?

tranquil fiber Oct 11, 2025, 6:19 PM

#

Especially #1395195891262029884 message

whole goblet Oct 11, 2025, 6:19 PM

#

spare flame can't you just make changes/hyperparams/whatever until it starts being lower los...

Sure, but that usually involves cranking LR to a point where it starts to plateau early. I have runs where that has happened

tranquil fiber Oct 11, 2025, 6:20 PM

#

whole goblet I see that diff within the same arch just with different seeds even on dense bas...

You may have to t-test when they are close if variance is that high, but I do mean .004-.007, .01, and .015. anything above the variance threshold is usually so far off you don't need a t test for it

whole goblet Oct 11, 2025, 6:20 PM

#

tranquil fiber (which is also why having a good testbench w/ a 5-10 minute turnaround for basic...

This is what I'm trying to get to, so I can identify the problems, but I'm also being told not to focus on speed :/ I just don't really know what you're asking for here

whole goblet Oct 11, 2025, 6:21 PM

#

tranquil fiber You may have to t-test when they are close if variance is that high, but I do me...

That level of differentiation usually doesn't occur until we're later in training

#

And I've also been warned this could have entirely different training dynamics?

tranquil fiber Oct 11, 2025, 6:21 PM

#

whole goblet This is what I'm trying to get to, so I can identify the problems, but I'm also ...

Yeah, you don't always need speeds for that, just pick the slice of what's happening at the beginning of training and watch that with your tensorboard logs to see if you can pick out some trends

whole goblet Oct 11, 2025, 6:22 PM

#

tranquil fiber Yeah, you don't always need speeds for that, just pick the slice of what's happe...

Usually I need to get at least 1000 iterations in to start to gauge performance of a hyperparam change

tranquil fiber Oct 11, 2025, 6:22 PM

#

whole goblet And I've also been warned this could have entirely different training dynamics?

Yes, definitely. If you're able to long a long-enough run to pick up on any differences, you can use that to try to differentiate what's different in terms of training dynamics for them

whole goblet Oct 11, 2025, 6:22 PM

#

usually if it early plateaus it's been in the 4-5 loss range

tranquil fiber Oct 11, 2025, 6:22 PM

#

tranquil fiber Yes, definitely. If you're able to long a long-enough run to pick up on any diff...

Then you can use that for the shorter runs

whole goblet Oct 11, 2025, 6:23 PM

#

tranquil fiber Yes, definitely. If you're able to long a long-enough run to pick up on any diff...

Yea, this just seems to happen pretty late. The router in this arch is closer to an inline encoder for the hypernetwork, so early early it can vary quite a bit just on hyperparams alone

tranquil fiber Oct 11, 2025, 6:23 PM

#

whole goblet Usually I need to get at least 1000 iterations in to start to gauge performance ...

That is a long time, yeah (by contrast that's about 70-80% of a modded-nanoGPT run, but, different beast and all)

whole goblet Oct 11, 2025, 6:23 PM

#

It's a joint optimization problem in each layer

whole goblet Oct 11, 2025, 6:24 PM

#

tranquil fiber That is a long time, yeah (by contrast that's about 70-80% of a modded-nanoGPT r...

Really? Base nanoGPT is only at 4.57 trailing 100 loss at that stage

tranquil fiber Oct 11, 2025, 6:24 PM

#

modded-nanoGPT is ~4.35 @ 125 steps

#

But it is a tough baseline to beat

whole goblet Oct 11, 2025, 6:24 PM

#

So am I using the wrong harness, then? I was told to just use default nanoGPT

#

Since it would be easier for reviewers to reproduce

spare flame Oct 11, 2025, 6:25 PM

#

its also bc its easier for you to use

#

and less complicated

tranquil fiber Oct 11, 2025, 6:25 PM

#

We had a pretty lengthy discussion where I suggested it but you wanted to stick with nanogpt since your previous runs were in it. And that's okay too, nanogpt is a pretty decent baseline that's well accepted so I think that's alright

#

Yeah

#

It's just more compute

#

Unfortunately

whole goblet Oct 11, 2025, 6:25 PM

#

spare flame its also bc its easier for you to use

tbh my earlier harness felt easier that was built on dsv3, but I was told that it would have reproducibility problems so I abandoned it

spare flame Oct 11, 2025, 6:26 PM

#

whole goblet tbh my earlier harness felt easier that was built on dsv3, but I was told that i...

using your own anything is not a good way to go

#

for the same reason that now we dont even know if PEER is real

tranquil fiber Oct 11, 2025, 6:26 PM

#

Nanogpt is definitely a step up from self coded harnesses since it's verified code

#

(tho if you verify your own harness with a baseline that should be okayish)

whole goblet Oct 11, 2025, 6:27 PM

#

Yea I mean the earlier harness was literally just DSv3's attention block

#

So I figured that was fine

#

But yea, maybe a port to nanogpt is a good idea at this point

tranquil fiber Oct 11, 2025, 6:28 PM

#

Wes if you're able to find some statistics over training and link those to network performance, and use that to ratchet down what you think is happening, that may set you up for a quick-loop harness to iterate over

spare flame Oct 11, 2025, 6:28 PM

#

yeah to be clear, I am not promoting the idea that hyperparameters are going to be the magic bullet here

whole goblet Oct 11, 2025, 6:28 PM

#

tranquil fiber Wes if you're able to find some statistics over training and link those to netwo...

I feel like I have a pretty good idea of what's happening in the network. That's part of why I jumped for triton kernels. Makes me explicitly engage with that

tranquil fiber Oct 11, 2025, 6:29 PM

#

whole goblet I feel like I have a pretty good idea of what's happening in the network. That's...

Yeah it's possible, just a risk

#

Like, running on paper what's happening with each number and their magnitudes throughout training is super super useful

#

Seeing if there are any outliers in activations, etc (and why)

whole goblet Oct 11, 2025, 6:29 PM

#

spare flame yeah to be clear, I am not promoting the idea that hyperparameters are going to ...

I don't think they'll be a magic bullet, but I do think that a single config that's not optimized can make up a .1 nat difference

tranquil fiber Oct 11, 2025, 6:29 PM

#

Also logging, extensive logging

#

That's pretty important

whole goblet Oct 11, 2025, 6:29 PM

#

tranquil fiber Also logging, extensive logging

Yea, I've got that now

tranquil fiber Oct 11, 2025, 6:29 PM

#

Nice

whole goblet Oct 11, 2025, 6:30 PM

#

tranquil fiber Seeing if there are any outliers in activations, etc (and why)

How would I check that?

tranquil fiber Oct 11, 2025, 6:30 PM

#

Are you looking at your histograms over time?

spare flame Oct 11, 2025, 6:30 PM

#

Wes is there some reason this is going to scale better than traditional ffn

tranquil fiber Oct 11, 2025, 6:30 PM

#

whole goblet How would I check that?

The logging I suggested from earlier

#

Histogram everything

#

In tensorboard

#

(dense logging can help catch changes as well)

whole goblet Oct 11, 2025, 6:31 PM

#

spare flame Wes is there some reason this is going to scale better than traditional ffn

Yep. The PEER block is basically expert parallelism with deterministic behavior that doesn't require scatter gathers the way tensor parallelism in dense networks do

spare flame Oct 11, 2025, 6:31 PM

#

whole goblet Yep. The PEER block is basically expert parallelism with deterministic behavior ...

ok so can you just allow it to be a bit worse but scale it

whole goblet Oct 11, 2025, 6:31 PM

#

spare flame ok so can you just allow it to be a bit worse but scale it

Sure, just requires multi-gpu and some small rewrites

#

It also (if my assessment ends up being correct) should be faster in practice than a dense FFN in inference

#

Just slower during training

#

Which is another reason I'm getting this low level

spare flame Oct 11, 2025, 6:32 PM

#

I see

whole goblet Oct 11, 2025, 6:33 PM

#

But yea, basically I have it to where you never have to leave the chip (at most SMEM writes) after attention

spare flame Oct 11, 2025, 6:33 PM

#

ok I dunno then - maybe its worth kernel work if it could be better in practical ways even if its maybe a bit less great than a normal FFN in performance

whole goblet Oct 11, 2025, 6:33 PM

#

spare flame ok I dunno then - maybe its worth kernel work if it could be better in practical...

I think I can likely match it too with tuning, but it sounds like y'all aren't confident in that. This is really just my first config that's .1 nat behind once we're further into training

#

.13 to be exact

spare flame Oct 11, 2025, 6:34 PM

#

whole goblet But yea, basically I have it to where you never have to leave the chip (at most ...

I don't really know what this means, because if you are using a lot of parameters those parameters are going to have to get to SRAM somehow

whole goblet Oct 11, 2025, 6:34 PM

#

spare flame I don't really know what this means, because if you are using a lot of parameter...

Sure, I guess I should be specific that they're single load

#

I'm saying intermediates never leave at this point, so it's getting pretty optimized

#

No readbacks

#

Just need to get tiling correct

spare flame Oct 11, 2025, 6:35 PM

#

how is that better than dense matmul?

whole goblet Oct 11, 2025, 6:35 PM

#

Scales better than tensor parallelism because no scatter gathers between PEER blocks

#

Just a single reduce before the next layer

#

Basically you get the benefits of expert parallelism without the load balancing issues

spare flame Oct 11, 2025, 6:37 PM

#

that sounds good, how does that occur?

whole goblet Oct 11, 2025, 6:37 PM

#

Because each GPU would just recieve a broadcast of the post attention token batch. Everything is pipelined between the router and output at that point, because the router and hypernetwork are tightly coupled.

#

In traditional PEER you'd be selecting the discrete experts from a massive pool. Same thing happens at lower scale for other MoE models

#

Here, because you're constructing the expert, you don't have the same problem

spare flame Oct 11, 2025, 6:39 PM

#

youre able to slice up the thing that constructs the experts somehow across gpus?

whole goblet Oct 11, 2025, 6:40 PM

#

More that as long as Heads % GPU_count == 0, they can handle Head / GPU count number of heads, if that makes sense

spare flame Oct 11, 2025, 6:40 PM

#

sure

whole goblet Oct 11, 2025, 6:40 PM

#

And then once each GPU has finished its batch, you just have a single reduce for their outputs

spare flame Oct 11, 2025, 6:41 PM

#

multiple head FFN doesnt typically work great

#

why does this?

whole goblet Oct 11, 2025, 6:41 PM

#

Because you're determining what kind of computation each token needs on the fly

#

MoE does this, but it's in a very discrete way, which is why you get things like aux loss for load balancing in most architetures

spare flame Oct 11, 2025, 6:42 PM

#

hmm ok I think I see the outline of the idea generally

#

so basically you're making FFN worse (but more easily parallelizable) by using heads, but making it better again by hypernetwork somehow

whole goblet Oct 11, 2025, 6:42 PM

#

Give or take

#

Like if I break down PEER (assuming it was real when the paper was produced)

spare flame Oct 11, 2025, 6:43 PM

#

I'd like you to run another experiment where you just divide the FFN into heads

whole goblet Oct 11, 2025, 6:43 PM

#

I can do that

spare flame Oct 11, 2025, 6:43 PM

#

thats a kind of ablation for this concept

#

should be super fast to run

whole goblet Oct 11, 2025, 6:43 PM

#

Yea, agreed

spare flame Oct 11, 2025, 6:44 PM

#

because right now we don't know if you're just exactly matching that

#

due to that being a part of your change

whole goblet Oct 11, 2025, 6:45 PM

#

whole goblet Like if I break down PEER (assuming it was real when the paper was produced)

The router was basically doing two things:

Determining what kind of compute was needed for a given token. This was encoded as the expert query
Determining exactly how to do that compute, neuron by neuron. That's the coordinate used for retrieval + scaling each neuron

So here I'm just using that query to generate a neuron, and then using the coordinates to condition generation of each neuron, and the coordinate score to scale them

whole goblet Oct 11, 2025, 6:45 PM

#

spare flame because right now we don't know if you're just exactly matching that

Yea, 100%

spare flame Oct 11, 2025, 6:45 PM

#

also, if you are doing BETTER than head-ffn, that's interesting, since you're doing head-ffn and then more stuff on top

whole goblet Oct 11, 2025, 6:45 PM

#

whole goblet The router was basically doing two things: 1. Determining what kind of compute w...

This, at least if I did the math right, is identical to constructing a rank-k expert instead of k rank-1 experts. And then you get diversity via multiple heads

whole goblet Oct 11, 2025, 6:47 PM

#

spare flame also, if you are doing BETTER than head-ffn, that's interesting, since you're do...

Would you be willing to review the multi head FFN approach just to make sure it stands up to what you're thinking? Should be able to have a version by tomorrow. Would just be a single code block

spare flame Oct 11, 2025, 6:50 PM

#

I probably dont have time (actually I gotta go do some stuff right now) but btw I also don't really know exactly what headedness you're describing in the first place 🤣

whole goblet Oct 11, 2025, 6:51 PM

#

In this case, just a PEER head

#

PEER defines a head as a router plus the k experts that router selects

#

So you have H routers per FFN

#

And no worries. I'll just give it a shot. I need to do some doc review on if there have been parallel FFNs without routing before

spare flame Oct 11, 2025, 6:52 PM

#

Is the idea just changing
matmul(act(matmul(x, A)), B)
into
sum( bmm( act( bmm(x,As) ), Bs ) )
?

Like a bunch of smaller ffns summed?

whole goblet Oct 11, 2025, 6:52 PM

#

Yep exactly

#

Because that would be the exact dense baseline without any complexity from what I'm doing

#

So basically instead of the 768 -> 3072 expansion in GPT-2 small per FFN, you'd have 8 768 -> 384 bottlenecks that are summed after

#

Looks like closest might be GroupBERT?

spare flame Oct 11, 2025, 6:57 PM

#

but the sum of a bunch of ffns is mathematically equivalent to a single wide ffn

whole goblet Oct 11, 2025, 6:57 PM

#

Yep

#

That's kind of the point, right?

spare flame Oct 11, 2025, 6:57 PM

#

so.. that means I don't understand the 'benefit' youre obtaining vs doing that for a FFN across gpus

#

I thought you said yours was more efficient bc of lack of tensor parallelism or something

whole goblet Oct 11, 2025, 6:58 PM

#

Fewer scatter gathers because you don't need to do that for each layer. Benefit is limited in GPT-2 style FFNs, but the second you add an additional hidden layer, benefit emerges

spare flame Oct 11, 2025, 6:59 PM

#

ok i dont understand but this is getting beyond what I can spend time on

#

sorry 🙁

whole goblet Oct 11, 2025, 6:59 PM

#

But the benefit still exists because it's a single scatter gather instead of 2. Woudl be same with the mutliheaded FFN approach

#

you're good

spare flame Oct 11, 2025, 7:06 PM

#

one last thought.. if the main benefit is this kind of practical speedup, maybe you should write up a clear explanation of what situations that is expected to occur

#

and what subset of the invention is required for that speedup

whole goblet Oct 11, 2025, 7:07 PM

#

Makes sense

#

I don't know if I've explored the architecture enough to know that its benefit is only a practical speedup on multi-node, but I can definitely try to isolate that aspect

#

Just going to switch to modded-nanogpt if the baseline needs to change again

tranquil fiber Oct 11, 2025, 7:12 PM

#

Yeah, modded-nanoGPT upside is faster experiments, downside is it will likely be very very hard to beat the baseline

#

Since it's a very highly tuned run

whole goblet Oct 11, 2025, 7:13 PM

#

Yea, I mean, I'm just swapping out the FFN, so if I get compute budget back, I'm fine with that

whole goblet Oct 11, 2025, 7:34 PM

#

    """Multi-headed FFN: splits into parallel bottlenecks and sums outputs."""

    def __init__(self, dim: int, num_heads: int = 8):
        super().__init__()
        self.num_heads = num_heads
        total_intermediate = 4 * dim
        assert total_intermediate % num_heads == 0
        self.bottleneck_dim = total_intermediate // num_heads

        # Create parallel FFN heads
        self.heads = nn.ModuleList([
            nn.ModuleDict({
                \'c_fc\': CastedLinear(dim, self.bottleneck_dim),
                \'c_proj\': CastedLinear(self.bottleneck_dim, dim),
            })
            for _ in range(num_heads)
        ])
        
        # Zero init projections
        for head in self.heads:
            head[\'c_proj\'].weight.detach().zero_()
    
    def forward(self, x: Tensor):
        outputs = []
        for head in self.heads:
            h = head[\'c_fc\'](x)
            h = F.relu(h).square()
            h = head[\'c_proj\'](h)
            outputs.append(h)
        return sum(outputs)```

#

Simple implementation, will let y'all know how it does on modded nano

whole goblet Oct 11, 2025, 8:20 PM

#

tranquil fiber modded-nanoGPT is ~4.35 @ 125 steps

Not sure if there's something vastly different from running on a single GPU. Only disabled a world_size == 8 assertion and changed align_to_bos=True to false in the data loader. Everything else is just cloned directly from main.

Base (for reproduction on single GH200):

step:0/1750 val_loss:10.8258 train_time:0ms step_avg:0.01ms
step:125/1750 val_loss:5.5574 train_time:14117ms step_avg:112.93ms
step:250/1750 val_loss:4.9899 train_time:28250ms step_avg:113.00ms
step:375/1750 val_loss:4.7004 train_time:42454ms step_avg:113.21ms
step:500/1750 val_loss:4.5114 train_time:56892ms step_avg:113.78ms
step:625/1750 val_loss:4.3930 train_time:71403ms step_avg:114.25ms
step:750/1750 val_loss:4.3105 train_time:86070ms step_avg:114.76ms
step:875/1750 val_loss:4.2550 train_time:100809ms step_avg:115.21ms
step:1000/1750 val_loss:4.1766 train_time:115700ms step_avg:115.70ms
step:1125/1750 val_loss:4.1036 train_time:130682ms step_avg:116.16ms
step:1250/1750 val_loss:4.0288 train_time:145701ms step_avg:116.56ms
step:1375/1750 val_loss:3.9647 train_time:160736ms step_avg:116.90ms
step:1500/1750 val_loss:3.9090 train_time:175922ms step_avg:117.28ms
step:1625/1750 val_loss:3.8606 train_time:191158ms step_avg:117.64ms
step:1750/1750 val_loss:3.8198 train_time:206425ms step_avg:117.96ms

#

Multiheaded FFN:

step:125/1750 val_loss:5.5590 train_time:39192ms step_avg:313.54ms
step:250/1750 val_loss:5.0048 train_time:78624ms step_avg:314.49ms
step:375/1750 val_loss:4.6950 train_time:118188ms step_avg:315.17ms
step:500/1750 val_loss:4.5055 train_time:158052ms step_avg:316.10ms
step:625/1750 val_loss:4.3863 train_time:197880ms step_avg:316.61ms
step:750/1750 val_loss:4.3022 train_time:237781ms step_avg:317.04ms
step:875/1750 val_loss:4.2447 train_time:277795ms step_avg:317.48ms
step:1000/1750 val_loss:4.1717 train_time:317970ms step_avg:317.97ms
step:1125/1750 val_loss:4.0932 train_time:358107ms step_avg:318.32ms
step:1250/1750 val_loss:4.0169 train_time:398400ms step_avg:318.72ms
step:1375/1750 val_loss:3.9494 train_time:438691ms step_avg:319.05ms
step:1500/1750 val_loss:3.8911 train_time:478944ms step_avg:319.30ms
step:1625/1750 val_loss:3.8419 train_time:519242ms step_avg:319.53ms
step:1750/1750 val_loss:3.8008 train_time:559605ms step_avg:319.77ms```

Testing my pytorch version next once I get it ported over.

tranquil fiber Oct 11, 2025, 8:24 PM

#

Don't forget to run your baseline!

#

Turning off align_bos does change things

whole goblet Oct 11, 2025, 8:25 PM

#

tranquil fiber Don't forget to run your baseline!

Baseline was the first one

tranquil fiber Oct 11, 2025, 8:25 PM

#

I'd recommend using kosarsky's January record, that is much easier to convert to 1 GPU

tranquil fiber Oct 11, 2025, 8:25 PM

#

whole goblet Baseline was the first one

It is an incorrect baseline, something is wildly off in the loss

#

Should have 3.28 very consistently (+/- a bit)

whole goblet Oct 11, 2025, 8:26 PM

#

tranquil fiber It is an incorrect baseline, something is wildly off in the loss

I mean, those are the only changes I made, but yea I'll go try an older version.

tranquil fiber Oct 11, 2025, 8:27 PM

#

It always should converge to ~3.28

whole goblet Oct 11, 2025, 8:27 PM

#

tranquil fiber I'd recommend using kosarsky's January record, that is much easier to convert to...

Which one is this?

tranquil fiber Oct 11, 2025, 8:28 PM

#

I can't remember the name of the top of my head, you can look in the records folder for it, I believe it's the tanh scaling upgrade iirc

whole goblet Oct 11, 2025, 8:28 PM

#

I'll scroll through

#

Does kosarsky go by a different name?

#

nvm found it

whole goblet Oct 11, 2025, 9:04 PM

#

Okay, yea. That's more normal then. Baseline, no changes needed from that version:

step:125/1390 val_loss:4.3667 train_time:119635ms step_avg:1040.31ms
step:250/1390 val_loss:3.9498 train_time:254696ms step_avg:1061.23ms
step:375/1390 val_loss:3.7707 train_time:392056ms step_avg:1074.13ms
step:500/1390 val_loss:3.6554 train_time:531675ms step_avg:1085.05ms
step:625/1390 val_loss:3.5748 train_time:671656ms step_avg:1092.12ms
step:750/1390 val_loss:3.5223 train_time:811987ms step_avg:1097.28ms
step:875/1390 val_loss:3.4717 train_time:954645ms step_avg:1103.64ms
step:1000/1390 val_loss:3.4056 train_time:1100304ms step_avg:1111.42ms
step:1125/1390 val_loss:3.3539 train_time:1246828ms step_avg:1118.23ms
step:1250/1390 val_loss:3.3070 train_time:1393206ms step_avg:1123.55ms
step:1375/1390 val_loss:3.2783 train_time:1538654ms step_avg:1127.22ms
step:1390/1390 val_loss:3.2775 train_time:1556107ms step_avg:1127.61ms```

whole goblet Oct 11, 2025, 9:41 PM

#

whole goblet Multiheaded FFN: ```step:0/1750 val_loss:10.8258 train_time:0ms step_avg:0.01ms ...

Take 2 on older harness that's single GPU friendly for multiheaded FFN:

step:125/1390 val_loss:4.3925 train_time:155044ms step_avg:1348.21ms
step:250/1390 val_loss:3.9615 train_time:328091ms step_avg:1367.05ms
step:375/1390 val_loss:3.7857 train_time:504000ms step_avg:1380.82ms
step:500/1390 val_loss:3.6669 train_time:680342ms step_avg:1388.45ms
step:625/1390 val_loss:3.5861 train_time:857694ms step_avg:1394.62ms
step:750/1390 val_loss:3.5308 train_time:1038387ms step_avg:1403.23ms
step:875/1390 val_loss:3.4804 train_time:1220604ms step_avg:1411.10ms
step:1000/1390 val_loss:3.4140 train_time:1402886ms step_avg:1417.06ms
step:1125/1390 val_loss:3.3613 train_time:1584735ms step_avg:1421.29ms
step:1250/1390 val_loss:3.3141 train_time:1767800ms step_avg:1425.64ms
step:1375/1390 val_loss:3.2852 train_time:1952838ms step_avg:1430.65ms
step:1390/1390 val_loss:3.2844 train_time:1975155ms step_avg:1431.27ms```
Little worse

spare flame Oct 11, 2025, 10:54 PM

#

Oh after understanding what you meant by multi headed I didn't think you still needed to compare since it's mathematically identical

#

(tho I still don't really get the justification for why it's faster than normal FFNs since it's identical in that way)

whole goblet Oct 11, 2025, 11:30 PM

#

Well it's done anyways, and yea it's pretty much identical just slower

#

The speed up when on >1 would be that if you have two hidden layers in those, you would be able to reduce a scatter gather from that second layer

#

but with a single hidden layer it's almost strictly worse

#

But when you have so many ops that are pipelined in my arch, it's more pronounced

#

Alright, I have my architecture patched into modded-nanogpt

#

We'll see how perf is

#

Also trying muon on my layers but might swap them back to Adam if there's major issues. Given I'm doing all manual gradient handling at this point so it should be pretty agnostic

whole goblet Oct 12, 2025, 2:35 AM

#

Alright, at a bit over 60k tokens per second, so double baseline, with the modded nanogpt repo and some kernel tuning. Still a lot slower than the 470k tokens per second that modded nanogpt can get on the same config, but I'm okay not beating a literal speedrunning config

#

Will run ablations from this. Should be able to get a decent chunk of configs out of this

#

util is still pretty low, so someone smarter than me could probably get further faster than I can at this point

whole goblet Oct 14, 2025, 6:40 PM

#

spare flame if its tight then maybe you should give up

I'm just about here, just going to try one last thing and then just open source the negative case. Got it to get within like .05 nats of the modded gpt baseline, but still like 7x the iteration time. I know there's a lot of room for speed based on nsight, but haven't been able to capture any of it.

Last hurrah is going to be trying to toss an expert choice router in front of each of the PEER routers. Would let each head specialize better, would reduce amount of overhead I'm eating from each head, and shouldn't require kernel rewrites the way I have them now, would just look like smaller batch sizes from that perspective.

if/when that fails I think I'll just give this up

#

I have the approach factorized to a point where it is lower FLOPs than a dense baseline, though, just less efficient ops currently.

spare flame Oct 14, 2025, 6:42 PM

#

sorry 🙁 thats interesting that its lower flops and almost equivalent performance but less efficient

whole goblet Oct 14, 2025, 6:43 PM

#

It feels like there's something here, and to some degree this shouldn't work at all, let alone nearly as well.

#

But yea, I just might not be the person to make it work if it does have merit

#

But yea, last hurrah is going to see if they specialize better if I go
Head Choice Router (ripped from expert choice, will reduce tokens each head needs to process) -> PEER Router -> Hypernetwork -> process token with generated experts

#

learned how to write reasonably efficient backwards kernels at least

#

forward is just a huge bottleneck rn

spare flame Oct 14, 2025, 7:11 PM

#

whole goblet It feels like there's something here, and to some degree this shouldn't work at ...

to me, the biggest open question is still PEER and what situations, if any, PEER is good in

#

without knowing the answer to that it's hard to know anything much about these kinds of 'dynamically constructing FFNs' methods

#

it's fine if your method requires more parameters if those parameters are used much more sparsely

#

but you need some kind of specific situations in which your method would lead to dramatically reduced compute (like not just a constant multiplier on FLOPs)

#

that was sort of the promise of PEER

#

ultimately, we can't tell you what the proposed benefits of the methods you're pursuing are... you have to be the one to know and communicate that

#

and if you can find a clear benefit of this sort it's probably worth pursuing, and if not then yeah it might not be

whole goblet Oct 14, 2025, 7:53 PM

#

spare flame to me, the biggest open question is still PEER and what situations, if any, PEER...

So far, my instinct is no with at least the currently available reproduction code and my attempts, but I haven't explored the space entirely.

whole goblet Oct 14, 2025, 7:54 PM

#

spare flame it's fine if your method requires more parameters if those parameters are used m...

And yea, this architecture is the opposite. Parameters are reused, so it's more parameter efficient but necessarily more FLOP efficient. Given I've landed on a better factorization that's actualy slightly lower than a dense FFN.

tranquil fiber Oct 14, 2025, 8:44 PM

#

whole goblet I'm just about here, just going to try one last thing and then just open source ...

I think there's a lot of room for twiddling on the MLP generating the weights, seems like there's a lot you could do there

whole goblet Oct 14, 2025, 8:46 PM

#

tranquil fiber I think there's a lot of room for twiddling on the MLP generating the weights, s...

Yea, just honestly getting burnt out

#

Doing this on limited budget without institutional support is fun when it's fun, but lately it's just been exhausting

#

hobbies are supposed to just mostly be fun lol

#

And it's hard to sift through "this advice is useful when you have a huge training budget" vs "this is useful at all scales"

#

Been good learning, just not really fun rn

tranquil fiber Oct 14, 2025, 8:48 PM

#

whole goblet Been good learning, just not really fun rn

yep, that's a good time to know when to not push it as much

whole goblet Oct 14, 2025, 8:51 PM

#

tranquil fiber yep, that's a good time to know when to not push it as much

Yea, I think I'd probably like this more as a job tbh, I just don't think I have the resources to really do this just for the fun of it, which sucks a lot of the fun out of it

whole goblet Oct 15, 2025, 5:09 PM

#

Okay, actually it looks like a good chunk of the gap can be made up by tuning LR just for PEER layers

#

Lowered to 3.35 by making coarse grained updates, can likely eek out a decent amount more by separating LR between attention and components of the FFN

#

(was at 3.40~)

spare flame Oct 15, 2025, 5:21 PM

#

whole goblet Okay, actually it looks like a good chunk of the gap can be made up by tuning LR...

Peer or ethos?

#

Cool

whole goblet Oct 15, 2025, 6:18 PM

#

spare flame Peer or ethos?

whatever we're calling the pure hypernetwork version

#

it's still peer in my codebase

spare flame Oct 15, 2025, 6:19 PM

#

o ok, was just not sure if u meant actual PEER

whole goblet Oct 15, 2025, 6:22 PM

#

Yea, no, the one I've been working on. I'm not going to spend any more compute trying to get a PEER baseline perfect

#

if GDM wants a repro study they can give me the compute for it

whole goblet Oct 16, 2025, 4:28 PM

#

Honestly starting to think that this might do better as a blog post on "how to use hypernetworks to make a worse transformer"

mental plinth Oct 17, 2025, 6:08 AM

#

Just getting caught up here. 😊

I’ve been messing with different decompositions on attention and the FFN, and this past week was trying to see if I could find a configuration that would actually achieve a lower perplexity than my dense baseline at the same parameter count.

I found a Google paper that discussed how best to allocate additional parameters, and it recommended adding more layers rather than making them bigger.

I tried applying that—any parameter savings I was getting from the decompositions, I’d into additional layers when possible.

An approach that finally got lower perplexity than the dense baseline was to interweave dense and low-rank layers.

I went from 6 dense layers (“dddddd”) to ten layers with a pattern of: “dssdssdssd” where ‘s’ is for sparse / low rank.

The low rank layers are narrow—they have ~1/3 as many heads and ~1/3 as many neurons as the dense layers.

My theory is that, since the low rank layers can’t access the full model space, it makes sense to make them more narrow / less expressive.

And also that it benefits from periodically having unrestricted read and write access to the full model space.

For Ethos and the variants, there is a lot of low rank-ness going on here with the PK Router and the hyper networks. I wonder if this architecture might benefit from a similar “banding” pattern?

mental plinth Oct 17, 2025, 5:36 PM

#

I think the same insight may apply to regular PEER as well. The PK router is a low-rank approximation of a standard MoE router. It can only see a subspace defined by the query projection.

I suppose it makes up for that, though, by having multiple heads (multiple low rank routers), similar to attention. So maybe nevermind.

mental plinth Oct 17, 2025, 5:55 PM

#

What I've always liked about PEER is the hope that it could learn the same kind of precise features that they find using SAEs in interpretability research.

I wonder if it could work to take a pre-trained model, do the SAE analysis, and then initialize a PEER layer from that.

Applied to GPT-2, an SAE setup might look like the below.

First, compute the activations for the FFN input neurons:

a = gelu(x @ W_in)   # with W_in.shape = (768, 3096)

Then the SAE has an encoder matrix E with, e.g., shape (3096, 1M), and a decoder matrix D with shape (1M, 3096).

So the output is something like:

y = gelu(a @ E) @ D @ W_out

The features they find are the rows of D. You can turn those into expert output neurons by multiplying with W_out. So for PEER expert neuron 'j':

w_vj = D[j, :] @ W_out

For the routing, the simplest but expensive approach would be to say that the input to the PEER layer is the activation vector a, which is length 3096.

So then you need to set up the PK router to approximate the operation a @ E to find those top features.

Maybe a better way would be to freeze all of those ideal output weights, and then use distillation to train the PK-router and the neuron inputs.

whole goblet Oct 17, 2025, 7:19 PM

#

Yea I thought the same thing, but I’m not sure you can hit ground truth on “ideal output weights”

#

You could hit a version of it, but PEER reproductions themselves don’t seem to be great from the versions I’ve tried

#

What I think could be more interesting is token + scaled output pairs on a trained network as a training set

#

So you attempt to learn the full path

whole goblet Oct 18, 2025, 5:14 PM

#

Wait I found a bug

#

I'll rerun training once it's fixed

whole goblet Oct 19, 2025, 8:47 PM

#

Would y’all have any thoughts on how to best test OOD data? I have half a hypothesis that because this is defining what compute you need on the fly, it might do better there.

whole goblet Oct 20, 2025, 5:23 PM

#

Also, am I potentially thinking of this wrong? The FFN is learning a materially more difficult task, so should I be expecting that it's as sample efficient as its dense counterpart?

#

It's learning given this input, what weights should I generate that would be useful and then immediately consuming those weights, which is significantly more difficult than the training task for the dense network

whole goblet Oct 22, 2025, 2:00 PM

#

Going to swap to gpt2-medium baseline just because I’m wasting a lot of cycles making a non power of 2 model dim work. Everything is a lot easier when model dim is 1024.

whole goblet Oct 22, 2025, 3:28 PM

#

step:125/1390 val_loss:4.3229 train_time:217484ms step_avg:1891.16ms
step:250/1390 val_loss:3.8955 train_time:460616ms step_avg:1919.23ms
step:375/1390 val_loss:3.7061 train_time:704883ms step_avg:1931.19ms
step:500/1390 val_loss:3.5879 train_time:953667ms step_avg:1946.26ms
step:625/1390 val_loss:3.5013 train_time:1206982ms step_avg:1962.57ms
step:750/1390 val_loss:3.4425 train_time:1459531ms step_avg:1972.34ms
step:875/1390 val_loss:3.3900 train_time:1716268ms step_avg:1984.12ms
step:1000/1390 val_loss:3.3211 train_time:1976056ms step_avg:1996.02ms
step:1125/1390 val_loss:3.2642 train_time:2233886ms step_avg:2003.49ms
step:1250/1390 val_loss:3.2146 train_time:2497299ms step_avg:2013.95ms
step:1375/1390 val_loss:3.1837 train_time:2760458ms step_avg:2022.31ms
step:1390/1390 val_loss:3.1829 train_time:2791743ms step_avg:2023.00ms```
Baseline gpt2-medium

whole goblet Oct 22, 2025, 5:48 PM

#

@spare flame going back through original ETHOS, I found a bug that might actually be a minor discovery. Any shot you'd have 10 minutes to verify? Should be apparent in the code.

Basically, the hypernetwork and expert weights never actually received gradients. We hit reasonable PPL without training those component at all. Just random init.

spare flame Oct 22, 2025, 5:50 PM

#

whole goblet <@1007072846960410685> going back through original ETHOS, I found a bug that mig...

They didn't receive grads?? Not sure how but sounds like a good find! I'm actually travelling until next week but I can take a peek then

whole goblet Oct 22, 2025, 5:51 PM

#

spare flame They didn't receive grads?? Not sure how but sounds like a good find! I'm actual...

There's literally no backward function. I had an autograd in there at some point that got deleted. Can even show original notebook we trained with with original output to show it's just not there

spare flame Oct 22, 2025, 5:51 PM

#

Lol!

whole goblet Oct 22, 2025, 5:51 PM

#

But we were using the forward triton kernel

whole goblet Oct 22, 2025, 6:07 PM

#

If this is true, I think I can just share latents across layers and never train them? Would just need to fix the hypernetwork to recieve gradients because that's likely suboptimal

spare flame Oct 22, 2025, 6:18 PM

#

Heh what part do you need me to look at? Shouldn't you just make sure everything has a backward?

#

I don't really want to look at code to debug it

tranquil fiber Oct 22, 2025, 6:27 PM

#

whole goblet ```step:0/1390 val_loss:10.8258 train_time:0ms step_avg:nanms step:125/1390 val_...

Please check your baseline numbers against the raw reported numbers, this is the third time!

whole goblet Oct 22, 2025, 6:32 PM

#

spare flame Heh what part do you need me to look at? Shouldn't you just make sure everything...

Just that there is not in fact a backwards

whole goblet Oct 22, 2025, 6:35 PM

#

tranquil fiber Please check your baseline numbers against the raw reported numbers, this is the...

I did, was using https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_2_medium/2025-01-18/241dd7a7-3d76-4dce-85a4-7df60387f32a.txt and this technically beats it

GitHub

modded-nanogpt/records/track_2_medium/2025-01-18/241dd7a7-3d76-4dce...

NanoGPT (124M) in 3 minutes. Contribute to KellerJordan/modded-nanogpt development by creating an account on GitHub.

#

Just used the same shorter run hyperparams from the gpt-2 small version

whole goblet Oct 22, 2025, 6:36 PM

#

spare flame Heh what part do you need me to look at? Shouldn't you just make sure everything...

Yea, it doesn't. Just validating absence, but it's no huge deal

whole goblet Oct 22, 2025, 6:39 PM

#

tranquil fiber Please check your baseline numbers against the raw reported numbers, this is the...

If you'd prefer I can hit the 2.92 target for the medium, just wanted to have a straighforward baseline with a 1024 model dim

tranquil fiber Oct 22, 2025, 7:01 PM

#

whole goblet If you'd prefer I can hit the 2.92 target for the medium, just wanted to have a ...

Yeah, I'm not entirely sure what that's getting you unless it's something specific to kernel writing.

But dynamics change when you change the numbers, baseline typically means "I didn't change anything and ran it as previously", changing hyperparams changes the result.

You can make a new baseline, but it's risky and takes longer because it loses the guarantees of the pre-existing baseline

#

(I probably won't comment too much more on ETHOS but I may come in occasionally to help)

whole goblet Oct 22, 2025, 7:03 PM

#

tranquil fiber Yeah, I'm not entirely sure what that's getting you unless it's something specif...

I'm looking to have a baseline architecture to compare my architecture to, so as long as I'm using the architecture provided in the repo, and then also holding maintaining other constants outside of my own independent variable, is that not scientifically valid?

#

E.g. I trained a GPT-2 Medium class model as my dense baseline on X tokens, and then also trained my IsoParam|IsoFLOP|IsoWallClock model on X tokens, is that not a valid comparison?

#

Goal isn't to get on leaderboards of modded-nanogpt, just have a scientifically valid baseline to compare to.

#

Especially when they're using identical hyperparams outside of my exact FFN change?

#

Maybe the part that wasn't clear is that baseline was run exactly from the repo just with the training duration changed?

whole goblet Oct 22, 2025, 7:37 PM

#

tranquil fiber Please check your baseline numbers against the raw reported numbers, this is the...

Actually maybe I caught the misunderstanding. This is not with any changes, just with the exact code from the above run I linked, with only num_iterations=1390. Maybe this came across as with changes to the FFN?

#

Otherwise not sure what the

"Yeah, I'm not entirely sure what that's getting you unless it's something specific to kernel writing."
part is about

whole goblet Oct 22, 2025, 8:00 PM

#

Was only offering to run it to 2.92 because that's what the speedrun requires, but this is just a short run baseline in order to keep sweeps reasonable cost.

whole goblet Oct 22, 2025, 8:38 PM

#

fwiw going to move forward with the shortened GPT-2 Medium baseline for ablations and hyperparam sweeps. I think that simply training for fewer tokens will just have to be a limitation for that portion, and I'm fine with that.

whole goblet Oct 23, 2025, 3:19 AM

#

Went through and just updated the most recent GPT-2 Medium speedrun to be friendly to single GPU and ran a full run. Final delta was +0.000934 (2.920684 vs 2.919750) Going to call that similar enough, use that as the harness and baseline, and just use early exit to disqualify candidates if they aren't training well.

Reference: https://github.com/KellerJordan/modded-nanogpt/blob/master/records/track_2_medium/2025-06-15_OptimizationLeaderboard/075_640429f2-e726-4e83-aa27-684626239ffc.txt

My changes and full run: https://gist.github.com/wrmedford/14893b6a4477b6d2ef114a3406d5aa87

#

Now back to the actual research

mental plinth Oct 23, 2025, 3:25 AM

#

How do you decide when to stop training? The goal is to keep going till you hit their perplexity mark, right? But you need to know ahead of time how many steps that’s going to take in order to schedule the learning rate 🤔

whole goblet Oct 23, 2025, 3:25 AM

#

Best I can tell it's a bit of trial and error on figuring out exactly which iteration to stop on, but I'm just going to be training to this token count to keep with the methodology

mental plinth Oct 23, 2025, 3:30 AM

#

What’s the metric you’re evaluating here for Ethos? Comparing validation loss at the end of training?

whole goblet Oct 23, 2025, 3:30 AM

#

I will be yea. I have a sweep harness and a few other things I'll throw in a branch probably tomorrow

#

It's already on the jupyter instance if you want to check there

mental plinth Oct 23, 2025, 3:31 AM

#

Cool!

#

Do you have an idea of what success looks like here?
Are we hoping for a lower validation loss at the same token count vs. the baseline?

whole goblet Oct 23, 2025, 3:34 AM

#

That's the hope, or seeing if we can match it at a lower param/flop count

#

now that I'm moving back closer to the standard ETHOS arch

mental plinth Oct 23, 2025, 3:35 AM

#

Ah, I see now, that run you just shared was the baseline. If it took 3 hours, how long is ethos going to take? 😬

whole goblet Oct 23, 2025, 3:37 AM

#

Hopefully not all ablations will take that long and will early exit on bad configs. Once I post the sweep (it’s in ethosv2/grid_search.py on the Jupyter server) it should make more sense. But should be able to just run them back to back for a while and just check results.

#

And I’m looking at potentially just tooling this for 8xH100 and doing it there if wall clock becomes prohibitive. But I think a couple weeks waiting on results is fine considering budget

mental plinth Oct 23, 2025, 3:39 AM

#

Yeah, makes sense

Have you tried swapping in the ethos layer yet?

#

After I saw that you were using this as the test framework, I’m considering trying it for the attention subspace work, but it’s not immediately obvious to me how tricky it will be to customize or not.

The MLP looks more straightforward for ethos, at least.

whole goblet Oct 23, 2025, 3:40 AM

#

Yea it trains, just need to fix the bug I found. I think what I found also implies that since the routers are that expressive, each router needs its own hypernetwork. Lines up with what I saw when I was testing out the hypernetwork only approach as well

#

Yea MLP swap here is straightforward, but there’s a lot of tricks happening with attention you’d need to disentangle. Would recommend a previous run before flex attention was introduced

mental plinth Oct 23, 2025, 3:41 AM

#

whole goblet Yea it trains, just need to fix the bug I found. I think what I found also impli...

Yeah, I think that makes a lot of sense (having a network per head)

whole goblet Oct 23, 2025, 3:42 AM

#

Yea, conceptually I think it makes more sense because then you’d never have different heads feeding a shared hypernetwork conflicting gradients

#

And then each can truly specialize

mental plinth Oct 23, 2025, 3:43 AM

#

I’m itching to go read up on their different hacks to attention. Stuff like the smear gate, and skip connections, and value embeddings 😳

whole goblet Oct 23, 2025, 3:43 AM

#

Part of me wants to test throwing an expert choice router in front of each PEER router since it should work. Would allow for even better specialization.

#

And still allows for lossless load balancing

#

And consistent expert load

mental plinth Oct 23, 2025, 3:45 AM

#

Expert choice? Like making an MoE layer where each expert is an ethos layer?

whole goblet Oct 23, 2025, 3:45 AM

#

An ethos head, but yea

#

Expert choice -> PEER router(s) -> constructed expert -> gather token output from experts that selected it

mental plinth Oct 23, 2025, 3:47 AM

#

Well, good stuff. It sounds like a really nice, simple test harness here that ought to be easy to interpret.

whole goblet Oct 23, 2025, 3:47 AM

#

Yea. Going to try the normal stuff first but the expert choice router just feels like it’ll fit there

whole goblet Oct 23, 2025, 4:18 PM

#

@mental plinth , I had half a thought yesterday. Do you think that PEER's router might fit some (albeit looser) definition of an encoder?

mental plinth Oct 24, 2025, 11:12 PM

#

Well, it certainly seems close to Attention. It’s like it attends to the expert neurons instead of a sequence of tokens. It applies softmax just like attention, except only to the top-k experts. And since there’s no “causal masking” I suppose you could compare the whole thing to an Encoder Attention block in that regard?

What aspects of an encoder were you thinking about?

whole goblet Oct 25, 2025, 12:43 AM

#

Mostly that it’s encoding language at that exact point in time since it’s working off a point in time language with causality baked in, in a way that is consumable by a hypernetwork

#

It might be stretching the definition a bit, but does definitely feel like the IO matches the shape of what I do for natural language encoding in my job

whole goblet Oct 27, 2025, 2:44 PM

#

So I tried to reduce this to just comparing to a standard MLP without the complexities of the surrounding model. Chose a CIFAR10 flattened set (not 1:1 with language, I know) just to try to track relative performance on situations where MLPs can struggle. Gave me better signal (on at last this task) on what configs beat an isoparam dense baseline and which don't. Also gave me some better info on which hyperparams impact performance more than others

#

In isoparam (where this arch is actually technically fewer FLOPs now after some tweaking) I'm able to consistently match or beat the single layer FFN baseline. Deeper FFNs are consistently beating, though, so I'm not sure exactly how to compare since MoE arches tend to use a single layer MLP

#

But it reminded me of what Smerky said earlier, that this approach might just be giving the model more depth, which might just be an inductive bias for the CIFAR10 stuff

#

The most interesting part is that it performs way worse on early epochs, but starts to pull ahead pretty quick. It does plateau earlier than the dense baseline so I might play with lowering its LR

whole goblet Oct 27, 2025, 5:44 PM

#

Okay, so counterintuitive finding: lower LR on the router, higher LR on the hypernetwork is preferred.

Not as counterintuitive findings: router is massively overparameterized in some ways and underparamterized in others. Starting to get some amount of crystalization on where the design space is for this without having to do full on training runs

whole goblet Oct 28, 2025, 5:01 PM

#

In testing, multiple heads here actually performs worse than a single router and single hypernetwork. My guess is that if there was an actual positive benefit from multiple heads in the PEER paper (becoming less and less convinced that paper wasn't outright wrong or required very sensitive hyperparams), it likely was due to having discrete experts that needed to be combined into valid experts.

#

And that might look different from a hypernetwork that jointly learns what that valid expert is.

mental plinth Oct 28, 2025, 7:56 PM

#

Nice! I like the approach. I actually did my first round of experiments using a Vision Transformer (ViT) trained on CIFAR-10. It's nice because you can train that wicked fast.

I like the approach of holding the parameters constant, too. If an architecture is truly "better", I think you'd expect that it's able to score higher given the same parameters, right? I think too many of my early experiments were in the regime of "it performs slightly worse with fewer parameters", and I've started to see that as not a very useful observation.

#

Maybe putting the ethos / peer layer into a ViT with an established configuration would make for a good comparison, over just a straight MLP that you have to architect yourself?

#

Also, do you think there might be any merit to testing PEER with a dense router? The PK router makes it efficient to run for a large number of experts, but also introduces a low rank decomposition, and I'm wondering if removing that and isolating just the single-neuron-expert approach could be informative. I'm not really sure--it's an incomplete thought.

whole goblet Oct 29, 2025, 12:12 AM

#

mental plinth Maybe putting the ethos / peer layer into a ViT with an established configuratio...

Yea I think this is the path I'm going to use for now just to test in isolation. Technically not 1:1 for LLMs but should be enough to at least inform behavior

#

Going to use this repo since it has baselines and previous training runs https://github.com/kentaroy47/vision-transformers-cifar10

GitHub

GitHub - kentaroy47/vision-transformers-cifar10: Let's train vision...

Let's train vision transformers (ViT) for cifar 10 / cifar 100! - GitHub - kentaroy47/vision-transformers-cifar10: Let's train vision transformers (ViT) for cifar 10 / cifar 100!

#

Going to set it up to run ETHOS and it side by side with an isoparam constraint on the MLP, which should get pretty well isolated behavior curves at least

whole goblet Oct 29, 2025, 1:45 PM

#

Ignore the mha misnomer since it transferred over from my toy example, is in fact using ViT. Up to date code, running the included sweep now. Includes FiLM which is an improvement on PEER to get an actual rank-k MLP to generate without needing to blow up the hypernetwork's output layer (in this arch it's now rank-m, with k being the depth of modulation).

Without doing an arch sweep (next phase once I have a decent idea of what hparams I should use), best baseline is getting 84.37, vs 83.27 for ETHOS. So consistently behind, but I think a decent chunk of that gap will be made up by better balancing of parameters once I do an arch sweep. Relatively certain current setup is overparameterized in some places, and under in others.

https://gist.github.com/wrmedford/ef452a86bae0c7dd1201b5e4e265729a

Gist

ethos.py

GitHub Gist: instantly share code, notes, and snippets.

#

Normal hedges of non-pretrained ViT on limited dataset, etc. etc. just seemed like the best way to isolate purely the MLP aspects of this without usng a super contrived method

whole goblet Nov 3, 2025, 10:51 PM

#

Small update ViT stuff is going reasonable. Just still doing a lot of arch/hparam searches. Narrowing, but not exactly solidified yet

whole goblet Nov 5, 2025, 12:31 AM

#

Alright, going to be setting this down for a bit. In a pretty isolated MLP vs ETHOS bakeoff, error bars overlap after tuning both, but it's not consistently beating a dense baseline. Maybe/Probably has better performance in transfer learning, but no clue yet. Going to set it down for a while.

#

I'm going to be popping over to just do some more standard kaggle competitions + maybe try my hand at the gpu mode kernel writing stuff especially since gluon is a thing now if anyone wants to join

quartz kestrel Jan 28, 2026, 2:52 PM

#

i did try a 256² experts version of a PEER implementation and got a gibberish generator. Was the original model able to produce meaningful text?

#

did someone tried a better dataset than C4 and wikitext, like FineWeb-Edu?

whole goblet Feb 1, 2026, 6:46 PM

#

quartz kestrel i did try a 256² experts version of a PEER implementation and got a gibberish ge...

It wasn't a gibberish generator at any scale I tested with, just not as strong as expected performance. Router code can be bug prone, so I would make sure you implemented PKM correctly

spare flame Feb 1, 2026, 9:17 PM

#

btw @whole goblet didnt know if you've seen this https://arxiv.org/abs/2508.18756v1

arXiv.org

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superio...

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE mode...

#

someone seems to have gotten PEER to be useful in at least some context

whole goblet Feb 2, 2026, 1:41 AM

#

Nah I've stepped away from research in general outside of work.

#

I'll give it a read though

#

But yea just been spending my time on some trading strats, been doing well on that so far, but not really anything you publish or do outside of your own money

spare flame Feb 2, 2026, 5:12 AM

#

o ok! just thought u might be interested since you were investigating PEER so heavily

quartz kestrel Feb 2, 2026, 5:32 AM

#

whole goblet It wasn't a gibberish generator at any scale I tested with, just not as strong a...

i did test this one https://huggingface.co/ThomasTheMaker/PEER-v1

#

when i say gibberish, not exactly gibberish, otherwise it wouldn't be PPL 7

#

it's just like repetitive patterns or sequences that doesn't make much sense for humans, but if you analyze the output, it carry meaning

#

to be fair, I wouldn't expect anything different from any model trained on wikitext-103-raw

#

I can't pretrain or finetune this scale on my hardware, would be awesome to see how PEER behave in high quality educational datasets, such as FineWeb-Edu

whole goblet Feb 2, 2026, 6:24 PM

#

quartz kestrel I can't pretrain or finetune this scale on my hardware, would be awesome to see ...

I found the GH200's from Lambda to be the sweet spot for this kind of research. The architecture of those machines is also good if you want to test out multi-tiered expert retrieval

#

If PEER/Grajewski pan out, I (sloppily) hypothesized that this kind of architecture would be the ideal one. https://github.com/wrmedford/moe-scaling

#

Keep the non-expert transformer elements of the model in HBM, expert portions in system RAM/Storage, and scale model to infinity. Has some other issues that I tried to get around by using hypernetworks to compress expert knowledge (https://gist.github.com/wrmedford/ef452a86bae0c7dd1201b5e4e265729a)

#

Happy to go over what I learned, but I'm pretty sure that PEER is just a bitter lesson wrapped in some solid high level logic at this point. I think that if there is a path towards breaking past both memory bandwidth walls and increasing effective model size on this architecture, it'll have to be done through a hypernetwork. With that said, I'm not sure that a standard backprop will be the answer to better performance here.

Small experiments I did with different hypernetwork based models (where the hypernetwork creates the expert on the fly) never actually beat their dense equivalents.

whole goblet Feb 2, 2026, 6:56 PM

#

whole goblet Keep the non-expert transformer elements of the model in HBM, expert portions in...

also @spare flame you might find this interesting, might not. It didn't end up being better, but beam search to build a cartesian product as the input to a network worked reasonably well

#

It's basically where I dropped inquiry though

#ETHOS