#Main Subnet Routing

51 messages · Page 1 of 1 (latest)

mossy wraith
#

For a long time I wondered whether using piece count really is the best way to do the routing. There are soo many "free" features one could use that are calculated anyway (like for the histories, or the simple eval score, big pieces -- queen / rooks, castling rights etc.). Some simple combination of a couple of those should still be very cheap to calculate.

Has anyone tried something other than piece_count?

A way more interesting idea might be to try and learn those features: One could take inspiration from MoE Transformers to train them smoothly and then approximate the learned router with decision tree stumps, which can "unrolled" very efficiently for CPU

loud elm
#

It's easy to say we want to try...

#

But it's not easy to try.

#

The easiest way to gain elo is to submit a few tests every once in a while...

#

And let fishtest handle the rest.

fossil frost
#

IMO the raw evaluation should be stateless

#

How else is the trainer supposed to work

fossil frost
mossy wraith
#

Ill try my best to implement, but I have no GPU to run the work. Would anyone be willing to train?

mossy wraith
# loud elm It's easy to say we want to try...

That's true. But I still wonder if anyone has already tried, so I know where to start.

I noticed that there are a few things that seem a bit over engineeres. The PSQ is basically a skip connection, but in a weird way, just as the single value in the main subnet.

Skip connections are essential in modern deep learning and it aligns with the NNUE framing of "the net has trouble learning high evals". Because everything is clipped

fossil frost
mossy wraith
#

Oh ok, thats great news! Thanks for the info 🔥🔥

fossil frost
#

Most experiments in the output buckets have been to change the specific piece count buckets

fossil frost
mossy wraith
fossil frost
#

Idk what @half gust last tried but farseer layout didn’t seem to gain

mossy wraith
half gust
mossy wraith
half gust
#

I just bought a GPU so now I can at least test correctness locally

#

(Not full runs of course, just make sure my logic is correct)

fossil frost
mossy wraith
#

Yeah I actually think doing small sub runs is essential to iron out the kinks initially

#

And to get a quick feel for different configurations

mossy wraith
#

One of my biggest criticism with the current SFNN is actually the tiny bottleneck from L1-> 16 in the main subnet. I think a "grouped linear" design would be able to not only safe parameters but also keep much more information. E.g. L1/N-> 16 but N times.

fossil frost
#

this one is by far the most expensive in terms of arithmetic

#

actually now threat inputs are probably more expensive in arithmetic

#

but yeah

#

pretty much the way it works is

#

you have expensive things

#

then you optimize them

#

and when you optimize them you can gain a little bit extra by then pushing the sizes just a bit farther

mossy wraith
#

Yes L1/N -> 16 N times has the same cost as L1 -> 16 but then you have 16 * N neurons in the next layer, which keeps a lot more information at low extra costs in the following layers. Thats why I think a single bottleneck is not great.

I'd assume thats also the reason for diminishing returns in L1 and why it reverted back to 1024

fossil frost
#

reverting back was because the input features are also much more

fossil frost
#

With this idea

jolly viper
#

Increasing L3 doesn't really do anything (for obvious reasons), and increasing L2 is incredibly expensive

#

I definitely think your idea has potential

mossy wraith
#

Does anyone know whether the current inference code is capable of handing heterogeneous experts? E.g. for small piece counts it's likely that a smaller expert will be good enough while being significantly faster. Or will it not improve in practice because of how the memory is managed?

jolly viper
#

stockfish already uses a completely separate small network, specifically trained to evaluate positions with high material imbalance (and used in those same cases to speed things up)

#

it's certainly possible to make some of the output buckets have more weights

#

since that sounds more like what you mean

fossil frost
#

You could theoretically load in output subnets of differing sizes yes

#

with some hacks

jolly viper
#

but i don't think this is worth it from a speed perspective (since the last layers are already very small)

#

and e.g. decreasing L2 to 8 would not even help on avx512

fossil frost
#

sf is special since the 32 -> 32 affine is also u8 * i8 -> i32

#

so idk generically maybe increasing L3 to 64

#

could be a thing