Main Subnet Routing | Stockfish | Page 1

mossy wraith Feb 4, 2026, 10:57 PM

#

For a long time I wondered whether using piece count really is the best way to do the routing. There are soo many "free" features one could use that are calculated anyway (like for the histories, or the simple eval score, big pieces -- queen / rooks, castling rights etc.). Some simple combination of a couple of those should still be very cheap to calculate.

Has anyone tried something other than piece_count?

A way more interesting idea might be to try and learn those features: One could take inspiration from MoE Transformers to train them smoothly and then approximate the learned router with decision tree stumps, which can "unrolled" very efficiently for CPU

loud elm Feb 5, 2026, 11:13 AM

#

It's easy to say we want to try...

#

But it's not easy to try.

#

The easiest way to gain elo is to submit a few tests every once in a while...

#

And let fishtest handle the rest.

fossil frost Feb 5, 2026, 12:08 PM

#

IMO the raw evaluation should be stateless

#

How else is the trainer supposed to work

fossil frost Feb 5, 2026, 12:10 PM

#

mossy wraith For a long time I wondered whether using piece count really is the best way to d...

The bottleneck is not ideas, but implementation, this stuff is not easy to add

mossy wraith Feb 5, 2026, 3:45 PM

#

Ill try my best to implement, but I have no GPU to run the work. Would anyone be willing to train?

mossy wraith Feb 5, 2026, 3:47 PM

#

loud elm It's easy to say we want to try...

That's true. But I still wonder if anyone has already tried, so I know where to start.

I noticed that there are a few things that seem a bit over engineeres. The PSQ is basically a skip connection, but in a weird way, just as the single value in the main subnet.

Skip connections are essential in modern deep learning and it aligns with the NNUE framing of "the net has trouble learning high evals". Because everything is clipped

fossil frost Feb 5, 2026, 3:48 PM

#

mossy wraith Ill try my best to implement, but I have no GPU to run the work. Would anyone be...

Just edit nnue-PyTorch and get a free H200 run in a few days

mossy wraith Feb 5, 2026, 3:48 PM

#

Oh ok, thats great news! Thanks for the info 🔥🔥

fossil frost Feb 5, 2026, 3:49 PM

#

Most experiments in the output buckets have been to change the specific piece count buckets

fossil frost Feb 5, 2026, 3:50 PM

#

fossil frost IMO the raw evaluation should be stateless

In particular it needs to be intrinsic properties I think

mossy wraith Feb 5, 2026, 3:50 PM

#

fossil frost Most experiments in the output buckets have been to change the specific piece co...

That's quite a reasonable start since 4 pieces really is quite trivial. Any results so far?

fossil frost Feb 5, 2026, 3:51 PM

#

Idk what @half gust last tried but farseer layout didn’t seem to gain

mossy wraith Feb 5, 2026, 3:52 PM

#

fossil frost In particular it needs to be intrinsic properties I think

Agreed, otherwise the dataset doesn't work. It's a bummer that getting clean data is so difficult.

Getting new data would probably be the single most effective way to get a better net...

half gust Feb 5, 2026, 3:53 PM

#

fossil frost Idk what <@418667403396775936> last tried but farseer layout didn’t seem to gain

I did it dumbly, literally just changed buckets and didn’t change the input data

mossy wraith Feb 5, 2026, 3:53 PM

#

fossil frost In particular it needs to be intrinsic properties I think

Im thinking about using the psq + linear for routing 🤔

half gust Feb 5, 2026, 3:53 PM

#

I just bought a GPU so now I can at least test correctness locally

#

(Not full runs of course, just make sure my logic is correct)

fossil frost Feb 5, 2026, 3:54 PM

#

mossy wraith Im thinking about using the psq + linear for routing 🤔

Yeah this extends the “material count” routing

mossy wraith Feb 5, 2026, 3:57 PM

#

Yeah I actually think doing small sub runs is essential to iron out the kinks initially

#

And to get a quick feel for different configurations

mossy wraith Feb 5, 2026, 4:13 PM

#

One of my biggest criticism with the current SFNN is actually the tiny bottleneck from L1-> 16 in the main subnet. I think a "grouped linear" design would be able to not only safe parameters but also keep much more information. E.g. L1/N-> 16 but N times.

fossil frost Feb 5, 2026, 5:50 PM

#

mossy wraith One of my biggest criticism with the current SFNN is actually the tiny bottlenec...

you need to be aware of the computational cost

#

this one is by far the most expensive in terms of arithmetic

#

actually now threat inputs are probably more expensive in arithmetic

#

but yeah

#

pretty much the way it works is

#

you have expensive things

#

then you optimize them

#

and when you optimize them you can gain a little bit extra by then pushing the sizes just a bit farther

mossy wraith Feb 5, 2026, 9:58 PM

#

Yes L1/N -> 16 N times has the same cost as L1 -> 16 but then you have 16 * N neurons in the next layer, which keeps a lot more information at low extra costs in the following layers. Thats why I think a single bottleneck is not great.

I'd assume thats also the reason for diminishing returns in L1 and why it reverted back to 1024

fossil frost Feb 5, 2026, 10:05 PM

#

reverting back was because the input features are also much more

fossil frost Feb 6, 2026, 2:34 AM

#

mossy wraith Yes L1/N -> 16 N times has the same cost as L1 -> 16 but then you have 16 * N ne...

Feel free to try something like 1024 -> 256 -> 16 -> 1

#

With this idea

jolly viper Feb 6, 2026, 7:46 AM

#

mossy wraith Yes L1/N -> 16 N times has the same cost as L1 -> 16 but then you have 16 * N ne...

L1 is the layer that increases net strength the most when increased, actually (especially when compared to the corresponding slowdown)

#

Increasing L3 doesn't really do anything (for obvious reasons), and increasing L2 is incredibly expensive

#

I definitely think your idea has potential

mossy wraith Feb 6, 2026, 3:11 PM

#

Does anyone know whether the current inference code is capable of handing heterogeneous experts? E.g. for small piece counts it's likely that a smaller expert will be good enough while being significantly faster. Or will it not improve in practice because of how the memory is managed?

jolly viper Feb 6, 2026, 3:13 PM

#

stockfish already uses a completely separate small network, specifically trained to evaluate positions with high material imbalance (and used in those same cases to speed things up)

#

it's certainly possible to make some of the output buckets have more weights

#

since that sounds more like what you mean

fossil frost Feb 6, 2026, 3:14 PM

#

You could theoretically load in output subnets of differing sizes yes

#

with some hacks

jolly viper Feb 6, 2026, 3:15 PM

#

but i don't think this is worth it from a speed perspective (since the last layers are already very small)

#

and e.g. decreasing L2 to 8 would not even help on avx512

fossil frost Feb 6, 2026, 3:17 PM

#

sf is special since the 32 -> 32 affine is also u8 * i8 -> i32

#

so idk generically maybe increasing L3 to 64

#

could be a thing

#Main Subnet Routing