UE Threat Inputs for AB | Stockfish | Page 8

rare jacinth Oct 25, 2025, 8:01 AM

#

where can I find the latest branch with all changes merged?

#

and what percent of time is going to threat calculation right now?

prime mica Oct 25, 2025, 8:01 AM

#

https://github.com/xu-shawn/Stockfish/tree/threat_inputs

GitHub

GitHub - xu-shawn/Stockfish at threat_inputs

A free and strong UCI chess engine. Contribute to xu-shawn/Stockfish development by creating an account on GitHub.

twilit oriole Oct 25, 2025, 8:02 AM

#

the profiling is posted somewhere. dont know the search terms to find but someone should

rare jacinth Oct 25, 2025, 8:03 AM

#

and are we still maintaining two accumulators, one for each side/using king buckets?

rocky vigil Oct 25, 2025, 8:23 AM

#

yes

#

two accumulators

#

well actually 4 now

#

2 for each side

#

one tracks king bucketed part

#

the other tracks threats

prime mica Oct 25, 2025, 8:23 AM

#

dumb question, coudl you combine them

rocky vigil Oct 25, 2025, 8:23 AM

#

rare jacinth and what percent of time is going to threat calculation right now?

around 5% last time profiling was run, maybe someone could run again with latest branch

prime mica Oct 25, 2025, 8:24 AM

#

actually where are they combined lol

rocky vigil Oct 25, 2025, 8:24 AM

#

prime mica dumb question, coudl you combine them

this actually loses, because then every time you move the king you need to refresh the entire accumulator

prime mica Oct 25, 2025, 8:24 AM

#

ohh good point

rocky vigil Oct 25, 2025, 8:24 AM

#

prime mica actually where are they combined lol

at the inference call

#

another idea, probably minor

#

since we never evaluate in check

#

we can ignore threats that would imply a check

#

but this is odd because it involves color

rare jacinth Oct 25, 2025, 8:28 AM

#

does that mean we have 4 accumulators of size 1024, each according to one perspective?

rocky vigil Oct 25, 2025, 8:28 AM

#

yeah, but really think of it as separating the big accumulator

#

e.g. threats + halfkav2_hm

#

so there is an accumulator which tracks all threat feature contributions

#

and there is an accumulator which tracks all halfkav2_hm feature contributions (this one also has the biases)

#

and the "true" accumulator is the sum of these two

lofty cedar Oct 25, 2025, 8:29 AM

#

Okay... another one potential threat input speedup.

#

Apruvu sama.

rocky vigil Oct 25, 2025, 8:30 AM

#

rocky vigil we can ignore threats that would imply a check

very unsure how much this would save

#

requires a trainer change

#

so is much more involved

prime mica Oct 25, 2025, 8:30 AM

#

threats implying a check is what, king attacks?

rocky vigil Oct 25, 2025, 8:30 AM

#

yeah, like if a piece of opposite color attacks a king

lofty cedar Oct 25, 2025, 8:30 AM

#

I think so...

prime mica Oct 25, 2025, 8:31 AM

#

rocky vigil yeah, like if a piece of opposite color attacks a king

but also the king attacks that piece?

#

or just one way

rocky vigil Oct 25, 2025, 8:31 AM

#

no i think as long as the king is attacked by a piece of the opposite color

#

we are in check

#

applies to both kings i think

prime mica Oct 25, 2025, 8:31 AM

#

I see

rocky vigil Oct 25, 2025, 8:31 AM

#

rocky vigil requires a trainer change

actually it might not

#

now that I think of it

#

since those features are actually never used anyways

#

it is def minor though

prime mica Oct 25, 2025, 8:32 AM

#

😩

lofty cedar Oct 25, 2025, 8:37 AM

#

Looks like I have yet another cursed speedup idea.

prime mica Oct 25, 2025, 8:37 AM

#

lol

#

you have rly pivoted

lofty cedar Oct 25, 2025, 8:38 AM

#

I mean... I still do regular patches, but speedups are needed for threat inputs.

prime mica Oct 25, 2025, 8:38 AM

#

^_^

rocky vigil Oct 25, 2025, 8:38 AM

#

Curious to see how far the speedups can go

rare jacinth Oct 25, 2025, 8:38 AM

#

why are we always updating threats after do_move when a large part of the time we return immediately

prime mica Oct 25, 2025, 8:38 AM

#

here's somethin

#

will put on fishtest and see

lofty cedar Oct 25, 2025, 8:39 AM

#

Do you report a speedup on mine?

rocky vigil Oct 25, 2025, 8:39 AM

#

rare jacinth why are we always updating threats after do_move when a large part of the time w...

One potentially major improvement is to defer the threat calculation to eval time, but this is a rather involved change

prime mica Oct 25, 2025, 8:39 AM

#

lofty cedar Do you report a speedup on mine?

didn't I send u

#

was major

lofty cedar Oct 25, 2025, 8:39 AM

#

No, the new one.

prime mica Oct 25, 2025, 8:40 AM

#

oh

rocky vigil Oct 25, 2025, 8:40 AM

#

If we can bandaid easy things for now that would also be good in the short term

rare jacinth Oct 25, 2025, 8:40 AM

#

rocky vigil One potentially major improvement is to defer the threat calculation to eval tim...

can you show me the source that showed threat input calculations is 5% before I go about this

prime mica Oct 25, 2025, 8:40 AM

#

which new one

lofty cedar Oct 25, 2025, 8:40 AM

#

https://tests.stockfishchess.org/tests/view/68fc8a55637acd2a11e72da5

#

Basically, I loaded four cache lines to force the prefetcher to prefetch.

#

And then come back later to finish.

rocky vigil Oct 25, 2025, 8:41 AM

#

This was from a week ago, right is the newer one I think

prime mica Oct 25, 2025, 8:42 AM

#

lofty cedar Basically, I loaded four cache lines to force the prefetcher to prefetch.

powerful lmao

#

I've never gotten prefetching to help

lofty cedar Oct 25, 2025, 8:42 AM

#

Not sure if it helps either.

rocky vigil Oct 25, 2025, 8:44 AM

#

In other news stc smp indeed concluded at ~neutral with master this time in 10k games

#

If nothing else I guess wait for a week of speedups while the double length net training runs and see where we are then

prime mica Oct 25, 2025, 8:49 AM

#

@rare jacinth

#

usual caveat that my computer is weird applies, but here's threat_inputs at the moment

lofty cedar Oct 25, 2025, 8:50 AM

#

How's my prefetch patch?

#

Does it work?

lofty cedar Oct 25, 2025, 8:51 AM

#

prime mica usual caveat that my computer is weird applies, but here's `threat_inputs` at th...

Oh, your computer isn't weird... it's ahead of its time.

prime mica Oct 25, 2025, 8:51 AM

#

I'll try it out in a bit

violet badger Oct 25, 2025, 8:58 AM

#

what did you run? speedtest multithreaded or bench single threaded?

lofty cedar Oct 25, 2025, 8:58 AM

#

I run speedtest 16 threads.

#

And this latest one reported a slight speedup.

#

Though my PC ain't very reliable at speedtest. There are background processes and so on.

rare jacinth Oct 25, 2025, 9:01 AM

#

prime mica <@411331585212809227>

I'm not seeing set_check_info for some reason even though that's at least 1% slowdown from my understanding

prime mica Oct 25, 2025, 9:02 AM

#

it probably got inlined

violet badger Oct 25, 2025, 9:03 AM

#

that's my local result.. on speedtest

#

and here 32 concurrent single threaded speedtests

#

quite a difference..

prime mica Oct 25, 2025, 9:07 AM

#

shared-memory patch 😭

lofty cedar Oct 25, 2025, 9:15 AM

#

Well... here's a thing. I was desperately looking for speedups so we could push threat_inputs so maybe some ideas weren't working.

#

@prime mica
But well... here's something. If you can investigate why my monomorphization patch speeds up massively on some machines but not others, maybe we can find a way?

prime mica Oct 25, 2025, 9:19 AM

#

I’m really not sure

#

I think it might simply be the memory bandwidth issue

#

And threat inputs weights being big

#

It’s something that’s fundamentally impossible to tune for

lofty cedar Oct 25, 2025, 9:22 AM

#

Would SMP help?

#

Let's try SMP?

naive comet Oct 25, 2025, 9:24 AM

#

I think mmap

lofty cedar Oct 25, 2025, 9:25 AM

#

Hmm? WDYM?

naive comet Oct 25, 2025, 9:28 AM

#

sharing the net across instances

will be a gainer for master but a greater gainer for threat inputs cuz of fatter net

lofty cedar Oct 25, 2025, 9:29 AM

#

Oh, yeah...

#

But for this, let's try this first.

prime mica Oct 25, 2025, 9:34 AM

#

We could try doing threat inputs STC + shared memory vs master with shared memory

#

If it works then could catalyze the shared memory branch to be pushed over the finish line

violet badger Oct 25, 2025, 9:34 AM

#

I had a look, it is probably still fairly easy to rebase the mmap branch on the threats branch tbh.

prime mica Oct 25, 2025, 9:35 AM

#

Huzzah

violet badger Oct 25, 2025, 9:35 AM

#

so good enough for testing.

#

but the mmap branch still needs some work..

#

it is a bit a beast in itself.

prime mica Oct 25, 2025, 9:35 AM

#

Lol

violet badger Oct 25, 2025, 10:01 AM

#

threats+mmap seems to beat master (without mmap) in a quick and dirty test...

#

(not entirely fair, obviously)

lofty cedar Oct 25, 2025, 11:53 AM

#

How's the new prefetch speedtest?

rocky vigil Oct 25, 2025, 6:44 PM

#

prime mica <@411331585212809227>

Interesting, seems like append_changed_indices disappeared from the hot spots

#

Actually this is strange, it used to take 5% of runtime, I don’t believe now it takes <1%

#

That would be too good to be true

violet badger Oct 25, 2025, 7:03 PM

#

maybe the compiler realized it could be inlined ...

prime mica Oct 25, 2025, 8:00 PM

#

yeah it probably got inlined

prime mica Oct 25, 2025, 8:01 PM

#

lofty cedar How's the new prefetch speedtest?

seems to make no difference 😩

prime mica Oct 25, 2025, 8:19 PM

#

https://tests.stockfishchess.org/tests/live_elo/68fc8e33637acd2a11e72dad wow this is failing hard

#

despite it being a 2% speedup locally

daring wren Oct 25, 2025, 8:47 PM

#

"failing hard"

prime mica Oct 25, 2025, 8:54 PM

#

yes

#

fail high, fail low, fail hard

#

I mean if it were a 2% speedup across the board it should pass STC quite quickly

rocky vigil Oct 25, 2025, 9:04 PM

#

Fishtest the dream crusher

prime mica Oct 25, 2025, 9:05 PM

#

I'ma put up shared-memory vs. threat inputs + shared-memory on fishtest

#

unless someone's done that already or has objections

rocky vigil Oct 25, 2025, 9:06 PM

#

feel free, if you can get it to work

lofty cedar Oct 25, 2025, 10:05 PM

#

prime mica despite it being a 2% speedup locally

But well, was that 2% even less than what I gave?

i-cache problem?

prime mica Oct 25, 2025, 10:09 PM

#

yeah potench

#

I was working off of the tip of threat_inputs rather than urs tho

#

curous how your approach would work when applied to the current

lofty cedar Oct 25, 2025, 10:14 PM

#

Well... specializing everything is not that good.

#

It's a niche optimization, not something to be broadly applied.

#

This ruins i-cache like crazy.

prime mica Oct 25, 2025, 10:23 PM

#

agreed

#

the aggressive unrolling doesn't help

#

I actually don't think unrolling the fused thing is ever helpful

amber fern Oct 25, 2025, 10:47 PM

#

violet badger threats+mmap seems to beat master (without mmap) in a quick and dirty test...

Have you got the fishtest link yet? 🙂

lofty cedar Oct 26, 2025, 1:49 AM

#

What? These two speedup patches seem to both not work on their own but combined they work together?

https://tests.stockfishchess.org/tests/view/68fcc468637acd2a11e72df2

naive comet Oct 26, 2025, 1:52 AM

#

is this your first time using fishtest

#

don't jinx

#

but also

#

please use pyshbench or smth to double check this

lofty cedar Oct 26, 2025, 1:55 AM

#

One of the patches work massively well on anematode's machine while the other doesn't work. Mine seem to report a marginal speedup on the prefetch one.

naive comet Oct 26, 2025, 1:55 AM

#

I mean the combined patch

lofty cedar Oct 26, 2025, 1:56 AM

#

Seems to have a marginal speedup.

violet badger Oct 26, 2025, 6:24 AM

#

amber fern Have you got the fishtest link yet? 🙂

No, but in local testing:

   1 mmap-master                       :    19.3    1.2  155153.5  294912    53
   2 mmap-1024-nn-26b0e5126117.nnue    :     9.1    1.2  148825.5  294912    50
   3 master                            :     0.0   ----  109242.0  221184    49
   4 1024-nn-26b0e5126117.nnue         :   -15.4    1.3  102875.0  221184    47

rocky vigil Oct 26, 2025, 6:30 AM

#

-10 with mmap in your machine is better than the normal -15 i guess

violet badger Oct 26, 2025, 6:31 AM

#

right

#

benefits threats somewhat more than master, as expected.

stray reef Oct 26, 2025, 9:36 PM

#

Alright here we go.

i8 feature weights (but only the threat weights)

Fixed nodes:

Elo   | -2.52 +- 4.09 (95%)
Conf  | N=20000 Threads=1 Hash=16MB
Games | N: 10084 W: 2811 L: 2884 D: 4389
Penta | [153, 1260, 2291, 1183, 155]

https://furybench.com/test/3531/

STC:

Elo   | 11.81 +- 4.29 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.89 (-2.25, 2.89) [0.00, 2.50]
Games | N: 6212 W: 1627 L: 1416 D: 3169
Penta | [6, 630, 1632, 823, 15]

https://furybench.com/test/3530/

#

(this is with QA=255, clamped into [-128, 127], about 3000 or so of 50M weights are affected by this, no QAT, no special clamping during training)

rocky vigil Oct 26, 2025, 9:40 PM

#

for sf one of the more pressing concerns is that weights are stored as x2 internally

#

i have not found a good way to bypass this

#

either we do two shifts, or perform a double add when combining the accumulators

twilit oriole Oct 26, 2025, 9:59 PM

#

buff_doge

rocky vigil Oct 26, 2025, 10:02 PM

#

anyways if someone who is better at simd would like to do this in sf

#

take the current net

#

read in the threat weights as normal i16s (important: don't multiply by x2)

#

clamp them to i8s

#

unpack i8 simd vectors into i16s

#

deal with the x2 somehow on inference

naive comet Oct 26, 2025, 11:21 PM

#

https://github.com/official-stockfish/Stockfish/commit/c6a1e7fd4232ec151206fab16cb7daa23bfd7137 the x2 thing is barely a speedup anyways @rocky vigil

rocky vigil Oct 26, 2025, 11:21 PM

#

wait +1.7% is like

#

decent

#

would like to preserve it if possible

frosty imp Oct 26, 2025, 11:23 PM

#

what's the issue with x2 now?

rocky vigil Oct 26, 2025, 11:23 PM

#

clamping threat weights to i8

naive comet Oct 26, 2025, 11:23 PM

#

basically because we were never bounded by size of weights we can x2 freely

#

but now since we want to clamp it to i8 we cant do that

naive comet Oct 26, 2025, 11:23 PM

#

stray reef Alright here we go. i8 feature weights (but only the threat weights) Fixed nod...

this is actually gg btw

frosty imp Oct 26, 2025, 11:23 PM

#

clamp it to i8/2 Kappa

rocky vigil Oct 26, 2025, 11:24 PM

#

naively, x2 can be dealt with by either a double add on combining accumulators, or by introducing an additional shift

#

both seem not ideal

naive comet Oct 26, 2025, 11:25 PM

#

OH wait

#

yoshie

naive comet Oct 26, 2025, 11:25 PM

#

stray reef (this is with QA=255, clamped into [-128, 127], about 3000 or so of 50M weights ...

^^ this thing

#

we can just QA=127, clamped to [-64, 63]

#

or we can QA=255, clamped to [-128, 127] and elide the *2 during load altogether

rocky vigil Oct 26, 2025, 11:27 PM

#

doesn't that require a training change?

naive comet Oct 26, 2025, 11:27 PM

#

yeah

#

dont we have vondele's gpu?

rocky vigil Oct 26, 2025, 11:27 PM

#

true

naive comet Oct 26, 2025, 11:28 PM

#

I'm sure 1 retrain for practically guaranteed passer is fine 😉

#

yoshie 🤝 sf

rocky vigil Oct 26, 2025, 11:28 PM

#

if we request a new training run let's also wait to figure out weight clipping then

#

i'm sure retraining plenty net with weight clipping is also worth some additional

rocky vigil Oct 26, 2025, 11:29 PM

#

naive comet or we can QA=255, clamped to [-128, 127] and elide the *2 during load altogether

is there a reason why 127 was done previously?

#

instead of 255

naive comet Oct 26, 2025, 11:29 PM

#

no retrain

rocky vigil Oct 26, 2025, 11:30 PM

#

oh

#

so it was an old choie

#

that just got locked in

#

over all the different stages

naive comet Oct 26, 2025, 11:31 PM

#

that patch was written right before linrock quit I think

#

one of his last experiments I told him to try 255

#

but it never made the light of day

rocky vigil Oct 26, 2025, 11:31 PM

#

ah

green moat Oct 27, 2025, 12:02 AM

#

currently vondele is training a 1024 factorized net with 1600 epochs/stage.
It will be ready ~friday 31th October.
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/pipelines/2119342422

rocky vigil Oct 27, 2025, 12:02 AM

#

after recent developments it might even be outdated before it's done lol

green moat Oct 27, 2025, 12:03 AM

#

rocky vigil after recent developments it might even be outdated before it's done lol

Elo is Elo

rocky vigil Oct 27, 2025, 12:03 AM

#

no like with the i8 quant

#

eh

#

we'll see

#

it'll probably take a few days to get stuff sorted out

rocky vigil Oct 27, 2025, 12:04 AM

#

rocky vigil no like with the i8 quant

this could be an extra +10 stc/+5 ltc

prime mica Oct 27, 2025, 12:09 AM

#

huzzah

stray reef Oct 27, 2025, 5:02 AM

#

stray reef Alright here we go. i8 feature weights (but only the threat weights) Fixed nod...

LTC seems to struggle a bit. I think this is due to worker distribution, the GCP workers seem to benefit less from this (either due to avx2 or due to different memory behaviour)

I will repeat both STC+LTC on my own machine to get comparable numbers

prime mica Oct 27, 2025, 5:03 AM

#

what is GCP

#

happy to contribute some CPU time as well if u can show me how

stray reef Oct 27, 2025, 5:03 AM

#

google cloud platform
these are 8 cores of epyc 7b13 per worker

prime mica Oct 27, 2025, 5:03 AM

#

oh google cloud

stray reef Oct 27, 2025, 5:06 AM

#

prime mica happy to contribute some CPU time as well if u can show me how

that would be awesome.
first create an account on https://furybench.com/, then when @hollow crystal approves it clone https://github.com/aronpetko/OpenBench and run Client/client.py

violet badger Oct 27, 2025, 5:07 AM

#

rocky vigil doesn't that require a training change?

so if we have a branch for the threat net that can be used, somebody just PR the changed recipe to the nettest repo and I can start the CI run?

violet badger Oct 27, 2025, 5:55 AM

#

naive comet I'm sure 1 retrain for practically guaranteed passer is fine 😉

yes, a retrain of the threat net is easy at this point, see message above

rocky vigil Oct 27, 2025, 6:20 AM

#

stray reef LTC seems to struggle a bit. I think this is due to worker distribution, the GCP...

it's very strange, fury's 7950X3D is also mimicing the STC +11 LTC +3

stray reef Oct 27, 2025, 6:21 AM

#

possibly the +11 is a fluke

#

i guess putting more games into this to get smaller error bars is worth it

twilit oriole Oct 27, 2025, 6:23 AM

#

The +11 is definitely too high

#

But even then tbh. If it comes in lower how does it help? If the LTC passes there is no way to know it regresses at higher TCs

stray reef Oct 27, 2025, 6:25 AM

#

i'd say if it has the "standard" speedup scaling from STC to LTC it's pretty safe to say it's not gonna go negative at any higher TC

twilit oriole Oct 27, 2025, 6:26 AM

#

The fixed nodes loss would suggest it's pretty safe to say regardless I think

stray reef Oct 27, 2025, 6:28 AM

#

alright. yeah i guess if monty got +16 with L1=3072 and also doing psq weights as i8, then +11 is indeed unrealistic

twilit oriole Oct 27, 2025, 6:30 AM

#

Monty also doesn't have UE which I assume would inflate the Elo gain

rocky vigil Oct 27, 2025, 6:37 AM

#

stray reef possibly the +11 is a fluke

btw, +3 LTC is still very good

#

even if it's "only" like +6 STC

rocky vigil Oct 27, 2025, 6:41 AM

#

rocky vigil this could be an extra +10 stc/+5 ltc

update on this, overestimated, thinking +6 stc / +3 ltc now

stray reef Oct 27, 2025, 6:42 AM

#

yeah 6STC/3LTC is still amazing

prime mica Oct 27, 2025, 6:46 AM

#

with i8 threat weights it might make sense to lay the weights out slightly differently

#

by interleaving two accumulator registers worth of weights

#

so imagine

#

i8 weights[128];
->
i16 weights[64];  // weights[0], weights[64], weights[1], weights[65], ...

naive comet Oct 27, 2025, 6:48 AM

#

ehh but an index from black pov is diff from white pov

prime mica Oct 27, 2025, 6:48 AM

#

wdym?

naive comet Oct 27, 2025, 6:48 AM

#

prime mica ``` i8 weights[128]; -> i16 weights[64]; // weights[0], weights[64], weights[1]...

oh that kind

#

yeah

prime mica Oct 27, 2025, 6:48 AM

#

I meant for this

naive comet Oct 27, 2025, 6:48 AM

#

I have no idea how we currently impl the i8 stuff btw

rocky vigil Oct 27, 2025, 6:48 AM

#

we dont

naive comet Oct 27, 2025, 6:49 AM

#

prime mica I meant for this

yeah that is just extra inefficient lol

stray reef Oct 27, 2025, 6:51 AM

#

prime mica ``` i8 weights[128]; -> i16 weights[64]; // weights[0], weights[64], weights[1]...

ah so you want to fuse the updates for both POVs essentially?

prime mica Oct 27, 2025, 6:51 AM

#

oh no

#

like currently you have weightsVec which I assume is half the size of inputVec

#

but you could have something like

#

for (int i = 0; i < L1_ITERATIONS; i += 2) {
   VecI16 goose = weightsVecPacked[i];
   outputVec[i] = addEpi16(inputVec[i], shiftRightEpi16(goose, 8));
   outputVec[i + 1] = addEpi16(inputVec[i + 1], sextEpi8to16(goose));
}```

#

idk exactly what you'd use for sextEpi8to16 but the point is you sign extend the lower 8 bits of each 16-bit pair to the full thingy

#

maybe our old friend maddubs or whatever

#

then it would be the same # of computational instructions but you could use full-width loads rather than two half-width loads

#

there might be something better ofc

stray reef Oct 27, 2025, 6:59 AM

#

I see, that sounds good, let's see if we find something good for extracting the lower i8

prime mica Oct 27, 2025, 6:59 AM

#

lol

#

lemme call Lisa Su rq

stray reef Oct 27, 2025, 7:00 AM

#

why don't we just & 0xFF

prime mica Oct 27, 2025, 7:00 AM

#

because then we'll be adding it as if it's unsigned 😩

stray reef Oct 27, 2025, 7:01 AM

#

right right i forgot the sign bit is in the wrong place

prime mica Oct 27, 2025, 7:03 AM

#

but yeah maddubs with set1_epi16(0x00_01) would work and it's one instruction

stray reef Oct 27, 2025, 7:05 AM

#

maddubs also seems to have lower throughput than cvtepi8_epi16 according to intel docs (i hope i understand throughput correctly in this context)

prime mica Oct 27, 2025, 7:05 AM

#

O

#

where does it say that?

#

do u mean this column?

#

CPI = cycles per instruction so lower is better

#

1 means 1 per clock cycle, 0.5 means 2 per clock cycle, etc.

stray reef Oct 27, 2025, 7:06 AM

#

_mm512_maddubs_epi16 has 0.5 except for the first row

#

not sure how this translates to the most recent architectures and amd but hey

prime mica Oct 27, 2025, 7:07 AM

#

yum

stray reef Oct 27, 2025, 9:11 AM

#

prime mica ``` i8 weights[128]; -> i16 weights[64]; // weights[0], weights[64], weights[1]...

got it to work with _mm512_srai_epi16(_mm512_slli_epi16(x, 8), 8) instead of maddubs now, with maddubs it'd require masking the upper 8 bits away still

#

unless i am again misunderstanding something

prime mica Oct 27, 2025, 9:12 AM

#

oh why

#

the semantics of maddubs are (iirc)

#

dst[i] = (u8)src1[2*i] * (i8)src2[2*i] + (u8)src1[2*i+1] * (i8)src2[2*i+1]

#

maybe u need to flip the order

#

the weights needs to come second

#

_mm_maddubs_epi16(_mm_set1_epi16(1), weights)

stray reef Oct 27, 2025, 9:17 AM

#

hm not sure what i'm doing wrong

#

give me a minute

#

ah i was only working in add, forgot to modify sub and addsub kekw

#

got it now

prime mica Oct 27, 2025, 9:22 AM

#

lololol

#

happens

stray reef Oct 27, 2025, 9:29 AM

#

how much sense do you think it makes to test this on zen5?

prime mica Oct 27, 2025, 9:31 AM

#

I think it'll exaggerate the benefit compared to other architectures

#

but worth a shot

#

does the weight interleaving make any difference on ur computeR?

stray reef Oct 27, 2025, 9:35 AM

#

when running bench 15x through hyperfine it's within error pretty much (this is a 7900X)

#

i'm also gonna run a VSTC on furybench

prime mica Oct 27, 2025, 9:39 AM

#

darn :/

#

I wonder whether the bottleneck is still memory then

stray reef Oct 27, 2025, 9:51 AM

#

yeah the VSTC failed

prime mica Oct 27, 2025, 9:51 AM

#

😩

lofty cedar Oct 27, 2025, 11:03 AM

#

But well, on the second thought, wouldn't it be a good idea?

I mean... if it passes VVLTC ofc.

#

VVLTC is what Stockfish aims for no?

#

A few seconds per move on an 8-thread hardware sounds reasonable.

#

But 72t 10s is probably about VVLTC.

stray reef Oct 27, 2025, 11:07 AM

#

60+0.6 8th is probably less than what most people use SF for. 10+0.1 72th maybe not...

lofty cedar Oct 27, 2025, 11:07 AM

#

100MB seems like nothing nowadays. Games nowadays are in megabytes.

prime mica Oct 27, 2025, 11:07 AM

#

at this point most people probably use a webassembly-based stockfish lol

lofty cedar Oct 27, 2025, 11:07 AM

#

stray reef 60+0.6 8th is probably less than what most people use SF for. 10+0.1 72th maybe ...

Roughly equivalent.

stray reef Oct 27, 2025, 11:08 AM

#

true actually

rocky vigil Oct 27, 2025, 5:58 PM

#

stray reef yeah the VSTC failed

What is the current state? It seems like there was some jumping to 126rrr3-i8t-wide but I thought that didn’t work?

stray reef Oct 27, 2025, 6:01 PM

#

0126rrr3-i8t-wide == 0126rrr3-i8t, except the branch i8weights-threatonly-wideload processes the net differently during compilation, in order to use full loads (e.g. __m512i on avx512 instead of __m256i). but that was slower, at least in my impl

#

i have now merged this https://furybench.com/test/3545/

rocky vigil Oct 27, 2025, 6:03 PM

#

Ah I see

#

What’s the difference between rrr3 and rrr6 then

stray reef Oct 27, 2025, 6:06 PM

#

i sort of used rrr6 on accident initially, it was neutral against rrr3, which is why i re-tested later with rrr3 before merging that

round stone Oct 27, 2025, 7:04 PM

#

naive comet or we can QA=255, clamped to [-128, 127] and elide the *2 during load altogether

https://github.com/linrock/nnue-pytorch/commit/f131f3dade86c05e8a8f6a008eb550bca02ccc62

if there's interest in QA=255, i confirmed this nnue-pytorch commit with cj's patch works for training

#

and retraining the final stage at the time with QA=255 with the same dataset led to about +1 elo

#

so i'd expect it to be slightly stronger for the main net. using 255 for the smallnet was slight negative or neutral

#

https://tests.stockfishchess.org/tests/view/671476f686d5ee47d953c8e7
QA=255 inference diff, and QA=255 smallnet (nn-aa2736ae40b1.nnue)

rocky vigil Oct 27, 2025, 7:11 PM

#

stray reef `0126rrr3-i8t-wide == 0126rrr3-i8t`, except the branch `i8weights-threatonly-wid...

ah how is training with clipping / QAT

rocky vigil Oct 27, 2025, 7:12 PM

#

round stone https://github.com/linrock/nnue-pytorch/commit/f131f3dade86c05e8a8f6a008eb550bca...

I'll check it out, see if I can get it to work

stray reef Oct 27, 2025, 7:12 PM

#

rocky vigil ah how is training with clipping / QAT

haven't tested it yet

rocky vigil Oct 27, 2025, 7:17 PM

#

what we are trying to do is use QA=255 and also clamp certain weights to i8 range

#

so that way we don't have to use the x2 trick

round stone Oct 27, 2025, 7:17 PM

#

https://tests.stockfishchess.org/tests/view/68fc3184637acd2a11e72d4e

Elo: -0.45 ± 3.4 (95%) LOS: 39.8%
Total: 10000 W: 2599 L: 2612 D: 4789
Ptnml(0-2): 30, 1165, 2616, 1166, 23
nElo: -0.90 ± 6.8 (95%) PairsRatio: 0.99

is this the latest measurement of threats vs. master?

rocky vigil Oct 27, 2025, 7:17 PM

#

stc smp yes

#

stc is like -5

round stone Oct 27, 2025, 7:18 PM

#

if it's already neutral at 5+0.05 th 8, why not try 60+0.6 th 8?

#

it could already be better than master

rocky vigil Oct 27, 2025, 7:18 PM

#

still waiting on a couple of new tricks really

#

if nothing else there's a 2x length training run ongoing to see if we can squeeze 1 more elo

#

anyways I'm really happy that we can get it neutral at stc smp without spsa

round stone Oct 27, 2025, 7:21 PM

#

alright, i'm guessing it's already better than master in its current state at vltc smp

#

also QA=255 alone should be at least +1 elo

rocky vigil Oct 27, 2025, 7:22 PM

#

i8 clamp above in PlentyChess is like

#

very big as well

rocky vigil Oct 27, 2025, 7:29 PM

#

stray reef i have now merged this https://furybench.com/test/3545/

yeah it looks like the stc sprt elo is off, bc the stc progtest against obsidian has barely gone anywhere?
that or error bars

stray reef Oct 27, 2025, 7:30 PM

#

the progtest against plenty 7 is looking great tho. not sure why it doesn't translate against obsidian

rocky vigil Oct 27, 2025, 7:31 PM

#

strange

#

although funny it looks like the "scaling" of threat inputs is mostly because it's slower

twilit oriole Oct 28, 2025, 4:01 AM

#

https://tests.stockfishchess.org/tests/view/68fcc468637acd2a11e72df2 why not let the sprt finish

rocky vigil Oct 28, 2025, 4:10 AM

#

sss

#

in all seriousness it seems like big machines benefit way more from the i8 quant

split warren Oct 28, 2025, 4:31 AM

#

7742 also has 256mb L3 cache, dunno if that's relevant

violet badger Oct 28, 2025, 5:46 AM

#

yeah, if you move from not fitting into L3 to fully fitting into L3 such things could happen?

stray reef Oct 28, 2025, 6:05 AM

#

Damn, i hope this translates to CCC and TCEC kekw

stray reef Oct 28, 2025, 12:39 PM

#

Another big gainer found, this time a training improvement.

STC

Elo   | 5.31 +- 2.81 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.91 (-2.25, 2.89) [0.00, 2.50]
Games | N: 15308 W: 3911 L: 3677 D: 7720
Penta | [36, 1717, 3928, 1923, 50]

https://furybench.com/test/3557/
LTC

Elo   | 6.90 +- 3.89 (95%)
SPRT  | 40.0+0.40s Threads=1 Hash=64MB
LLR   | 1.89 (-2.25, 2.89) [0.00, 2.50]
Games | N: 6850 W: 1747 L: 1611 D: 3492
Penta | [2, 688, 1913, 816, 6]

https://furybench.com/test/3558/

I currently have a 4 stage training setup, with SBs 300, 300, 400, 300, and WDL 0.15, 0.3, 0.6, 1.0.
I figured for threat inputs less/longer stages with less LR jumping might be beneficial, so I switched to 2 stages with SBs 1000, 400 and WDL 0.15->0.6, 1.0 for this new network

the 0.3 WDL stage used old and partially bugged 5ksn data, that's gone now, it's all just 20ksn, 20ksn adversarial and 5ksn adversarial data thrown together. and the score is not given by search, but by the chonker (768x12 -> 4096)x2 -> (96 -> 192 -> 192 -> 1)x8 net (which was previously done in the first stage as well)

#

i previously tested simpler ways of merging the early stages, as well as increasing the length of the last stage to 400 SBs, and that all failed on its own

rare jacinth Oct 28, 2025, 2:09 PM

#

What is SB here?

stray reef Oct 28, 2025, 2:09 PM

#

superbatch, i.e. 100M positions

rare jacinth Oct 28, 2025, 2:10 PM

#

And these values for WDL?

stray reef Oct 28, 2025, 2:11 PM

#

0.0 WDL means training purely on score, 1.0 WDL means training purely on game result. Everything else is a linear blend

#

And by 0.15->0.6 I mean that it scales up WDL linearly during the 1000 SBs, instead of keeping WDL constant

rare jacinth Oct 28, 2025, 2:27 PM

#

I can do some work on this if anyone has a machine with strong gpus I can ssh into, I’ve been asked not to use the leela gpus for stockfish related stuff

naive comet Oct 28, 2025, 2:28 PM

#

@violet badger

violet badger Oct 28, 2025, 2:49 PM

#

no ssh access, but one can specify this in a recipe at nettest, and the CI pipeline will execute? So fork the repo, adjust recipe according to your liking, PR, and wait and see.

#

now, I must say that I tried optimizing these values so there is probably no low hanging fruit. However, not excluding this is possible.

rocky vigil Oct 28, 2025, 2:59 PM

#

About the moves with no threat changes

#

I feel like they tend to be endgame positions?

#

But yeah it definitely feels like there should be a bit to gain from handling the empty case like a null move

#

I wonder if this (attkr ^ attkd) == 8 is much faster somehow

violet badger Oct 28, 2025, 4:18 PM

#

I think the most important thing right now is to look in the int8 threats..

#

If there is a branch with the updated trainer, the training should be kicked off asap, in parallel to the little experiment that is running on longer training

lofty cedar Oct 28, 2025, 4:19 PM

#

Should we try to pass a 1280 net at VVLTC?

violet badger Oct 28, 2025, 4:20 PM

#

no, like that's absolutely useless IMO

lofty cedar Oct 28, 2025, 4:20 PM

#

I mean, Stockfish optimizes for around VVLTC right?

violet badger Oct 28, 2025, 4:20 PM

#

why?

lofty cedar Oct 28, 2025, 4:20 PM

#

Because it's where real uses come.

violet badger Oct 28, 2025, 4:20 PM

#

no, that's somehow a misconception

lofty cedar Oct 28, 2025, 4:20 PM

#

And 100MB more is like a drop of water for modern rams.

violet badger Oct 28, 2025, 4:21 PM

#

100MB more is like a hell of a lot for modern L3 caches

lofty cedar Oct 28, 2025, 4:21 PM

#

violet badger no, that's somehow a misconception

Because it and longer TC (that aren't practical to test) are where real use cases come?

violet badger Oct 28, 2025, 4:21 PM

#

especially on your phone

lofty cedar Oct 28, 2025, 4:22 PM

#

Yeah... and then? I mean... if it passes VVLTC, it should be considered good for normal use cases right?

#

(It's rare that normal folks would use Stockfish at lower TC anyway.)

violet badger Oct 28, 2025, 4:23 PM

#

why do you think so?

#

average lichess game analysis is seconds..

#

on slower hardware

#

right now, the priority should be to innovate on these ideas...

#

this will be helped by keep it a bit nimble and agile.

#

not by pushing a large net through some TC that can't be supported by the resources.

#

So, speedups, int8, mmap nets, smarter training processes.

lofty cedar Oct 28, 2025, 4:25 PM

#

Ahh! I see.

But well, Stockfish does indeed optimize for VVLTC right? At least with things like singular extension. Were Stockfish optimized for lichess analysis, we'd be pushing all sorts of anti-scalers through.

#

Though yeah... I agree it's not a priority for now. Maybe we can come back to it later when necessary.

rare jacinth Oct 28, 2025, 4:40 PM

#

@twilit oriole can you give me a full list of the techniques you've tried

eternal onyx Oct 28, 2025, 4:40 PM

#

Keeping the download small and keeping the net fitting in cache is helpful but Stockfish has been more than strong enough for basically any use case for a long time. Remember phones were beating grandmasters since the 2000s!

lofty cedar Oct 28, 2025, 4:43 PM

#

I think at this point the purpose of Stockfish development is pure entertainment.

#

The stronger the engine, the harder it is to learn from it.

#

The good things that come from them are rather seeing interesting matches.

rocky vigil Oct 28, 2025, 9:59 PM

#

@regal steeple for https://tests.stockfishchess.org/tests/view/68fde428637acd2a11e72f83 maybe (attkr ^ attkd) >> 3 is better than (attkr ^ attkd) & 8?

#

i feel like the == 8 instead of & 8 might be nontrivial speedup for whatever reason

twilit oriole Oct 28, 2025, 10:05 PM

#

rare jacinth <@156022481147133952> can you give me a full list of the techniques you've tried

The thread is 7k messages and some of the techniques are not from me (especially the more incremental ones like finding the speedups in threat gen etc)

#

So, a list is not possible

rocky vigil Oct 28, 2025, 10:13 PM

#

@rare jacinth what types of techniques are you looking for? Speedups? Arch changes? Quantization?

#

i am very surprised that removing king threats is not a minor speedup

twilit oriole Oct 28, 2025, 10:17 PM

#

There are known gainers that can be focused on also. Like lazy threats, i8 threats etc

rocky vigil Oct 28, 2025, 10:21 PM

#

yeah those are the two big (known) ones, mostly depend on effort

#

minor: at some point we should perform a cleanup of smallnet, so that it stops using an unnecessary 20MB

rocky vigil Oct 28, 2025, 10:25 PM

#

rocky vigil minor: at some point we should perform a cleanup of smallnet, so that it stops u...

afaik there's no way to cleanly do this under a single class, unless someone has a better suggestion

#

the best way I can think of right now is to declare a base feature transformer class with the psq parts and then have a separate threat feature transformer that inherits the functions and adds threat support

rare jacinth Oct 28, 2025, 10:30 PM

#

rocky vigil <@411331585212809227> what types of techniques are you looking for? Speedups? Ar...

broad architectural ideas like adding inputs of whether a pawn is a passed pawn/removing certain threats/etc.

rocky vigil Oct 28, 2025, 10:30 PM

#

i cannot speak for viren but what we currently do is remove duplicate threats

#

i.e. rook attacks queen

#

also a point to investigate is why removing (enemy) threats to a king seems to be a minor slowdown on fishtest

rocky vigil Oct 28, 2025, 10:39 PM

#

rocky vigil minor: at some point we should perform a cleanup of smallnet, so that it stops u...

right now we are using 90MB more than master, so if master is using ~140MB of net, we are using ~230MB

#

can cut it down to 210MB with this first

#

then i8 will make it around 110MB

naive comet Oct 29, 2025, 12:23 AM

#

rocky vigil <@628932984459886612> for <https://tests.stockfishchess.org/tests/view/68fde4286...

I'll try nobranch ver of this maybe

#

oh but I cant compile shit fml

prime mica Oct 29, 2025, 12:42 AM

#

Even on master I got significant differences from changing the implementation of make_index

#

So def worth a shot

twilit oriole Oct 29, 2025, 1:46 AM

#

https://furybench.com/test/3549/ @stray reef dark worker destroyed the gain of your test lol

#

There's probably no point running this with other workers ngl

#

This is due to exceeding the L3 cache. 128MB on dark worker

#

Along with history tables and such has to fit in the L3 with the net

#

Just 1 cache miss is bad because then you have a latency spike in the eval. 2 cache misses isn't that much slower than 1

naive comet Oct 29, 2025, 3:58 AM

#

https://tests.stockfishchess.org/tests/view/690190cb637acd2a11e737dc

stray reef Oct 29, 2025, 6:54 AM

#

twilit oriole https://furybench.com/test/3549/ <@415167192296849409> dark worker destroyed the...

doesn't really matter, we know/have good reason to believe that L3 influences the speedup, and that's awesome

regal steeple Oct 29, 2025, 7:04 AM

#

rocky vigil <@628932984459886612> for <https://tests.stockfishchess.org/tests/view/68fde4286...

Maybe, I dont really know, lets see how cjs attempt goes

naive comet Oct 29, 2025, 7:49 AM

#

that thing doesn't include your change btw

#

I managed to get it in branchless but it inflated table size

#

will try again maybe

twilit oriole Oct 29, 2025, 1:32 PM

#

@rocky vigil so now QA=255 is fine what's the blocker on the i8 threats. Since no x2 trick is needed

rocky vigil Oct 29, 2025, 1:32 PM

#

no blocker

#

uh

#

the blocker is I have a chem exam today

naive comet Oct 29, 2025, 1:32 PM

#

realshit

twilit oriole Oct 29, 2025, 1:33 PM

#

Cool

rocky vigil Oct 29, 2025, 1:33 PM

#

rocky vigil the blocker is I have a chem exam today

tmrw i will attempt to get started on this

#

also for then

twilit oriole Oct 29, 2025, 1:34 PM

#

And is anyone on the speedup side of things looking into lazy threats

rocky vigil Oct 29, 2025, 1:34 PM

#

I'll ask in advance @naive comet where is weight clipping on the trainer side located

twilit oriole Oct 29, 2025, 1:35 PM

#

twilit oriole And is anyone on the speedup side of things looking into lazy threats

I might but I think it is likely I get nowhere with it lol

#

What's the approx speedup percent for that

#

So I can see what's the expected result if it is working lol

rocky vigil Oct 29, 2025, 1:36 PM

#

threat tracking seems ~5% but you would need to know how much lazy tracking would save

#

over normal

twilit oriole Oct 29, 2025, 1:37 PM

#

Oh. I guess it will be within error bars with speedup tool which makes things harder

rocky vigil Oct 29, 2025, 1:37 PM

#

so like, I estimate 1-2% speedup

#

which is good still

#

but not that easy to see

naive comet Oct 29, 2025, 1:38 PM

#

rocky vigil I'll ask in advance <@1082450465301733376> where is weight clipping on the train...

I have no clue

#

maybe ask sopel

stray reef Oct 29, 2025, 1:39 PM

#

twilit oriole And is anyone on the speedup side of things looking into lazy threats

maybe i can try smth in plenty today or tomorrow

rocky vigil Oct 29, 2025, 1:57 PM

#

stray reef doesn't really matter, we know/have good reason to believe that L3 influences th...

tcec has 256 MB L3 per CPU so should def be a big speedup

stray reef Oct 29, 2025, 1:57 PM

#

fingers crossed

green moat Oct 29, 2025, 6:08 PM

#

Was this "fuse transformer" patch merged in the Threat Inputs branch, whatever it is?
https://tests.stockfishchess.org/tests/view/690008ee637acd2a11e73441
Kappa

rocky vigil Oct 29, 2025, 6:10 PM

#

Ask @prime mica if it can be applied

#

It probably can tbh

prime mica Oct 29, 2025, 6:18 PM

#

Potench

#

The problem is it requires the final add/sub to be computed at the same time as transformed features

#

Fwiw I don’t think it’s a good patch bc it breaks encapsulation of the layers

#

and I’m not planning on trying to get it cleaned up or PRed any time soon

rocky vigil Oct 29, 2025, 6:23 PM

#

Oh actually it can’t be applied

#

Bc the accumulators are split

prime mica Oct 29, 2025, 6:26 PM

#

Right

#

It’s a horrible change for only +2 ELO STC

rocky vigil Oct 29, 2025, 6:27 PM

#

+2 ELO stc still pretty good

prime mica Oct 29, 2025, 6:27 PM

#

Eh

#

Every impediment to trying new net architectures is -1 bajillion

rocky vigil Oct 29, 2025, 6:28 PM

#

btw i still think trying to replace https://github.com/xu-shawn/Stockfish/blob/threat_inputs/src/nnue/features/full_threats.cpp#L71 with a lookup is worth a shot

prime mica Oct 29, 2025, 6:29 PM

#

Oh yes

rocky vigil Oct 29, 2025, 6:32 PM

#

like, removing attacks to king

#

is free

#

with a lookup tablle

lofty cedar Oct 29, 2025, 9:11 PM

#

Should be easy... replace the "map" with a table 4x the size. You can get rid of the logic of determining whether or not it's an enemy and so on.

Then, maybe you could like replace the index calculatio with simpler, less precise arithmetic. Pad unused slots. It would take a bit more memory but as long as they don't get inside the cache, should be fine.

naive comet Oct 30, 2025, 12:44 AM

#

I tried even a 2x size table and it tanks a ton

#

i think the first course of action is to shrink the perspective thing (same patch as ces32 (I might be misremembering his name) but without removing templates)

rocky vigil Oct 30, 2025, 1:41 AM

#

round stone https://github.com/linrock/nnue-pytorch/commit/f131f3dade86c05e8a8f6a008eb550bca...

@naive comet it seems like since this commit self.quantized_one has changed from 255 to 127

#

or is this commit like a fix to a previous QA=255 attempt

#

it also seems like self.hidden_quantized_one has disappeared

#

actually what is self.hidden_quantized_one

#

I cannot find a reference to it

#

oh there's a typo

naive comet Oct 30, 2025, 2:04 AM

#

rocky vigil <@1082450465301733376> it seems like since this commit self.quantized_one has ch...

that is a commit on his branch...

naive comet Oct 30, 2025, 2:04 AM

#

rocky vigil or is this commit like a fix to a previous QA=255 attempt

not a fix

#

that was just our old attempt

rocky vigil Oct 30, 2025, 2:07 AM

#

I'll have smth up soon

#

for u to check

#

@naive comet can you check https://github.com/sscg13/nnue-pytorch/commit/6d5c50ac427eae851b5a02a3c721068ef85bace7

#

oh shoot I found a couple of typos

#

like it should be model.quantization.(etc)

#

instead of model.(etc0

#

does everything else seem ok

#

I still haven't figured out how to selectively clip weights

naive comet Oct 30, 2025, 2:12 AM

#

it's fucked up I think

#

rocky vigil Oct 30, 2025, 2:12 AM

#

i took this from linrock commit

#

ihni what it's doing

naive comet Oct 30, 2025, 2:13 AM

#

ehh

#

then nvm just leave it

#

ok yeah it's right

#

I'm stupid

rocky vigil Oct 30, 2025, 2:13 AM

#

wait how does it work

#

sorry

naive comet Oct 30, 2025, 2:13 AM

#

ok I think it's fine

#

now for the clipping

rocky vigil Oct 30, 2025, 2:14 AM

#

naive comet ok I think it's fine

alright I'll apply model.quantization. fix

rocky vigil Oct 30, 2025, 2:17 AM

#

naive comet now for the clipping

i mean the naive way is just to add some like bool is_threat_weight to quantize_feature_transformer

#

actually I find it strange

#

that the trainer basically entirely ignores weight limits

#

during training

#

?

naive comet Oct 30, 2025, 4:43 AM

#

it clips for non-ft I'm pretty sure

#

it has to

#

or the affine transforms will overflow

rocky vigil Oct 30, 2025, 4:46 AM

#

well yeah

#

but I'm looking for ft clipping

rocky vigil Oct 30, 2025, 4:47 AM

#

rocky vigil i mean the naive way is just to add some like `bool is_threat_weight` to `quanti...

also what do you think of this idea

naive comet Oct 30, 2025, 4:49 AM

#

ft doesn't clip

naive comet Oct 30, 2025, 4:50 AM

#

rocky vigil i mean the naive way is just to add some like `bool is_threat_weight` to `quanti...

just do that

rocky vigil Oct 30, 2025, 4:54 AM

#

naive comet ft doesn't clip

really?

#

i thought ft got clipped to 2 * (ONE)

#

or smth

#

huh

#

i guess it doesn't actually

#

very interesting

#

i still feel like it would be beneficial to make the trainer aware of the clipping

naive comet Oct 30, 2025, 5:23 AM

#

we can do that later surely

rocky vigil Oct 30, 2025, 5:28 AM

#

also like

#

it seems the entire tensor of FT weights

#

is being passed as a whole big chunk

#

so that would need to be split between threat part and psq par

stray reef Oct 30, 2025, 6:28 PM

#

twilit oriole And is anyone on the speedup side of things looking into lazy threats

started working on it, but it's much more annoying than i thought as it requires keeping track of basically all board updates that happen during the move

#

you'd have to duplicate like 90% of makemove and at that point i really don't think there is a speedup here

#

though possibly there is a smart way to do it better than i'm thinking

rocky vigil Oct 30, 2025, 10:09 PM

#

bruh what is going on with https://tests.stockfishchess.org/tests/view/68f4e178637acd2a11e72170

#

imo if it reaches the 800k limit we can just merge

prime mica Oct 30, 2025, 10:11 PM

#

looks like a microarchitectural oddity

#

I think as a rule of thumb auto purge should just be off for nfc patches...

twilit oriole Oct 31, 2025, 12:22 AM

#

rocky vigil imo if it reaches the 800k limit we can just merge

Can merge already, it already passed and then got auto purged

frosty imp Oct 31, 2025, 1:22 AM

#

rocky vigil imo if it reaches the 800k limit we can just merge

800k is just an arbitrary limit

#

You can raise that if you want

shell breach Oct 31, 2025, 2:16 AM

#

stray reef you'd have to duplicate like 90% of makemove and at that point i really don't th...

There are weird filters?

#

And weird stuff in set theory to work with long big numbers

#

Just keep it up we believe in you

frosty imp Oct 31, 2025, 2:18 AM

#

stray reef you'd have to duplicate like 90% of makemove and at that point i really don't th...

Can’t you do something with the same mechanism as lazy accumulator updates

rocky vigil Oct 31, 2025, 2:49 AM

#

naive comet we can do that later surely

If we’re not doing any clipping during training though it’s easier to just post-process the weights

naive comet Oct 31, 2025, 3:16 AM

#

we should clip during training

#

I think

#

idk how we currently do it for the other weights

rocky vigil Oct 31, 2025, 3:33 AM

#

naive comet we should clip during training

Yeah this is probably both easier and more effective than trying to do it at quantization level

stray reef Oct 31, 2025, 7:30 AM

#

frosty imp Can’t you do something with the same mechanism as lazy accumulator updates

that's what I'm trying

#

ideally i'd figure out from a static board + move what to update

rocky vigil Oct 31, 2025, 7:52 AM

#

Yeah smth like that would be ideal

#

So that way the positions are already given

#

So no need to duplicate makemove

stray reef Oct 31, 2025, 7:54 AM

#

How else would you figure out from a board+move what to update, without duplicating half of makemove?

rocky vigil Oct 31, 2025, 8:11 AM

#

stray reef How else would you figure out from a board+move what to update, without duplicat...

What if we use both this board, the move, and the next board

#

Then ideally we know info from both ends

green moat Oct 31, 2025, 5:20 PM

#

Long run of 1024 Factorized nets finished.
So far, 3rd stage net is the best, 5th stage net test is still initializing.
2nd stage net is defending, though

#

step 2: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795300
step 3: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795301
step 4: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795302
step 5: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795303

violet badger Oct 31, 2025, 5:25 PM

#

I'm locally playing step 3, comparing to the best 1024n so far (nn-962e10fb93ee.nnue vs nn-26b0e5126117.nnue), doesn't look like any gain.

green moat Oct 31, 2025, 5:26 PM

#

what about stage 5 net?
copium

violet badger Oct 31, 2025, 5:27 PM

#

stuck job, but I don't think it will be any different. Seems like the longer training plateaus at roughly the same value for all steps.

green moat Oct 31, 2025, 5:31 PM

#

Meanwhile stage 4 net closed the gap with stage 3 net, and we still don't know performance of stage 5 net
copium

violet badger Oct 31, 2025, 5:37 PM

#

   1 nn-26b0e5126117.nnue    :     9.7    1.5  74929.5  147456    51
   2 nn-962e10fb93ee.nnue    :     7.8    1.5  74340.0  147456    50

#

so, having established longer training as no better... i8 is next 🙂

rocky vigil Oct 31, 2025, 5:41 PM

#

Ouch

green moat Oct 31, 2025, 5:41 PM

#

not even testing nn-e5bcdd034264.nnue (stage 5 net), just in case?

rocky vigil Oct 31, 2025, 5:41 PM

#

Yeah i8 is next then

violet badger Oct 31, 2025, 5:42 PM

#

green moat not even testing nn-e5bcdd034264.nnue (stage 5 net), just in case?

I can add it.

violet badger Oct 31, 2025, 7:53 PM

#

No difference:

   1 nn-962e10fb93ee.nnue    :    12.9    1.8  38195.0   73728    52
   2 nn-e5bcdd034264.nnue    :    12.8    1.8  38189.5   73728    52

rocky vigil Oct 31, 2025, 7:55 PM

#

I guess 26b is the strongest we can get without i8 or spsa then

prime mica Oct 31, 2025, 7:59 PM

#

☹️

green moat Oct 31, 2025, 8:00 PM

#

748K tests and still not passing
https://tests.stockfishchess.org/tests/live_elo/68f4e178637acd2a11e72170
😐

prime mica Oct 31, 2025, 8:00 PM

#

lol

#

needs to be put out of its misery

rocky vigil Oct 31, 2025, 8:01 PM

#

It’s a speedup, just on the borderline to be enough elo

green moat Oct 31, 2025, 8:06 PM

#

So, what remains to be done?
Some speedup, i8 inference, i8 nets training, verbatim nets.....and eventually SPSA the net

#

Is it correct?
Is that the roadmap, more or less?

rocky vigil Oct 31, 2025, 8:40 PM

#

yep

#

afaik after i8 we'll probably look towards cleaning it up to be merged

green moat Oct 31, 2025, 9:40 PM

#

rocky vigil afaik after i8 we'll probably look towards cleaning it up to be merged

Barring Elo gain.....
🙏

green moat Oct 31, 2025, 11:26 PM

#

It succedeed
😭
https://tests.stockfishchess.org/tests/live_elo/68f4e178637acd2a11e72170

lofty cedar Nov 1, 2025, 12:22 AM

#

Anyway... if we're gonna do i8 quantization, then we could just like store the 1024 in i16?

rocky vigil Nov 1, 2025, 1:39 AM

#

ig @regal steeple can pr the maybemaybemaybe test

regal steeple Nov 1, 2025, 8:28 AM

#

rocky vigil ig <@628932984459886612> can pr the maybemaybemaybe test

https://github.com/xu-shawn/Stockfish/pull/24

violet badger Nov 1, 2025, 8:31 AM

#

the branch should probably use the nn-26b0e5126117.nnue network?

rocky vigil Nov 1, 2025, 11:07 AM

#

@frosty imp looks like we got some rebasing to do now

#

i will attempt to do this rn

rocky vigil Nov 1, 2025, 11:32 AM

#

check https://github.com/sscg13/Stockfish/commit/bad931487a1c31701b791c2eedd192f3ba80b73a

#

i think everything should be fine

#

like it compiles and has a bench

#

I haven't touched anything

regal steeple Nov 1, 2025, 11:33 AM

#

violet badger the branch should probably use the nn-26b0e5126117.nnue network?

The test used an old net because it was running so long, I rebased the pr branch so the net in the pr should be correct

rocky vigil Nov 1, 2025, 11:35 AM

#

bruh I filled nps instead of nodes for bench again

#

hate this

#

aight https://github.com/sscg13/Stockfish/commit/7a41f56227aad2fae1d00166b2f6d6e281264756 has the correct bench

rocky vigil Nov 1, 2025, 11:44 AM

#

regal steeple The test used an old net because it was running so long, I rebased the pr branch...

you might need to check this out post-master rebase

frosty imp Nov 1, 2025, 7:06 PM

#

Merged and rebased

violet badger Nov 1, 2025, 7:12 PM

#

so what's the current guestimate on Elo difference at STC?

rocky vigil Nov 1, 2025, 7:14 PM

#

Nice

#

Well not much new so still -5 or so stc

#

Unless the search behavior is different

#

Weight clipping for i8 will hopefully be solved today…

frosty imp Nov 2, 2025, 1:33 AM

#

Any blockers for i8 clipping?

daring wren Nov 2, 2025, 1:35 AM

#

frosty imp Any blockers for i8 clipping?

me

prime mica Nov 2, 2025, 1:45 AM

#

lol

#

https://tenor.com/view/you-shall-not-pass-lotr-do-not-enter-not-allowed-scream-gif-16729885

Tenor

rocky vigil Nov 2, 2025, 2:18 AM

#

frosty imp Any blockers for i8 clipping?

the fact that I don't know pytorch

#

what I think it supposed to happen is to get a tensor slice corresponding to the threat weights

#

and then clip them like that

#

but idk how to do it

frosty imp Nov 2, 2025, 2:22 AM

#

https://github.com/official-stockfish/nnue-pytorch/blob/5d18196172dcb181bb878922092ea45ec94aec29/model/quantize.py#L40

#

Add a new field here

#

https://github.com/official-stockfish/nnue-pytorch/blob/5d18196172dcb181bb878922092ea45ec94aec29/model/model.py#L232

#

Logic goes here

#

model.input.weight should be the tensor that gets clipped

frosty imp Nov 2, 2025, 2:30 AM

#

rocky vigil what I think it supposed to happen is to get a tensor slice corresponding to the...

Only threat weights get i8?

naive comet Nov 2, 2025, 2:41 AM

#

yeah for yoshie it was the most beneficial

prime mica Nov 2, 2025, 2:50 AM

#

If I ever buy a GPU I will definitely try quanitizing the main net to i8

frosty imp Nov 2, 2025, 2:55 AM

#

I mean you don’t need a gpu to run just the quantization

prime mica Nov 2, 2025, 2:56 AM

#

sure but I think it needs to be re-trained

frosty imp Nov 2, 2025, 2:56 AM

#

naive comet yeah for yoshie it was the most beneficial

Honestly if we are doing that it’s probably better to just overhaul the feature system

#

But I guess for testing just hard code selecting the corresponding slice from model.input.weight

rocky vigil Nov 2, 2025, 4:16 AM

#

frosty imp Only threat weights get i8?

yeah, how would we make that slice

frosty imp Nov 2, 2025, 7:46 AM

#

rocky vigil yeah, how would we make that slice

something that looks like weight[0:12345]=weight[0:12345].clamp(...)

rocky vigil Nov 2, 2025, 7:50 AM

#

frosty imp <https://github.com/official-stockfish/nnue-pytorch/blob/5d18196172dcb181bb87892...

so like here, add

{
"params": [model.input.weight[0:79856]],
"min_weight": -self.max_threat_weight,
"max_weight": self.max_threat_weight,
}

#

?

#

if the slicing is that easy then I'm happy

frosty imp Nov 2, 2025, 7:57 AM

#

~~nah the slicing creates a new object~~

#

or wait

#

yeah I think you can do that

rocky vigil Nov 2, 2025, 7:57 AM

#

afaik it's used as like

#

.data()

#

so as long as it works

#

how would I define max_threat_weight

#

i know it should be smth like

#

ft_quantized_one / 2

#

but idk if there are any details I need to watch out for

stray reef Nov 2, 2025, 9:04 AM

#

i did not have to do any retraining for i8 threat weights to work

twilit oriole Nov 2, 2025, 9:04 AM

#

SF is different, they had to change the QA

stray reef Nov 2, 2025, 9:04 AM

#

right

#

do they have QAT? or what is the reason for the retraining then?

rocky vigil Nov 2, 2025, 9:10 AM

#

to avoid having a bunch of x2's

#

for mulhi

stray reef Nov 2, 2025, 9:11 AM

#

with bullet, if no QAT is used, one can just re-quantise a checkpoint

#

so i'm wondering why that's not possible for SF

naive comet Nov 2, 2025, 10:05 AM

#

cuz bullet > nnue-pytorch gigachad

rocky vigil Nov 2, 2025, 11:10 AM

#

anyways check

#

https://github.com/sscg13/nnue-pytorch/tree/threat-i8-QA-255

#

to make sure I haven't done anything wrong

#

and if it looks good we can start a test run

violet badger Nov 2, 2025, 1:00 PM

#

I'll get to giving that a test run later today, but obviously that shouldn't stop people from having a look now 😉

violet badger Nov 2, 2025, 3:18 PM

#

net sharing is merged, so probably makes sense to rebase the SF threats branch on master, and run another 10k test of the current state.

lofty cedar Nov 2, 2025, 3:36 PM

#

Do we re-try the 1280 net on STC?

#

Shared memory should disproportionately benefit larger nets I think.

stray reef Nov 2, 2025, 3:37 PM

#

1024 with i8 is the way go go rn

#

i don't think the benefit is that big, per position only a tiny fraction of inputs are used

violet badger Nov 2, 2025, 3:41 PM

#

so, let's see where I get with the above branch by @rocky vigil ....

twilit oriole Nov 2, 2025, 3:56 PM

#

buff_doge

violet badger Nov 2, 2025, 3:58 PM

#

https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11943014180#L271

GitLab

step_1_91eca72b4cf9_rep_0_train (#11943014180) · Jobs · CSCS-CI /...

Mirror of https://github.com/vondele/nettest

#

fails compiling right now

twilit oriole Nov 2, 2025, 3:58 PM

#

Interesting a 38% speedup is apparently only 25 Elo in SF now. Used to be far more IIRC

violet badger Nov 2, 2025, 4:00 PM

#

might be... this is not the conventional way of measuring this (timeodds).

#

(should be maybe similar, but not certain)

#

let me try to measure once with current master, interesting enough a question..

violet badger Nov 2, 2025, 4:18 PM

#

   1 shared_memoryPRtc138    :    49.7    1.8  41962.5   73728    57
   2 shared_memoryPRtc100    :     0.0   ----  31765.5   73728    43

#

tc adjustment (tc=13.8+0.138)

#

That was the node count in the match master vs master and sharedmem vs sharedmem

   9    ===== shared_memory =====-
  10   771 seconds for 169406377858 nodes
  11   nps:  2.19723e+08
  12    ===== master =====-
  13   769 seconds for 122427206326 nodes
  14   nps:  1.59203e+08

twilit oriole Nov 2, 2025, 4:20 PM

#

Interesting. I wonder how the elo of a 20% time odds difference changes with SF version

#

Like if it gets less with recent versions or stays constant

violet badger Nov 2, 2025, 4:21 PM

#

I suspect it will get less.

frosty imp Nov 2, 2025, 6:11 PM

#

stray reef with bullet, if no QAT is used, one can just re-quantise a checkpoint

we can. idk if the checkpoint is kept

#

rebased

violet badger Nov 2, 2025, 6:30 PM

#

frosty imp we can. idk if the checkpoint is kept

not forever, but I do have those.

stray reef Nov 2, 2025, 6:37 PM

#

then why train again?

frosty imp Nov 2, 2025, 6:38 PM

#

prolly no reason to

violet badger Nov 2, 2025, 6:43 PM

#

to enable QAT eventually...

#

I can upload a checkpoint somewhere.

#

https://drive.google.com/file/d/1jO4wuEd2QbbK_JSihMnbDb2QKqaJwt29/view?usp=sharing

Google Docs

nn-26b0e5126117.ckpt.gz

rocky vigil Nov 2, 2025, 7:13 PM

#

stray reef then why train again?

Hoping that having the trainer aware of clipping is a slight gain

rocky vigil Nov 2, 2025, 7:44 PM

#

violet badger https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/292682...

huh when did that semicolon get removed ???
anyways that error is fixed now...

violet badger Nov 2, 2025, 7:44 PM

#

let me try again.

rocky vigil Nov 2, 2025, 7:56 PM

#

well segfault on that

#

gah

green moat Nov 2, 2025, 8:02 PM

#

Failed again
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11943538786
🤨

rocky vigil Nov 2, 2025, 8:02 PM

#

null pointer

#

amazing

violet badger Nov 2, 2025, 8:03 PM

#

no, so code got deleted

#

git diff 5bcb0036825206ad6a23df6ed1b07211e3a73f58

#

diff --git a/training_data_loader.cpp b/training_data_loader.cpp
index 9f04699..8c79b9b 100644
--- a/training_data_loader.cpp
+++ b/training_data_loader.cpp
@@ -1202,7 +1202,7 @@ extern "C" {
         {
             return new SparseBatch(FeatureSet<Full_Threats>{}, entries);
         }
-        else if (feature_set == "Full_Threats^")
+        else if (feature_set == "Full_Threats^") 
         {
             return new SparseBatch(FeatureSet<Full_ThreatsFactorized>{}, entries);
         }
@@ -1267,10 +1267,6 @@ extern "C" {
         {
             return new FeaturedBatchStream<FeatureSet<Full_Threats>, SparseBatch>(concurrency, filenames_vec, batch_size, cyclic, skipPredicate);
         }
-        else if (feature_set == "Full_Threats^") 
-        {
-            return new FeaturedBatchStream<FeatureSet<Full_ThreatsFactorized>, SparseBatch>(concurrency, filenames_vec, batch_size, cyclic, skipPredicate);
-        }
         fprintf(stderr, "Unknown feature_set %s\n", feature_set_c);
         return nullptr;
     }

rocky vigil Nov 2, 2025, 8:05 PM

#

???
i have no idea what happened
strange

violet badger Nov 2, 2025, 8:05 PM

#

maybe started from a different branch or so?

#

pretty sure that's a fix we made earlier or.

rocky vigil Nov 2, 2025, 8:06 PM

#

oh it's missing https://github.com/sscg13/nnue-pytorch/commit/6f7dd2ea5876190b49cab19cc4248d370f170db0

#

ok

#

ah i think my local branch was behind

#

since i just merged the pr on github

violet badger Nov 2, 2025, 8:07 PM

#

sounds plausible

rocky vigil Nov 2, 2025, 8:08 PM

#

alright that should be fixed now

#

so if there's any error it must be either with the different quantization or the clipping

#

as a side note I think the current method also clips the threat psqbucket weights

frosty imp Nov 2, 2025, 8:13 PM

#

rocky vigil Hoping that having the trainer aware of clipping is a slight gain

Against the i8 current net?

rocky vigil Nov 2, 2025, 8:14 PM

#

frosty imp Against the i8 current net?

yeah against just clipping the weights directly

#

i mean someone is free to try i8 post-processing the current net

#

just read the threat weights and then write them as i8

violet badger Nov 2, 2025, 8:15 PM

#

I've posted the checkpoint above..

#

(a link to)

#

it would probably be a good baseline, and test of the inference code .

rocky vigil Nov 2, 2025, 8:19 PM

#

well apparently it is 8x slower

#

gah

#

i think there are too many weights clipped too frequently

violet badger Nov 2, 2025, 8:21 PM

#

that's a bit too slow to be practical.. I guess probably not quite right.

rocky vigil Nov 2, 2025, 8:22 PM

#

i think we need a more fine grained optimization to make this work

#

it's clipping every batch

#

it should be safe to reduce the threat clipping to every SB (epoch) instead

#

and that should end the slowdown

frosty imp Nov 2, 2025, 8:33 PM

#

8x slowdown sounds too much?

rocky vigil Nov 2, 2025, 8:35 PM

#

while we're at it I think the init could be better in threat inputs

#

afaik bullet has some stuff that improves the init

rocky vigil Nov 2, 2025, 8:36 PM

#

frosty imp 8x slowdown sounds too much?

yeah not sure why

#

maybe it's the slicing that's causing the issue?

violet badger Nov 2, 2025, 8:37 PM

#

is it maybe doing this cpu side, i.e. transferring stuff back and forth?

rocky vigil Nov 2, 2025, 8:37 PM

#

could be, maybe it's transferring all 320 MB of threat weights every time

#

what's the cpu-gpu bandwidth?

#

shouldn't be too hard to check if this is the case...

violet badger Nov 2, 2025, 8:40 PM

#

high, but obviously that would be bottleneck. (450GB/s)

rocky vigil Nov 2, 2025, 8:42 PM

#

hmm

#

lemme split the clipping and only call threat clipping on batch 0

violet badger Nov 2, 2025, 8:43 PM

#

worth trying, but I can hardly imagine it being that slow in general... unless something is unexpected.

rocky vigil Nov 2, 2025, 8:50 PM

#

frosty imp 8x slowdown sounds too much?

can u check the changes see if anything besides clipping could possibly be the slowdown

#

here are the batches cyclic between 0-6103

frosty imp Nov 2, 2025, 9:00 PM

#

violet badger is it maybe doing this cpu side, i.e. transferring stuff back and forth?

Shouldn’t be

#

Everything is automatically moved to cpu via lightning

rocky vigil Nov 2, 2025, 9:00 PM

#

actually if it's doing the clipping on cpu

#

uh

#

clipping 80 million weight

#

takes a lot of time

rocky vigil Nov 2, 2025, 9:06 PM

#

rocky vigil here are the batches cyclic between 0-6103

is it fine to just replace ?? with 6103

#

@frosty imp

frosty imp Nov 2, 2025, 9:12 PM

#

Idk

#

Isn’t batch the data?

rocky vigil Nov 2, 2025, 9:12 PM

#

oh

#

bruh

#

how do I do it like at the end of every superbatch then

#

or is it batch_idx

frosty imp Nov 2, 2025, 9:13 PM

#

I mean there is a batch_idx

rocky vigil Nov 2, 2025, 9:13 PM

#

batch_idx == 6103 works?

frosty imp Nov 2, 2025, 9:13 PM

#

That I don’t know

rocky vigil Nov 2, 2025, 9:14 PM

#

actually let's do 0

#

since quantization clips it at the very end anyways

#

btw @frosty imp could you also run stc smp

green moat Nov 2, 2025, 9:31 PM

#

Still failing....
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11943610457
🤔

rocky vigil Nov 2, 2025, 9:51 PM

#

serialize hmm

#

issue with "deepcopy"

#

this is out of my depth

#

i cannot connect this to anything new introduced

frosty imp Nov 2, 2025, 10:11 PM

#

rocky vigil issue with "deepcopy"

try not doing the slice thing

#

I think serialize runs on the cpu so you can prolly test that

rocky vigil Nov 2, 2025, 11:29 PM

#

frosty imp try not doing the slice thing

but the slicing is only for clipping threat weights

#

I don't see why it would affect

frosty imp Nov 2, 2025, 11:52 PM

#

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001

#

So perhaps something is interacting to implicitly create a tensor

#

I think maybe the view was copied since the quant config is part of model

rocky vigil Nov 2, 2025, 11:54 PM

#

Huh

rocky vigil Nov 3, 2025, 3:08 AM

#

frosty imp I think maybe the view was copied since the quant config is part of model

not sure what you're referring to by this

frosty imp Nov 3, 2025, 6:33 AM

#

the view of model.input.weights is in the clipping dictionary

#

the model itself owns a copy of the clipping dictionary

#

when you deepcopy the model, you deepcopy the clipping dictionary and therefore the view itself

twilit oriole Nov 3, 2025, 6:35 AM

#

python be like

frosty imp Nov 3, 2025, 6:36 AM

#

i mean isn't that like most languages that passes by reference

rocky vigil Nov 3, 2025, 7:56 AM

#

so locally I removed the slice from the weight clipping config

#

and instead this function exists

#

will this also solve the deepcopy issue?

frosty imp Nov 3, 2025, 7:56 AM

#

try it I guess

frosty imp Nov 3, 2025, 7:57 AM

#

frosty imp when you deepcopy the model, you deepcopy the clipping dictionary and therefore ...

if this is the issue then breaking it into a separate function would help

rocky vigil Nov 3, 2025, 8:00 AM

#

frosty imp try it I guess

like run serialize?

frosty imp Nov 3, 2025, 8:00 AM

#

yeah

rocky vigil Nov 3, 2025, 8:00 AM

#

don't I need a checkpoint for that

frosty imp Nov 3, 2025, 8:01 AM

#

you can do it on a .nnue as well

rocky vigil Nov 3, 2025, 8:02 AM

#

ok

#

how do I actually run serialize standalone

violet badger Nov 3, 2025, 8:24 AM

#

rocky vigil don't I need a checkpoint for that

there is a checkpoint

#

#1336647760388034610 message

rocky vigil Nov 3, 2025, 8:29 AM

#

frosty imp you can do it on a .nnue as well

i can also do it with this, the issue is that idk how to run it in the first place

violet badger Nov 3, 2025, 8:32 AM

#

https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11805606607#L2788

#

(that includes the permutation optimization, which you might want to skip)

rocky vigil Nov 3, 2025, 8:39 AM

#

i have such warnings as

RuntimeWarning: invalid value encountered in exp2
  epsneg_f128 = exp2(ld(-113))
RuntimeWarning: invalid value encountered in exp2
  tiny_f128 = exp2(ld(-16382))
RuntimeWarning: invalid value encountered in exp2
  eps=exp2(ld(-112)),
RuntimeWarning: invalid value encountered in nextafter
  self._smallest_subnormal = nextafter(
RuntimeWarning: invalid value encountered in log10
  self.precision = int(-log10(self.eps))```
but no deepcopy crash

violet badger Nov 3, 2025, 8:41 AM

#

that's probably when you start from the checkpoint, not the nnue?

rocky vigil Nov 3, 2025, 8:41 AM

#

i also have no idea what that even did

#

py -u serialize.py ../Stockfish/src/nn-26b0e5126117.nnue test.nnue --device=0 --features=Full_Threats --l1=1024 --ft_compression=leb128

violet badger Nov 3, 2025, 8:42 AM

#

The crash was with:

python -u serialize.py /workspace/scratch/c82d8c15bf8b/run/lightning_logs/version_0/checkpoints/nonopt.nnue /workspace/scratch/c82d8c15bf8b/run/lightning_logs/version_0/checkpoints/last.nnue --ft_optimize_data=/workspace/data/official-stockfish/master-binpacks/fishpack32.binpack --device=0 --features=Full_Threats --l1=1024 --ft_optimize_count=100000 --ft_optimize --ft_compression=leb128

#

and if ft_optimize so

#

you can try that with a small ft_optimize_count ?

rocky vigil Nov 3, 2025, 8:43 AM

#

do I really need to download 5.58 GB binpack for this...

violet badger Nov 3, 2025, 8:44 AM

#

any tiny binpack will do?

rocky vigil Nov 3, 2025, 8:44 AM

#

ok I'll try it with small.binpack

violet badger Nov 3, 2025, 8:44 AM

#

even just start the download and kill it ...

rocky vigil Nov 3, 2025, 8:45 AM

#

maybe was a crash

#

no error message though

violet badger Nov 3, 2025, 8:48 AM

#

if you use small.binpack, try to reduce optimize_count. Could also be that this only crashes when run on GPU?

rocky vigil Nov 3, 2025, 8:48 AM

#

rocky vigil maybe was a crash

same thing happens if I use the base threat_inputs branch

#

it gives the invalid value warnings

#

and then terminal looks like it indicates a crash

#

there is an additional warning of ```Warning: Numpy built with MINGW-W64 on Windows 64 bits is experimental, and only available for
testing. You are advised not to use it for production.

CRASHES ARE TO BE EXPECTED - PLEASE REPORT THEM TO NUMPY DEVELOPERS```

#

ok so i think something is wrong here

#

because it is not even getting to the python part

rocky vigil Nov 3, 2025, 8:57 AM

#

violet badger if you use small.binpack, try to reduce optimize_count. Could also be that this ...

in the meanwhile, could you check that https://github.com/sscg13/nnue-pytorch/commit/2f978cce16324358e541d22a87610500eee7ac1f fixes the slowdown?

violet badger Nov 3, 2025, 8:59 AM

#

running

rocky vigil Nov 3, 2025, 9:10 AM

#

ah

#

the syntax is not what I thought it was

#

hmm

#

oh i typoed

#

weight

#

not weights

#

apparently

violet badger Nov 3, 2025, 9:14 AM

#

restarted

rocky vigil Nov 3, 2025, 9:28 AM

#

alright runs now with basically no speed loss

#

in several minutes we'll know if serialize works

#

so it looks like it worked and a net was uploaded

#

now let me check that the threat weights are indeed in i8 range...

#

zip compresses better (48 MB instead of 65)

#

good sign hopefully

violet badger Nov 3, 2025, 9:41 AM

#

very early net, might also contribute

rocky vigil Nov 3, 2025, 9:45 AM

#

well, better

#

i might need to be more aggressive with the limits

prime mica Nov 3, 2025, 9:46 AM

#

by how much do they exceed the limits

rocky vigil Nov 3, 2025, 9:46 AM

#

now lemme recompile

prime mica Nov 3, 2025, 9:46 AM

#

if it's only a couple we could just clamp it... obviously not the cleanest solution ofc

rocky vigil Nov 3, 2025, 9:46 AM

#

this is only a test run

#

i can just fix this

prime mica Nov 3, 2025, 9:47 AM

#

O ok

#

huzzah

rocky vigil Nov 3, 2025, 9:51 AM

#

why does gcc say it invokes undefined behavior past 22528 * l1

#

I am pretty confident that weights has size 102384 * l1

prime mica Nov 3, 2025, 9:52 AM

#

what is ThreatInputDimensions * HalfDimensions

#

also shouldn't it be weights[i] >= 128 and weights[i] < -128?

rocky vigil Nov 3, 2025, 9:53 AM

#

prime mica what is `ThreatInputDimensions * HalfDimensions`

number of threat weights

#

79856 * 1024

#

weights[0:79856*1024] should be the threat weights

prime mica Nov 3, 2025, 9:54 AM

#

😩

#

maybe weights is too smol

rocky vigil Nov 3, 2025, 9:55 AM

#

here are the weights in question

#

it might be because the lr is too high

#

so clipping at the start of epoch

#

still gives them a full epoch to exceed the limits

#

nevertheless, 200 is still better than the 400k it says for the current net

violet badger Nov 3, 2025, 9:57 AM

#

can't you add something to the loss?

rocky vigil Nov 3, 2025, 9:58 AM

#

more complexity

#

can also enforce the clipping at final quantization step probably

violet badger Nov 3, 2025, 9:58 AM

#

but rather straightforward?

rocky vigil Nov 3, 2025, 9:59 AM

#

probably

#

maybe one of the things that counts as QAT

violet badger Nov 3, 2025, 9:59 AM

#

yeah, probably.

rocky vigil Nov 3, 2025, 10:08 AM

#

does weight[0:79856].clamp(-128, 127) do what I want it to do?

#

post-integer conversion

#

if that line does what I want it to do then https://github.com/sscg13/nnue-pytorch/commit/c718d1543a5ef4068b04770914a4385c5d874b6b can be used

violet badger Nov 3, 2025, 10:15 AM

#

seems reasonable, should I give it another go?

#

started

rocky vigil Nov 3, 2025, 10:19 AM

#

rocky vigil does `weight[0:79856].clamp(-128, 127)` do what I want it to do?

unsure whether this or
weight[0:79856] = weight[0:79856].clamp(-128, 127) is right

#

it is possible they both are

rocky vigil Nov 3, 2025, 10:23 AM

#

rocky vigil unsure whether this or `weight[0:79856] = weight[0:79856].clamp(-128, 127)` is ...

yeah I think this should be

#

not what I have

violet badger Nov 3, 2025, 10:23 AM

#

ah... yeah, so well, in a few minutes I can start the next run.

rocky vigil Nov 3, 2025, 10:25 AM

#

yeah clamp_() is the in-place version

prime mica Nov 3, 2025, 10:25 AM

#

rocky vigil yeah `clamp_()` is the in-place version

wtf is this naming

rocky vigil Nov 3, 2025, 10:25 AM

#

so either weight[0:79856] = weight[0:79856].clamp(-128, 127) or weight[0:79856].clamp_(-128, 127)
gonna go with the former bc yeah

prime mica Nov 3, 2025, 10:25 AM

#

🤯

#

they should have a [[nodiscard]] equivalent on weight[0:79856].clamp(-128, 127) lol

#

oh ig not a thing in Python

violet badger Nov 3, 2025, 10:45 AM

#

restarted with the latest syntax

rocky vigil Nov 3, 2025, 11:11 AM

#

still 133 exceeding limits

#

extremely strange

#

oh shoot

#

I've been reading the psq weights

#

actually why are so many of them small

#

0 out of 81772544 threat weights exceed i8 limits

#

looks like it works

rocky vigil Nov 3, 2025, 11:16 AM

#

rocky vigil I've been reading the psq weights

that also explains the compiler warning

#

as it turns out inputdimensions should probably be named psqinputdimensions

#

oh well

violet badger Nov 3, 2025, 11:19 AM

#

so, things look better now?

rocky vigil Nov 3, 2025, 11:19 AM

#

yeah

#

very surprising

#

that only 200 of the psq weights have managed to exceed 128 in abs. value

#

i guess this is an artifact of only 8 epochs

#

now the harder part to verify is that the QA=255 is working

violet badger Nov 3, 2025, 11:21 AM

#

I suppose I could start a longer train, like one 800epoch run. Means we have a reasonable net by tomorrow?

rocky vigil Nov 3, 2025, 11:21 AM

#

yeah

#

curious if 100 epoch stages 1-5 produces a better net faster than just 800 epoch of stage 1

#

i suppose it doesn't really matter

#

keeping it to single stage might also be easier to set up

violet badger Nov 3, 2025, 11:23 AM

#

if this 1 stage has the advantage we can kind of compare with 1stage normal setup.

rocky vigil Nov 3, 2025, 11:23 AM

#

yeah

#

right

#

and if it works can save having to redo it

violet badger Nov 3, 2025, 11:23 AM

#

so, let me start this.

#

also

#

meanwhile would be interesting to see what happens just quantizing the existing checkpoint tbh.

rocky vigil Nov 3, 2025, 11:25 AM

#

on inference side requires an extra x2 somewhere (either as shift, or double add) but it is also interesting

#

or, the inference side is still not entirely worked out

#

i think it is possible actually to just post-process the latest nnue

violet badger Nov 3, 2025, 11:26 AM

#

in SF even, yeah, I guess so.

rocky vigil Nov 3, 2025, 11:30 AM

#

also @frosty imp is there a reason why all of this code is necessary instead of

read_leb_128(stream, threatWeights, HalfDimensions * ThreatInputDimensions)
read_leb_128(stream, weights, HalfDimensions * InputDimensions)
...same for psqt

#

oh

#

so the leb128 here actually includes a length counter for the slice

#

so it must all be read in one go

rocky vigil Nov 3, 2025, 10:08 PM

#

meanwhile still open to suggestions on how to avoid declaring threat weights when it's unncessary

#

(saves 20 MB of memory)

prime mica Nov 3, 2025, 10:08 PM

#

explain?

rocky vigil Nov 3, 2025, 10:08 PM

#

for the smallnet

#

which doesn't use threat inputs

#

there is no reason to declare a 79856 * 128 array

#

for the threat weights

#

that will be unused

prime mica Nov 3, 2025, 10:09 PM

#

oh

#

why not just make the size depend

#

threatWeights[IsSmallNet ? 0 : ...]```

rocky vigil Nov 3, 2025, 10:09 PM

#

oh

#

right right

prime mica Nov 3, 2025, 10:09 PM

#

huzzah

rocky vigil Nov 3, 2025, 10:10 PM

#

is declaring a [0] array valid

prime mica Nov 3, 2025, 10:10 PM

#

idt so

#

but [1] ^_^

rocky vigil Nov 3, 2025, 10:10 PM

#

fair

#

yeah this is smart

#

no code duplication needed

prime mica Nov 3, 2025, 10:11 PM

#

DRY lover

rocky vigil Nov 3, 2025, 10:22 PM

#

@naive comet what needs to be done inference side for the QA=255

naive comet Nov 3, 2025, 10:26 PM

#

anywhere with 127*2, replace with 255, delete weight doubling, gg

rocky vigil Nov 3, 2025, 10:26 PM

#

ok

#

we'll see how it goes in like

#

6 hours or so

#

tbh I wonder if removing the unused threat weights in smallnet actually does anything

#

besides just shaving off 20 MB

#

surely the compiler knows they're unused

upbeat pewter Nov 3, 2025, 10:28 PM

#

rocky vigil is declaring a [0] array valid

it is, in fact, valid; allows you to declare alignment

rocky vigil Nov 3, 2025, 10:28 PM

#

huh stackoverflow says not

#

idk

#

can just use 1

#

who cares

rocky vigil Nov 3, 2025, 11:02 PM

#

prime mica ``` threatWeights[IsSmallNet ? 0 : ...]```

crashes on runtime

#

gah

prime mica Nov 3, 2025, 11:02 PM

#

wai

rocky vigil Nov 3, 2025, 11:02 PM

#

ok it's lldb time

#

can't figure out this templating mess

#

nvm it's in the scaling lmao

#

alright

#

6 MiB instead of 28

#

or smth

#

let's go

#

info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))```

#

that's better

#

i like it

rocky vigil Nov 3, 2025, 11:10 PM

#

rocky vigil can just use 1

lmao this technically has to be 128 otherwise it'll fail some static assert

prime mica Nov 3, 2025, 11:11 PM

#

banger

rocky vigil Nov 3, 2025, 11:12 PM

#

@frosty imp https://github.com/xu-shawn/Stockfish/pull/25

naive comet Nov 3, 2025, 11:18 PM

#

lmao then surely it passes now right

#

even without i8

frosty imp Nov 3, 2025, 11:23 PM

#

eh lemme try refactoring it in a better way

rocky vigil Nov 3, 2025, 11:25 PM

#

sure

rocky vigil Nov 3, 2025, 11:25 PM

#

naive comet lmao then surely it passes now right

is 20 MB of memory that valuable

#

i thought since it was unused actually

#

it just sits in memory and does nothing

naive comet Nov 3, 2025, 11:26 PM

#

idk

#

I have no clue

rocky vigil Nov 4, 2025, 4:35 AM

#

frosty imp eh lemme try refactoring it in a better way

will u have this by the time the stage 1 i8 net is trained

frosty imp Nov 4, 2025, 4:36 AM

#

eh just merged your pr

#

spent the time going off a tangent switching everything to std::array because the way it is now is a pain to refactor

rocky vigil Nov 4, 2025, 4:38 AM

#

oh

rocky vigil Nov 4, 2025, 5:11 AM

#

naive comet anywhere with 127*2, replace with 255, delete weight doubling, gg

do the later layers just remain unchanged?

#

net is trained now

#

validation loss looks fine

#

0 out of 81772544 threat weights exceed i8 limits```

#

good start

rocky vigil Nov 4, 2025, 5:43 AM

#

@naive comet https://github.com/sscg13/Stockfish/commit/83eb0e1d835e138194237c33cc968c48f42a6a68 look good?

naive comet Nov 4, 2025, 6:06 AM

#

lgtm I think

rocky vigil Nov 4, 2025, 6:17 AM

#

info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))```

#

let's go?

#

bench matches

prime mica Nov 4, 2025, 6:18 AM

#

🥳

#

u are cooking

rocky vigil Nov 4, 2025, 6:19 AM

#

can someone check single/multithread speedtest of https://github.com/sscg13/Stockfish/commit/5a6633ad554f22ef1ad953bff2af74d0db3c0b79 vs right before this commit

prime mica Nov 4, 2025, 6:19 AM

#

yessir gimme a bit

rocky vigil Nov 4, 2025, 6:21 AM

#

cool cool

#

expecting single thread ~ neutral, more threads to be big speedup

prime mica Nov 4, 2025, 6:27 AM

#

huh

#

do I need to manually download it in this case

#

that's fine but just checking

rocky vigil Nov 4, 2025, 6:29 AM

#

💀

#

oh shoot

#

lemme in fact upload to fishtest

#

we're gonna be doing a test vs previous stage 1 anyways

prime mica Nov 4, 2025, 6:29 AM

#

O

#

Failed to download from https://tests.stockfishchess.org/api/nn/nn-81c52631cfec.nnue

#

maybe you need to upload

rocky vigil Nov 4, 2025, 6:31 AM

#

rocky vigil lemme in fact upload to fishtest

uploaded to fishtest now

prime mica Nov 4, 2025, 6:31 AM

#

danke

rocky vigil Nov 4, 2025, 6:32 AM

#

or

#

uploading

#

i should say

#

it might take a minute

#

to go through

prime mica Nov 4, 2025, 6:32 AM

#

yep

#

since we're doing i8 we might as well skip the leb128 nonsense for that section

#

later tho

rocky vigil Nov 4, 2025, 6:33 AM

#

504 gateway time out 💀

#

uh

#

lemme retry

prime mica Nov 4, 2025, 6:33 AM

#

story of my lief

rocky vigil Nov 4, 2025, 6:33 AM

#

oh

#

it went through

#

ok

rocky vigil Nov 4, 2025, 6:35 AM

#

prime mica since we're doing i8 we might as well skip the leb128 nonsense for that section

yeah i want to

#

leb still compresses it somewhat though

#

compared to verbatim

prime mica Nov 4, 2025, 6:35 AM

#

bench is 2266138

#UE Threat Inputs for AB