UE Threat Inputs for AB | Stockfish | Page 2

rocky vigil Mar 5, 2025, 2:57 AM

#

Engine::load_networks()
network::load(rootDirectory, evalfilePath)
network::load_internal()
network::load()
Read L1 network::read_parameters()
FT::read_parameters()
network::load()
Read L1 network::read_parameters()
FT::read_parameters()
network::load()
Read L1 network::read_parameters()
FT::read_parameters()```

#

I don't understand how it reads 3 times

#

https://github.com/sscg13/Stockfish/tree/threat-inputs if you want branch

frosty imp Mar 5, 2025, 2:59 AM

#

where did you add the cout

rocky vigil Mar 5, 2025, 3:00 AM

#

uh

#

latest commit I pushed just now

#

has all the debug couts

#

i know at least one issue is that it is expecting a non-empty header

#

which it never gets

frosty imp Mar 5, 2025, 3:12 AM

#

one call from load_internal

#

one call from load_user_net

rocky vigil Mar 5, 2025, 3:12 AM

#

oh huh

#

load_user_net

#

hmm

frosty imp Mar 5, 2025, 3:13 AM

#

another call from load_user_net

rocky vigil Mar 5, 2025, 3:14 AM

#

but shouldn't they all trigger this cout

#

unless there are 3 dirs

#

right

#

ok

#

so the load of L1 fails without telling me it fails

#

huh

#

stream.fail is false here apparently, according to this last cout

#

therefore this should return true

#

oh lmao @frosty imp insertinig the std::cout here breaks the one line if statement

#

gah

#

i hate my life

frosty imp Mar 5, 2025, 3:25 AM

#

nohope

rocky vigil Mar 5, 2025, 3:26 AM

#

eval
Engine::verify_networks()
network::verify()
Current path: nn-98b68b5a9455.nnue
info string NNUE evaluation using nn-98b68b5a9455.nnue (7MiB, (15776, 256, 1))


 NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
|   r   |   n   |   b   |   q   |   k   |   b   |   n   |   r   |
| +631  | +614  | -43.2 | -62.5 |       | -37.0 | +616  | -56.6 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   p   |   p   |   p   |   p   |   p   |   p   |   p   |
| +619  | +621  | +573  | +343  | +231  | +506  | +623  | -76.1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |   P   |   P   |   P   |   P   |
| +579  | +245  | +58.9 | +285  | +376  | +23.3 | +245  | +573  |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   R   |   N   |   B   |   Q   |   K   |   B   |   N   |   R   |
| +53.8 | +255  | +46.8 | +362  |       | +304  | +254  | +53.3 |
+-------+-------+-------+-------+-------+-------+-------+-------+


NNUE evaluation        -271.85 (white side)
Final evaluation       +27.16 (white side) [with scaled NNUE, ...]```

#

ah yes

#

we love to see it

#

ok I forgot to divide by QA*QB

#

that explains the hilariously high values

#

nvm I didn't

#

oh

#

I forgot to do CReLU

#

average x^2 activation

#

+-------+-------+-------+-------+-------+-------+-------+-------+
|   r   |   n   |   b   |   q   |   k   |   b   |   n   |   r   |
| -0.00 |  0.00 | -0.00 | -0.01 |       | -0.00 |  0.00 | -0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   p   |   p   |   p   |   p   |   p   |   p   |   p   |
|  0.00 |  0.00 |  0.00 | -0.00 | -0.00 |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |   P   |   P   |   P   |   P   |
| -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   R   |   N   |   B   |   Q   |   K   |   B   |   N   |   R   |
| +0.66 | +0.49 | +0.40 | -0.02 |       | -0.35 | -0.46 | -0.18 |
+-------+-------+-------+-------+-------+-------+-------+-------+


NNUE evaluation        +2.60 (white side)
Final evaluation       +9.55 (white side) [with scaled NNUE, ...]```

#

aha

#

it still doesn't work

rocky vigil Mar 5, 2025, 6:41 AM

#

yay /s

#

evaling startpos

#

between different searches

#

gives different results

#

yeah I think I have UB somewhere

#

when typing 'eval' command twice in a row results in different outputs

#

there is nondeterminism happening here and I can't find it gah

#

only good news is I think active features are correctly computed

violet badger Mar 5, 2025, 7:10 AM

#

time for a run through valgrind ...

#

but that looks like progress

rocky vigil Mar 5, 2025, 7:17 AM

#

#

these look like normal accumulator values...

#

ok what the FT expects to output

#

does not remotely match what the second layer receives as input

#

huh I guess I'll debug this tmrw

rocky vigil Mar 5, 2025, 5:06 PM

#

Ok inference works now

#

As in it plays superhuman chess

#

Uh

#

Got a HalfKAv2hm net

#

Or so

#

(Also single layer)

rocky vigil Mar 5, 2025, 5:22 PM

#

rocky vigil Ok inference works now

Speedtest 510398 for (non-ue threats->256) vs 1246511 for master

#

I estimate an 8x gain with ue is reasonable

#

So overall at the same size perhaps 2-3x slower

#

In the midgame, significantly better than 8x should also be possible

#

Considering that I also compute the psq features from scratch

#

Also would it be fine to run a fixed nodes test on fishtest later (assuming I get also a halfkav2hm -> 256 net)

#

Or should we wait for more work on training side

rocky vigil Mar 5, 2025, 5:50 PM

#

rocky vigil Got a HalfKAv2hm net

@round stone do you have a single layer (HalfKAv2hm ->256) net (bullet format) that I can compare with for fixed nodes?

#

@formal smelt is the deduplication strategy to always take the lower index feature of a pair

round stone Mar 5, 2025, 5:59 PM

#

There’s an L1-256 multilayer net that’s reasonably strong in stockfish nnue format, used by lichess, that i can find later

rocky vigil Mar 5, 2025, 6:00 PM

#

Ok yeah that would work

round stone Mar 5, 2025, 6:00 PM

#

Alright i’ll find it later, afk now

rocky vigil Mar 5, 2025, 6:02 PM

#

When you do that can you also remove the “bulletbullet” padding at the end of skip-bm, rename it to nn-98b68b5a9455.nnue and upload it to fishtest

round stone Mar 5, 2025, 6:03 PM

#

You mean for future bullet nets? The L1-256 i was going to find later is already uploaded on fishtest somewhere

rocky vigil Mar 5, 2025, 6:03 PM

#

Yeah

#

To get it on fishtest

#

So maybe I can start a fixed nodes test

#

With more compute than just my laptop

round stone Mar 5, 2025, 6:04 PM

#

Sure, you got inference working with the arch of those nets?

rocky vigil Mar 5, 2025, 6:04 PM

#

round stone You mean for future bullet nets? The L1-256 i was going to find later is already...

Yeah future bullet nets need the padding removed because the parser expects eof after reading all the weights

rocky vigil Mar 5, 2025, 6:04 PM

#

round stone Sure, you got inference working with the arch of those nets?

I have inference working with these nets yes

round stone Mar 5, 2025, 6:05 PM

#

Alright np

rocky vigil Mar 5, 2025, 6:05 PM

#

Non-ue

rocky vigil Mar 5, 2025, 6:05 PM

#

rocky vigil Speedtest 510398 for (non-ue threats->256) vs 1246511 for master

See here

round stone Mar 5, 2025, 6:05 PM

#

Ok i can upload those later too. Feel free to upload too if you want to test sooner

formal smelt Mar 5, 2025, 6:05 PM

#

rocky vigil Yeah future bullet nets need the padding removed because the parser expects eof ...

once output buckets are added this wont be an issue anyway

rocky vigil Mar 5, 2025, 6:06 PM

#

Wait I claim the (threat-256)-1x8 still has padding

rocky vigil Mar 5, 2025, 6:07 PM

#

round stone Ok i can upload those later too. Feel free to upload too if you want to test soo...

Doing this on my phone will be quite a pain so I’ll just wait

round stone Mar 5, 2025, 6:07 PM

#

rocky vigil Also would it be fine to run a fixed nodes test on fishtest later (assuming I ge...

It’s fine to test. I’d say testing sooner than later is good

rocky vigil Mar 5, 2025, 6:11 PM

#

Ok yeah I’ll just wait for the net(s) to be found

#

And then I can set up a test

foggy wind Mar 5, 2025, 6:19 PM

#

This should be the small network that Lichess uses: https://tests.stockfishchess.org/api/nn/nn-4fd273888b72.nnue

twilit oriole Mar 5, 2025, 6:21 PM

#

rocky vigil Also would it be fine to run a fixed nodes test on fishtest later (assuming I ge...

Fishtest doesn't support fixed nodes. You can post the stuff here and I can run. Or I think Stockfish works on OB also

foggy wind Mar 5, 2025, 6:25 PM

#

foggy wind This should be the small network that Lichess uses: https://tests.stockfishchess...

https://github.com/hi-ogawa/Stockfish/commit/dc8e726a4e5ea74ab2b8354f82c03878117a0819

rocky vigil Mar 5, 2025, 6:48 PM

#

twilit oriole Fishtest doesn't support fixed nodes. You can post the stuff here and I can run....

Uh ok

#

https://github.com/sscg13/Stockfish/tree/threat-inputs (you might get some warnings on unused parameters when compiling, ignore those)

rocky vigil Mar 5, 2025, 6:49 PM

#

rocky vigil When you do that can you also remove the “bulletbullet” padding at the end of sk...

You also need to do this

#

I don’t have a branch for base halfkav2hm yet though

#

Might need to wait a couple hours for that

round stone Mar 5, 2025, 6:51 PM

#

foggy wind This should be the small network that Lichess uses: https://tests.stockfishchess...

no, there's a newer one that's L1-256

#

this one is L1-256. forgot if there's an even newer one
https://tests.stockfishchess.org/tests/view/64b6b6abdc56e1650abab4e8
https://tests.stockfishchess.org/api/nn/nn-ecb35f70ff2a.nnue

round stone Mar 5, 2025, 7:06 PM

#

rocky vigil When you do that can you also remove the “bulletbullet” padding at the end of sk...

padding trimmed, uploaded here:
https://tests.stockfishchess.org/api/nn/nn-98b68b5a9455.nnue

rocky vigil Mar 5, 2025, 7:38 PM

#

Yes I will not be back at computer for an hour and a half

rocky vigil Mar 5, 2025, 7:38 PM

#

rocky vigil <https://github.com/sscg13/Stockfish/tree/threat-inputs> (you might get some war...

@twilit oriole let me know if you are able to compile

#

(With 98b68b5a9455)

twilit oriole Mar 5, 2025, 7:39 PM

#

Oh I can't test till tomorrow evening. You should be able to put it on one of the OB instances if you need a fixed nodes I think

rocky vigil Mar 5, 2025, 7:40 PM

#

Ah I see

#

Doesn’t OB need additional modifications

twilit oriole Mar 5, 2025, 7:43 PM

#

Don't think so

rocky vigil Mar 5, 2025, 7:44 PM

#

Oh right it works bc auto download net

#

So you don’t need to do any makefile shenanigans

round stone Mar 5, 2025, 8:56 PM

#

early fixed nodes results:

Results of ./sscg13-sf/src/stockfish vs ./Stockfish-256/src/stockfish (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 42.92 +/- 8.13, nElo: 60.53 +/- 11.34
LOS: 100.00 %, DrawRatio: 42.01 %, PairsRatio: 1.79
Games: 3604, Wins: 1472, Losses: 1029, Draws: 1103, Points: 2023.5 (56.15 %)
Ptnml(0-2): [69, 306, 757, 453, 217], WL/DD Ratio: 3.40

#

Results of ./sscg13-sf/src/stockfish vs ./Stockfish/src/stockfish (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -159.73 +/- 6.80, nElo: -260.30 +/- 9.65
LOS: 0.00 %, DrawRatio: 22.95 %, PairsRatio: 0.07
Games: 4976, Wins: 561, Losses: 2700, Draws: 1715, Points: 1418.5 (28.51 %)
Ptnml(0-2): [503, 1282, 571, 115, 17], WL/DD Ratio: 2.59

#

25k nodes per move: https://github.com/sscg13/Stockfish/tree/threat-inputs

about +40 vs. L1-256 as main net, smallnet disabled
about -160 vs. master

rocky vigil Mar 5, 2025, 9:04 PM

#

Ok not bad for single layer

#

What is the approximate speed difference just curious?

#

I estimate that full threats are roughly the same speed (maybe even slightly faster) compared to simplified so this is promising

round stone Mar 5, 2025, 9:08 PM

#

Result of  10 runs
==================
base (...rc/stockfish) =     238162  +/- 123
test (...rc/stockfish) =    2636398  +/- 8043
diff                   =   +2398236  +/- 7955

speedup        = +10.0698
P(speedup > 0) =  1.0000

#

non-UE threats less than 1/10 the speed of L1-256

rocky vigil Mar 5, 2025, 9:11 PM

#

Yeah ok

#

non-ue doesn’t ue the psq inputs either

round stone Mar 5, 2025, 9:11 PM

#

Result of  10 runs
==================
base (...rc/stockfish) =     238051  +/- 159
test (...rc/stockfish) =    1114922  +/- 4646
diff                   =    +876871  +/- 4576

speedup        = +3.6835
P(speedup > 0) =  1.0000

this is speed vs. master

rocky vigil Mar 5, 2025, 9:12 PM

#

What positions primarily consist of

#

In this test do you know

#

I think ue can be anywhere between 4 - 16x faster

#

Depending on the stage of the game

round stone Mar 5, 2025, 9:13 PM

#

these are speeds based on whatever is in the stockfish bench position list

rocky vigil Mar 5, 2025, 9:13 PM

#

Ah ok

round stone Mar 5, 2025, 9:13 PM

#

https://github.com/official-stockfish/Stockfish/blob/master/src/benchmark.cpp

rocky vigil Mar 5, 2025, 9:14 PM

#

Yeah mostly midgame

#

Then probably over 10x speedup from good ue

#

And properly optimized vector operations

#

If you have time I’ll also try and implement full threat inputs later

#

According to the fixed montytrain

formal smelt Mar 5, 2025, 9:18 PM

#

rocky vigil Then probably over 10x speedup from good ue

hm

#

i think that might be a bit optimistic

rocky vigil Mar 5, 2025, 9:20 PM

#

Well you are going from ~70 avg features processed to ~8 in the midgame

#

And you are also going from compiler autovec to proper SIMD kernels

formal smelt Mar 5, 2025, 9:21 PM

#

the compiler autovec for the updates alone is going to be basically perfect

#

its addition/subtraction in a loop

rocky vigil Mar 5, 2025, 9:21 PM

#

Hmm I see

#

Btw how expensive do you think looping through all the pieces to get all active threats is

#

Compared with an accumulator update

formal smelt Mar 5, 2025, 9:22 PM

#

why dont you just time how long it currently takes to calculate all the indices

#

and compare it to the average time per accumulator update

rocky vigil Mar 5, 2025, 9:23 PM

#

formal smelt why dont you just time how long it currently takes to calculate all the indices

Touching anything in sf code takes a lot of effort hmm

rocky vigil Mar 5, 2025, 9:45 PM

#

Ok writing difference of vectors of size ~ 800k elements (of which ~750k shared) is 2-3 msec

#

Meaning that for vectors of size ~80 the time is negligible

rocky vigil Mar 5, 2025, 11:23 PM

#

@formal smelt a couple questions regarding full inputs

Pawn doesn’t distinguish between the exact type of piece it attacks, only enemy/friend? why is this
MAP[] appears to be color insensitive, I assume I’m missing something?
pawn->bishop, pawn->queen, pawn->king, bishop->queen, rook->queen, king->queen are the features excluded by deduplication right? And then pawn->enemy pawn or piece->piece is by whether from < to? But what about the cases of King->Rook and King->Bishop that are duplicated? I don’t see the code currently handling that

#

It is relatively simple for me to modify simplified input code to process full inputs after I know all the details

formal smelt Mar 6, 2025, 12:34 AM

#

rocky vigil <@236941606035521537> a couple questions regarding full inputs 1. Pawn doesn’t d...

It does distinguish, look at the target, enemy bool is just used to cut out some inputs

#

The only case color would matter is for pawns and you can see that gets handled separately

#

You can see those are in fact handled in map_king_threat

#

https://github.com/official-monty/montytrain/blob/threat-inputs-nnue-fixed/value/src/threats.rs#L132

rocky vigil Mar 6, 2025, 1:09 AM

#

https://github.com/official-monty/montytrain/blob/threat-inputs-nnue-fixed/value/src/threats.rs#L33

#

what

rocky vigil Mar 6, 2025, 1:10 AM

#

formal smelt 1. It does distinguish, look at the target, enemy bool is just used to cut out s...

let threat = offsets::PAWN + usize::from(enemy) * indices::PAWN + (src / 8 - 1) * 14 + attack; it doesn't look lke target is used to me

formal smelt Mar 6, 2025, 1:11 AM

#

Ah bruh

#

I copy pasted the fix from simplified threat inputs

rocky vigil Mar 6, 2025, 1:53 AM

#

twilit oriole Fishtest doesn't support fixed nodes. You can post the stuff here and I can run....

actually I think node tm exists though and should be good enough

#

or is that only for spsa.

#

actually I think it works but will be cursed

lofty cedar Mar 6, 2025, 2:16 AM

#

round stone 25k nodes per move: https://github.com/sscg13/Stockfish/tree/threat-inputs - abo...

So, not working?

#

-160 against master takes a lot to overcome.

rocky vigil Mar 6, 2025, 2:20 AM

#

stockfish expects [i16 L1 bias LEB128] [i16 L1 weights LEB128] [i32 L1 PSQT LEB128] [i16 L2 bias little-endian bucket 1] [i16 L2 weight little-endian bucket 1] [i16 L3 bias little-endian bucket 1] [i16 L3 bias little-endian bucket 1] [i16 L3 weight little-endian bucket 1] ... (bucket 2) (bucket 3) ... (bucket 8)
but there is a thing where the input sizes for L2, L3, L4 are padded to a multiple of 32

rocky vigil Mar 6, 2025, 2:20 AM

#

lofty cedar So, not working?

keep in mind this is L1=256 and master is L1=3072

lofty cedar Mar 6, 2025, 2:21 AM

#

Oh, well, I see.

#

Also, would the later layer's bucket still be indexed the same way?

#

Or would we go directly to the indexing scheme where most buckets got 3 pieces count?

rocky vigil Mar 6, 2025, 2:23 AM

#

this might have to be tested

lofty cedar Mar 6, 2025, 2:24 AM

#

rocky vigil this might have to be tested

This is a chance to test our data pipeline and so on.

#

We're resetting the board.

#

If we wait, there would be more and more sunk costs into suboptimal data training and so on.

rocky vigil Mar 6, 2025, 2:25 AM

#

hmm

#

I'm not too familiar with training side so

lofty cedar Mar 6, 2025, 2:27 AM

#

Here was the recommended idea.

📎 SF-TrainingData.pdf

rocky vigil Mar 6, 2025, 2:27 AM

#

lofty cedar -160 against master takes a lot to overcome.

I'm actually pretty excited for this

#

the net is only single layer

#

no multilayer yet

#

and there are still other things we can test

#

like full inputs vs simplified

#

ultimately I think it comes down to how fast we can make UE

#

rn I compute entirely from scratch (including psq inputs, so in midgame there are an average of like 50-60 features per side)

rocky vigil Mar 6, 2025, 2:32 AM

#

rocky vigil stockfish expects [i16 L1 bias LEB128] [i16 L1 weights LEB128] [i32 L1 PSQT LEB1...

@formal smelt how feasible do you think this would be with current bullet

formal smelt Mar 6, 2025, 2:34 AM

#

rocky vigil <@236941606035521537> how feasible do you think this would be with current bulle...

Writing the net to a specific file format has nothing to do with bullet

#

You can use a callback to run whatever code that writes it like that when you’d usually be saving a checkpoint

rocky vigil Mar 6, 2025, 2:35 AM

#

if the LEB128 is a problem I can hack sf code to load it in little-endian and write it in LEB128

formal smelt Mar 6, 2025, 2:35 AM

#

I’m sure there is a leb128 crate you can use

rocky vigil Mar 6, 2025, 2:36 AM

#

yeah ig linrock can go for a multilayer simplified net

#

using bullet

#

when he has time

#

and I'll figure out the formatting and whatever

formal smelt Mar 6, 2025, 2:36 AM

#

https://github.com/official-monty/montytrain/blob/main/policy/src/main.rs#L84 you can see using a callback here

rocky vigil Mar 6, 2025, 2:38 AM

#

should I attempt to use this net at all or should I wait for changes to get pushed

rocky vigil Mar 6, 2025, 3:10 AM

#

@frosty imp how do you think I could try and have accumulator caches go by ply, instead of by ksq-color

#

sort of like

#

to evaluate this position

#

we compute its active features

#

then we look backwards to find the latest computed ply

#

and try to ue the difference from that

frosty imp Mar 6, 2025, 3:34 AM

#

rocky vigil <@453859636890828802> how do you think I could try and have accumulator caches g...

isn't that the same as efficient updates?

#

imo accumulator caches are not necessary for this kind of inputs

#

since there is no need for full refreshes

rocky vigil Mar 6, 2025, 3:35 AM

#

frosty imp isn't that the same as efficient updates?

ok sure I can just have make_move

#

compute all attacks

rocky vigil Mar 6, 2025, 3:37 AM

#

frosty imp imo accumulator caches are not necessary for this kind of inputs

but what kind of function can I construct then

frosty imp Mar 6, 2025, 3:38 AM

#

rocky vigil but what kind of function can I construct then

wdym

#

just use update_accumulator_incremental for everything

rocky vigil Mar 6, 2025, 3:39 AM

#

uh

#

then I need to pass a pos

#

into it

#

instead of whatever is currently passed

frosty imp Mar 6, 2025, 3:39 AM

#

i think you can just extend dirtypiece

#

to contain the threat indicies too

#

or just pre-calculate every changed index on makemove

rocky vigil Mar 6, 2025, 3:39 AM

#

well my ue plan was to just compute every active threat index and then remove the ones that are already there

frosty imp Mar 6, 2025, 3:40 AM

#

oh ok

rocky vigil Mar 6, 2025, 3:40 AM

#

this is simplest

#

for now

frosty imp Mar 6, 2025, 3:40 AM

#

then just store all active indicies in dirtyPiece

#

maybe rename it to something else

rocky vigil Mar 6, 2025, 3:41 AM

#

what is stateinfo btw

frosty imp Mar 6, 2025, 3:41 AM

#

it's the copy-make part of the board object

#

sf does partial copy-make

rocky vigil Mar 6, 2025, 3:41 AM

#

ok

#

also I have to ue both psq and threats

#

so uh

#

I guess I can just define a new structure

#

along with dirtypiece

rocky vigil Mar 6, 2025, 3:45 AM

#

rocky vigil I guess I can just define a new structure

actually this is unnecessary I can maybe just directly add a vector of threats in the stateinfo itself

frosty imp Mar 6, 2025, 3:45 AM

#

maybe not a vector

#

SF creates a new stateinfo object in the beginning of every search call

#

so gotta allocate on the stack

rocky vigil Mar 6, 2025, 3:46 AM

#

huh

#

ok

#

array of length 128 works as well

frosty imp Mar 6, 2025, 3:46 AM

#

use ValueList

#

defined in misc.h i think

rocky vigil Mar 6, 2025, 3:47 AM

#

bruh valuelist is like exactly what I've been using in nnue code

frosty imp Mar 6, 2025, 3:47 AM

#

ah ok

rocky vigil Mar 6, 2025, 3:48 AM

#

valuelist is basically like a vector right

frosty imp Mar 6, 2025, 3:48 AM

#

yeah

#

fixed-capacity vector

rocky vigil Mar 6, 2025, 3:49 AM

#

also the ue accumulator updates will probably just be for loops

#

jw says compiler autovec is essentially perfect for simple arithmetic like that anyways

#

someone else can rewrite the simd tilings and other things if this actually works

frosty imp Mar 6, 2025, 3:50 AM

#

sf nnue code is due for a refactor anyways

rocky vigil Mar 6, 2025, 3:55 AM

#

btw I am open to other ue ideas more advanced than just "compute the threat difference from the last state and apply it (to both colors)"

lofty cedar Mar 6, 2025, 5:31 AM

#

frosty imp sf nnue code is due for a refactor anyways

What refactor? It looks clean to me.

#

Granted, some code are hard to read, but they're out of necessity rather than because of bad engineering.

naive comet Mar 6, 2025, 5:32 AM

#

the NNUE code is overly abstracted

lofty cedar Mar 6, 2025, 5:33 AM

#

Hmm? What's the issue?

#

We need abstraction to contain the monstrosity that is the high-performance code.

candid ivy Mar 6, 2025, 5:33 AM

#

Again stop talking nonsense

frosty imp Mar 6, 2025, 5:35 AM

#

lofty cedar Granted, some code are hard to read, but they're out of necessity rather than be...

it is definitely not necessary

#

you can look at countless other engines for cleaner NNUE code

lofty cedar Mar 6, 2025, 5:36 AM

#

I understand that high-performance code can sometimes be hard to read because well, the logic had to be complicated. You had to do some fancy tricks to go fast. There is no other way. For a performance-critical program, this is sometimes needed.
So, the best we can do is to abstract away those unreadable code into easy-to-understand functions.

But you're saying there are better ways?

frosty imp Mar 6, 2025, 5:36 AM

#

not all abstractions are equal

#

we need abstractions != we need bad abstractions

lofty cedar Mar 6, 2025, 5:38 AM

#

I agree that unreadable code are usually bad, but sometimes they're necessary evils, especially in a performance-critical software. What I would do when there is no other option is wrap them in "untouchable" functions.

frosty imp Mar 6, 2025, 5:39 AM

#

well that hypothetical situation doesn't apply here

#

so maybe we can discuss that in another place

lofty cedar Mar 6, 2025, 5:40 AM

#

frosty imp you can look at countless other engines for cleaner NNUE code

I thought the other engines managed to do so because they were either

less optimized
not supporting as many architectures as Stockfish does.

frosty imp Mar 6, 2025, 5:41 AM

#

lofty cedar I thought the other engines managed to do so because they were either 1) less op...

well number 1 doesn't really describe what's going on

#

you can argue for no. 2, but parts that aren't architecture-dependent is not clean either

lofty cedar Mar 6, 2025, 5:45 AM

#

frosty imp well that hypothetical situation doesn't apply here

Is it actually possible to write a readable high-performance code under hardcore performance optimization?

Sometimes it can happen, but you could end up in scenarios like comments longer than the code themselves.

#

Though I guess that's a red herring. The NNUE logic isn't some sort of nightmarish code. It's just a straightforward SIMD code after all.

frosty imp Mar 6, 2025, 5:51 AM

#

NNUE inference is not very complicated

naive comet Mar 6, 2025, 5:51 AM

#

lofty cedar I understand that high-performance code can sometimes be hard to read because we...

there are literally 2 optimisations on sf nnue code that I know of but cannot implement because of the current state of code

frosty imp Mar 6, 2025, 5:52 AM

#

2?

naive comet Mar 6, 2025, 5:52 AM

#

yes

#

this has strayed off the topic of UE threat inputs

#

we should move off

lofty cedar Mar 6, 2025, 5:52 AM

#

Sure.

round stone Mar 6, 2025, 7:24 AM

#

rocky vigil yeah ig linrock can go for a multilayer simplified net

I’ll train multilayer bullet format nets later, along with fixed full threat input nets

#

leb128 only matters if we have something that can beat the current master

rocky vigil Mar 6, 2025, 7:27 AM

#

ok

#

well the issue is a full threat net

#

might be like 160 mb

#

for 80k -> 1024

round stone Mar 6, 2025, 7:28 AM

#

Getting any kind of UE working will be important for measuring baselines at TC. Currently still don’t know how far we actually are from master

#

Yea 160mb is huge, but fine for testing

#

Still unclear whether full threat 1024 UE will be strong enough vs. master at TC

rocky vigil Mar 6, 2025, 7:37 AM

#

yeah I hope to get UE working tmrw

twilit oriole Mar 6, 2025, 8:09 AM

#

formal smelt I copy pasted the fix from simplified threat inputs

So does the branch work? Training is ready to start once it is confirmed

twilit oriole Mar 6, 2025, 8:19 AM

#

round stone Getting any kind of UE working will be important for measuring baselines at TC. ...

It will be an important test but keep in mind it is still going to be missing some large optimisations.

The full threat input net is extremely sparse, so permuting it will yield 20+ Elo at STC

Shared net weights (e.g mmap) probably also gains 20+ at STC single threaded because memory accesses are less predictable with threat inputs than king buckets (need a larger portion of the net in cache)

formal smelt Mar 6, 2025, 2:17 PM

#

rocky vigil <https://github.com/official-monty/montytrain/blob/threat-inputs-nnue-fixed/valu...

clearly i was too tired last night

#

that is indeed wrong lol

#

at least i think it is

#

@twilit oriole I have pushed fixes

rocky vigil Mar 6, 2025, 3:56 PM

#

formal smelt 3. You can see those are in fact handled in map_king_threat

I still don’t see this btw

formal smelt Mar 6, 2025, 3:59 PM

#

rocky vigil I still don’t see this btw

https://github.com/official-monty/montytrain/blob/threat-inputs-nnue-fixed/value/src/threats.rs#L133

#

oh right

#

i mean its completely negligible

twilit oriole Mar 6, 2025, 4:01 PM

#

its added complexity for a tiny gain. but sf is about tiny gains so sure you can add them if you want lol

#

ah it is mentioned

#

hmm actually. duplicate inputs here would only occur in check. in which case eval is skipped anyways?

#

hm for enemy at least. i guess there is still friendly

#

and i guess it is super common for a friendly rook to be next to the king lol

rocky vigil Mar 6, 2025, 5:04 PM

#

Yeah king-rook would save a bit bc castling

#

It’s fine

#

If this case doesn’t have deduplication I can also make a small change to the indexing

#

Changing the indexing is pretty easy on my side

rocky vigil Mar 6, 2025, 7:09 PM

#

@frosty imp I cannot take only stateinfo into update_accumulator_incremental because I need access to the entire board

#

So how should I format this

#

Unless stateinfo will also store 8 bitboards for the pieces

rocky vigil Mar 6, 2025, 7:54 PM

#

Ok I think I might just add 8 bitboards to stateinfo

#

Gah code rewrite

rocky vigil Mar 6, 2025, 8:47 PM

#

Aight appending active features only needs color bb, piece bb, and piece array now

rocky vigil Mar 6, 2025, 9:20 PM

#

Untested incremental update function now up at my branch

rocky vigil Mar 6, 2025, 10:09 PM

#

Lmao ue is up at my branch but only ~2x faster

#

I probably messed smth up

#

Yeah can someone test

#

And also maybe give suggestions on my hacked ue

rocky vigil Mar 6, 2025, 11:22 PM

#

I think I am having some fundamental ue impl issue here

#

Not overhead

#

Because if my debug info is to be trusted

#

I have like 20 updates/color/position

#

After 1M nodes from startpos

#

This is approximately 3x better than no ue

#

But still seems very high

rocky vigil Mar 6, 2025, 11:26 PM

#

rocky vigil I think I am having some fundamental ue impl issue here

Also in very short searches speed is up to 5x faster

#

Before declining down to 3

#

@frosty imp hints?

frosty imp Mar 6, 2025, 11:29 PM

#

no idea

#

could you send diff

#

do some profiling?

rocky vigil Mar 6, 2025, 11:31 PM

#

frosty imp could you send diff

At my branch lol

rocky vigil Mar 6, 2025, 11:31 PM

#

frosty imp do some profiling?

How to profile

frosty imp Mar 6, 2025, 11:31 PM

#

just find a good profiler and follow the tutorials

rocky vigil Mar 6, 2025, 11:34 PM

#

rocky vigil I have like 20 updates/color/position

But this is abnormal no? It should be like less than 10

frosty imp Mar 6, 2025, 11:38 PM

#

does the ue output match the eval starting from scratch

rocky vigil Mar 6, 2025, 11:41 PM

#

Yes

#

Same bench

#

& everything

formal smelt Mar 6, 2025, 11:42 PM

#

a 2x speedup is not unreasonable from UE

rocky vigil Mar 6, 2025, 11:42 PM

#

What’s more concerning is why it declines from 5x to 3x

#

Over more time

formal smelt Mar 6, 2025, 11:43 PM

#

wdym?

#

like if you search longer?

rocky vigil Mar 6, 2025, 11:44 PM

#

Yes

formal smelt Mar 6, 2025, 11:45 PM

#

weird indeed

rocky vigil Mar 7, 2025, 12:06 AM

#

well might be getting trolled by laptop but there is still a noticeable slowdown

📎 message.txt

rocky vigil Mar 7, 2025, 12:10 AM

#

rocky vigil I think I am having some fundamental ue impl issue here

lmao I'm right
"update_accumulator_incremental" is never called

#

how on earth is it still faster than non-ue then

rocky vigil Mar 7, 2025, 12:18 AM

#

round stone ``` Result of 10 runs ================== base (...rc/stockfish) = 238051 +...

the speedup is real (I'm getting ue ~ 60% of master rn) but for the life of me I cannot figure out why update_accumulator_incremental is never called

#

I even have https://github.com/sscg13/Stockfish/blob/threat-inputs/src/nnue/nnue_feature_transformer.h#L126 here and this line is literally never triggered

frosty imp Mar 7, 2025, 12:24 AM

#

rocky vigil I even have <https://github.com/sscg13/Stockfish/blob/threat-inputs/src/nnue/nnu...

https://github.com/sscg13/Stockfish/blob/a2604d40b42b7f755cebe91ed5a41d2cd0ac30d9/src/nnue/nnue_feature_transformer.h#L216-L232

rocky vigil Mar 7, 2025, 12:25 AM

#

ok but what is the issue there

frosty imp Mar 7, 2025, 12:25 AM

#

maybe the update from scratch always got triggered?

#

like in the beginning nothing is computed

#

so i'd assume this always triggers

#

hmm yeah

rocky vigil Mar 7, 2025, 12:29 AM

#

yeah but after 10 million nodes

frosty imp Mar 7, 2025, 12:29 AM

#

i guess the fix is just to label the accumulator as computed

#

in update_accumulator_scratch

rocky vigil Mar 7, 2025, 12:29 AM

#

I refuse to believe nothing iis wrong

#

oh shoot

#

you may be rght

rocky vigil Mar 7, 2025, 12:46 AM

#

bruh

#

using update_accumulator_incremental is actually slower

#

lemme do a little bit of optimize

#

is it possible to clear a valuelist

frosty imp Mar 7, 2025, 1:01 AM

#

yeah

#

write a function clear()

#

just set size to 0

rocky vigil Mar 7, 2025, 1:02 AM

#

bruh there's no prebuilt function

frosty imp Mar 7, 2025, 1:02 AM

#

pr it sabaping

rocky vigil Mar 7, 2025, 1:03 AM

#

frosty imp just set size to 0

wait but this doesn't erase the 1, 2, ... elements

frosty imp Mar 7, 2025, 1:03 AM

#

it doesn't

#

but you'd overwrite them anyway when pushing new elements

rocky vigil Mar 7, 2025, 1:04 AM

#

oh wait they don't matter

#

like we can just loop until size

#

and not further

rocky vigil Mar 7, 2025, 1:05 AM

#

rocky vigil like we can just loop until size

does the (for index : features) notation do this

frosty imp Mar 7, 2025, 1:05 AM

#

yeah

rocky vigil Mar 7, 2025, 1:05 AM

#

or will it also find the remaining elements as well

#

like if I push_back 1, 2, 3, 4, 5

#

then set size to 2

#

will the for loop only find the values 1, 2

frosty imp Mar 7, 2025, 1:05 AM

#

yeah

rocky vigil Mar 7, 2025, 1:05 AM

#

or all 5 values

frosty imp Mar 7, 2025, 1:05 AM

#

everything will work

rocky vigil Mar 7, 2025, 1:05 AM

#

yay

#

nvm I broke smth

#

https://github.com/sscg13/Stockfish/commit/ea86cca955d56c5c1520725caa441f1ac08f787d

#

smth in this is broken

#

i don't know how it's broken

rocky vigil Mar 7, 2025, 1:47 AM

#

i have no idea why this doesn't work

rocky vigil Mar 7, 2025, 2:07 AM

#

anyways I have been unable to get incremental update to be faster at all

#

so rn the "ue" optimization is probably literally accumulator reuse

#

i might be really borking because my statistics show that using my code, incremental vs scratch computation are ~ the same amount of compute

#

which I really don't believe

#

ok it might be that my write_difference isn't working

rocky vigil Mar 7, 2025, 3:09 AM

#

i actually need help

#

incremental doesn't work and I have no idea why

#

some comments:

#

(branch is https://github.com/sscg13/Stockfish/tree/threat-inputs once again)

#

assertion that the threats in computed stateinfo match the ones computed passes

#

assertion that the features are sorted passes

#

different results are returned using dirtypiece vs write_difference of psq

#

well

#

it is very probable that append_active_psq and append_active_threats indeed compute the correct indices

#

given that I have used these functions to do non-ue

#

and the version which doesn't use update_accumulator_incremental is also sound

#

so the failure point is probably in write_difference

#

but idk how that is wrong

#

 NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
|   r   |   n   |   b   |   q   |   k   |   b   |   n   |   r   |
|  0.00 |  0.00 |  0.00 |  0.00 |       |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   p   |   p   |   p   |   p   |   p   |   p   |   p   |
|  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |   P   |   P   |   P   |   P   |
|  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   R   |   N   |   B   |   Q   |   K   |   B   |   N   |   R   |
|  0.00 |  0.00 |  0.00 |  0.00 |       |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+

NNUE evaluation        +0.20 (white side)
Final evaluation       +0.26 (white side) [with scaled NNUE, ...]```

#

this is the other trademark of broken update_accumulator_incremental

#

gets the eval right but completely borks the piece value estimation

rocky vigil Mar 7, 2025, 3:51 AM

#

rocky vigil ``` NNUE derived piece values: +-------+-------+-------+-------+-------+-------...

this one actually persists, even if only using compute-from-scratch

#

which has the same bench

#

alright I'm going to wait for someone else to help me debug

#

if you really wanna try some "optimized non-ue" in the meanwhile

#

https://github.com/sscg13/Stockfish/commit/a2604d40b42b7f755cebe91ed5a41d2cd0ac30d9

#

^^also computes everything from scratch, but somehow is 2-3x faster

frosty imp Mar 7, 2025, 4:26 AM

#

what part of UE is wrong

#

does making, say, e2e4 on startpos give you the wrong result via ue?

rocky vigil Mar 7, 2025, 4:51 AM

#

stockfish code is so convoluted idk how to test this

frosty imp Mar 7, 2025, 5:10 AM

#

just make move on a pos object, then call evaluate

#

i think you can find some example on making move in the extend_pv function

rocky vigil Mar 7, 2025, 6:59 AM

#

rocky vigil ``` NNUE derived piece values: +-------+-------+-------+-------+-------+-------...

I think I understood the 0.00 phenomenon btw

#

if only removing a piece

#

it doesn't actually change the position or stateinfo

#

so the accumulator gets re-used

rocky vigil Mar 7, 2025, 7:45 AM

#

so far, with my current implementation, we have anywhere between 30 to 80 accumulator updates per node on average, depending on position

#

note that rn it's still just non-ue with accumulator re-use while I attempt to debug my incremental calculations

#

i think, if we can get this down to 20 avg.

#

then threat inputs should be quite good

rocky vigil Mar 7, 2025, 9:16 PM

#

Stockfish Threatnet (last working version) vs Stormphrax 6: 39 - 55 - 106

#

Very rough STC estimate

formal smelt Mar 7, 2025, 9:17 PM

#

what about against the smallnet

rocky vigil Mar 7, 2025, 9:17 PM

#

I can’t download it rn bc my school wifi blocks fishtest

#

Also the STC will suck even harder

#

Considering that I still haven’t gotten update_accumulator_incremental to work

rocky vigil Mar 7, 2025, 9:18 PM

#

rocky vigil `Stockfish Threatnet (last working version) vs Stormphrax 6: 39 - 55 - 106`

This version is using update_accumulator_scratch for everything

#

So it’s basically non-ue with accumulator reuse

#

Somehow still 2-3x faster than scratch computation at eval time

rocky vigil Mar 7, 2025, 9:19 PM

#

rocky vigil alright I'm going to wait for someone else to help me debug

I am still looking for help

#

I literally can’t see the issue

rocky vigil Mar 7, 2025, 11:13 PM

#

rocky vigil Very rough STC estimate

I estimate there is still close to 200 STC elo to be gained from speedup

rocky vigil Mar 9, 2025, 3:14 AM

#

average stockfish debug

daring wren Mar 9, 2025, 3:15 AM

#

engin

rocky vigil Mar 9, 2025, 3:39 AM

#

frosty imp just make move on a pos object, then call evaluate

i did this for startpos and move e2e4, and I found that update_accumulator_incremental was completely correct in this case

#

...

#

at least w.r.t. the difference calculation for added/removed features

#

@frosty imp ```c++
template<Color Perspective>
void update_accumulator(const Position& pos) {
StateInfo* st = pos.state();
if ((st->*accPtr).computed[Perspective])
return; // nothing to do

    // Look for a usable already computed accumulator of an earlier position.
    // Always try to do an incremental update as most accumulators will be reusable.
    do
    {
        if (!st->previous || st->previous->next != st)
        {
            // compute accumulator from scratch for this position
            update_accumulator_scratch<Perspective>(pos);/*
            if (st != pos.state())
                // when computing an accumulator from scratch we can use it to
                // efficiently compute the accumulator backwards, until we get to a king
                // move. We expect that we will need these accumulators later anyway, so
                // computing them now will save some work.
                update_accumulator_incremental<Perspective, BACKWARDS>(
                  pos.square<KING>(Perspective), st, pos.state());*/
            return;
        }
        st = st->previous;
    } while (!(st->*accPtr).computed[Perspective]);
    // Start from the oldest computed accumulator, update all the
    // accumulators up to the current position.
    update_accumulator_incremental<Perspective>(pos.square<KING>(Perspective), pos.state(), st);
}```

#

would you know why this code, from a 100k node startpos search, would only trigger update_accumulator_scratch once only, at the very beginning?

frosty imp Mar 9, 2025, 4:14 AM

#

you fixed the bug right

#

like the one with computed flag

rocky vigil Mar 9, 2025, 7:09 AM

#

frosty imp like the one with computed flag

yes?

#

both update_accumulator_scratch and update_accumulator_incremental

#

set computed = true at the end

frosty imp Mar 9, 2025, 7:16 AM

#

hmm

#

maybe try calling evaluate once on the root position before search?

rocky vigil Mar 9, 2025, 7:19 AM

#

uh

#

else
        {
            for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = (computed->*accPtr).accumulation[Perspective][j];
            }
            acc_updates++;
            threat_loops += (int)removed.size();
            threat_loops += (int)added.size();
            // Difference calculation for the activated features
            for (auto index : added)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                for (IndexType i = 0; i < TransformedFeatureDimensions; ++i)
                    (next->*accPtr).accumulation[Perspective][i] += weights[offset + i];
            }
            // Difference calculation for the deactivated features
            for (auto index : removed)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                for (IndexType i = 0; i < TransformedFeatureDimensions; ++i)
                    (next->*accPtr).accumulation[Perspective][i] -= weights[offset + i];
            }
        }```

#

there is an issue in this

#

piece of code

frosty imp Mar 9, 2025, 7:21 AM

#

hmm I don't see anything wrong from a quick skim

rocky vigil Mar 9, 2025, 7:21 AM

#

yeah that's why this is so suspicious

frosty imp Mar 9, 2025, 7:22 AM

#

does the offset calculation match other parts of the code?

#

idr how it works

rocky vigil Mar 9, 2025, 7:23 AM

#

rocky vigil ```c++ else { for (IndexType j = 0; j < TransformedFeatureDi...

replacing the block with

else
        {
            acc_updates++;
            threat_loops += (int)newthreats.size();
            threat_loops += (int)newpsq.size();
            for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = biases[j];
            }
            for (auto index : newpsq)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
            for (auto index : newthreats)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
        }

works but defeats the purpose of this function

#

i am equally perplexed

#

the (flawed) version of incremental halves avg. accumulator updates per node

rocky vigil Mar 9, 2025, 7:43 AM

#

rocky vigil the (flawed) version of incremental halves avg. accumulator updates per node

but also my guess is that crippling the search also causes it to search on avg moves that are more heavy in threat updates

#

like I think incremental on every move is not optimal with threat inputs

#

because e.g. if you have a long series of captures

#

you do not really need to evaluate how the threats change with each intermediate capture

#

just skip all the way to the end

rocky vigil Mar 9, 2025, 7:49 AM

#

frosty imp hmm I don't see anything wrong from a quick skim

you may also assume that removed and added are correct

#

at least, it has matched all the manual experiments

rocky vigil Mar 9, 2025, 8:24 AM

#

rocky vigil ```c++ else { for (IndexType j = 0; j < TransformedFeatureDi...

@frosty imp issue is in this line

for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = (computed->*accPtr).accumulation[Perspective][j];
            }

sigh

#

we can replace it with

for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = biases[j];
            }
            for (auto index : oldpsq)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
            for (auto index : oldthreats)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }

and it will work

#

meaning

#

despite (computed->*accPtr).computed[Perspective] being true

#

the accumulator is not correct

#

ok my github is up to date

#

I would really like help examining where accumulator.accumulation is updated

#

and how it can become "outdated"

naive comet Mar 9, 2025, 8:59 AM

#

honestly if you want a proof of concept you can just remove the accumulator updates altogether and do everything via the finny refreshes

#

it is not much of a slowdown and not an absolute PITA to implement

violet badger Mar 9, 2025, 12:41 PM

#

rocky vigil I would really like help examining where accumulator.accumulation is updated

Just a quick look at the warnings, I'm seeing:

position.cpp:1022:16: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of a non-trivial type ‘struct Stockfish::StateInfo’ leaves 1088 bytes unchanged [-Wclass-memaccess]
 1022 |     std::memcpy(&newSt, st, offsetof(StateInfo, accumulatorBig));
      |     ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from position.cpp:19:

I'd be worried about this kind of warnings.

rocky vigil Mar 9, 2025, 7:20 PM

#

violet badger Just a quick look at the warnings, I'm seeing: ``` position.cpp:1022:16: warning...

I indeed added 1440 extra bytes to StateInfo...

#

also clang never gave me this warning

#

perhaps I should install gcc

#

what does this code do

desert tree Mar 9, 2025, 7:45 PM

#

that seems to be finding the byte offset of a member of a struct

rocky vigil Mar 9, 2025, 7:45 PM

#

ok so how would it be off

#

if I add extra things to StateInfo

frosty imp Mar 9, 2025, 7:46 PM

#

it's partially copying the stateinfo struct

#

(remember the partial copy-make sf uses)

desert tree Mar 9, 2025, 7:46 PM

#

aka

struct S {
u32 a;
u32 b;
}
-> offsetof(S, b) == 4 // probably

rocky vigil Mar 9, 2025, 7:46 PM

#

ok

rocky vigil Mar 9, 2025, 7:46 PM

#

violet badger Just a quick look at the warnings, I'm seeing: ``` position.cpp:1022:16: warning...

but according to this warning it is not working

frosty imp Mar 9, 2025, 7:49 PM

#

rocky vigil but according to this warning it is not working

what did you add to StateInfo

rocky vigil Mar 9, 2025, 7:51 PM

#

frosty imp what did you add to StateInfo

2x Eval::NNUE::FeatureSet::IndexList (520 bytes each), ColorBB[2], PieceBB[16] (I intended for it to be 8 but whatever), Piece[64]

#

This is all before accumulatorBig

candid ivy Mar 9, 2025, 8:57 PM

#

rocky vigil but according to this warning it is not working

according to the warning the object is non trivial anymore, this will be there for as long as StateInfo is non trivial

#

it might work might not work depends a bit on the context

rocky vigil Mar 9, 2025, 10:01 PM

#

Is it because of adding a Eval::NNUE::FeatureSet::IndexList in StateInfo

#

I can store them in Accumulator instead I think

#

And that should make StateInfo trivial again

candid ivy Mar 9, 2025, 10:06 PM

#

Isn’t accumulator in state info too ? It would have the same result since if a member is not trivially copyable then the struct itself isn’t either

rocky vigil Mar 9, 2025, 10:24 PM

#

but since we only memcpy the non-accumulator portion of StateInfo

#

perhaps that issue could be avoided

#

otherwise I could look into storing pointers instead

#

but idk how that would work

#

actually since valuelist is very simple implementation

#

what if I just construct a trivial version of it

#

lemme try that

candid ivy Mar 9, 2025, 10:34 PM

#

First check if that’s actually the struct causing the warning

rocky vigil Mar 9, 2025, 10:37 PM

#

frosty imp Mar 9, 2025, 10:38 PM

#

template<typename T, std::size_t MaxSize>
class ValueList {

   public:
    std::size_t size() const { return size_; }
    void        push_back(const T& value) { values_[size_++] = value; }
    const T*    begin() const { return values_; }
    const T*    end() const { return values_ + size_; }
    const T&    operator[](int index) const { return values_[index]; }

   private:
    T           values_[MaxSize];
    std::size_t size_ = 0;
};

rocky vigil Mar 9, 2025, 10:38 PM

#

and if I comment out the IndexList declarations in stateinfo I get it is trivial

#

valuelist should be easy to do in a trivial manner

#

it's essentially just an array

frosty imp Mar 9, 2025, 10:40 PM

#

idk why it isn't trivially copyable

rocky vigil Mar 9, 2025, 10:40 PM

#

it might be because of the [] operator

frosty imp Mar 9, 2025, 10:41 PM

#

well the member functions aren't copied anyway

rocky vigil Mar 9, 2025, 10:41 PM

#

yeah I also have no idea why it's nontrivial

desert tree Mar 9, 2025, 10:43 PM

#

maybe static_assert(std::is_trivially_copyable_v<T>)?

#

otherwise it will not be trivially copyable

rocky vigil Mar 9, 2025, 10:43 PM

#

#

well

#

idk what this means lol

rocky vigil Mar 9, 2025, 10:45 PM

#

desert tree maybe static_assert(std::is_trivially_copyable_v<T>)?

Eval::NNUE::IndexType is just uint32_t

desert tree Mar 9, 2025, 10:46 PM

#

can you send the error you get

frosty imp Mar 9, 2025, 10:46 PM

#

violet badger Just a quick look at the warnings, I'm seeing: ``` position.cpp:1022:16: warning...

here

desert tree Mar 9, 2025, 10:46 PM

#

for static_assert(std::is_trivially_copyable<ValueList<T, N>>)?

rocky vigil Mar 9, 2025, 10:46 PM

#

essentially memcpy doesn't work because adding valuelists makes stateinfo nontrivial

#

whereas I would like to have a valuelist store all active features corresponding to a given accumulator

desert tree Mar 9, 2025, 10:47 PM

#

compiler and language version?

rocky vigil Mar 9, 2025, 10:47 PM

#

so that I don't need to recompute them to take the difference

rocky vigil Mar 9, 2025, 10:47 PM

#

desert tree compiler and language version?

I personally have clang 19.1.4, and stockfish compiles with c++17

#

also since millions of these will be processed for accumulator updates

#

trivialness might actually impact the speed

#

in the meanwhile I'll just replace it with an actual array

#

and see if that fixes things

rocky vigil Mar 9, 2025, 10:57 PM

#

desert tree maybe static_assert(std::is_trivially_copyable_v<T>)?

this assert passes if I insert it into Valuelist

rocky vigil Mar 9, 2025, 10:59 PM

#

frosty imp ```c++ template<typename T, std::size_t MaxSize> class ValueList { public: ...

if vscode is to be trusted the issue is in std::size_t size_ = 0

#

maybe vscode is not to be trusted though

#

ok godbolt backs up this claim

frosty imp Mar 9, 2025, 11:08 PM

#

could you send godbolt link

rocky vigil Mar 9, 2025, 11:08 PM

#

how do I get a link

frosty imp Mar 9, 2025, 11:08 PM

#

"share"

#

at top right corner

rocky vigil Mar 9, 2025, 11:08 PM

#

https://godbolt.org/z/vee7vhY7E

desert tree Mar 9, 2025, 11:10 PM

#

nvm im dumb

rocky vigil Mar 9, 2025, 11:10 PM

#

anyways the compiler in godbolt (hopefully) doesn't lie

#

even though I have no idea what the issue could possibly be

desert tree Mar 9, 2025, 11:11 PM

#

also why cout

frosty imp Mar 9, 2025, 11:11 PM

#

i think it's just the custom constructor

desert tree Mar 9, 2025, 11:11 PM

#

instead of static_assert

frosty imp Mar 9, 2025, 11:11 PM

#

    std::cout << std::boolalpha << std::is_trivially_copyable_v<ValueList<std::uint32_t, 128ULL>> << std::endl;

#

but this is still true though

desert tree Mar 9, 2025, 11:11 PM

#

is_trivial vs is_trivially_copyable

#

are not the same

frosty imp Mar 9, 2025, 11:12 PM

#

yeah

#

but didn't the warning say trivially copyable

#

ah

desert tree Mar 9, 2025, 11:12 PM

#

non-trivial type

rocky vigil Mar 9, 2025, 11:12 PM

#

vondele's warning just says 'writing to an object of a non-trivial type'

desert tree Mar 9, 2025, 11:12 PM

#

thats so strange

#

ill take a look tomorrow

#

gotta sleep

rocky vigil Mar 9, 2025, 11:13 PM

#

anyways I can really just like

#

manually replace the functionality

#

with only an array

#

but that's stupid

#

like just have arr[0] be the replacement for size_ or whatever

frosty imp Mar 9, 2025, 11:14 PM

#

i think it's the initialization of size_=0 that causes the problem

desert tree Mar 9, 2025, 11:14 PM

#

nahh please have a separate member no?

desert tree Mar 9, 2025, 11:14 PM

#

frosty imp i think it's the initialization of size_=0 that causes the problem

so remove it?

frosty imp Mar 9, 2025, 11:14 PM

#

idk how you can do that without breaking stuff

rocky vigil Mar 9, 2025, 11:15 PM

#

frosty imp i think it's the initialization of size_=0 that causes the problem

no initializing size_ to 1 also has issues

#

unless it's the initialization in general

desert tree Mar 9, 2025, 11:15 PM

#

maybe all members need initialization or none?

frosty imp Mar 9, 2025, 11:15 PM

#

you need to remove the initialization

desert tree Mar 9, 2025, 11:15 PM

#

rocky vigil unless it's the initialization in general

it wouldnt be any particular value

rocky vigil Mar 9, 2025, 11:15 PM

#

yes removing initialization makes it work

#

well then

frosty imp Mar 9, 2025, 11:16 PM

#

probably the initialization makes the constructor non-trivial?

rocky vigil Mar 9, 2025, 11:18 PM

#

ok but how do you get the functionality without the initialization of size_

frosty imp Mar 9, 2025, 11:19 PM

#

eh you probably can't

#

you know this would be a great time to refactor accumulator updates Kappa

rocky vigil Mar 9, 2025, 11:19 PM

#

what if we just leave size_ uninitialized

#

and then use clear() to set it to 0

#

does this bypass work

frosty imp Mar 9, 2025, 11:20 PM

#

that could solve it

#

but i don't like the extra step needed to use it elsewere

#

feels error prone

rocky vigil Mar 9, 2025, 11:20 PM

#

yeah it's

#

annoying

#

this is basically doing a constructor without actually doing one

frosty imp Mar 9, 2025, 11:21 PM

#

how about you just do it that way for the prototype

rocky vigil Mar 9, 2025, 11:21 PM

#

yeah I'll just try and see if it works for ue now

frosty imp Mar 9, 2025, 11:21 PM

#

would you be open to rebasing this onto some other refactors

rocky vigil Mar 9, 2025, 11:21 PM

#

and if we get close to merging we can figure a better solution out

#

sure

#

yeah

frosty imp Mar 9, 2025, 11:21 PM

#

yeah I can try making it nice but it'll probably take a few days lol

rocky vigil Mar 9, 2025, 11:22 PM

#

i mean only if this will get merged

#

myself I'm pretty confident that threat inputs will work with sufficient optimization but

#

thankfully valuelist is only used once

#

'capturesSearched' and 'quietsSearched'

frosty imp Mar 9, 2025, 11:24 PM

#

i really hope it works lol

#

threat inputs might solve a whole class of fortress issues

rocky vigil Mar 9, 2025, 11:46 PM

#

welp not initializing size_ = 0 breaks the code somehow even though I thought I tracked down and added .clear() every time a valuelist is declared

frosty imp Mar 9, 2025, 11:50 PM

#

did you clear it in the movepicker constructor

#

nvm it doesn't use valuelsit

rocky vigil Mar 9, 2025, 11:57 PM

#

Assertion failed: (next->*accPtr).accumulation[Perspective][j] == (computed->*accPtr).accumulation[Perspective][j], file nnue/nnue_feature_transformer.h, line 264```

#

lmao it still doesn't work

#

the accumulator still gets messed up somehow

#

for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = biases[j];
            }
            for (auto index : oldfeatures)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
            for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                assert((next->*accPtr).accumulation[Perspective][j] == (computed->*accPtr).accumulation[Perspective][j]);
            }``` relevant code

#

uh @frosty imp I pushed commit to my branch so can you compile it with gcc and tell me all the warnings

#

because clang did not give me any memcpy warnings

frosty imp Mar 10, 2025, 12:02 AM

#

https://pastebin.com/cYwb8Bbh

Pastebin

g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -pedantic -Wextr...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

rocky vigil Mar 10, 2025, 12:04 AM

#

ok so no more memcpy warnings

#

this is not inspiring

rocky vigil Mar 10, 2025, 12:25 AM

#

also for a startpos search at least

#

after update_accumulator_scratch being called on startpos

#

everything else will be update_accumulator_incremental

#

anyways since Stockfish::StateInfo is back to being nontrivial the memcpy should be fine

rocky vigil Mar 10, 2025, 1:15 AM

#

is it still possible that something in the memcpy is going wrong?

#

I don't know any other change that could affect the accumulation

#

since nowhere else in the code besides feature_transformer is it modified

rocky vigil Mar 10, 2025, 3:59 AM

#

I still have no idea where accumulator.accumulation is modified besides in the feature transformer functions

#

so I really don't know how it would stop matching

#

I don't think it's because of any memcpy because a 4096 byte buffer in the accumulator struct before the values doesn't work either

#

92 -60 -152 39 -71 -20 -113 31 -45 262 -76 -104 193 -98 72 -6 65 4 -2 -1025 88 -300 295 33 -270 -292 -442 37 72 -7 -184 252 -209 -70 -514 -235 130 -243 -256 -112 -57 140 -418 -121 -75 13 -906 110 77 -150 103 -231 -155 -1937 -2310 -98 -191 -522 -27 1 46 -2177 -185 135 -358 12 -131 -44 -81 -127 -87 -952 -430 -416 -6 -56 140 -22 85 -49 -131 10 -68 218 -1726 -630 -92 -23 -81 139 -459 30 -223 -53 13 -785 -32 -88 244 99 107 159 93 -272 -399 -551 -136 15 17 -731 -328 -116 -48 -294 8 54 -88 -609 -204 28 13 -739 36 -149 -213 -128 -201 -53 -417 -161 -180 -10 -2 25 -11 -175 56 -130 -90 -178 -3416 39 -656 -84 181 69 -5 -394 -137 -763 31 -205 46 11 -72 -71 -41 196 39 -1601 -521 -223 47 -3563 -384 -96 43 57 215 -195 -72 -12 -191 -153 -26 106 65 -2286 9 -192 190 -50 2 -256 170 -167 -43 50 70 71 -641 346 -2 72 -10 67 -107 -42 40 36 306 -476 -122 -188 33 -36 -604 119 -275 93 -917 2 -701 -43 -410 -128 -156 -946 -80 64 -446 40 6 38 75 -648 177 -27 -425 -20 -32 -1993 -174 40 -117 -7 60 -148 124 -226 -71 27 -358 -47 147 -168 -226 -601 5 7 -231 -273 -12 -9 -174 -121```

#

101 -61 -132 28 -64 -10 -105 45 -51 267 -77 -99 180 -299 81 -17 72 33 -41 -975 88 -298 265 65 -251 -304 -402 -103 138 -101 -186 250 -221 -224 -502 -140 128 -168 -239 -114 -56 133 -424 -114 -81 17 -937 128 33 -135 101 -334 -151 -1822 -2369 -95 -189 -363 3 -31 28 -2126 -166 136 -382 2 -647 -19 -166 -126 -87 -998 -443 -606 -31 -49 169 -27 86 -60 -127 6 -75 198 -1701 -549 -107 -29 -109 145 -437 20 -213 -15 25 -753 -47 -80 237 138 101 143 123 -291 -395 -530 -118 -50 7 -706 -330 -101 -53 -448 34 40 -176 -606 -197 8 -11 -837 27 -140 -208 -100 -200 -58 -435 -181 -150 -17 38 15 -12 -162 -73 -136 -90 16 -3478 -108 -680 -509 163 65 -9 -393 -197 -729 39 -111 46 -4 -69 -75 -26 211 11 -1556 -523 -209 43 -3528 -265 -71 20 -8 211 -180 -81 2 -181 -131 -27 115 69 -2229 19 -219 208 -45 8 -243 159 -181 -15 56 43 73 -659 311 -9 80 4 75 -129 -38 25 12 303 -464 -115 -163 32 -32 -598 41 -283 111 -935 -14 -705 -83 -416 -60 -177 -939 -78 66 -455 48 1 -155 74 -740 166 -53 -358 -37 -7 -2134 -175 -45 -108 -3 67 -158 131 -232 -81 15 -345 -67 139 196 -187 -627 39 -4 -221 -270 -11 31 -169 -58```

#

what

#

ok then

#

idk how useful this is though

#

8 9 10 11 12 13 14 15 65 70 130 133 192 199 260 323 429 432 433 434 435 436 438 439 505 510 570 573 632 639 700 763 938 957 1622 1623 1643 1644 2728 2735 2832 2833 4607 4608 4611 4612 4613 7445 7446 7447 7448 7449 8350 8353 8754 8771 8773 9635 9636 9656 9657 11022 11023 11120 11127 13407 13408 13413 13414 15333 15334 15335 15336 15337```

#

this should be the position after 1. c2c3

rocky vigil Mar 10, 2025, 4:34 AM

#

rocky vigil ```Instead got: 101 -61 -132 28 -64 -10 -105 45 -51 267 -77 -99 180 -299 81 -17 ...

ok I have discovered this is the white perspective accumulator after c2c3

#

so somehow it's getting passed the accumulator of the wrong perspective

#

wot

#

ok I am a clown

#

I have the same

#

IndexList

#

for both perspectives

#

because I forgot to split by perspective

#

so of course the features and stuff will cease to match...

rocky vigil Mar 10, 2025, 5:01 AM

#

UE reduces accumulator updates by more than 4x in 10M node search from startpos

#

but is not that much faster

#

maybe because of write_difference overhead

#

speedtest Nodes/second : 1809147 (non-ue)

#

Nodes/second : 1736032 (ue)

#

lmao

#

write_difference overhead is actually insane apparently

rocky vigil Mar 10, 2025, 5:32 AM

#

ok branch updated

#

uh

#

tbh ue being like

#

~same speed

#

with anywhere between 1/3 and 1/5 of the accumulator updates

#

is shocking

#

(ly bad overhead)

#

btw please compare current branch vs https://github.com/sscg13/Stockfish/commit/a2604d40b42b7f755cebe91ed5a41d2cd0ac30d9

#

for speedtesting purposes

#

(and/or profile it, the only major difference should be way less accumulator updates but many usages of write_difference)

#

i would appreciate it a lot

#

according to my debug statistics, which I can run now that I have 'real ue', this estimate is approximately correct (per side, so ~16 total per eval)

rocky vigil Mar 10, 2025, 4:04 PM

#

I am suspicious of the bench change when using dirtypiece to perform the psq feature updates

#

But the short STC I ran (50 - 31 - 119) doesn’t lie

#

It’s like a 10% speedup

#

@formal smelt are we still doing king-rook and king-bishop deduplication

#

In full inputs

formal smelt Mar 10, 2025, 4:09 PM

#

i thought we weren't doing that

rocky vigil Mar 10, 2025, 4:09 PM

#

Ok sure that’s fine

#

Yeah now that ue is in a better shape

#

(Overhead is constant but massively less accumulator updates scales much better with larger nets)

#

I can work on supporting full threats as well

rocky vigil Mar 10, 2025, 4:13 PM

#

rocky vigil with anywhere between 1/3 and 1/5 of the accumulator updates

Based on this statistic, it should only have 2x more accumulator updates vs halfkav2hm so disregarding overhead it should be very promising

rocky vigil Mar 10, 2025, 4:13 PM

#

rocky vigil It’s like a 10% speedup

Also I’m pretty sure this speedup is purely from write_difference on smaller vectors

round stone Mar 10, 2025, 7:54 PM

#

rocky vigil btw please compare current branch vs <https://github.com/sscg13/Stockfish/commit...

Result of  50 runs
==================
base (...rc/stockfish) =     275282  +/- 1164
test (...rc/stockfish) =     447672  +/- 1937
diff                   =    +172389  +/- 1315

speedup        = +0.6262
P(speedup > 0) =  1.0000

speedup of latest threat-inputs branch (46581e8) vs. a2604d4

rocky vigil Mar 10, 2025, 8:19 PM

#

Ok looks like ue is much better on your machine compared to mine

#

On my laptop it’s barely like 10%

rocky vigil Mar 10, 2025, 8:51 PM

#

@formal smelt can I get active features for kiwipete, full inputs (including deduplication) when you have time

candid ivy Mar 10, 2025, 8:56 PM

#

~/bench_parallel.sh ./stockfish_a2604d40 ./stockfish_46581e8a 13 10
sf_base =   335055 +/-   1366 (95%)
sf_test =   739457 +/-   6135 (95%)
diff    =   404401 +/-   5258 (95%)
speedup = 120.69692% +/- 1.570% (95%)

twilit oriole Mar 10, 2025, 9:01 PM

#

Since net weights are not shared between instances

#

If you do single threaded the result will be far better

round stone Mar 10, 2025, 9:02 PM

#

is there at least another 2x speedup expected beyond this?

candid ivy Mar 10, 2025, 9:32 PM

#

diff --git a/src/nnue/nnue_feature_transformer.h b/src/nnue/nnue_feature_transformer.h
index 35027bf6..ae832a67 100644
--- a/src/nnue/nnue_feature_transformer.h
+++ b/src/nnue/nnue_feature_transformer.h
@@ -256,21 +256,25 @@ class FeatureTransformer {
             acc_updates++;
             threat_loops += (int)removed.size();
             threat_loops += (int)added.size();
+
+            auto* acc_ptr = &((next->*accPtr).accumulation[Perspective][0]);
+
             // Difference calculation for the activated features
             for (auto index : added)
             {
-                const IndexType offset = TransformedFeatureDimensions * index;
-                assert(offset < TransformedFeatureDimensions * InputDimensions);
+                const auto* weight_ptr = &weights[TransformedFeatureDimensions * index];
+
                 for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
-                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
+                    acc_ptr[j] += weight_ptr[j];
             }
+
             // Difference calculation for the deactivated features
             for (auto index : removed)
             {
-                const IndexType offset = TransformedFeatureDimensions * index;
-                assert(offset < TransformedFeatureDimensions * InputDimensions);
+                const auto* weight_ptr = &weights[TransformedFeatureDimensions * index];
+
                 for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
-                    (next->*accPtr).accumulation[Perspective][j] -= weights[offset + j];
+                    acc_ptr[j] -= weight_ptr[j];
             }
         }

unless i did something wrong i got a good speedup from this

sf_base =   731585 +/-   8007 (95%)
sf_test =  1162599 +/-  14972 (95%)
diff    =   431013 +/-  10956 (95%)
speedup = 58.91504% +/- 1.498% (95%)

formal smelt Mar 10, 2025, 9:42 PM

#

rocky vigil <@236941606035521537> can I get active features for kiwipete, full inputs (inclu...

you can use bullet main, compile with --no-default-features --features cpu

#

uh hang on

#

ill just push to the montytrain branch

#

simplified inputs ye?

#

branch print-features-simpler-inputs
run with cargo r -r --bin value
edit the fen here: https://github.com/official-monty/montytrain/blob/print-features-simpler-inputs/value/src/main.rs#L10

rocky vigil Mar 10, 2025, 10:18 PM

#

formal smelt simplified inputs ye?

Full inputs (including deduplication) but should be easy change

#

Simpler inputs is already worked out

rocky vigil Mar 10, 2025, 10:23 PM

#

round stone is there at least another 2x speedup expected beyond this?

Well we apparently already have a significant one so given I suck at optimization I expect this to be the case

rocky vigil Mar 10, 2025, 11:09 PM

#

candid ivy ```diff diff --git a/src/nnue/nnue_feature_transformer.h b/src/nnue/nnue_feature...

yeah I also get a decent speedup with this but not nearly as large (probably because I use a laptop and don't use the most aggressive optimization settings)

rocky vigil Mar 10, 2025, 11:31 PM

#

https://tests.stockfishchess.org/tests/view/67cf763b7be98c1ad9b0214c

#

well I expect this to be -100 elo so far

#

going to test this, and also with a +50% time odds to simulate further optimization

rocky vigil Mar 10, 2025, 11:56 PM

#

Uh

#

There may be something suspicious with https://github.com/sscg13/Stockfish/commit/a9cb57d5cb6c7c06b7bd7445fd1a6d04da333c96

#

Bench altering

twilit oriole Mar 10, 2025, 11:57 PM

#

I don't get how a 256 threat net is that slow. It shouldn't be much, 80%+ time is in search

rocky vigil Mar 10, 2025, 11:57 PM

#

I swear locally it was +30 elo so I looked past it

#

Idk though

twilit oriole Mar 10, 2025, 11:57 PM

#

Because vondele said the 3072 master net is only 50% of runtime in eval

rocky vigil Mar 10, 2025, 11:57 PM

#

Yeah I probably optimized smth poorly

#

Maybe the special SIMD kernels actually are meaningful

rocky vigil Mar 11, 2025, 12:40 AM

#

rocky vigil There may be something suspicious with <https://github.com/sscg13/Stockfish/comm...

Welp when I get back I’ll try undoing this change

lofty cedar Mar 11, 2025, 12:56 AM

#

Threat input tests went horribly.

#

200 elo is a long shot.

formal smelt Mar 11, 2025, 1:01 AM

#

lofty cedar Threat input tests went horribly.

read the thread

#

i wouldn't take that test as indicating anything

lofty cedar Mar 11, 2025, 1:02 AM

#

I see.

rocky vigil Mar 11, 2025, 1:25 AM

#

twilit oriole I don't get how a 256 threat net is that slow. It shouldn't be much, 80%+ time i...

If someone could profile like tmrw I would really appreciate it ^^

rocky vigil Mar 11, 2025, 1:39 AM

#

twilit oriole I don't get how a 256 threat net is that slow. It shouldn't be much, 80%+ time i...

I am pretty sure write_difference is much slower than I expected

#

nothing else can explain that ue is less than 2x faster than non-ue with around 1/4 the accumulator updates on average

rocky vigil Mar 11, 2025, 1:40 AM

#

rocky vigil I am pretty sure write_difference is much slower than I expected

I will look into hacking in a custom incremental difference for single move

rocky vigil Mar 11, 2025, 1:44 AM

#

lofty cedar I see.

and also +60% time only performing 50 elo better at STC UHO is way below what we would expect from scaling data

daring wren Mar 11, 2025, 1:51 AM

#

is it? I thought it was 2x time odds ~= 70-80 elo

rocky vigil Mar 11, 2025, 1:51 AM

#

ok well in that case

#

+40 fixed nodes 1/2 speed

#

in no way equals -270 stc elo

#

so either way something is wrong

daring wren Mar 11, 2025, 1:52 AM

#

maybe net has bad scaling

#

have you tried fixed nodes but with higher node count?

rocky vigil Mar 11, 2025, 1:54 AM

#

daring wren maybe net has bad scaling

well the bigger issue was I merged some ue change that affected bench bc I ran SSS and it was not worse

#

that was probably against my better judgement

#

so now I'm running a non-SSS test on fishtest

#

what the - (check back in on this in a couple hours)

#

I really got scammed by SSS

#

maybe I straight up loaded the wrong compile

#

anyways pls fix dirtypiece

#

it's worth another like 20% if done properly or smth

rocky vigil Mar 11, 2025, 2:05 AM

#

rocky vigil what the - (check back in on this in a couple hours)

@twilit oriole I think everything is explained lol

lofty cedar Mar 11, 2025, 3:04 AM

#

Interesting.

rocky vigil Mar 11, 2025, 6:28 AM

#

ok yeah tmrw I will try to get dirtypiece working again

#

and maybe also try and optimize incremental more

#

(in the difference calculation for features)

candid ivy Mar 11, 2025, 9:18 AM

#

rocky vigil Well we apparently already have a significant one so given I suck at optimizatio...

look at append_active_threats for optimizations, this eats most of the time, i.e. move the vector into the class don't recreate it constantly

#

the enemy bool in the make_index can be rewritten as bool enemy = (attkr ^ attkd) & 8; in my profile it was kinda slow i think, (~1.4-2% speedup)

candid ivy Mar 11, 2025, 11:46 AM

#

@rocky vigil whats the purpose of sorting the features in the append_active_threats ? I get the same bench without them

candid ivy Mar 11, 2025, 2:04 PM

#

diff --git a/src/nnue/layers/screlu_affine.h b/src/nnue/layers/screlu_affine.h
index aeb7e951..06ad8249 100644
--- a/src/nnue/layers/screlu_affine.h
+++ b/src/nnue/layers/screlu_affine.h
@@ -56,24 +56,23 @@ class SCReLUAffine {
     }

     // Forward propagation
-    OutputType evaluate(InputType* input, IndexType bucket) {
+    OutputType evaluate(const InputType* input, IndexType bucket) {
         assert(bucket < OutputBuckets);
-        constexpr IndexType Start = 0;
-        OutputType output = 255*(std::int32_t)biases[bucket];
-        for (IndexType i = Start; i < InputDimensions; i++) {
-            input[i] = std::min((std::int16_t)255, std::max(input[i], (std::int16_t)0));
-        }
-        for (IndexType i = Start; i < InputDimensions; i++) {
-            intermediate[i] = input[i]*weights[bucket*InputDimensions+i];
-        }
-        for (IndexType i = Start; i < InputDimensions; ++i)
-        {
-            output += (std::int32_t)(input[i])*(std::int32_t)(intermediate[i]);
+
+        const IndexType weightOffset = bucket * InputDimensions;
+        const auto* weights_ptr = &(weights[weightOffset]);
+
+        std::int32_t output = 255 * static_cast<std::int32_t>(biases[bucket]);
+
+        for (IndexType i = 0; i < InputDimensions; i++) {
+            const std::int16_t clipped = std::clamp(input[i], static_cast<std::int16_t>(0), static_cast<std::int16_t>(255));
+
+            output += clipped * clipped * weights_ptr[i];
         }
+
         return output / 255;
     }

-    alignas(CacheLineSize) std::int16_t intermediate[InputDimensions];
     alignas(CacheLineSize) std::int16_t biases[OutputBuckets];
     alignas(CacheLineSize) std::int16_t weights[OutputBuckets * InputDimensions];
 };

you can speedup evaluate by like 5% for me, if you need the sort order for the active inputs, then I wonder if you can avoid the sort by changing the loop so that all indices which are generated next are already in ascending order?

candid ivy Mar 11, 2025, 2:20 PM

#

then I wonder if you can avoid the sort by changing the loop so that all indices which are generated next are already in ascending order?
maybe they already are?

rocky vigil Mar 11, 2025, 4:58 PM

#

candid ivy <@693549181838819338> whats the purpose of sorting the features in the `append_a...

This is so that write_difference works
Also I’m pretty sure that they don’t automatically generate in ascending order, say you have a rook defending two pawns on a3 and h3, then if the king is on efgh by mirroring the h3 one is first

candid ivy Mar 11, 2025, 4:58 PM

#

yeye i checked for some reason bench was unchanged but they weren't in ascending order

rocky vigil Mar 11, 2025, 5:46 PM

#

candid ivy look at `append_active_threats` for optimizations, this eats most of the time, i...

Test latest commit or two

#

9594b46 appears to be a significant gain in bench

#

Though the school laptop’s antivirus sometimes screws the results

rocky vigil Mar 11, 2025, 5:49 PM

#

candid ivy the enemy bool in the make_index can be rewritten as `bool enemy = (attkr ^ attk...

In general how much do the various enum conversions cost? If this is actually significant then I can just go back to bitwise tricks

rocky vigil Mar 11, 2025, 6:13 PM

#

Also if someone wants to debug the dirtypiece

candid ivy Mar 11, 2025, 6:13 PM

#

rocky vigil In general how much do the various enum conversions cost? If this is actually si...

Well they are free basically this seems to just do better than the color of which uses bitshifts

rocky vigil Mar 11, 2025, 6:13 PM

#

I mostly copied it from master

#

Make_index(piece, from, from, piece, ksq) does get you the psq index

#

As a hack

#

Figuring this out should also be a large gain

candid ivy Mar 11, 2025, 6:16 PM

#

rocky vigil 9594b46 appears to be a significant gain in bench

iirc some 15%

rocky vigil Mar 11, 2025, 6:17 PM

#

Yeah turns out declaring many small vectors is highly unnecessary

rocky vigil Mar 11, 2025, 6:23 PM

#

rocky vigil Also if someone wants to debug the dirtypiece

Oh bruh what happens to dirtypiece in case of a king move

#

I think this is what is breaking

#

Yeah I probably mess up the mirror refresh

rocky vigil Mar 11, 2025, 6:44 PM

#

Alright I fixed dirtypiece

#

In 286c9e6a

#

Seems to be significant gain as well

candid ivy Mar 11, 2025, 7:02 PM

#

~16.5%

rocky vigil Mar 11, 2025, 7:11 PM

#

candid ivy the enemy bool in the make_index can be rewritten as `bool enemy = (attkr ^ attk...

Yeah I added this and combined psq, threat loops when computing from scratch in d0605d9

#

Probably not as big of a speedup though

rocky vigil Mar 11, 2025, 7:37 PM

#

Multilayer inference will be better because we have proper code for it

#

Ok I have no new optimizations planned rn so I’ll run a test to see where we are at

#

we’re probably ready to test multilayer with simplified inputs @round stone

rocky vigil Mar 11, 2025, 7:41 PM

#

rocky vigil Ok I have no new optimizations planned rn so I’ll run a test to see where we are...

Ah shoot can’t for a few hours bc I don’t remember my fishtest password

candid ivy Mar 11, 2025, 7:57 PM

#

with some additional optimizations and simd you can get another 11%

twilit oriole Mar 11, 2025, 8:00 PM

#

this is not permuted as well right?

rocky vigil Mar 11, 2025, 8:05 PM

#

Nope

#

Just whatever bullet default order is

#

At this point it’s almost certain that at L1=256 simplified threats are superior to halfkav2hm at STC

rocky vigil Mar 11, 2025, 8:09 PM

#

rocky vigil we’re probably ready to test multilayer with simplified inputs <@450517669570150...

I want to switch to multilayer in part because we have optimized SIMD for that already

#

Besides if computing the threat indices is the bottleneck rn that implies favorable scaling to large L1

#

Speaking of computing threat indices bulk pawn attacks might be considerable speedup

rustic bough Mar 11, 2025, 8:16 PM

#

MultiThreading is broken in 9f21b44. Was ok in 2db74f4.

rocky vigil Mar 11, 2025, 8:16 PM

#

Since half the pieces being looped through will be pawns

#

Huh interesting

#

Yeah uh

#

I see

#

Lemme bisect it quickly

#

9594b46 breaks multithreading

#

sigh

#

Do the threads like access the same featuretransformer class or smth

#

As long as they access separate featuretransformers it shouldn’t break?

candid ivy Mar 11, 2025, 8:31 PM

#

rocky vigil As long as they access separate featuretransformers it shouldn’t break?

make it a static thread_local

rocky vigil Mar 11, 2025, 8:32 PM

#

Welp n*ram for n threads

candid ivy Mar 11, 2025, 8:32 PM

#

it's really not much ram?

rocky vigil Mar 11, 2025, 8:32 PM

#

I guess fishtest already has this problem though

candid ivy Mar 11, 2025, 8:33 PM

#

how many pieces are in that array, you are sorting and adding it constantly, it's like max 30 elements with 4 bytes or something?

rocky vigil Mar 11, 2025, 8:33 PM

#

Max 16 with 4 bytes

candid ivy Mar 11, 2025, 8:34 PM

#

64 bytes per thread kekgasm

rocky vigil Mar 11, 2025, 8:34 PM

#

Oh but the commit afterwards moves removed, added also into featuretransformer

#

Whatever that’s like 1KB

#

Uh adding static thread_local gives me a bunch of warnings on forward declaration

#

And then undefined symbol errors

candid ivy Mar 11, 2025, 8:45 PM

#

show me

#

also just move the vector into the function again if you do that

#

or if it is max 16, i mean use another valuelist not a vector

rocky vigil Mar 11, 2025, 8:51 PM

#

candid ivy or if it is max 16, i mean use another valuelist not a vector

std::sort doesn’t work on valuelist

rocky vigil Mar 11, 2025, 8:51 PM

#

candid ivy show me

uh in a couple hours

#

Still at school lol

candid ivy Mar 11, 2025, 9:03 PM

#

rocky vigil std::sort doesn’t work on valuelist

you just need to add

T* begin() { return values_; }
T* end() { return values_ + size_; }

rocky vigil Mar 11, 2025, 9:04 PM

#

Wait I claim valuelist already has these

candid ivy Mar 11, 2025, 9:04 PM

#

it has them with const

#

sorting is well non const

rocky vigil Mar 11, 2025, 9:04 PM

#

Ok well yeah I won’t be able to do changes for a couple hours

#

At that point I’ll also run another STC on fishtest

#

To see where are are single thread now

round stone Mar 11, 2025, 9:56 PM

#

rocky vigil we’re probably ready to test multilayer with simplified inputs <@450517669570150...

alright, something like this to start?

const HIDDEN_SIZE: usize = 256;
let mut trainer = TrainerBuilder::default()
    .quantisations(&[255, 64])
    .optimiser(optimiser::AdamW)
    .loss_fn(Loss::SigmoidMPE(2.6))
    .input(ThreatInputs)
    .output_buckets(outputs::MaterialCount::<8>)
    .feature_transformer(HIDDEN_SIZE)
    .activate(Activation::SCReLU)
    .add_layer(16)
    .activate(Activation::CReLU)
    .add_layer(32)
    .activate(Activation::CReLU)
    .add_layer(1)
    .build();

formal smelt Mar 11, 2025, 9:59 PM

#

You could just recreate the SF arch

#

It should be more or less what is in the advanced example

round stone Mar 11, 2025, 10:03 PM

#

sure, if the inference code for SF arch is reusable or easy to set up for this

#

down to use whatever arch. lmk what multi-layer arch is easiest to get inference working for @rocky vigil

frosty imp Mar 11, 2025, 10:09 PM

#

can bullet leb compress the weights

round stone Mar 11, 2025, 10:09 PM

#

also found that there's a newer L1-256 which is the one lichess is using: nn-9067e33176e8.nnue

#

https://tests.stockfishchess.org/tests/view/656fe47c6980e15f69c78dea

#

https://tests.stockfishchess.org/api/nn/nn-9067e33176e8.nnue

#

there's no leb compression implementation in bullet

formal smelt Mar 11, 2025, 10:12 PM

#

this is hardly something needed in bullet

#

just postprocess the file

rocky vigil Mar 11, 2025, 10:16 PM

#

frosty imp can bullet leb compress the weights

No but Stockfish can

rocky vigil Mar 11, 2025, 10:17 PM

#

round stone down to use whatever arch. lmk what multi-layer arch is easiest to get inference...

Anyways idk it doesn’t matter the multilayer arch

#

I’ll get it to work either way (including weight reading)

#

Whatever you think is best

round stone Mar 11, 2025, 10:19 PM

#

alright, then i'm inclined to start simple with the arch

#

less code to deal with

#

and ignore leb128 until the end, or if larger L1 gets annoying to deal with during testing

#

since it has no effect on strength, and strength is the important part now

rocky vigil Mar 11, 2025, 10:21 PM

#

Yeah that’s fine

rocky vigil Mar 11, 2025, 10:21 PM

#

round stone alright, something like this to start? ```rust const HIDDEN_SIZE: usize = 256; l...

This should be fine

rocky vigil Mar 12, 2025, 12:11 AM

#

candid ivy show me

      'Stockfish::Eval::NNUE::FeatureTransformer<256, &Stockfish::StateInfo::accumulatorBig>::added' required here, but
      no definition is available [-Wundefined-var-template]
  129 |         added.clear();
      |         ^
nnue/nnue_feature_transformer.h:291:17: note: in instantiation of function template specialization
      'Stockfish::Eval::NNUE::FeatureTransformer<256,
      &Stockfish::StateInfo::accumulatorBig>::update_accumulator_scratch<Stockfish::WHITE>' requested here
  291 |                 update_accumulator_scratch<Perspective>(pos);
      |                 ^
nnue/nnue_feature_transformer.h:309:9: note: in instantiation of function template specialization
      'Stockfish::Eval::NNUE::FeatureTransformer<256,
      &Stockfish::StateInfo::accumulatorBig>::update_accumulator<Stockfish::WHITE>' requested here
  309 |         update_accumulator<WHITE>(pos);
      |         ^
nnue/network.cpp:222:25: note: in instantiation of member function 'Stockfish::Eval::NNUE::FeatureTransformer<256,
      &Stockfish::StateInfo::accumulatorBig>::transform' requested here
  222 |     featureTransformer->transform(pos, acc);
      |                         ^
nnue/nnue_feature_transformer.h:59:47: note: forward declaration of template entity is here
   59 |     static thread_local FeatureSet::IndexList added;
      |                                               ^
nnue/nnue_feature_transformer.h:129:9: note: add an explicit instantiation declaration to suppress this warning if
      'Stockfish::Eval::NNUE::FeatureTransformer<256, &Stockfish::StateInfo::accumulatorBig>::added' is explicitly
      instantiated in another translation unit
  129 |         added.clear();```

rocky vigil Mar 12, 2025, 1:03 AM

#

hmm we are not cooking with this one

rocky vigil Mar 12, 2025, 1:25 AM

#

uh can someone run bench of threat-inputs vs halfkav2hm-256-base

rocky vigil Mar 12, 2025, 1:51 AM

#

what is going on with the residuals in the test

#

btw

#

this is the first time I've seen red

round stone Mar 12, 2025, 1:51 AM

#

speed of latest threat-inputs vs. L1-256nn-9067e33176e8.nnue as main net

Result of  20 runs
==================
base (...-256-s3-9067) =    2760160  +/- 6232
test (...g13-sf-mar11) =    1059262  +/- 3025
diff                   =   -1700898  +/- 5835

speedup        = -0.6162
P(speedup > 0) =  0.0000

rocky vigil Mar 12, 2025, 1:52 AM

#

ok we are not cooking very hard on the speed

#

that's probably why still -100 elo

#

i mean the current code will perform (relatively) better with large nets because I think calculating threat indices is still the bottleneck

#

i will try and bulk pawn threats later

#

once multithreading is resolved

round stone Mar 12, 2025, 2:02 AM

#

for multilayer, floats for later layers ok? based on bullet morelayers.rs

const HIDDEN_SIZE: usize = 256;
let mut trainer = TrainerBuilder::default()
    .advanced_quantisations(&[QuantTarget::I16(255), QuantTarget::I16(64), QuantTarget::Float, QuantTarget::Float])
    .optimiser(optimiser::AdamW)
    .loss_fn(Loss::SigmoidMPE(2.6))
    .input(ThreatInputs)
    .output_buckets(outputs::MaterialCount::<8>)
    .feature_transformer(HIDDEN_SIZE)
    .activate(Activation::SCReLU)
    .add_layer(16)
    .activate(Activation::CReLU)
    .add_layer(32)
    .activate(Activation::CReLU)
    .add_layer(1)
    .build();

lofty cedar Mar 12, 2025, 2:04 AM

#

Still like -100 elo.

round stone Mar 12, 2025, 2:04 AM

#

-100 elo vs. L1-256 main net means still a long ways though

#

yea, morelayers can be a baseline but isn't going to help that much

twilit oriole Mar 12, 2025, 2:07 AM

#

I did say I don't think simplified threat inputs are sufficient lol

#

But it's fine, if it's close enough then we know the full will be better

rocky vigil Mar 12, 2025, 2:09 AM

#

round stone for multilayer, floats for later layers ok? based on bullet morelayers.rs ```rus...

ok hopefully the inference layers already in sf work lol

rocky vigil Mar 12, 2025, 2:10 AM

#

round stone -100 elo vs. L1-256 main net means still a long ways though

well as disservin said most of the slowdown is from computing threat indices

rocky vigil Mar 12, 2025, 2:10 AM

#

candid ivy look at `append_active_threats` for optimizations, this eats most of the time, i...

like here

twilit oriole Mar 12, 2025, 2:11 AM

#

Interesting

#

So that slowdown actually decreases as net size increases

rocky vigil Mar 12, 2025, 2:12 AM

#

according to my statistics you have ~ 2x as more accumulator updates with threats compared to halfkav2hm

#

in midgame

#

now obviously because of hilariously poor overhead this doesn't match what we see at small sizes

#

and yeah I would appreciate help with ue impl

#

like if there's a way to compute threat difference without looping through all the pieces

round stone Mar 12, 2025, 2:16 AM

#

i'll try L1-512 simple threats for another data point then

rocky vigil Mar 12, 2025, 2:17 AM

#

ok

#

I expect it to be not much slower

round stone Mar 12, 2025, 2:17 AM

#

and i'll look into full threats soon too

rocky vigil Mar 12, 2025, 2:18 AM

#

if you want I can take the current net, permute the ft/l2 weights 7 more times, and "effectively" have a l1 = 2048 for speed testing purposes

round stone Mar 12, 2025, 2:19 AM

#

full threats i assume can be trained with this
https://github.com/official-monty/montytrain/commits/threat-inputs-nnue-fixed/

#

i can train an actual L1-2048 and use an early checkpoint for speedtest purposes

rocky vigil Mar 12, 2025, 2:20 AM

#

yes I believe everything there is ready

#

although inference side it's not quite ready yet

#

I'll work on it

round stone Mar 12, 2025, 2:24 AM

#

alright full threats would be more important than multilayer

rocky vigil Mar 12, 2025, 2:27 AM

#

yeah sure the fixed nodes test of full threats vs simplified would probably be more informative

rocky vigil Mar 12, 2025, 2:28 AM

#

rocky vigil ```nnue/nnue_feature_transformer.h:129:9: warning: instantiation of variable ...

btw @frosty imp would you be able to help with this

#

moving indexlists to classes so that we don't declare a bunch of temporary ones (a noticeable speed gain in single thread btw) currently breaks multithreading

round stone Mar 12, 2025, 2:34 AM

#

Result of  10 runs
==================
base (...g13-sf-mar11) =    1063281  +/- 1289
test (...1-sscg13-512) =     254596  +/- 485
diff                   =    -808684  +/- 1307

speedup        = -0.7606
P(speedup > 0) =  0.0000

assuming i did this right, this shows simple threats L1-512 being quite a lot slower than L1-256

#

early results after 10 superbatches of training on SF data

#

Architecture           : (15776 -> 512)x2 -> 1x8
Inputs                 : Threat inputs
Number of Weights      : 8.09m

rocky vigil Mar 12, 2025, 2:40 AM

#

hmm

#

wait 4x slower is a bit

#

suspicious

#

like doubling L1 should never make it 4x slower

round stone Mar 12, 2025, 2:51 AM

#

16mb .nnue file. only change in the engine code otherwise was setting L1 to 512

rocky vigil Mar 12, 2025, 2:51 AM

#

can you send it to me so I can test

round stone Mar 12, 2025, 2:53 AM

#

📎 nn-ff12e5c0b08b.nnue

rocky vigil Mar 12, 2025, 2:58 AM

#

I got ~900k for 256 vs ~800k for 512

#

10M node search from startpos with 256: info depth 33 seldepth 40 multipv 1 score cp 28 lowerbound nodes 10000621 nps 625625 hashfull 999 tbhits 0 time 15985 pv d2d4 bestmove d2d4 ponder d7d5 Number of accumulator updates: 15940910 Number of feature indices looped through: 189471834
10M node search from startpos with 512: info depth 30 seldepth 37 multipv 1 score cp 35 lowerbound nodes 10000298 nps 570467 hashfull 1000 tbhits 0 time 17530 pv e2e4 bestmove e2e4 ponder e7e5 Number of accumulator updates: 15923538 Number of feature indices looped through: 190573575

round stone Mar 12, 2025, 3:05 AM

#

what do the Nodes/second numbers show when you run stockfish bench with both?

rocky vigil Mar 12, 2025, 3:05 AM

#

rocky vigil I got ~900k for 256 vs ~800k for 512

~this

#

though I don't have the benchmarking script so I need to run it manually

round stone Mar 12, 2025, 3:08 AM

#

weird, we'll this training finish and see how it fares on fishtest

#

this is all i changed:

-#define EvalFileDefaultNameBig "nn-98b68b5a9455.nnue"
+#define EvalFileDefaultNameBig "nn-ff12e5c0b08b.nnue"

 // Number of input feature dimensions after conversion
-constexpr IndexType TransformedFeatureDimensionsBig = 256;
+constexpr IndexType TransformedFeatureDimensionsBig = 512;

rocky vigil Mar 12, 2025, 3:09 AM

#

that is also all I changed

round stone Mar 12, 2025, 3:10 AM

#

oh wait, my branch wasn't updated with the latest speed updates

#

nm, looks less slow now

#

it should be on this commit right? 9f21b44 disservin screlu affine speedup

rocky vigil Mar 12, 2025, 3:12 AM

#

yeah

#

that one

#

unfortunately a couple of the single-thread speedups break multithread

#

I am hoping for someone to help me resolve those (I have no experience coding multithreaded)

#

otherwise it would be a shame to have to roll those back

round stone Mar 12, 2025, 3:14 AM

#

L1-512 on top of 9f21b44 looks a lot better

Result of  10 runs
==================
base (...g13-sf-mar11) =    1057315  +/- 5402
test (...rofile-build) =     916618  +/- 4834
diff                   =    -140697  +/- 6748

speedup        = -0.1331
P(speedup > 0) =  0.0000

rocky vigil Mar 12, 2025, 3:14 AM

#

mm

#

looks like those speedups were worth a lot

#

how far back was your old branch just curious

#

the speedups today should only be like +30-40% compared to yesterday

round stone Mar 12, 2025, 3:17 AM

#

Updating 46581e8a..9f21b44c

#

46581e8a enable backward incremental updates

rocky vigil Mar 12, 2025, 3:18 AM

#

round stone L1-512 on top of `9f21b44` looks a lot better ``` Result of 10 runs ===========...

anyways based on this napkin math suggests scaling to 3072 as master would be 40% of the speed

rocky vigil Mar 12, 2025, 3:19 AM

#

round stone `46581e8a` enable backward incremental updates

oh huh the total speedup should not be so large then, then again this branch was broken inference because I added dirtypiece and forgot that it doesn't work if the king moves from efgh to abcd

round stone Mar 12, 2025, 3:20 AM

#

how does it perform if you try it locally?

rocky vigil Mar 12, 2025, 3:23 AM

#

upper 700k (for L1=256)

#

then again my laptop is like really not great for speed testing

#

like fishtest stc https://tests.stockfishchess.org/tests/view/67d0ca72166a3e8781d84242 suggests today was a ~40% speedup for L1=256 on average, my laptop suggests it was half that lol

rocky vigil Mar 12, 2025, 3:40 AM

#

anyways better speed is better speed lol

round stone Mar 12, 2025, 3:54 AM

#

yea, any speed we can get is good

#

i wouldn't worry about multithreaded for now either

#

it's nice if it works of course. however the main blocker is getting anything on par with master

round stone Mar 12, 2025, 6:21 AM

#

simple threat inputs - L1-512 vs. L1-256
https://tests.stockfishchess.org/tests/view/67d123cf166a3e8781d842bf

candid ivy Mar 12, 2025, 9:21 AM

#

rocky vigil std::sort doesn’t work on valuelist

just do this https://github.com/Disservin/Stockfish/commit/a3427fca70623052136b4be9b08e3629ee5e4daa

naive comet Mar 12, 2025, 9:23 AM

#

yeah that's 100% a huge speedup

candid ivy Mar 12, 2025, 9:30 AM

#

naive comet yeah that's 100% a huge speedup

why?

#

that's barely a speedup, it just makes multi threaded work again

rustic bough Mar 12, 2025, 9:53 AM

#

With a3427fc multithreading works again without crash. But the analysis is totally weird.

naive comet Mar 12, 2025, 10:09 AM

#

https://github.com/cj5716/Alexandria/commit/bb20a5bb7c217d2a47caa24d72cf0e07c05885ab was a 10% speedup for me tho

#

I guess this is quite different

#

I didn't read it in detail

candid ivy Mar 12, 2025, 11:17 AM

#

naive comet <https://github.com/cj5716/Alexandria/commit/bb20a5bb7c217d2a47caa24d72cf0e07c05...

ah well the vector is created in the function so that was a speedup for you, but sscg already moved the instatation out of the function so that speedup is gone

candid ivy Mar 12, 2025, 11:19 AM

#

rocky vigil i mean the current code will perform (relatively) better with large nets because...

{
    Color c = order[Perspective][i];
    PieceType pt = PAWN;
    Piece attkr = make_piece(c, pt);
    Bitboard bb  = colorBB[c] & pieceBB[pt];
    indices.clear();

    auto right = c == WHITE ? NORTH_EAST : SOUTH_WEST;
    auto left = c == WHITE ? NORTH_WEST : SOUTH_EAST;
    auto attacks_left = (c == WHITE ? shift<NORTH_EAST>(bb) : shift<SOUTH_WEST>(bb)) & occupied;
    auto attacks_right = (c == WHITE ? shift<NORTH_WEST>(bb) : shift<SOUTH_EAST>(bb)) & occupied;

    while (attacks_left) {
        Square to = pop_lsb(attacks_left);
        Square from = to - right;
        Piece attkd = board[to];
        indices.push_back(make_index<Perspective>(attkr, from, to, attkd, ksq));
    }

    while (attacks_right) {
        Square to = pop_lsb(attacks_right);
        Square from = to - left;
        Piece attkd = board[to];
        indices.push_back(make_index<Perspective>(attkr, from, to, attkd, ksq));
    }

    std::sort(indices.begin(), indices.end());

    for (auto threat : indices) {
        active.push_back(threat);
    }
}

here you go, ~6% for me

naive comet Mar 12, 2025, 11:25 AM

#

try using emplace_back and see if it helps

rocky vigil Mar 12, 2025, 2:42 PM

#

round stone simple threat inputs - L1-512 vs. L1-256 https://tests.stockfishchess.org/tests/...

Hmm this is quite concerning
Do you have a fixed nodes estimate for the net?

round stone Mar 12, 2025, 4:30 PM

#

rocky vigil Hmm this is quite concerning Do you have a fixed nodes estimate for the net?

Results of threat-inputs/sscg13-sf-mar11 vs threat-inputs/mar11-sscg13-512 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 135.41 +/- 8.83, nElo: 191.47 +/- 11.30
LOS: 100.00 %, DrawRatio: 31.83 %, PairsRatio: 7.04
Games: 3632, Wins: 1989, Losses: 641, Draws: 1002, Points: 2490.0 (68.56 %)
Ptnml(0-2): [26, 128, 578, 640, 444], WL/DD Ratio: 3.94

#

+135 elo at 25k nodes per move: L1-512 vs. L1-256, simple threats

rocky vigil Mar 12, 2025, 4:34 PM

#

What the

#

+130 fixed nodes
-20% speed maybe
+16 STC elo

#

It does not add up

round stone Mar 12, 2025, 4:50 PM

#

the speed difference measured with bench positions may not be reflective of speed changes throughout actual games

#

simple threat inputs - L1-1024 vs. L1-512
https://tests.stockfishchess.org/tests/view/67d1be92166a3e8781d843aa

#

Result of  10 runs
==================
base (...rofile-build) =     913329  +/- 4211
test (...rofile-build) =     668868  +/- 2992
diff                   =    -244461  +/- 4790

speedup        = -0.2677
P(speedup > 0) =  0.0000

#

look who's hiding in the last bytes of the L1-1024

lapis parrot Mar 12, 2025, 5:24 PM

#

noA0

rocky vigil Mar 12, 2025, 5:47 PM

#

Hmm looks like we still need major optimization work

#

Fixed nodes results are really strong though

lapis parrot Mar 12, 2025, 5:53 PM

#

rocky vigil +130 fixed nodes -20% speed maybe +16 STC elo

why do you think it doesn't add up though

#

did we have any measurements of how much elo we actually get from doubling nowadays?

rocky vigil Mar 12, 2025, 5:58 PM

#

Surely -20% speed is not -100 elo

#

-100 elo is more like -50% speed

rocky vigil Mar 12, 2025, 6:11 PM

#

rustic bough With a3427fc multithreading works again without crash. But the analysis is total...

f84f2226 should fully restore it for hopefully minimal speed loss

rocky vigil Mar 12, 2025, 6:12 PM

#

round stone simple threat inputs - L1-1024 vs. L1-512 https://tests.stockfishchess.org/tests...

You linked the wrong test but this one’s going even worse than 512 vs 256
I’m curious what the fixed nodes are for this one as well

round stone Mar 12, 2025, 6:15 PM

#

oops, edited the original message with the correct test link:
https://tests.stockfishchess.org/tests/view/67d1be92166a3e8781d843aa

rocky vigil Mar 12, 2025, 6:16 PM

#

I actually just suck at optimization and I have no idea what is going wrong

#

Like all the data suggests that threat inputs should be significantly more accurate as evaluation but the STC never matches

violet badger Mar 12, 2025, 6:27 PM

#

round stone the speed difference measured with bench positions may not be reflective of spee...

for that the new speedtest should be much more representative.

rocky vigil Mar 12, 2025, 6:28 PM

#

Ok well I think I fixed multithread

#

So hopefully speedtest works again

violet badger Mar 12, 2025, 6:28 PM

#

you can run speedtest with 1 thread..

#

but great you fixed multithread

rocky vigil Mar 12, 2025, 6:28 PM

#

Oh I see

#

It is hopefully not a huge single thread speedup loss

violet badger Mar 12, 2025, 6:30 PM

#

./stockfish speedtest 1 16 5 (speedtest [threads] [hash (MiB)] [runtime (s)])

#

(you might want to give it a bit more than 5s though)

rocky vigil Mar 12, 2025, 6:32 PM

#

Running a couple 150 sec 4 thread for 256, 512, 1024

#

Will have results in several minutes

round stone Mar 12, 2025, 6:34 PM

#

rocky vigil You linked the wrong test but this one’s going even worse than 512 vs 256 I’m cu...

Results of ./threat-inputs/mar11-sscg13-512-profile-build vs ./threat-inputs/mar11-sscg13-1024-profile-build (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -42.48 +/- 7.02, nElo: -67.50 +/- 11.04
LOS: 0.00 %, DrawRatio: 44.51 %, PairsRatio: 0.49
Games: 3806, Wins: 963, Losses: 1426, Draws: 1417, Points: 1671.5 (43.92 %)
Ptnml(0-2): [144, 563, 847, 310, 39], WL/DD Ratio: 2.11

rocky vigil Mar 12, 2025, 6:34 PM

#

rocky vigil Running a couple 150 sec 4 thread for 256, 512, 1024

Actually nvm I forgot school wifi blocks fishtest

#

So I can’t download new nets nohope

round stone Mar 12, 2025, 6:35 PM

#

those results somehow indicate 1024 is negative vs. 512 at fixed nodes

formal smelt Mar 12, 2025, 6:35 PM

#

isnt that 512 vs 1024

#

would be a very weird way to display it if not

rocky vigil Mar 12, 2025, 6:35 PM

#

Uh can you get a Google drive link to 512 and 1024 quickly

#

So I can download

round stone Mar 12, 2025, 6:35 PM

#

can you download if i upload them here directly?

rocky vigil Mar 12, 2025, 6:36 PM

#

Uh I need to transfer from phone to school laptop then

#

So Google drive might be easier

#

L1=256 4 thread 662978

#

(This computer is very slow disregard the absolute numbers)

round stone Mar 12, 2025, 6:38 PM

#

https://robotmoon.com/assets/nn-135ac6a72683.nnue
https://robotmoon.com/assets/nn-67beaa6ecf7f.nnue

#

uploaded there. too much work to figure out google drive

rocky vigil Mar 12, 2025, 6:38 PM

#

Bruh the school WiFi also manages to block that nohope

#

School WiFi is actually terrible

#

Fine I’ll figure out phone -> laptop transfer

#

But you might have a wait longer

round stone Mar 12, 2025, 6:39 PM

#

is it blocking based on the domain, or the filename extension or what?

candid ivy Mar 12, 2025, 6:39 PM

#

📎 nn-135ac6a72683.nnue

rocky vigil Mar 12, 2025, 6:39 PM

#

Because I have a class soon

rocky vigil Mar 12, 2025, 6:39 PM

#

round stone is it blocking based on the domain, or the filename extension or what?

I think it is domain of website

#

Some kind of filter

candid ivy Mar 12, 2025, 6:39 PM

#

📎 nn-67beaa6ecf7f.nnue

violet badger Mar 12, 2025, 6:42 PM

#

rocky vigil Fine I’ll figure out phone -> laptop transfer

phone == mobile hotspot ... connect computer to hotspot. No longer blocking domains?

#

you can also just have discord on the laptop...

rocky vigil Mar 12, 2025, 6:44 PM

#

School laptop as well

#

There are a lot of things I can do on this laptop but discord and fishtest are not in that category

#

the web filtering is partially built in as software on the laptop as well

rocky vigil Mar 12, 2025, 6:47 PM

#

round stone ``` Results of ./threat-inputs/mar11-sscg13-512-profile-build vs ./threat-inputs...

Anyways STC to fixed nodes difference is less for 512-1024 than 256-512 so I guess there is less slowdown

violet badger Mar 12, 2025, 6:47 PM

#

I see.

candid ivy Mar 12, 2025, 6:52 PM

#

btw that is the profile

rocky vigil Mar 12, 2025, 6:52 PM

#

rocky vigil L1=256 4 thread 662978

L1=512 4 thread 539842

#

Bruh overhead is like 4x actual accumulator updates

#UE Threat Inputs for AB