#UE Threat Inputs for AB
1 messages ยท Page 6 of 1
spsa should be the last resort imo
it's absolutely detrimental to net training development
True
Stc -15 is not bad
If no spsa we are essentially relying on scaling
hopefully there's good scaling
Issue is master training run pre-spsa also is like -5 elo
well. just test vs that and if it beats it at LTC then spsa
its not good enough rn anyways. needs work
well we need factorizer anyways
fix the factoriser, scale the L1 on the training side. that 30 elo is too low
Yeah letโs see how 1280 does w/o factorizer
Maybe it was actually worth a ton of elo
why not add the factorizer
And I am a fool for not figuring out how to add it
what you do is to add an extra 768 inputs at the end of the loaded indices. those are psq features that are active regardless of king position
Btw viren regarding like training longer
Since stages 4/5 donโt appear to help much
Do we just try like 1200SB for stages 1, 2, 3?
ig
factoriser has benefits even with unlimited data and should be as simple as adding extra features?
Lemme check how halfkav2hm implements it
Yeah
only thing is to make sure it doesnt clip when it gets combined in main net weights
also modify get_feature_factors in full threats
I thought nnue-pytorch handled the coalescing
well yes but you need to define the mapping
yeah idk how it does things. but in bullet i would need to half the scale
nnue-pytorch doesn't clip ft weights iirc
Oh ok
basically
return [idx] when threats index (no virtual feature)
return [idx, virtual idx] when pst index
could use a better system which would also make training faster
ouch
maybe i was too optimistic in how much things were worth
or like i thought we could get 80% speed of master
when it turns out it's closer to 70%
that 10% difference is indeed 20 elo
โน๏ธ
are 533-547 pasted from halfkav2hm? if so then lgtm
yeah
lmao
decided that indexing the features probably costs 0 speed
in comparison to actually training
bruh you were right about not being able to add multiple features
like this?
Lgtm
I'm a bit lost on where the test vs master is, and what has been tested since then
if it's -15 plus 5 elo from smallnet plus some other small speedups plus training improvements that's great honestly
-20 + 5 elo from smallnet + some other small speedups + training
actually on vondele's machines it's like -25 but we don't talk about that
https://tests.stockfishchess.org/tests/view/68eef18c28e6d77fcff9fe6e (test vs master)
btw what on earth is self.get_factor_base_feature("A")
this seems wrong
when used for threats
ah nvm
i read the def
it seems correct
extremely overengineered though
pushed to branch
@frosty imp can you give it a try (the factorized features)
https://www.sp-cc.de/ Unfortunately here the TI version is 0 +- 4 elo relative to non-TI
could have been better, but no regression at least, still a partial success
me when reading this:
so stage 5 finished as well, maybe another 1 Elo or so..
bring back the 13 stage training for threat inputs 
That's 8 more Elo it seems.. ๐
currently training are l1=128 and l1=1280. The former can be used if there would be a need for a threats smallnet (idk), and the latter might give a hint on larger sizes, even though it is a small increment.
now, at scale this might look already interesting..
--------------------------------------------------
Results of master vs patch (60+0.6, 288t, 16000MB, UHO_Lichess_4852_v1.epd):
Elo: -5.79 +/- 18.13, nElo: -16.23 +/- 50.76
LOS: 26.54 %, DrawRatio: 74.44 %, PairsRatio: 0.77
Games: 180, Wins: 45, Losses: 48, Draws: 87, Points: 88.5 (49.17 %)
Ptnml(0-2): [0, 13, 67, 10, 0], WL/DD Ratio: 1.09
--------------------------------------------------
hmmm
just a teaser I think ...
what kind of system has so many threads wtf
fitbit
right...
ouch
that's at c2396284 ... so shawn's nn-598188c9a702.nnue branch
i mean the stage 5 is only like 2 elo (maybe)
plus small net another 5..
hmm I won't have a gpu computer around soon
https://github.com/xu-shawn/Stockfish/pull/12 @frosty imp I suggest you test this on your hardware first before merging cuz I am not super confident in this one
maybe after this one we can reprofile
what be this :skull;
ok
idk what went wrong
I submitted from phone xd
xd
i can also give it a test in my
or hangon
i have -3% but uh yeah
big noise
ig
let's just see how fishtest works
it's over ๐
๐
guys i'm not getting something
when we add or remove a piece from a square, we have to recompute the attacks through / blocked by it
the idea here is, on captures, we remove from a square then add back to that same square
so we prevent any recomputation and also reduce the number of extra updates to do
but,
this does more updates than the current best TI branch
???
i call this operation a piece mutation, because this doesn't affect sliders
yeah exactly so taking into account piece mutations makes us have more updates somehow xd
so i have add-piece, remove-piece, mutate-piece
not sure, that doesn't make much sense
I think the idea works, I had that idea as well and only noticed you already had that idea and implemented it before me too late, https://github.com/rn5f107s2/Stockfish/compare/40e85bebee329ac27018bc0ca80e247df80235dd...rn5f107s2:Stockfish:7c47493cd258aa3ff18d092a2ec01e5418eb0cc2 this is my branch, I get an average of
6.74802 updates for your branch
6.11372 for threat_inputs and
5.88847 for mine,
so Im pretty sure it works in theory
I started a test https://tests.stockfishchess.org/tests/live_elo/68f091a228e6d77fcffa0128 but stopped it after I saw you already had this idea
Could someone benchmark this https://tests.stockfishchess.org/tests/live_elo/68f0cc5428e6d77fcffa0196 btw? Im not sure how trustworthy my hardware is, I saw a lot of discrepancy in nps when running the speedup test
that is actually hilarious
actually you could use attackers_to() function for that cant you
oh but you reduce computation
nvm ignore me
OHHHH wait I think I know why yours is better
you remove before swap
I swap before remove
Do you want to run yours then again? Im fine with stopping mine
I think you should keep yours
I got 6.51335 for threat_inputs_4
and 6.79114 on threat_inputs
this is on standard bench
yours: 6.51507
I mean mine is microscopically less cuz I implemented it for castling too
next step is promotions I guess
I tried that but for some reason it changes my bench
I'll look into it tmr
could run a simplification against the smallnet branch
yeah.
did we already run a stage 5 net test, with all improvements so far merged?
(e.g. fixed nodes test against master)
Lemme pr smallnet to Shawnโs branch
where's the copy button
@frosty imp https://github.com/xu-shawn/Stockfish/pull/13
this one is also bound to pass.. https://tests.stockfishchess.org/tests/view/68f0cc5428e6d77fcffa0196
this reduction is still good even if it adds bit of extra overhead because it increases the chances that 1280 is better than 1024
you can try this if you get it to work ig
stage 5 not run yet since stage 4 test is ongoing
rn5 speedup not merged yet
ah, you mean on fishtest.
sure. We have fairly good estimates of the stages nevertheless.
let me paste them
oh cool
5: Elo: -27.08 +/- 1.82, nElo: -50.97 +/- 3.40
4: Elo: -25.29 +/- 1.82, nElo: -47.45 +/- 3.40
3: Elo: -30.98 +/- 1.82, nElo: -58.16 +/- 3.40
2: Elo: -33.88 +/- 1.84, nElo: -63.06 +/- 3.40
1: Elo: -74.37 +/- 1.83, nElo: -142.55 +/- 3.40
so that would suggest 4 is right now the strongest.. but within error of that test
testing is end of the pipeline btw https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11736875984
I see
segfault
Bruh
Knew smth like this would happen
Yeah idt that test includes smallnet either
Combined the two are probably worth 10 elo or so
Is it a memory error?
not sure
Iโll be back in ~20 min or so
will run a threatnet test after rn5 passes
I mean smallnet
Oh I thought you meant stc estimation vs master
But progress check would be good as well
Which is probably close to -10 at fishtest rn
Yeah
Getting closer to -5
Which is what we can get w/o spsa
Just need factorizer to workโฆ
ok i can start code staring rn
do you know which code the error comes from?
where is sanity checking?
like what does it check
ah it happens at the beginning of traning
sanity checking dataloader
oh shoot @frosty imp i might be stupid
forgot to expose the factorized features
ok that should be fixed
didn't find any other errors during code staring
/workspace/nnue-pytorch/training_data_loader.cpp:548:2: error: expected โ;โ after struct definition
548 | }
| ^
| ;
also needs to expose on line 1269
Can you test make changes locally for now and see if it works?
I wonโt be back for a few hours; misread time of event
i'll make a PR. works now as far as I can test
seems that my gpu doesn't have enough memory to run the training
not sure if the preloading goes to the gpu or cpu memory, you can try lowerng that
or for testing just reduce l1 ...
or that is probably the easiest ๐
for me l1=1024 seems to be need about 6GB GPU mem
ah I merged it now
That also works
what is this speedup test script you have that gives a probability of significance
Merged now
So I guess for the new training run
Can we try 1200 SB for stages 1, 2, 3?
Also
Or will it take too long
https://github.com/hazzl/pyshbench or patched local version of this.
thx!
well we could, but I believe we still have factorizers left to try
yeah
factorizers + longer stage 1, 2, 3
since the threat inputs are sparser
and no good factorization scheme
Anyway... threat input could benefit a lot from fast incremental threat calculation.
Which was implemented in the clockwork HCE engine.
If someone wanna take a look.
we have this, reasonably fast, already
adopting clockwork's scheme is way too much change
though of course more speedups are welcome
Yeah...
not practical short to medium term
https://tests.stockfishchess.org/tests/view/68f15c8128e6d77fcffa030f this has all the improvements right
like if we get another rn5 level speedup we just win
yeah, so far
right
stage 5 net untested
so not in there
vondele says stage 5 is basically neutral (if not slightly worse) with stage 4
so what's next is factorizer
- maybe 50% longer stages 1, 2, 3
idk
It's a fixed nodes test right? not reliable exactly
this is stc
The stage 5 net was tested STC?
maybe better scaling from higher wdl
not on fishtest
vondele did 40k games stc locally
it's in the ci pipeline that the test was made
I see
#1336647760388034610 message
@twilit oriole are you expecting this to be worth the time
so move from 800 SB to 1200
is there a factorizer net in the training pipeline?
Oh u got factoriser working?
it's working now
yeah
did you check like when the loader runs it indeed loads the correct factorized psq
i mean i copy pasted this from halfka code
so it probably just works
no but let me check right now
well i can confirm popcount for sample 0 matches
probably some more effort to verify the psq
https://tests.stockfishchess.org/tests/view/68f0cc5428e6d77fcffa0196 does plentychess have this trick
@stray reef
threat smallnet not great btw https://tests.stockfishchess.org/tests/view/68f1719f28e6d77fcffa0353
๐
gimme a bit, i have setup a framework in desmos that should let me verify each position by hand relatively quickl
I guess let's try LTC then. Perhaps a progression test capstone.
sample 0 correct
sample 1 correct
sample 2 correct
i think this should be pretty good yeah
if only the serializer is deterministic ๐
like i'm pretty sure get_feature_factors works
is there any way for you to isolate it individually and call it
like as a standalone python function
it shouldn't be that hard?
to make
well you have the get_coalesced_ft
no i just wanna be sure get_feature_factors works
which merges virtual weights
this could be an individual function no?
replace the constants
well you already have a featureset object
in the model
so just call feature.get_feature_factors
hangon lemme just quickly test it in an isolated
python file
ok yeah it works on sample 0
i think that's good lol
@violet badger any chance to get a factorized run in the near future?
yep, did some final sanity checks just now
so am pretty confident it works
U have to remember the threats act as a pseudo factoriser it's not so simple
true
ah right
ngl i was definitely too optimistic about the gains
the actual results are a lot more close
speaking of scaling we should have 1280 in a couple days
or like, stage 3 of 1280 in like half a day
true.
we should port SF NNUE to clockwork
@stray reef do you think using your lookup scheme to index threats would be measurably faster?
I have a smol idea I will try later
also rn5 that speedup is :xdd:
sure, if it is working. Can start later this weekend, would need to sha to insert in the threats recipe. I'll also see if I can slightly improve the current recipe by varying the lr/gamma of the current recipe.
latest commit here should be working https://github.com/sscg13/nnue-pytorch/tree/threat-inputs
maybe we can try horizontal mirroring for the threat inputs? not sure how worth it would be
I believe we already do that
oh oopz
guys latest bench is wrong
You guys any where close to making Thread inputs stronger than main yet? How many elo off are we? ๐
15ish I think?
not yet, will implement today
i think it was maybe 3 STC elo
worth trying
zack
https://furybench.com/test/3372/
not yet merged tho (2+0.02)
I see, nice
idk why i never thought about just using magics for that one part instead of keeping all threats, like in rn5s patch
I mean incremental is just intuitively faster lol
-10 yoo!
So what does UE Thread Inputs actually mean, hows it better despite to my understanding being slower? Sorry I'm not a dev lol
The network explicitly receives all piece interactions (threats on enemies, defenses on own pieces) as input, so it has a lot more important information directly available
It is a lot stronger given equal nodes for this reason, but since keeping the threat information up to date is relatively expensive, it's slower overall
what's the distribution of the threat weights?
I'm wondering whether a light quantization on the threat weights would be less detrimental than quantizing the main NN
Well, quantization saves memory, but adds cost in interpretation.
So, not sure if it helps.
I'm pretty sure add/sub is grotesquely memory bandwidth bound
but idk I'll try it out at somep oint
Like... even if you imagined that you could losslessly compress the data... should you?
we'll find out
I tried running sf17.1 with this new nnue file, but I got this error in arena:
I mean... I tried clustering the weight with k-nn. So, each weight would be represented by something else.
hm
So, instead of storing each weight, we store cluster weight and the cluster membership of each one.
the lookup table would be fairly expensive tho
But since my re-quantization code was botched, it got like -1000 elo at fixed node.
yeah you'll have to build the dev branch yourself currently. sf 17.1 does not implement the new architecture
I mean... instead of feature_length times the input, it's feature length times cluster plus one index per input.
What's expensive about it?
Though my quantization code is botched currently so I can't try.
Do I have to build sf17.1 using this repository? https://github.com/xu-shawn/Stockfish/tree/threat_inputs
A free and strong UCI chess engine. Contribute to xu-shawn/Stockfish development by creating an account on GitHub.
https://tests.stockfishchess.org/tests/view/68f1f57728e6d77fcffa0416
@frosty imp @regal steeple trying this rn
Did I just get rate limited for checking out the website too much? LOL
happens when there's a test with a big diff
Considering threat input adds significant complexity, what kind of elo do we need to merge?
5 elo maybe?
Yay, I successfully build the sf using the threads branch as suggested, but I was wondering why the default nnue file was different to the current 'best' that yall told me to download: nn-bf4519f857f4.nnue, you guys got me to download: nn-598188c9a702
ahh, it seems like the default hasnt been updated on the github yet, even though the new 598188 is +2 elo on fishtest: https://tests.stockfishchess.org/tests/live_elo/68efb98928e6d77fcff9ffd9
Does Stockfish TI training data include positions with check?
afaik vondele just uses the (almost)master net training pipeline, so no
doesn't matter anyway since the code hasn't been changed to call eval in check positions I guess?
https://furybench.com/test/3386/ rn5's speedup in plenty (needed to change back some stuff that already relied on threat information)
not 2+0.02 this time but 8+0.08
Precisely what mattered. I was thinking of testing if I could remove that if threat input became good enough at detecting random checks.
lel
something for later imo
bruh just wait till we run this on fishtest
9 games sssposting is crazy
It's current. Maybe things will change.
no way, it will end at 247.9 elo
it's not even an even number of games
Having some 100-games sanity test... not saying that they would be significant in the grand scheme of things.
but rather useless
Want to see it playing, except I don't even know how well it plays because Stockfish plays are inscrutable to mere mortals like me.
It is easier to quantise a threat net yes. Already we (monty) did this, with i8 weights in the FT
No. Already been tested. This is a waste of time
Oh... I see.
yeah I recall linrock trying that
Can we not fill the thread with noise like this. Quantising the threat net further than the main net absolutely is promising
The way we did it is just use i8 FT weights but keep the calculation in i16. This just halves the mem bandwidth used to fetch FT weights and works very well
Was only -5 fixed nodes for us
how much faster?
https://tests.montychess.org/tests/view/68b5d37756f229dd4390d7a1 and this is with a relatively small value net which took only a fraction of total time
I guess it helps even if the weights are in L3 cache
The value L1 there is 3072 so it's pretty good indication it will work at least for SF
Rounding the weights when quantising them is crucial
okay cool
honestly memory bandwidth is not an issue with mmap is it
well the linked monty test is with mmap
halving the number of bytes loaded is beneficial regardless
passed with 3.26 +- 2.04 (95%)
I am pretty sure i8 works
At least a vast majority should work
Oh viren already said so
If we can get it to work and add sub is mem bandwidth bound as you said we should be able to shave off a significant amount of the 25%(?) runtime that it currently uses
btw shawn i think https://tests.stockfishchess.org/tests/view/68ef157828e6d77fcff9fead could also be merged
actually question
i8 quantizing threat weights means you should not do the x2 scaling
on loading the weights
@prime mica 81765782 weights losslessly quantizable to i8 out of 81772544
aka close to 99.99% of them
Yeah but what's getting clipped are the most important weights :p might need training scale change
Not really, you can always scalar multiple the eval
that works?
there is also this
i guess the solution is to do (psq + 2 * threat)
We just i8 quantised the entire FT
ah interesting
since our accumulators are separate we could keep i16 for psq
if needed
i.e. this is only out of the threat features
if i ran psq features probably would be nowhere near as good
There may be a problem because the multilayer is quantised so aggressively it makes it harder to quantise the FT. Because we have the inverse, can't do fast multilayer after quantising the FT to i8
i mean it is definitely worth a try to quantize threat only
how much of a bottleneck of memory bandwidth are we talking about?
in comparison to compute
Don't have comparable numbers because we don't have UE
would it be worth it to try and decompress 8 bit format live
No
ok
actually lemme check with the other net
this is bf4
it shouldn't make a difference
Well I would just test the inference speedup at least. I expect significant but single digit %
this includes psq quantization right?
Yes but multilayer far less quantised
i guess we'll see later
btw is there a way to disable the net sha check on comp
it takes like 5 seconds
which adds up :p
https://tests.stockfishchess.org/tests/view/68f23eac28e6d77fcffa04b3
I think this should be compatible with all pending patches, could someone approve?
81759318 weights losslessly quantizable to i8 out of 81772544 nn-598
viren could (approve)
i think
if he's heere
gonna give this a quick test
actually leb makes this annoying
gimme a second to write it in (little) endian
I missed that updates can still be fused if the moved piece was not taken, submitted an improved version
https://tests.stockfishchess.org/tests/view/68f244ee28e6d77fcffa04bc
actually how do you get the vec_t to interpret it as i16
if it's i8 originally
to my understanding reinterpret_cast will just fill it with 2x the amount of i8 values
cvtepi8_epi16
ok
https://tests.stockfishchess.org/tests/view/68f20dab28e6d77fcffa045a stopped this, read comment for info
bwoahohohohohoh
huge speedup
how do you do this in monty? I'm losing like 5% with all the mm_cvtepi8_epi16s and stuff
like the i8 to i16 simd conversion
just convert to i16 at startup to save some instructions

actually from startpos it seems to be a few %
but like
i am def doing it wrong
bc the pv is cooked
bahhh rust makes this so different
Nodes/second : 1029999 for current vs Nodes/second : 1021921 for i8 test
so um
i am not doing this correctly
anyways i leave it to simd experts to try this later
a lot of things are huge speedups if you accidentally confuse patch and master 
wait cj does this actually work
ah because enemy only matters when the piece types are the same
nice find
btw it appears that "real compression" performs better on threat inputs
i.e. zipping master net gives 60 MB, and zipping l1=1024 threat net gives 60 MB as well
asdadasds l1=1280 too large for fishtest
btw 1280 is like, 15% slower
than 1024
from bench
lemme try speedtest
ok bench is just slow with 1280
Nodes/second : 933288, ie 10% slower
... Stockfish TI-1280 playing White: 145 - 76 - 279 [0.569] 500
... Stockfish TI-1280 playing Black: 77 - 128 - 295 [0.449] 500
... White vs Black: 273 - 153 - 574 [0.560] 1000
Elo difference: 6.3 +/- 14.0, LOS: 80.8 %, DrawRatio: 57.4 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
1000 of 1000 games finished.``` stage 4 or 5 (or factorizer) need to bring serious improvement if we want this to be viable
Fixed nodes?
how about 896 HL
let's wait for 1280 to finish
surely if 1024->1280 is bad then 1024->896 is good
and then see
maybe stage 4 is where the breakthrough happens
larger net being slower to train and all that
also https://tests.stockfishchess.org/tests/view/68f091a228e6d77fcffa0128 should help with this
if it adds overhead but decreases avg threat features updated
btw shawn are you going to merge removing fusing
merged
(fix bench)
Those error bars are too large. A 10 Elo fixed nodes improvement is fine
https://furybench.com/test/3274/ reminder this is what a 25% HL increase looked like for plenty
yeah bc i'm running this on my laptop only
not vondele's 288 core fitbit or whatever he uses to do the local tests
yes but your statement a "serious improvement" is needed is incorrect
strange
also the ltc was neutral
i'm pretty sure
https://furybench.com/test/3273/ ye merged on scaling vibes
again let's wait for stage 4/5
It's a 6 +- 14 test. serious improvement to number of games is what is needed is my point lol there's no indication it is underperforming where it needs to be
sure
would've put it on fishtest to check stc but we hit the size limit
will it require me getting the net and all that crap. cos im doing through ssh takes time
.
no auto download
ah
well its fine if i can just wget it or smth
got 384 thread machine free in an hour, when will branch be done?
ig I can just do the TC tests also
link
cool
well changing the limit is just changing a number in the nginx config on fishtest server
its easy to do
can u make/link branch for 1024 net also
passed against main but didn't wanna waste games against L1=512
fair lol
cool
btw idea
like to reduce also i.e. lichess load
oh wait it's not compatible with mmap
maybe we do it for the releases only
factorizer is surprisingly not that much of a slowdown
so far at least
still holding in mid-80 its/sec
ah have you had a chance to run fixed nodes / tc
or are you going to wait for stage 4
Wait probably. Should ask @violet badger to increase the net size limit on fishtest server also
#1336647760388034610 message
Ah nice u already read lol
well not merged and not me.. but yes
doesn't adjust the memory estimates for the workers though so is not fully complete.
I think there's some nginx limit or smth to adjust also. Outside of PR
yeah ppigazzini is the fishtest maintainer right
yes, and indeed could need a bit more.
https://tests.stockfishchess.org/tests/live_elo/68f2499a28e6d77fcffa04c6 ... seems promising?
so that's on top of what the 10k STC test tested, right..
yes
would imply near parity with master..
once this finishes i plan to do both stc / ltc 10k games
Hmm? Why not SPRT against master right away? After it finishes?
If it gains now, then it ends here.
hmm
not expecting pass @ stc yet
ltc would be different story
idk what others think but i only wanted to do sprt vs master once we iron everything out and finalize it
since that will include a LTC SMP
I agree with finalizing before doing more advanced tests.
I expect also some net improvements could be found.
it is expected* to scale with both time and threads (*known in plentychess, we'll see how it goes in stockfish relatively soon, it seems?)
yeah patience is good
Yes... there's a good chance we could gain now at LTC.
But we could finalize things first.
things are moving very fast actually
I also still see warnings like:
position.cpp:1060:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
1060 | Bitboard threatened = ray & qAttacks & occupied;
| ^~~~~~~~~~
position.cpp:1030:14: note: shadowed declaration is here
1030 | Bitboard threatened = attacks_bb(pc, s, occupied) & occupied;
| ^~~~~~~~~~
position.cpp: In instantiation of 'void Stockfish::Position::update_piece_threats(Stockfish::Piece, Stockfish::Square, Stockfish::DirtyThreats*) [with bool put_piece = true]':
position.h:346:35: required from here
position.cpp:1060:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
1060 | Bitboard threatened = ray & qAttacks & occupied;
| ^~~~~~~~~~
position.cpp:1030:14: note: shadowed declaration is here
1030 | Bitboard threatened = attacks_bb(pc, s, occupied) & occupied;
| ^~~~~~~~~~
position.cpp: In instantiation of 'void Stockfish::Position::update_piece_threats(Stockfish::Piece, Stockfish::Square, Stockfish::DirtyThreats*) [with bool put_piece = false]':
position.h:353:36: required from here
position.cpp:1060:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
1060 | Bitboard threatened = ray & qAttacks & occupied;
| ^~~~~~~~~~
position.cpp:1030:14: note: shadowed declaration is here
1030 | Bitboard threatened = attacks_bb(pc, s, occupied) & occupied;
| ^~~~~~~~~~
probably easy fixes
in comparison to the 6 month hiatus before
Because once it passes, there would be more things to do like search tune and so on that could lock in the decisions.
Though I'd agree we should finalize things first.
So we don't waste compute on VVLTC search tunes on incomplete net.
our goal is to pass this (preferably, with a comfortable gain) without having to resort to spsa
which looks a lot more doable with the new speedups
https://tests.stockfishchess.org/tests/live_elo/68f091a228e6d77fcffa0128 has been flying under the radar
but it might also help with increasing L1
according to data above this reduces the avg. number of features updated which is more improvement with higher L1
hard to track what is actually the most up-to-date branch ๐
i think shawn has been trying to keep updated as tests pass
Someone mentioned that like 99+% of the feature transformer weights losslessly compress to i8.
Though the rest could remain problematic.
either clamping it directly does not work, or I have done something wrong elsewhere in the inference
in any case I gave it a quick try and got ~~neutral in speedtest
though someone more knowledgeable with simd could of course do better I presume
anematode is probably the most knowledgeable we have at this.
it helps with memory bandwidth yes, but adds many extra cvtepi8_epi16 calls, unless I am missing something
and also an extra addition pass
because the x2 trick doesn't work anymore
But then... what about our ambition to deduplicate the net? Now, do we have to start over again?
maybe instead doing i8 * (i8 = 2) -> i16 multiplication instead of this works better
these things will get sorted eventually, it is more or less orthogonal to the threats arch.
I'm glad the net deduplication didn't contain a bunch of arch-specific hacks to get it working.
i think threats make it more possible
since you gain more speed ostensibly
and less quantization penalty
since the threat feature weights are much less in abs value
Oh, it already gained like 40 elo in fishtest condition.
But there were some problems that made it not viable to be merged right away.
Oh, I see.
this is not really "40 elo improvement" (in the normal sense) but rather "tests can be done X% faster"
I'm tracking who has contributed, we are gonna end up with a PR with 20 coauthors or smth lol
i think it's 10 so far right?
Nah it's way more
really
There's a lot of ppl that contributed earlier who are no longer active here
But I remembered
I have u, jw, (ravenslofty - since yoshie's impl is inspired from yukari), yoshie, disservin, linrock, vondele, me, shawn, rn5, cj
were there others?
your memory ig is better than mine lol, bc I definitely forgot over 6 months
Though... well... this is Stockfish. 20+ people on months of work for maybe like 10 elo gain.
More like 2 months work
so hyped
lol
And now that we have our training infrastructure back up, we could try new archs.
there are a lot of cool ideas, note that a full net still takes several days though...
I've calculated that a subnet made up of only pawn/minor piece/major piece/diagonal piece inputs all could be cached with like 80% hit rate if not more.
so right now what will happen is basically experimentation with L1 and training schedule
https://github.com/official-stockfish/nnue-pytorch/pull/352 needs debugging ..
not really big arch changes
would speedup testing/training 2x.
thought it required 2x the hardware :p
(I think the actual data loader is probably OK; but something else is not).
HW is there.
With 2048 cache entries, a subnet with only pawn or minor pieces could be cached with about 90% hit rate.
Well, not sure what can we make of it.
this probably belongs in a separate like thread
let's wait, there are more important things for now
anyways if we merge this into master this thread will be abandoned and further discussion will just be in nnue-dev
Yeah...
anyways there's not much to do now while waiting, if someone wants to try a potential improvement would be to update the threats lazily as well, since we don't use them for anything other than the nnue
but this is a lot of effort
should be
regarding the i8 quantization, it'll probably be pretty machine dependent whether it's faster or not
but ideally it can just be optional
while maintaining a consistent bench
this can only be the case if we force the i16 version to also abide by the i8 limits no?
which will incur some (hopefully minor) loss
right
would be a balancing act
although, if the exceeding elements are extremely rare, we can do a scalar cleanup for those rows
without much overhead
(I tried this with the main net but the exceeding elements are too common for it to work)
The i8 speedup is more so on high thread counts
And makes net smaller
Smaller than master even I think
on my machine, I got a speedup even single threaded
but my computer has proven very weird in terms of perf characteristics
so probably wouldn't generalize
zen5 is however modern, and we should prioritize moving forward IMO.
how modern are the majority of fishtest workers btw
Don't like this tbh, a big benefit is smaller net
less memory pressure on fishtest workers should help a lot as well yeah
especially with mmap
smaller as in the distributed binary, or smaller as in memory footprint?
Binary size
I see
I mean if you're willing to not use LEB128 then you could get it nearly the same size, e.g. use -127..=127 as literal i8 values and -128 as a prefix byte
but yeah there is an elegance about making it all i8
Not a major concern I think, they already use a custom binary
how does stockfish wasm even work? I don't see anything webassembly-specific in the repository...
Ohhh ok
They zstd compress the net and that iirc
oh ok if they do it on their side it's good
it shouldn't impact them too much then
threat inputs compress better which will offset the raw size (assuming i16)
though i8 is obviously preferable
i16::from
It just worksโข
This seems like the kind of thing that definitely doesnโt need manual simd
honestly to get around mulhi trick
i think if we wanted to do that
i8 * (i8 = 2) -> i16 mul is better
do we coauthor everyone or just code contributors?
because non-code contributors were not usually coauthored
but mentioned in the PR
sscg13, shawn_xu, cj5716, Yoshie2000, Viren, jw, vondele, anematode, rn5f107s2......who else?
for this I count sscg13, shawn_xu, cj5716, rn5, Yoshie2000, vondele
many more would need credits in the PR
yeah it might depend on how we do coauthor/credit split
a lot of ppl need credits yeah
I guess disservin actually contributed to the original threat-inputs branch
I didn't contribute lol, just kibitzing
Don't need to be posting lists lol. I already have it and none of the posted ones are complete anyways
yeah let's figure this out at the end
are we getting too ahead of ourselves here 
indeed
All we need to agree, I get to make the PR 
though i wouldn't think it is wrong to be feeling pretty good about it
the scary time was before rn5 speedup
dumb question, why isn't threat information (or something equivalent) already encoded somehow in the main network through training
can we clean up the different horizontal mirroring scheme btw
yeah i stuck to bullet mirroring bc i was too paranoid about it
well the network can't extract those info efficiently
gotcha ok
it's really hard to generalise that information just from the PST inputs without a lot of layers
this same argument can be used to "remove" king buckets
true
conversely, you cannot figure out psq information from threat information
the reason why this stuff doesn't work is like
you have to think of nnues as not really deep networks
so they are very contained by the additive structure
of the first layer
which carries most of the info
part of why I wanted to play with threat inputs is that I always wanted to wire the attack table information I already had into the eval
and up until yoshie cracked it, I was the only AB engine that could do so without being majorly crippled in performance (though I did need to quarter my net width)
yeah cj speedup was displaying insane sss numbers earlier, in actuality it's probably closer to 5 elo which is what is predicted from the speedup amount listed
we still have work to do
yeah 5-10 Elo still needed I think
how would you want me to do it
it is relatively easy for the already trained nets, by repermuting the FT
i'll do it when we get closer to pass it hink
yeah and also on the trainer side ig
Maybe someone could also use something similar to splat_moves to update threats faster?
Maybe...
Though the ideal byteboard... would be pretty hard.
I mean does threat updates still take a serious amoutn of time after cj/sscg/shawn's work?
the big gain still todo
is compute threat updates lazily
overall it should still be like high single digit % of runtime
Yeah... I know, but each threat update depends on the board state... and updating the entire thing means... well... wait... tracking every added and removed piece? It could be as much as recomputing the entire threat...
Could be faster... IDK.
update_accumulator_incremental ๐ฉ
this is cj's branch btw
I'll try the i8 compression on ur branch later today
๐
uh oh
info depth 39 seldepth 56 multipv 1 score cp -190 nodes 56649998 nps 1583021 hashfull 515 tbhits 0 time 35786 pv f8f5 b1c2 a6a5 d7d8 g8g7 d8a5 g7g8 a5a6 g8g7 c2c3 g7f7 c3b4 f7g7 b4a3 g7h7 a6a7 h7h8 a3a2 h8g8 a2b1 f5f1 b1c2 f1f5 a7b8 g8g7 c2c3 g7h7 b8c8 h7g7 c3b4 g7h7 b4a3 h7g7 c8b8 g7h7 a3a2 h7g7 b8d6 g7h7 a2a3 f5f7 a3b4 f7f5 d6g3
info depth 40 currmove f8f5 currmovenumber 1
stockfish: nnue/nnue_accumulator.cpp:115: void Stockfish::Eval::NNUE::AccumulatorStack::push(const Stockfish::DirtyBoardData&): Assertion `size + 1 < psq_accumulators.size()' failed.
position fen r4rk1/1b2bp2/p2p4/1p3pNp/4P2P/1P1Q1Pq1/1P6/1K1R2R1 b - - 1 25
go
Which branch is that
wait wrong fen
position fen 5rk1/3Q4/p5p1/1p5p/8/1P6/1P6/1K6 b - - 0 37
go
Huh how does it have anything to do with threats
What the heck kek
will bisect in a moment ๐
https://tests.stockfishchess.org/tests/live_elo/68f2c1e828e6d77fcffa057d
Need to be quick and get my patch in as well!
how often does this trigger
About 12%.
Interesting
Could there be even more if we sorted them
Each eliminated pair is worth quite a lot
The data is quite structured.
As in, the first few almost always pair with the last few.
In order.
But I'm not sure if the cost of checking would outweight it so I only check one pair.
gotcha
How did it go
which commit is this
cj ongoing test
oh ok
so it looks like tracking and indexing threats each take up ~5% of the runtime
after cj speedup
so lazy tracking could be a couple % gain
Though... well... there's this patch that needs aprxval. This one is a low-hanging fruit.
actually why is it that we get duplicate features in both added and removed
Well, I think the feature got added in one move but then it didn't use the net so it got removed later on.
huh shouldn't incremental always be one-move updates
strange
stage 1 validation loss 0.00305 w/ factorizer compared to 0.0031 from old run
not sure how much this can be read into
It's one-move, but sometimes the eval is skipped.
Has anyone added the weight permutation or something to the system?
the only weight permutation to be done is re-indexing the threats to be efgh mirrored (so as to remain consistent with the psq mirroring)
this does not functionally change the evaluation, so I'm delaying it to finishing touches
hmm, I thought the incremental updates were done move-by-move though
... I won't question it
let's see how it fares on fishtest
how big are added / removed on average? if intersection is significant it might be worth it to search for cancellations
yeah
old profile
new profile
old and new together 
total delta averages ~7 I think
goat
when you add a piece then remove it then there's an extra + and - from the slider attacking the piece behind it
I wonder if you can do something about it then?
just hardcode threat updates for each movetype maybe?
https://tests.stockfishchess.org/tests/view/68f2c1e828e6d77fcffa057d so is this neutral?
it reduced updates by ~0.7 on avg I think
can we test this on top of the other speedups
that directly interacts with rn5's patch lol
@lofty cedar
Well a dbg on repeat would be useful ig
correction: 0.3
๐
ok well gg I guess
@naive comet plz pr
What interaction?
It looks like my patch speeds up more than I expected.
It measured barely 1% on my machine.
๐
btw while you wait for this I'm going to start both stc + ltc vs master
Are you merging my patch too?
lemme keep it to patches that have passed fishtest...
there's plenty of time to be patient
Oh, okay.
alright now we wait
nooo how could your cores do threat inputs dirty like that !!!
https://tests.stockfishchess.org/tests/view/68f15c8128e6d77fcffa030f this was the previous PT
elaborate?
Different game phases have a very varying speed diffs and fixed nodes differential. The fixed nodes gain occurs in the positions with the most slowdown
You have to just read the STC and LTC the fixed nodes does not tell about expected scaling
gotcha
Well the new speedups don't appear to help in PT much
So that's a big issue
What a terrible result kek
๐ญ
Well it may be related to the fact you rebased on master
Which is optimising for master net
Gainer patched there may not necessarily translate
Either that or the branch is fucked in some way
@frosty imp https://github.com/xu-shawn/Stockfish/pull/16
@rocky vigil @regal steeple remember to rebase
ok nice
At best if the test got super unlucky in both STC and LTC it's equal to previous PT
We can just attribute it to that these new tests didn't have Shawn's blessings

I think it is actually likely the previous PT was just super lucky
Since the jump to -10 didn't add up from the previous approx -25

wasn't it -20 + (around 10) = -10
idk anymore
to be fair what I see in sprts
one side can apparently just randomly gain 5 elo
then lose it
so idk anymore
Well I think this PT was too early so not enough gainers to overcome error bars. There is still actual gain
It's just not visible enough yet
Adding sprt Elos is not valid lol
You have to assume on the low end for all of them
I don't think you can extend it
Fishtest "feature"
Try to if you want to see what I mean
eh whatever
oh yeah
"unable to modify number of games in a fixed game test" lmao
actual const int games = 10000
It's dumb, I disabled the check on our instance lol
undefined behavior
It's just a check in the code that throws that message when you try
We used to be able to till a few years ago
Like it's intentional design choice to not allow users to do it now
merged
bam
whoo
sanity check factorized vs unfactorized stage 1
What's this factorization?
just bringing back factorized weights for psq inputs
What's factorizing weight in the first place?
we add an extra bucket active regardless of ksq, then merging that bucket to all other buckets after training
help with convergence in rarer buckets
Oh, I see.
Though the elo in the test can be misleading.
I thought we were like only 10 elo away but after a few more patches it's further.
Though still within the error bar.
Well stage 1 ofc will be much better. It won't be that huge difference in the end
so, the equivalent test of https://tests.stockfishchess.org/tests/view/68f2f8d528e6d77fcffa05d6 run locally, but with 72t:
Results of master vs patch (10+0.1, 72t, 32000MB, UHO_Lichess_4852_v1.epd)
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 6518.5 12800 51
2 patch : -6.6 4.4 6281.5 12800 49
yikes
well, not infinitely far from beating it.
3-4% speedup, or a clear improvement in the training.
if i checked correctly then apply sometimes does remove/add the same threats can that be?
like the value in the added list also exists in the removed list ? so we are doing some unnecessary ops no?
yeah thatโs what people are now optimizing
ah ๐
@violet badger
Unexpected EOF in the Factorizer pipeline
๐
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11750723588
btw, I wonder what that 'factorized' pipeline is actually using, as it is setup to use just --features=Full_Threats .. not --features=Full_Threats^ ?
Huh
so, you agree that it should be using the latter?
