#UE Threat Inputs for AB
1 messages Β· Page 11 of 1
Kieren is Australian.
Yeah, I still don't get it why he should be upset about a branch named emu
π€·ββοΈ
This I guess: https://en.wikipedia.org/wiki/Emu_War
Oh!
lololol
fighting anti-emu bigotry one PR at a time
my friend has a pet emu
haven't yet met her tho
(the emu)
GROUPED BY x86
x86 | Elo: 1.09 Β± 1.46 | LOS: 92.8% | LLR: 1.07 | [269, 7246, 15053, 7370, 302]
ARM | Elo: 20.68 Β± 6.52 | LOS: 100.0% | LLR: 1.89 | [7, 276, 744, 425, 20]
hm ok
lmao
zactly
well there you have the contributions
haven't you seen the memes?
ugh
without vondele fleet LLR printers. .... I object. I print Elo, not LLR
#NotAllEmus
there are a lot of this stuff, you can google yourself
despite only being 4.6% of games, ARM is responsible for 64% of LLR
we'll support RISC-V only for the future
even at my work we have a buld that supports arm
but there are severe downsides though
finally the academic in you comes out
EPI for the win
true
well, what is this function called
to calculate 1/x for x being a float
fast but not precise
vrcpss
nah
lol
well in general this function doesn't exist in library of arm cpus we use
but exists in dsp
rsqrtss
well you should understand that we use controllers etc
relay protection
recipf
ofc
at least in what we use you can't really use this in arm because library doesn't exist, note that this is a big production cycles so you can't simply switch to newer stuff out of the blue
for sure
so in general I tend to exclude divisions unless absolutely necessary
ideal
I would say non functional with avx512icl and gcc 15.2.1
Result of 200 runs
==================
base (...fish.ostrich) = 2055743 +/- 4626
test (...tockfish.emu) = 2053770 +/- 4626
diff = -1973 +/- 2362
speedup = -0.0010
P(speedup > 0) = 0.0510
yikes
we'll see fishtest then
might be arch dependent
could you also try out https://tests.stockfishchess.org/tests/live_elo/69108025ec1d00d2c195c5d6 when you have time
no bench change = non-functional
no bench change, slowdown = dysfunctional
There is a new warning for snowy-egret-2
position.cpp: In member function 'Stockfish::Position& Stockfish::Position::set(const std::string&, bool, Stockfish::StateInfo*)':
position.cpp:204:16: warning: 'void* memset(void*, int, size_t)' clearing an object of type 'class Stockfish::Position' with no trivial copy-assignment; use value-initialization instead [-Wclass-memaccess]
204 | std::memset(this, 0, sizeof(Position));
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from position.cpp:19:
position.h:80:7: note: 'class Stockfish::Position' declared here
80 | class Position {
| ^~~~~~~~
eh that's fine
it's because I added a dummy DirtyThreats to Position
we can silence it by casting to char*
then it will be sunday silence
:)
technically not UB
because wherever I use DirtyThreats I use placement new before it
well google "sunday silence"
Sunday Silence (March 25, 1986 β August 19, 2002) was an American-bred Thoroughbred racehorse and sire. In 1989, he won the Kentucky Derby and the Preakness Stakes but failed to complete the Triple Crown when he was defeated in the Belmont Stakes. Nevertheless, he won the Breeders' Cup Classic and was voted American Champion Three-Year-Old Col...
Result of 200 runs
==================
base (...fish.ti_base) = 1997515 +/- 4078
test (...nowy-egret-2) = 2009546 +/- 4115
diff = +12031 +/- 1905
speedup = +0.0060
P(speedup > 0) = 1.0000
But it matches my gcc 15 result.
GROUPED BY COMPILER VERSION
g++ 13 | Elo: 2.32 Β± 3.55 | LOS: 90.0% | LLR: 0.55 | [23, 899, 2438, 949, 27]
g++ 15 | Elo: -0.39 Β± 5.10 | LOS: 44.0% | LLR: -0.13 | [16, 500, 1190, 509, 9]
g++ 14 | Elo: 4.04 Β± 5.30 | LOS: 93.3% | LLR: 0.48 | [11, 442, 1116, 478, 17]
g++ 11 | Elo: 1.24 Β± 7.10 | LOS: 63.4% | LLR: 0.06 | [5, 243, 622, 239, 11]
clang++ 20 | Elo: 1.49 Β± 8.25 | LOS: 63.8% | LLR: 0.06 | [1, 185, 440, 186, 4]
g++ 12 | Elo: -10.32 Β± 13.22 | LOS: 6.3% | LLR: -0.23 | [3, 77, 177, 62, 1]
clang++ 22 | Elo: -43.66 Β± 43.48 | LOS: 2.3% | LLR: -0.08 | [0, 13, 14, 5, 0]
interesting what is this
blackmail material
I think this is too SSS
gotcha
idk if it's only -0.1% on Zen 5 and decent on other architectures then I think it's an easy choice
but we'll see, might fail
I figured out a cool prefetch trick that seems to work ok...
Even if it is neutral on gcc 15 and works well with older versions, everything is fine.
do the psqt accumulation first and in those loops, prefetch the first chunk of the weights accumulation
finnicky tho
when u have time if you could check out https://tests.stockfishchess.org/tests/live_elo/6910ec7cec1d00d2c195c6aa that'd be swell
I think because you have the X3D (?) it'll be neutral-to-negative
because so much cache
but maybe better on fishtest
if it goes well, thoughts on a VLTC test?
rather SMP
which I happen to run locally right now π
x86
well, that never happens, but looking good SMP at 10+0.1
Torch shaking in its boots π©
oh interesting
is it bc of elo compression
at least 120+1.2 SPSA did scale way past 120+1.2
elo compression on uho books exists
'indefinitely' is poorly defined π
but it's not big
fair enough haha
chess is O(1)
true
yeah at infinity it will play perfect chess anyway
rare professor who cares about the big-O constant
cosmic bit flip tho :)
just disable TT
we already documented one during SF development

really??
that's awesome
let me find this..
nice you had the tab still open π
Result of 200 runs
==================
base (...fish.ti_base) = 2000882 +/- 4020
test (...apped-nunlet) = 1988707 +/- 3553
diff = -12175 +/- 1773
speedup = -0.0061
P(speedup > 0) = 0.0000
CPU: 16 x AMD Ryzen 9 9950X3D 16-Core Processor
Hyperthreading: on
meanwhile:
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 patch : 8.2 3.2 11799.0 23138 51
2 master : 0.0 ---- 11339.0 23138 49
very promising
let me see what I get singlethreaded on the same hardware.
same hardware, same TC, single threaded
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 16984.0 32768 52
2 patch : -13.0 2.8 15784.0 32768 48
Looks like excellent scaling π
well, with that kind of scaling this might not be needed.
do we just ignore STC
not 'just ignore'
and hope that search patches will bring it up
excellent
especially for speedups π
but that kind of difference between single threaded and multithreaded is kind of insane.
indeed...
I'ma do a similar test locally to see how it looks on Zen 5
do you have a script you used?
not really, but can share the fastchess commandline.
threads=1
taskset --cpu-list $tasksetlow-$tasksethigh \
./fastchess -tournament roundrobin -concurrency $(($size/$threads)) -rounds 16 -games 2 -repeat -srand $RANDOM \
-openings file=./UHO_Lichess_4852_v1.epd format=epd order=random\
-engine name=master cmd=./stockfish.master.x86 tc=10+0.1\
-engine name=patch cmd=./stockfish.patch.x86 tc=10+0.1\
-config outname=config-foo\
-pgnout file=games-foo.pgn\
-each proto=uci option.Threads=$threads option.Hash=$((16*threads)) >& out-foo
yes.
look for a file named out-foo π
if you think it's worth the data, I'd try running the STC tournament with no SMT...
I'm suspecting that the i8->i16 conversion spam doesn't play well with SMT
(not that that's a solvable problem)
Results of master vs patch (10+0.1, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -3.04 +/- 4.16, nElo: -5.90 +/- 8.08
LOS: 7.63 %, DrawRatio: 51.06 %, PairsRatio: 0.95
Games: 7094, Wins: 1827, Losses: 1889, Draws: 3378, Points: 3516.0 (49.56 %)
Ptnml(0-2): [32, 859, 1811, 829, 16], WL/DD Ratio: 1.14
ST penalty not quite so bad over here so far
so that's quite good.
With more threads (10+0.1t256) still good..
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 patch : 13.8 11.0 1063.5 2048 52
2 master : 0.0 ---- 984.5 2048 48
numerous
ok, part of the cleanup effort..
I'll do it rn, why not
sure
so much threat inputs progress in the past few weeks 
There are decades where nothing happens; and there are weeks where decades happen.
So, the multithreaded cousin of this one looks like:
Results of master vs patch (10+0.1, 8t, 64MB, UHO_Lichess_4852_v1.epd):
Elo: -17.93 +/- 13.31, nElo: -36.33 +/- 26.92
LOS: 0.41 %, DrawRatio: 51.25 %, PairsRatio: 0.66
Games: 640, Wins: 140, Losses: 173, Draws: 327, Points: 303.5 (47.42 %)
Ptnml(0-2): [1, 93, 164, 62, 0], WL/DD Ratio: 0.91
mind the order (master vs patch)
so roughly 25Elo difference on the same machine between 1t and 8t at STC
yeah.
powerful
in this case, there are months where nothing happens, and there are weeks where months happen
lol
Lenin is displeased
hmph how to polyfill _mm_cvtepi8_epi16 for < SSE 4.1
on SSSE3 you could pshufb + srai
the fallback looks SSE2 compaible
oh interesting
I'll write three implementations, one for SSE4.1, one for SSSE3, and one for SSE2
then we should be good to go
Sf doesn't support non-sse2?
well we'll have a generic C fallback tha's slow as molasses
not sure whether that's done yet
the generic fallback is literally for loops
doesn't it get implicitly casted
feel free to copy it, don't have to credit, it probably sucks anyway
Wrong bench for general-64: Nodes searched : 3117291
π€¦
ok lemme fix
while ur around would you mind benching https://tests.stockfishchess.org/tests/live_elo/691101f3ec1d00d2c195c6fd vs threat-inputs-i8?
Does ARM already work without NEON? And 32-bit ARM?
non-NEON ARM will probably use the fallback
ngl I don't see why it's wrong...
huh, it's correct locally...
make -j build ARCH=general-64 right?
yea
maybe (after ur done benching) you can try my SSE port branch...
I did a profile-build, but it shouldn't matter
I tried the sse branch
btw the shifts are unnecessary in this version i think, i previously used _mm_set1_epi64 instead of _mm_cvtsi64_si128 and forgot to remove them
That still uses vector
what
idt so
when I make changes to the generic fallback it changes the bench
for (const auto index : removed)
{
const IndexType offset = Dimensions * index;
for (IndexType j = 0; j < Dimensions; ++j)
toAcc[j] = fromAcc[j] - featureTransformer.threatWeights[offset + j];
for (std::size_t k = 0; k < PSQTBuckets; ++k)
toPsqtAcc[k] =
fromPsqtAcc[k] - featureTransformer.threatPsqtWeights[index * PSQTBuckets + k];
}
for (const auto index : added)
{
const IndexType offset = Dimensions * index;
for (IndexType j = 0; j < Dimensions; ++j)
toAcc[j] += featureTransformer.threatWeights[offset + j];
for (std::size_t k = 0; k < PSQTBuckets; ++k)
toPsqtAcc[k] += featureTransformer.threatPsqtWeights[index * PSQTBuckets + k];
}
I have to be missing something really obvious
Removed loop looks off
Result of 200 runs
==================
base (...fish.ti_base) = 1990667 +/- 3984
test (....emu-inlined) = 2048729 +/- 3966
diff = +58062 +/- 2201
speedup = +0.0292
P(speedup > 0) = 1.0000
ok not bad
better than ostrich which is what matters
rare force_inline W
thx as always <3
OK
sse2 inefficiency fixed, general-64 works again
so we should be good to go
@warm thistle if ur around I'd appreciate a bench on ur computer too
hm same for me
Result of 200 runs
==================
base (...fish.ti_base) = 1996106 +/- 4152
test (...fish.ostrich) = 2056736 +/- 4537
diff = +60631 +/- 2106
speedup = +0.0304
P(speedup > 0) = 1.0000
O nvm huh
it rly rips through the indexing on your computer lol
ok well we'll wait for fishtest then
on it
Result of 20 runs
==================
base (./sf-old ) = 1385104 +/- 7425
test (./stockfish ) = 1420016 +/- 9039
diff = +34912 +/- 3671
speedup = +0.0252
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
no surprise, but good:
Verify node counts:
g++-9 : 2324801
g++-10 : 2324801
g++-11 : 2324801
g++-12 : 2324801
g++-13 : 2324801
clang++-11 : 2324801
clang++-12 : 2324801
clang++-13 : 2324801
clang++-14 : 2324801
clang++-15 : 2324801
clang++-16 : 2324801
clang++-17 : 2324801
clang++-18 : 2324801
clang++-19 : 2324801
clang++-20 : 2324801
I should probably add a loop over our architectures..
time to call it a day. I suggest to start both SMP runs on fishtest once the LTC passes.
gn!
are we preparing for the PR now?
it's only at 2.6 LLR though
It passed
π
ππ₯³ππ₯³ππ₯³π₯³ππππ»π»π»
I'm assuming the branch is threat-i8-QA-255?
shouldn't the smallnet also be updated with the QA=255 quantization
@violet badger would it be possible to look into merging nnue-pytorch#370? I have some refactors planned that should make the feature system easier to work with
Alright lemme send the ltc smp in
yep
this is a separate patch that we can test later
same can be said for all of those though
sure, how long will smallnet training take?
i think the threat-i8-QA-255 branch can also be used to train a smallnet
we can just requantize?
just use HalfKAv2_hm^ feature set
isn't the threat weight clipping hard coded
does anyone have the original checkpoint though
ig just requantizing from nnue
ok try to do this as a simpl i guess
replace x with x * 255 / 127
actually no
just replace x with x * 2
So, if this passes, then will it be merged? Or will you all try for more first?
a lot of cleanup work to do first...
@frosty imp you've already done a lot of cleaning up right
gotcha
I think some of the nicer inference cleanups need trainer side coordination
ah
but threat index calculation & co should be fine
anything I can help with?
I have i8 merged but not your speedup
kk
kk
this branch should be updated now
with clang-format
huzzah
will you apply the diff to the most recent SPRTs yourself or should I do that and PR it
oh these are already included
I'll clean them up a bit though
oops I broke the compile by removing the friend struct Position thing
huh I get a segfault with sanitize=undefined,address
oh well we'll figure it out later
seems to be a misaligned struct
it's segfaulting on a memcpy that expanded to vmovdqa instructions
hmm
https://github.com/xu-shawn/Stockfish/pull/29 anyway this PR fixes an OOB read in my LUTslop
merged
danke
Shawn have you clang formatted
yes
yeah
honestly the code isn't that bad
the only serious pain point imo is nnue_accumulator.cpp
which I gather u've been working on
will be traveling today. I assume that needs some light testing at least?
I would probably do that to be safe
although it's a simple reorganization. probably nothing will go wrong
yeah, so will be a bit later.
u've done smth wrong
ooh where are you going?
just work..
meanwhile, some results for 60+0.6t256.
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 patch : 6.3 10.7 1042.0 2048 51
2 master : 0.0 ---- 1006.0 2048 49
A bit sss, but looks good.
well, likely quite reasonable progress
@frosty imp how did you requantize the smallnet?
converted .nnue to .pt in master nnue-pytorch
then used https://github.com/xu-shawn/nnue-pytorch/tree/QA_255 to convert it back to NNUE
try just multiplying every weight by 2 in the nnue
hmm isn't that 254 quant tho
doesn't matter practically speaking
i'll try that later
512
a new era of nnue just started, great job everyone 
so many good things coming from this at once. master net will be reproducable again, there will be new nets again after a long time, no spsa needed rn, probably some smart speedups incoming, SF 18 is coming π
yessir I have a few ideas still...
and obviously others will find fruit
ok, gave it some testing, have a look at the PR for some copilot comments.
I assume that is a step towards getting threats into the main brach, right?
should allow refactoring feature transformers in the next PR, which will make getting threats in main easy
If the running test ends the way it is looking like, then threat inputs does indeed scale very well.
life is good
Thanks for your hard work, and everybody's in general.
The LTC looked much more x86 friendly.
GROUPED BY ARCH
64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 3.24 Β± 2.91 | LOS: 98.6% | LLR: 1.11 | [9, 1429, 3509, 1528, 20]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 2.63 Β± 4.49 | LOS: 87.4% | LLR: 0.35 | [8, 590, 1469, 637, 5]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 2.47 Β± 6.17 | LOS: 78.4% | LLR: 0.17 | [2, 304, 774, 320, 4]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 10.76 Β± 6.57 | LOS: 99.9% | LLR: 0.86 | [0, 242, 669, 314, 2]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -1.66 Β± 8.16 | LOS: 34.5% | LLR: -0.13 | [1, 200, 444, 190, 2]
64bit VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -2.00 Β± 8.39 | LOS: 32.0% | LLR: -0.15 | [2, 183, 417, 178, 0]
64bit POPCNT NEON_DOTPROD | Elo: 23.63 Β± 10.79 | LOS: 100.0% | LLR: 0.71 | [1, 85, 248, 151, 1]
GROUPED BY x86
x86 | Elo: 3.11 Β± 2.01 | LOS: 99.9% | LLR: 2.19 | [22, 2948, 7282, 3167, 33]
ARM | Elo: 23.63 Β± 10.79 | LOS: 100.0% | LLR: 0.71 | [1, 85, 248, 151, 1]
And net SPSA as well...
π
eventually
me too btw
I believe it was agreed this won't happen?
sprt?
As I understand it will happen eventually, when no more Elo could be squeezed
SPSA, sorry, I fixed it
Are there any preliminary results on L2=31 TI nets?
Tomorrow the nets produced with this job will be available:
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/pipelines/2149181872
You can always train a better net
So this point will never come
What is the current elo vs master?
Over 10k posts in this thread, long battle π
ltc by ISA type
so make a PR? test will pass soon I guess
what branch is it again
like is the final branch going to be the shawn one or the one sscg13 has on the test
For the PR message need to decide how much detail to go in. Like do I discuss the alternate schemes that failed before this one (or things that were tried and failed in general) or stick to just explaining the final product
I think you should write failed alternates in a separate doc
and link it in PR
otherwise it's too much text
yep
if someone wants to make that doc feel free. and then i can just add the section talking about the other failed input schemes
it would be good to summarise the findings of this 10k messages i think
Shawn one probably with extra cleanups and stuff
Yes this is important I think
I feel like sscg or shawn should make the pr if they wish to do so since they put in the vast majority of the work, not to discredit you but it feels a little weird that you are the one making the pr when you """only""" proposed the idea
I personally donβt care, I think it would make most sense to have Shawn make the pr since his branch is being actively updated with cleanup work
There are also corresponding PRs Shawn and I need to make to nnue-PyTorch
I don't think that is true, there was months of prior work to land on this scheme and do a first test in Stockfish
But you did not do any stuff to get it to work in sf (which is where this pr is made)?
i agree one of shawn/sscg should open the pr
is the pr not gonna have like 10 quadrillion coauthors anyway?
Yeah
coolio dont forget me :p else i'll be briefly sad
Donβt worry viren claims to have a big list
its just his name in a very large font size
Might be time to reveal :P
viren me lofty yoshie sscg shawn and then all the SF speedup gang?
in chronological order even
~~ in chronological order I think I come before yoshie~~
Disservin, vondele, linrock also need to be credited
this will be the holiest PR in existence
Tbf I think viren should just reveal the list
what are the current elo numbers at STC/LTC/SMP?
Can we make it larger than the original nnue pr
Yeah I will I'm on phone rn I don't have it on me lol
2 / 3.5 / 6 so far
How else will I farm this for the resume
tbh it probably doesn't matter all that much now
Iβm happy now that when I say Iβm a sf dev it doesnβt mean I just made a one line simp
This is huge w/o spsa, the comparable master arch number is -5
Yeah if you can guarantee itβll last
Increased TP of the current VLTC test to 100%...
I donβt trust my personal acc with important stuff like this
Thatβll get referenced many times in the future
It will not be referenced directly. It will be downloaded and attached through GitHub lol
It's only for the collab stage
Oh cool
^
Yeah just make one then
Mine is the same one that appears on my github
π»
Just share it publicly
Well it has version history anyways
Finally snowy egret has a chance ugh
The memcpys were pissing me off
Maybe we should add a proper move assignment operator to ValueList
which wonβt fix the problem but at least it wonβt copy the whole thing
I think finding an upper bound for the threats list size no longer matters tho with egret
Speedups don't seem terribly important for the PR description right
maybe we briefly describe the most important ones?
I'm planning to write an in-depth blog post about it (bc some of the techniques are interesting imo) so we can also link that
ah nice
aura
wow writing this stuff is harder than I though
just stream of consciousness it!
and then we can reorganize
@foggy wind would u mind benching https://tests.stockfishchess.org/tests/live_elo/6911b37fec1d00d2c195c8f8
works ok locally
trolled by a loongarch worker lmaoo
I should do a loong vsx port some time
Result of 200 runs
==================
base (...fish.ostrich) = 2056887 +/- 4406
test (...grine-falcon) = 2051799 +/- 4516
diff = -5087 +/- 2380
speedup = -0.0025
P(speedup > 0) = 0.0000
ugh
oh yeah meanwhile
I wonder if it gets inlined, are you using clang?
π₯³
Congratulations to everyone who put in a lot of hard work π
thank u for all the hlep
there is also still this warning:
position.cpp: In member function 'void Stockfish::Position::update_piece_threats(Stockfish::Piece, Stockfish::Square, Stockfish::DirtyThreats*)':
position.cpp:1104:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
1104 | Bitboard threatened = ray & qAttacks & occupied;
| ^~~~~~~~~~
position.cpp:1057:14: note: shadowed declaration is here
1057 | Bitboard threatened;
| ^~~~~~~~~~
actually it probably wouldn't hurt to take a look at itnow
Meanwhile Stage 4 net with L2=31 available.
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/12028659745/artifacts/download
Tomorrow morning (CET) Stage 5 net will be available
Btw on net format, letβs try to print i8 verbatim and then leb128 for i16
And update the trainer accordingly
is there a reason we use leb128 instead of something a bit simpler and more compact
the vast marjotiy of weights are in [-127,127] so
the format can just be 0x80 + (2 bytes) for i16s that don't fit, and the literal value otherwise (which will be sign extended)
should be like 10% smaller
I mean we could unironically just write the weights verbatim
based
The sad part is because of packus preprocessing it doesnβt simplify memory sharing
@frosty imp does this sound good?
when we agree on a new net format afterwards I can modify the bitcoin miner script
lol
snowy-egret-2 passed
https://tests.stockfishchess.org/tests/live_elo/69108025ec1d00d2c195c5d6
congrats... please go ahead with the PR.
i think there are still many cleanups to make
yes sure
- we haven't prepared the PR message
is the doc supposed to be the PR message
i think we could also include like the most important parts
also what shall we do with this
append it to the pr?
good
sure, up to you
we can also just merge it later into master
can just imitate this
seems nice and concise lol
personally I'd just link to an external doc or Wiki entry?
for a more extensive explanation
I think in the actual message we just put the SPRT results, a brief description of threat inputs, and the contributors
we are also waiting on Viren's list of contributors over time
viren also wanted to write down the exploration process here and other detailed stuff
so let's wait for that
i mean there's no rush
in the meanwhile what are opinions on this
guys, the PR description is not important. It can be modified/updated afterwards
I can try that after refactoring the feature modules
ok cool
i would also prefer that like we be able to make the prs to sf / nnue-pytorch at the same time
since they're companion prs
yeah that seems good
I'd avoid that, waste of compute, I also would really like the net to be the result of the ci pipeline, or fully captured by the nettest script.
yes
I would keep it, or we have to redo training?
I would say threat weights and psqt weights be stored separately
which we can of course
yeah
wouldn't need to redo training, is just a change in serializing net
i think
the weights themselves would remain the same
just formatted differently
well, I would at need fix the script ... new trainer sha would need rerun.
more after dinner π
Is there any work/experiment to do on smallnet?
yes, attempt to re-quantize it to QA=255
meh
i think anematode can just tack it onto shawn's branch
it's not functional
I think we should hold off on all functional changes (i.e. requantize smallnet) until after merge
great
Also, tomorrow morning we will have Stage 5 net with L2=31.
Stage 4 net already available, should we test it?
π€
^^
I think it's fine to merge if tested
we should now work towards a mergable PR. There should probably be still a final test to verify the cleanup didn't introduce a regression or so. However, larger change, I would keep for afterwards and use the normal process to improve stuff.
Agree
The flurry of improvements will last a while probably
So might as well get it in
i think if possible try to avoid links in the text itself, i think take them outside to the top of the page. otherwise it is a bit much
like just list out monty, yukari, plentychess prs, nnue trainer prs etc. before going into the text
hm i think we can actually put that in the PR message itself instead. and attached through github. It seems to be the easiest way to understand the general concept quickly
A graphical version of an earlier scheme (with less refinement) that illustrates the core concepts can be found at: <Link the initial monty inputs v2 pdf> I copy it here for later
tbh I don't think the doc really needs to go into detail about speedups seeing as it's more to serve as an introduction into threat inputs
hmm well if someone does write it it's still better even if not strictly necessary. otherwise those concepts will be lost to time
it's not like you can git blame the speedups itself
yeah that's a good place to add it
and we can always link it later
great
what was the range of x86 speed loss again compared to master?
For the PR message itself I think the PR links of the 3 engines (Monty, Yukari, Plentychess) being included is also probably important. Maybe that PR link section can just move there itself
I think its just going to be a bunch of links and short summary. Only way to have it condense
yep
I would recommend to not overdo a PR
i will just give a range of 15 to 5% if no one digs up more specific numbers
since new arch and stuff is probably good to speedup development in other areas
so as soon as you make it the better it is imho
even if it's not complete, relevant info can be put on github after it
i think we can get everything done within 1-2 days if we speed it
which should be fairly fast
yeah just saying, my last project at work is more or less finished 2 weeks ago and I'm still making docs to close it
(:
ah
I think the "full" doc looks solid now
so we can work on preparing PR message
and after that shawn and I need to lock in on the other things
Well the PR message can be done in 1 hour, the branch being ready is the main thing lol
@frosty imp what is remaining before threat inputs can be PR'd to nnue-pytorch
and I'll try to set up a tracking issue for the main sf PR
ok so attempting to create a new issue now enforces that it follows the "typical issue" format
so idk if this is actually the best approach now
yknow what let's just try to do it here
one thing absolutely should try to do soon is fix this
most other stuff is optional and can be done after the initial PR
but this warning should definitely go
you can also make a PR and use the first PR comment to keep a list of items?
true, would need to wait for shawn to do that
yes, creating a PR would have the advantage of CI running.
let's see what it uncovers π
oh dear
π§
like uncovering a rock w/ a bajillion roaches and worms underneath it
(maybe)
or hopefully it's a nicely mowed lawn
alright give me a few minutes to set it up
don't worry
roaches are 2 supply
so maxing out on them is not good
speaking of vvltc results
I do wonder which pair master got double killed in...
Rest of message to be written...
Passed STC:
LLR: 2.93 (-2.94,2.94) <0.00,2.00>
Total: 63424 W: 16956 L: 16591 D: 29877
Ptnml(0-2): 276, 7522, 15797, 7795, 322
https://tests.stockfish...
you can search for it lol
with ctrl+f ,1
and downloading relevant pgn
but in general it's pretty meaningless
they can be downloaded by machine?
this pairs happen in dev vs sf 17 from both sides
you can open the test
do the positions tend to be very sharp or something
or does one side just make a blunder early on
(or both)
yeah PT has them
usually neither
actually in higher frequency
just some time trouble
where one side shows 0,00 and other shows +2
or some tactical miss
where losing side lacked ike 1-2 plies of search to see it
first zombie identified π
merging in a couple speedups
https://tests.stockfishchess.org/tests/live_elo/6911b37fec1d00d2c195c8f8 this will probably pass if I'm understanding its effect correctly
and then snowy egret ofc
both are pretty straightforward and non functional
task list, PR to nnue-repo
Merge them after or redo the overall sprts I think
Otherwise the listed Elos will be wrong I guess
Task: verify correct .yaml is mentioned/linked for the training recipe
is that a task for this pr or for nnue-pytorch pr though
which I need to get confirmation from shawn
that everything is ready
just somewhere we can keep track π
vondele, given how PRs work, wouldn't it be better to merge threats_input only when it will be difficult even for anematode to find some speedup?
My fear is that, if threats_input is merged soon, some next speedups might get lost in merge waves or could interfere with other gains....
Just saying...
π
nahhh
it is better to merge this and have it become master
^
well all other pending ones will need to get redone
that's unavoidable
but the sooner we do it the less time is wasted
ok, ok
π
mfw
we also gotta add all the other contributors still unlisted here, once viren gets his list
Relatively little for a patch that changes 30+ files I'd say.

large number are due to not writing the network out correctly I think (from SF)
well that gives a new entry to task list
"remove unused code"
and "fix write NNUE"
this is much simpler if we change the nnue format
though we could also hack it
it still irks me that read_leb128(a), read_leb128(b) is not equivalent to read_leb128(a+b)
But it seems no UB right?
that is funny
there's definitely something... bc it crashes for me locally with sanitizers on (but without a message)
the write functions probably actually have UB
since nobody touched them
Oh... yeah... but other than that...
Though I thought Stockfish was going to be slower to adopt threat input than this. It's pretty fast. Only Monty, Yukari, and Plentychess adopted it faster?
Impressive considering the baseline net is much better in Stockfish.
so, the hack fix for this is to declare a combined array, write the threat weights and normal weights into the combined array, and then write_leb_128 the combined array
great
I am out of date on this, is the issue the i8 weights are too hard to compress or smth?
What is needed instead
basically I don't like the current format bc we have to declare these huge combined arrays
in roder to read and write
The hack fix doesn't sound too bad to me tbh
yeah
ideally what I would prefer is:
(i8 threat weights) (leb128 psq weights) (rest of network)
instead of (leb128 combined weights) (rest of network)
Why verbatim?
Also, some poor guy (shawn_xu?) will have to update new SF NNUEv10 architecture scheme in nnue-pytorch....
π
basically just memcpy
this necessitates a leb128 read into a combined array (and similar for write)
when it shouldn't be necessary at all
yknow what I'll do hack fix for now
see if it fixes anything
Right now, hack fix, and work on a new format for another round... yeah
if anybody asks about the architectures that SF supports most solidly, refer to this picture please
btw @twilit oriole I've attached the PR links directly in the PR msg
so I think we can get rid of them in the other doc now
smallnet printing has been hacked correctly
still working on threatnet
alright
net printing fixed...
ok it seems like the next issue is the declaration shadows local variable
which turns into an error on some CI
-Werror
i think that can be easily fixed
let me try to do it as well...
what is this test checking?
I'll attempt to fix this as well if I understand what it wants
it tells you if you're including a header that's not needed for it to compile
or if you're not including a header that, didn't your compiler transitively include it by another header, would make the program fail to compile
last one to fix, don't worry
but it is explicit on what it wants : https://github.com/official-stockfish/Stockfish/actions/runs/19247091247/job/55023637038?pr=6406#step:7:180
this cannot cause any performance regression right
nah
i'll chalk up the abnormally slow bench to my laptop being weird then
laptop regression confirmed by remote diagnosis.
The matetrack error is more interesting.
at least nothing else erroring out so far
the error was there before, so nothing random
ouch https://github.com/official-stockfish/Stockfish/actions/runs/19247561380/job/55025208765?pr=6406 also has an issue with the sse41 i8 conversio
yeah a lot of compilers having issues with that
interesting
ugh
_mm_cvtsi64x_si128 maybe
actually that's even worse
hm
we could do _mm_set_epi64x(0,x)
miserable
idk compiler diffs are weird
yeah prob not
tough
i mean I wouldn't know tho
somehow master has none of these compilation issues
sigh
lmao
average non portability
also i think someone who worked on the incremental threat can revisit this
@rocky vigil ok I think try replacing _mm_cvtsi64_si128(x) with _mm_loadu_si64(&x)
OH WAIT
it's because it's building on 32-bit
ughhhhhh
ok yeah then _mm_loadu_si64 should work
@rocky vigil PR sent
i may have a speedup ```
Result of 20 runs
base (./sf-old ) = 1374133 +/- 10832
test (./stockfish ) = 1383722 +/- 11199
diff = +9589 +/- 2380
speedup = +0.0070
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
it's also a simp so we'll see
exciting
the bulk of the documentation can be added as a markdown file in the pytorch repo and the stockfish pr just quick summary and link to that
(if that hasnβt been said yet)
huh @warm thistle it was added here for optimization
oh hm
idk removing it seems to do better on my machine..
ig we see what fishtest says
seems reasonable to get rid of it if it doesn't help
maybe it should be an assert instead of an assume
but I don't see how it could be faster
@rocky vigil made a few more PRs. will be back for more once those get merged
Great job everyone! This was a long journey, but now, after the cleanup finished, we could finally merge to master!
And then we can start search tune and so on to gain some more.
You're awesome!
Aight lemme check
aight they're all merged
pr made
lol yeah it can take a few iterations to make IWYU happy
merged btw
matetrack issue seems to be related to tb
i have no idea how that happened
we ostensibly didn't touch any part of tb probing
btw @frosty imp how is progress on nnue-pytorch going
should I make a PR now for that
and you see what needs to be changed
I'm working on a feature set refactor so best to wait after that
since there's prolly going to be heavy merge conflicts
i mean it won't be too hard to rebase stuff
manually
later
yeah i assume we are not gonna change the net format or anything
For the matetrack, I trying to find the reproducer, haven't extracted it yet, but seen this error message, which is probably the reason:
stockfish: syzygy/tbprobe.cpp:1148: void Stockfish::{anonymous}::set(T&, uint8_t*) [with T = TBTable<Stockfish::<unnamed>::WDL>; uint8_t = unsigned char]: Assertion `e.hasPawns == bool(*data & HasPawns)' failed.
unless that rings an immediate bell, I'll try to extract the testcase
syzygy/tbprobe.cpp:1073: uint8_t* Stockfish::{anonymous}::set_sizes(PairsData*, uint8_t*): Assertion `d->base64[i] * 2 >= d->base64[i + 1]' failed.
something is fishy π
maybe salting the fish helps
i went to bed when sscg was fixing still, i woke up and sscg is still going 
nah I also went to bed
and woke up and started fixing more
guess i just need more sleep than most people here
If this is true, u clearly need more sleep
or sscg needs to speed up
btw does anyone else have cleanups they would like to propose to the pr
if so just pr it to my branch
planned to read the diff in a lecture later (3-4h from now)
fair
one minor thing that comes to mind tho: we no longer require safe_destination() in bitboard.h, it can be moved back to make the diff simpler
ah
btw on the x86-32-sse41-popcnt comp failure
LLM is suggesting to use _mm_cvtsi32_si128 instead
idk how trustworthy that is
i get the general issue of attempting to manipulate 64 bit stuff on 32 bit comp
but do we have a way to distinguish between 32 bit sse41 and 64 bit sse41
could FullThreats::append_active_indices be simplified for pawns using one of the newly introduced attacks_bb() functions in bitboard.h?
just throwing some ideas here, not sure how much cleanup should be done now vs. afterwards
afaik for pawns they're done in bulk
it is faster for refreshing
though refreshing takes negligible amount of total time
$ cat test3.inp
setoption name syzygyPath value ../../syzygy/3-4-5/
position fen 8/8/8/8/6b1/1N1P4/5K1p/7k b - - 0 1
go nodes 100000
$ cat test3.inp - | ../Stockfish/src/stockfish
Stockfish dev-20251110-b5a26a84 by the Stockfish developers (see AUTHORS file)
info string Found 145 WDL and 145 DTZ tablebase files (up to 5-man).
info string Available processors: 0-31
info string Using 1 thread
info string NNUE evaluation using nn-49c1193b131c.nnue (125MiB, (102384, 1024, 15, 32, 1))
info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))
info string Network replica 1: Shared memory.
info depth 1 seldepth 3 multipv 1 score cp -40 nodes 11 nps 11000 hashfull 0 tbhits 0 time 1 pv g4f3
info depth 2 seldepth 3 multipv 1 score cp -33 nodes 26 nps 26000 hashfull 0 tbhits 0 time 1 pv g4f3
info depth 3 seldepth 4 multipv 1 score cp -27 nodes 138 nps 138000 hashfull 0 tbhits 0 time 1 pv g4e2
info depth 4 seldepth 5 multipv 1 score cp -97 nodes 811 nps 811000 hashfull 0 tbhits 0 time 1 pv g4h5 b3d2
info depth 5 seldepth 6 multipv 1 score cp -93 nodes 983 nps 491500 hashfull 0 tbhits 0 time 2 pv g4e2 d3d4 e2f3 b3d2 f3g2
info depth 6 seldepth 7 multipv 1 score cp -87 nodes 1024 nps 512000 hashfull 0 tbhits 0 time 2 pv g4e2 d3d4 e2f3 b3d2 f3g2
info depth 7 seldepth 9 multipv 1 score cp -96 nodes 1268 nps 634000 hashfull 0 tbhits 0 time 2 pv g4e2 d3d4 e2f3 b3d2 f3g2 d2c4 g2f3
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append
under valgrind
no idea what is going on there..
huh it looks like it's in the TB code
can you use n-ary searching to figure out which node the crash occurs on?
educate me ....
yeah to get an fen
essentially
how updated is probing code?
maybe some issue like https://github.com/syzygy1/probetool/commit/f3f8227dafdcd7039ad0da445fcf7bea20cf9bfe
I think it is more some corruption that just happens to trigger that.
in any case it's strange
if you happen to have TB around, can you test if you can reproduce?
i only have shatranj TB lol
ok dw
setoption name syzygyPath value ../../syzygy/3-4-5/
position fen 8/8/8/8/6b1/1N1P4/5K1p/7k b - - 0 1
go nodes 100000```
but let me see if I get the fen
yeah this is 6 piece (root pos)
If I print out the fens it probs I get
...
Probe: 8/8/8/8/3P4/5KN1/8/6kr w - - 0 7
Probe: 8/8/8/8/3P4/5K2/8/6kN b - - 0 7
Probe: 8/8/8/8/3P4/5KN1/8/6kb w - - 0 7
==1805787== Thread 2:
==1805787== Invalid read of size 1
with that last fen triggering the error
what if you try a precursor position like 8/8/8/8/3P4/5KN1/7p/6k1 b - - 0 1
no problem
no something is strange..
underpromotion, it being a check, captures available in the position are all edge cases of TB idk
well, we've never had TB issues.
but why would it only crash when root pos is far away
can you also get the internal data being passed to the TB probing, see if that differs somehow?
If I compile with sanitize=undefined I get:
Probe: 8/8/8/8/3P4/5KN1/8/6kb w - - 0 7
syzygy/tbprobe.cpp:1081:22: runtime error: shift exponent 64 is too large for 64-bit type 'long unsigned int'
syzygy/tbprobe.cpp:1042:31: runtime error: shift exponent 151 is too large for 64-bit type 'long long unsigned int'
syzygy/tbprobe.cpp:1043:31: runtime error: shift exponent 209 is too large for 64-bit type 'long long unsigned int'
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append
that is very strange
and I assume it doesn't occur when the position is probed directly?
it doesn't trigger on master... let me check on the branch
no the positions searches fine as rootpos
right
i guess the issue must be in the internal data being passed somehow
I think so, but have to stop debugging now.. later today I can look into it again.
Issue in make move maybe?
it is something rare, I'm currently playing games with syzygy enabled, and it is not triggering after a few 100 games.
but does trigger on that testcase.
OK, finally, have a setup where this triggers reliably while playing games (basically book of random 6men positions).
34 0-1 {Black mates}
12 0-1 {White disconnects}
9 1-0 {Black disconnects}
25 1-0 {White mates}
3 1/2-1/2 {Draw by fifty moves rule}
7 1/2-1/2 {Draw by insufficient mating material}
and it is specific to the branch, not happening for master.
probably this should be PRed to the branch?
there is something similar for armv7 neon https://github.com/official-stockfish/Stockfish/actions/runs/19256000717/job/55050601986?pr=6406#step:10:161
Yep I will in a bit
Ai ya
Is there a field of Position or StateInfo that only tbprobe reads
Iβm surprised address sanitizer isnβt catching anything
is there a reason there's double_inc_update for threats?
it seems to me like there's no optimization there?
this is a slight speedup for me
--- src/nnue/nnue_accumulator.cpp
+++ src/nnue/nnue_accumulator.cpp
@@ -212,17 +212,6 @@ void AccumulatorStack::forward_update_incremental(
DirtyPiece& dp1 = psq_accumulators[next].diff;
DirtyPiece& dp2 = psq_accumulators[next + 1].diff;
- if (std::is_same_v<FeatureSet, ThreatFeatureSet> && dp2.remove_sq != SQ_NONE
- && ((threat_accumulators[next].diff.threateningSqs & square_bb(dp2.remove_sq))
- || (threat_accumulators[next].diff.threatenedSqs & square_bb(dp2.remove_sq))))
- {
- double_inc_update<Perspective>(featureTransformer, ksq, threat_accumulators[next],
- threat_accumulators[next + 1],
- threat_accumulators[next - 1], dp2);
- next++;
- continue;
- }
-
if (std::is_same_v<FeatureSet, PSQFeatureSet> && dp1.to != SQ_NONE
&& dp1.to == dp2.remove_sq)
{
ok what's particularly demented is that these shift operands come from the TB file itself...
so I think data is getting misaligned somehow in the TB read logic
huh but the only usage of a possibly-bad pos in mapped is constructing the file name...
is there a consistency check utility function for Position anywhere?
yeah
just check whether __i386__ or __x86_64__ is defined
I don't think the LLM's suggestion makes much sense
Is the threat inputs branch merged with the official SF branch yet? If not, when with that happen π
patience
lol
Yes I just read through 400+ messages on this thread, the entire history of the last ~4 days, y'all have a lot to say
it's a complex change!
By the way, the new L2=31 Stage 4-5 nets are now available. Has someone already tested them? Are they outdaded/superseded by other nets? π
Stage 5: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/12028659754/artifacts/download
Stage 4: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/12028659745/artifacts/download
Fish test time!
patience
we'll do it after the merge

