#UE Threat Inputs for AB
1 messages · Page 7 of 1
so, I'd better setup an additional run with that enabled...
makes me wonder why the first quick test of stage 1 seemed so good?
it did finish stage 2... so I guess we could test again.
Closer if you consider that we are around this close to master w/o spsa for a standard net
yes, I've been keeping that in mind 🙂
Though I hope there is still some 5-10 elo more we can squeeze in total
I don’t think the fishtest actually helps let us know the scaling
At least it shouldn’t scale poorly
well... #1336647760388034610 message is a test for scaling to some extend. Looks good but no cigar
Oh shoot I thought that was also stc just done faster locally
stc, but with a few more threads
Another 3 days waiting
😭
no need to ping ...
It remains to be seen whether 1280 can be worth it over 1024
Stage 4 should be done now right?
yes, currently Stage 5 is ongoing
@stray reef do you think lookup could still be faster than current indexing scheme post-cj speedup?
i think it's worth trying the full lookup table for sure
but i personally will try this way as well
Another potential speedup patch inbound!
ig it's a question of larger table vs. no branches/one pseudo_attacks lookup and popcount saved
I think larger tables are relatively free if their cache lines aren't used.
They just stay out of the caches and don't cause slowdown.
Can try it ig
is this with factoriser
nope
is just stage 4
also it's on fishtest now
so you can just pull the two branches involved
if you want to run locally
should test this also ig
https://tests.stockfishchess.org/tests/view/68f31c6c28e6d77fcffa0611 i started the stage 1 test again to see if its just a fluke
how much is factorized training worth?
this net isnt factorized. they thought it was but it wasnt
was the master net trained without factorizer?
the current best net is w/o factorizer
I still feel that https://tests.stockfishchess.org/tests/view/68f091a228e6d77fcffa0128 will help 1280
since this one reduced avg. number of threats updated
oh well
yeah both are active rn
it'll be really funny though
if just better training luck == elo
though bad overall in long term
thats not normal. bullet doesnt have this
idk how aggressive the skipping / filtering is
but each stage is much less than 1 epoch I think
I wonder if the branch had an additional fix?
i did not edit anything touching the original threat inputs definition so I wouldn't expect that to be the case
but who knows
what is going on at ltc rn
with the 1280 net
idk..
Can you do a training run getting rid of king buckets? at l1 1280 again
need to define new feature set
I tested it at L1 3072 at it was -20 fixed nodes at that L1 size. But others tried at smaller L1s and it gained a lot
it could be done though
let's wait to see
if this single machine is just very anomalous
diff --git a/model/features/full_threats.py b/model/features/full_threats.py
index 8219c47..6ff36fd 100644
--- a/model/features/full_threats.py
+++ b/model/features/full_threats.py
@@ -160,9 +160,11 @@ class FactorizedFeatures(FeatureBlock):
def get_feature_factors(self, idx: int) -> list[int]:
if idx >= self.num_real_features:
raise Exception("Feature must be real")
-
- a_idx = idx % NUM_PLANES_REAL
- k_idx = idx // NUM_PLANES_REAL
+ if idx < 79856:
+ return [idx]
+
+ a_idx = (idx - 79856) % NUM_PLANES_REAL
+ k_idx = (idx - 79856) // NUM_PLANES_REAL
if a_idx // NUM_SQ == 10 and k_idx != KingBuckets[a_idx % NUM_SQ]:
a_idx += NUM_SQ
is that specific to the factorizer or a bug fix in general?
this is specific to factorizer
i think
isn't this code not called?
when factorizer is not used
Ok, looking at the diff between the trainers:
git diff 73696ad5f56e6ba216ba693bf5ad41a278004e36 5bcb0036825206ad6a23df6ed1b07211e3a73f58
which are the shas of the two versions used for training.
It also contains the change to the rng.. but I doubt that matters?
I have certainly never seen that.
At the end of the pipeline there is maybe 1-2 Elo variation
(certainly <5Elo)
approx how long will it take for stage 1 of the real factorized run?
we could also compare then
yeah so it won't take that long to find out...
I wonder if it makes sense to start training L1 = 768...
perhaps
stc of 1280 is holding steady so far
i guess really just wait a while for ltc
I think the better performance of that factorized net is probably because of the reference net used, I don't see it to be the stage 2 equivalent. The training run has these nets:
Step 1 : starting from None leading to 574f3061fd9e
--> step 1 is final already. Result: /workspace/scratch/574f3061fd9e/run/lightning_logs/version_1/checkpoints/nn-3e22bf1f564d.nnue
Step 2 : starting from 574f3061fd9e leading to e3109a97a662
--> step 2 is final already. Result: /workspace/scratch/e3109a97a662/run/lightning_logs/version_1/checkpoints/nn-a878500a97a8.nnue
Step 3 : starting from e3109a97a662 leading to 6d0eccfc51a2
--> step 3 is final already. Result: /workspace/scratch/6d0eccfc51a2/run/lightning_logs/version_1/checkpoints/nn-bf4519f857f4.nnue
Step 4 : starting from 6d0eccfc51a2 leading to bedc9e9b73fd
--> step 4 is final already. Result: /workspace/scratch/bedc9e9b73fd/run/lightning_logs/version_2/checkpoints/nn-598188c9a702.nnue
Step 5 : starting from bedc9e9b73fd leading to e919dd3ada1a
--> step 5 is final already. Result: /workspace/scratch/e919dd3ada1a/run/lightning_logs/version_1/checkpoints/nn-d1dc1ab9cb1c.nnue
while the test has a different base net, not sure what it is. https://tests.stockfishchess.org/tests/view/68f31c6c28e6d77fcffa0611
Step 1 : starting from None leading to fbfaa6b547c6
--> step 1 is final already. Result: /workspace/scratch/fbfaa6b547c6/run/lightning_logs/version_1/checkpoints/nn-020430fc567b.nnue
has to be, this sprt result is impossible lol
so the proper test would be nn-3e22bf1f564d.nnue vs nn-020430fc567b.nnue
lmao
nn-fd9f...
that base net
is a 100 SB net
so 1/8 of the real stage 1
yeah this test is meaningless
i kinda wanna see in a few days
if we are at the level of master replication attempt
(i.e. pre-spsa)
so, at least we know we can't stop training after 100epochs.
Could someone test if this is a speedup https://tests.stockfishchess.org/tests/view/68f3d698637acd2a11e71ffe ? My local test says it is but I dont really trust my tests anymore
let me try..
probably no difference?
Result of 100 runs
==================
base (./stockfish.base ) = 977742 +/- 3024
test (./stockfish.new ) = 977027 +/- 3049
diff = -715 +/- 2094
speedup = -0.0007
P(speedup > 0) = 0.2520
But well, always tricky to measure small difference
it might also be that things now are a bit HW dependent
Thank you, thats quite interesting, my local testing showed a decent speedup, but im not sure, maybe I made some mistake in my test
I think it depends a bit on what dominates, and probably in my case slow memory access dominates.
but well, we will see what fishtest figures out..
the 1280 results are strange
i dont think they are. can be explained its just too slow, maybe undertrained etc
I think the current net size is well selected and should focus on optimising it fully first
it terms of training schedules etc
I already played quite a bit around with lr / alpha, but no gains so far.
yeah 1024 seems good, we could give 768 a try later just to confirm
nn-e0189470ae73.nnue available for "Use 1280" pipeline (based on threat_inputs branch)
kind of a wash though at least with current estimates
like it will be 1-2 elo stronger than stage 4
that'll put it at maybe -10 elo stc, -4 elo ltc
768 might be more interesting
probably antiscales but maybe the base stc gain will be high
hm
do u have an estimate on how much incremental threat updates would help the situation
I also wonder whether some of the threats are "low information" in the sense that they're already encoded somehow in the main net
like if you have a queen right next to the king lol
1-2% is my guess, idk, we won’t know until we try
Like threat tracking is ~5% of runtime rn
oh that's not much
true
Idk if that can be done faster though, unless the raw number of updates is decreased
my computer is weird because it has extremely high arithmetic throughput
Wait what machine do you have
it's a recent AMD EPYC machine
Oh yeah
Bc it very simple loop
so it can do 4x 512-bit vpaddw per cycle
er wait no
2x
but still quite a bit more than most computers on fishtest
Using many threads probably stresses the avx / memory more
true
That’s why I think viren said i8 would only be worth it at smp
ye
But fishtest conditions are also multithread basically in terms of memory pressure
Bc of concurrency
I also learned there is no simd i8 * i8 = i16 mul
threat inputs, shared memory vs. master, shared memory
According to viren / jw’s experience with Monty this indeed favors threat inputs
CPU mcts engine
oh! cool
The one where this idea originated from
Since they got it to work in Monty first
And then it worked in Yukari, then Plentychess
And soon hopefully sf
I see
it's cool that stockfish imports ideas from other engines!
like big ideas would probably be really hard to test and push through bc master is so carefully tuned
I mean that’s the purpose of all this stuff being open source
sure
And the collaborative nature

corrhist which was a big gain (like 6 elo, it’s literally a whole third of the progress from 17 to 17.1) shortly after SF17 was also originally done in other ab engines
Technically anything above -5 stc / ltc vs master is a win because we can’t get a new net above that without spsa either
But I’m hoping we can go the full way
And yeah big ideas like this require many many people
cool beans
oh this is the model this nice twitch streamer made: https://www.twitch.tv/raymarch
he has a very aesthetic stream
I have another idea
but the thing is that all my smart ideas fail and my dumb ideas tend to work
just look at my last speedup for example
like how tf does that give 2%
pogey
lol that's cute
5 stages of 1280:
1: Elo: -83.06 +/- 1.85, nElo: -158.79 +/- 3.40 nn-8f15e80a1212.nnue
2: Elo: -44.34 +/- 1.84, nElo: -82.88 +/- 3.40 nn-ee65bf2468c5.nnue
3: Elo: -41.99 +/- 1.84, nElo: -78.44 +/- 3.40 nn-da4726ad1062.nnue
4: Elo: -38.09 +/- 1.84, nElo: -71.15 +/- 3.40 nn-07f85ae62b17.nnue
5: Elo: -36.27 +/- 1.86, nElo: -67.03 +/- 3.40 nn-e0189470ae73.nnue
vs #1336647760388034610 message of 1024
(not the latest optimized SF playing of course)
yeah -11 stc to neutral ltc what is this
this gives impression of being undertrained
the convergence time increased
maybe stage 5 is enough to pass ltc
well 768 will probably be huge antiscaler at this rate
💀
oh well
might as well see
so, just replace last step with e.g. 1200 epochs, add a step of 1200 epochs, or redo all 🙂
i would assume safest bet is to increase length of all stages and redo. dunno how other stuff might affect things
but wait for factoriser first ig that should help a bit
right, maybe smarter to wait for the factorizer.
that one (the for real one) should finish step 1 soon (2h?) and I think we should run a sanity check against the corresponding step without factorizer.
Is there a reason why we don't push indices directly to active in append_active_indices?
I'm seeing ~1.5% speedup, also bench looks identical (to xu-shawn/threats_inputs)
Another free speedup...
yeah
wait shoot where'd that come from
i thought I removed that
it's tech debt back when I was doing "UE at home"
Idk what's the latest version, I just looked at shawn's branch and it seemed strange to have that.
also mineta ray is unused in the threats updates in Position
yeah i thought i removed it in the retry
you should prolly include that too
You mean ray &= BetweenBB[s][threatened_sq]; ?
I'd rather let it included directly because it's trivial. fishtest is already a bit under the strain atm.
I'm not really seeing that difference in fixed game tests (for 1024) tbh. Let's not forget this is just a sprt run.
maybe 1280 is different... who knows. Still some work to do, but we're making progress.
yeah
(60+0.6, 72t, 32000MB, UHO_Lichess_4852_v1.epd):
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 4173.0 8192 51
2 patch : -6.7 5.5 4019.0 8192 49
(patch is 3fc4b6a58c288001f929acc560cb8b28adf03125 (cj-latest-speedup branch))
relative to #1336647760388034610 message
this is 8192 games, each one having engines with 72 threads?
yes
gotcha
ok so 1024 is approximately neutral scaling with master
in threads and time
so probably it is a good spot
though of course 768 might pull a surprise...
well, I should probably compute the results with 1 thread... I think there is good scaling actually. One thing we're definitely seeing, as with all arch changes, is that the real result depends on the HW.
LTCs on fishtest do get a larger diversity of machines (at least for the same number of games), so that might play a role
yikes...
then would 1280 have more promise?
to be fair, all this test shows is that stc 72t and ltc 72t are around the same, as stc, at -7
Is there a test yet
was waiting for mineta to do so but I can make one
I don't see what is 'yikes' about -6 at this stage it's completely fine
Need a bit of patience...
There are gainers to come
If there are no pending improvements incoming I can make a test
I think sscg recently made one
oh i already have the branch set up
but have not made test
i'll make it
it'll run slightly faster
sure
here we go
btw smth strange
@regal steeple can this test be reconciled with upstream changes
so, thread scaling is real in this context. Relative to #1336647760388034610 message
(60+0.6, 1t, 64MB, UHO_Lichess_4852_v1.epd)
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 9367.0 17920 52
2 patch : -16.2 3.7 8553.0 17920 48
😲
I assume quite a bit comes from the memory pressure (i.e. the known thing where nets are not (yet) shared between processes, using 72t is an easy workaround).
was about to say that
So, for completeness... STC to follow.
if it ever gets added to fishtest I assume it benefits threat input net more than master yeah
btw factorized stage 1 is done
so I'll put that on fishtest
and hopefully avoid netgate...
what does this mean lol
shawn tested the last "fake factorized" stage 1 net against the test net which only had 1/8 of a stage
and surprise surprise +30 elo
Im not entirely sure, I can speedup test this in a bit and resubmit if that test seems promising
bruh forgot to change the fixed games preset
can just stop after 20k or so if the results are clear
ideally if the gain is still there it is amplified now that the engine is overall faster
I get
speedup = +0.0113
P(speedup > 0) = 1.0000
I submitted the test https://tests.stockfishchess.org/tests/view/68f4a2e3637acd2a11e72101
so, looks like the factorizer still works as advertised: https://tests.stockfishchess.org/tests/view/68f49c11637acd2a11e720f0 ... we'll have to see how much of this remains after 4-5 stages, but it is a good start
and finally the STC number in this set.
(10+0.1, 1t, 16MB, UHO_Lichess_4852_v1.epd)
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 9485.5 17920 53
2 patch : -20.8 3.7 8434.5 17920 47
So summarizing, at l1=1024, we have
STC, 1t: -20.8
LTC, 1t: -16.2
STC, 72t: -6.6
LTC, 72t: -6.7
Could you speedup test this https://github.com/rn5f107s2/Stockfish/tree/maybemaybemaybe vondele? 10 runs or so would suffice
against the shawn threat_inputs branch, I assume
Yes
speedup = +0.0135
P(speedup > 0) = 1.0000
Thank you that looks good I guess
yes, it does
I got around the same value, so hardware difference seems to not be an issue
Some data from pohls tests
PlentyChess 7 TI Test
STC (3min+1sec, ratinglist conditions, 512MB):
Torch 4 a512 : 1000 (+215,=472,-313), 45.1 %, -34 +- 15
Stockfish 17.1 250330 : 1000 (+122,=503,-375), 37.4 %, -89 +- 15
LTC (30min+10sec, 512MB):
PlentyChess 7.0.0 a512 : 1000 (+259,=497,-244), 50.8 %, 5 +- 15
Torch 4 a512 : 1000 (+221,=476,-303), 45.9 %, -28 +- 15
Stockfish 17.1 250330 : 1000 (+142,=491,-367), 38.8 %, -79 +- 15
the error bars are not great ofc, but the trend is there.
and keep in mind it's 512MB hash in both cases, he doesn't have RAM for more
no STC plentychess results?
and now l1=1280 (but won't repeat all tests for this one):
(10+0.1, 72t, 32000MB, UHO_Lichess_4852_v1.epd)
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 8304.5 16384 51
2 patch.1280 : -4.9 3.9 8079.5 16384 49
those were not played actually, the "STC" is from his normal ratinglist run
I've also started a factorized 1280 training run.
maybemaybemaybe you can template put_piece
Oh yeah, I also just saw that im not initializing threatened and threatening square anywhere, so im gonna resubmit the test
what do you predict? what is the future of threat inputs?
threatened ? More seriously, not quite stronger than master, but close. There are still inference patches that will speedup, training sessions that will improve, and tests to be done. So quite some work. Even if this is not certain that it would be merged, it definitely helped to revamp some of our tools and processes.
https://tests.stockfishchess.org/tests/view/68f494ca637acd2a11e720d4 some of these patches are not running with the correct bounds @rocky vigil
this could be simplification yes
oh bruh
i thought it would've sped up at least by some nontrivial amount
who knows
eh I'll just recalculate the llr later
it's kind of a waste of games to just restart the test
i do feel like memory bottleneck is big
at 72t it should not be, the full net should fit in the socket's L3 cache.
for the stc 1t i mean
yeah, in that case most likely.
coolio
final net nn-6b685002b4b6.nnue available for "Factorized" pipeline:
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/pipelines/2107827770
The Elo difference to reference is much better in the tests that are executed in the end, but one would need to check if the shas of the playing engines are the same. If the engines are the same, would be quite an improvement.
Probably the playing engine that is different.
Previous:
75edbee01e6f8cb53a2555499192ccaddb883577 b7f553ee8b28a4abace6c1056dceb1d69169873a
Elo: -25.29 +/- 1.82, nElo: -47.45 +/- 3.40
Factorized:
75edbee01e6f8cb53a2555499192ccaddb883577 d5fad05e412e3118f94ab79aa5e03067ac86d204
Elo: -21.06 +/- 2.19, nElo: -39.16 +/- 4.06
So, 4 Elo progress in this test, but we can't attribute to net or playing engine.
quite a few moving targets.
Would it better then testing nn-6b685002b4b6.nnue against nn-598188c9a702.nnue on Fishtest?
yes.
that's the test to run, or maybe once the testing at the end of the pipeline finishes, so we take the proper net.
but it is already rather clear that this time step 5 is better (nn-6b685002b4b6.nnue)
@twilit oriole I just tried i8 input weights, it's 13% faster but unfortunately -20 fixed nodes (https://furybench.com/test/3426/). I guess the difference to monty is the i8 l1 matmul. but maybe you (or someone else) have/has any ideas how to sneak more accuracy into my impl?
there is still a factor of 2 up for grabs technically, if i replace _mm512_cmpgt_epi32_mask with _mm512_cmpneq_epi32_mask for nnz calculation, though i've not found a nice way to do cmpneq on avx2 and below yet
another thing is, i saw you using 128 (or 127?) for input weight quantisation, whereas I am limited to 64/63 due to the factoriser ([-98, 125] is the range rn). my weights are clamped to [-0.99, 0.99] during training
or just cmpeq and invert the mask?
cmpeq_epi32_mask (also cmpneq) is all avx512 only
Where do I make a PR for threat inputs?
I think follow this example https://github.com/xu-shawn/Stockfish/pull/17
My test got a bit of a problem due to one worker having 10 residual... which may subject passed test to additional tests.
don't worry, that one will be purged
you can just do what you currently do for avx2, just replace cmpgt with cmpeq and invert the mask??
ohh didn't see _mm256_cmpeq_epi32 is avx2, yes, thx
doubling quantisation accuracy for the threat weights does a bit, but not much it seems https://furybench.com/test/3430/
@lofty cedar
Is your "Threat_input_speedup" orthogonal to rn5f107s2 PR (https://github.com/xu-shawn/Stockfish/pull/17) ?
Yes, it just changes the update_piece_threats to basically precompute the pawn attack bitboard for 1 square (which simply was not there because we didn't need it), and saved a few attack recalculations by re-using values we need anyway.
I think https://tests.stockfishchess.org/tests/view/68f3adba28e6d77fcffa0727 might need a retest after https://github.com/xu-shawn/Stockfish/pull/17 because with rn5's patch, Threat_input_speedup now does extra computation in the mutate_piece phase rather than less
@lofty cedar
@frosty imp @regal steeple
also aside from that patch, @stray reef do you have Finny tables for your threat inputs? I was thinking about it, do you do a large bitmask and calculate difference that way by any chance?
Why?
cuz your patch hinges on the fact that we always compute rAttacks, qAttacks and bAttacks
but with rn5's patch we dont do it on mutate_piece
so combining the 2 means you are computing rAttacks, qAttacks, bAttacks even when unnecessary
ofc this is a trivial specialcasing away but still
Oh... okay...
But well, I think the compiler won't miss special case. After all, it's a template argument.
well still good to at least do a speedtest using either speedtest or pyshbench before merging
And if one were to be pedantic, one could also say that if the compiler didn't recognize the fact that you can elide qAttack in the case, then you don't need to compute that...
I see.
I think its fine unless im missing something, slider attacks are still needed for the new attacks to the square the piece changed (https://github.com/xu-shawn/Stockfish/blob/293d3a673f8a7cc0983d48feb9b202f4286e9985/src/position.cpp#L1039C9-L1039C79), since threats-to-square are no longer getting incrementally updated. But I really dont like the new table, is it maybe possible to use attacks_bb<PAWN>(s, C) instead?
Sorry wrong link, I meant this line https://github.com/xu-shawn/Stockfish/blob/293d3a673f8a7cc0983d48feb9b202f4286e9985/src/position.cpp#L1072
no, the threat refreshes literally don't matter according to my profiling
oh
with some improvements, i8 input weights are -14 fixed nodes now, and seem to fail STC
https://furybench.com/test/3437/
https://furybench.com/test/3438/
I know QAT has been tried and declared neutral, but maybe with tighter quantisations there's some elo up for grabs? @formal smelt @twilit oriole
ohhh um actually you can put compute_rays on the outside
skip the while loop even
at least my impl did that
I think that was possible while threats to square were still getting incrementally tracked because threats by sliders were getting added in the third loop but now they are getting added in the second loop
surely you can still do that by:
<pseudocode>
threats_remove()
threats_add()
mutate_board()
or something in that order
needs checking at least
Im not sure im following, say we have a position like this 6k1/8/5n2/4p3/4P3/8/6B1/6K1 b - - 0 1 with f6e4 played, we still need to remove the bishop to pawn threat and add the bishop to knight threat
6k1/8/5n2/4p3/4P3/8/6B1/6K1 b - - 0 1Lichess Link | Image
ill experiment after your pr gets merged
Okay! STC passed.
the goat
The net from the factorized-not-really-factorized pipeline is essentially the same strength as what we have, but maybe 1 Elo progress:
1 nn-6b685002b4b6.nnue : 2300.6 0.6 73981.5 147456 50
2 nn-598188c9a702.nnue : 2299.4 0.6 73474.5 147456 50
(I guess good enough to include in the branch..)
(and also uploaded to make that easy..)
Yeah I think it is worth trying
We were going to try it also
Relatively soon
awesome
i know nothing about QAT, is there something in some git repo that shows how it's done in bullet?
merged speedup and net
Has anyone tried doubling the size of the later layers in threat net?
long ago (at least for SF) people tried these things, at that point it didn't help much?
might be things have changed, but I somehow doubt.
Yeah... but neural networks show that a more detailed input scheme often require larger net to interpret.
Also, the slowdown in later layers are not that significant anyway if it really helps.
pure guess here, but I think for later layers to be really useful, they might need to be significantly wider. Somehow neither 32 nor 64 can present enough features to the later layers to be able to reason much about the board...
They might help a little bit introducing non-linearity or so, but not 'reasoning' or 'tactics'
I see.
You PR’d to wrong branch
Put it against xu-shawn threat_inputs not master
Hmm? Correct branch? I did put it against threat_inputs?
Oops...
I tried to put it against threat_input...
But for some reason, it sent to master.
what's avg. number of threat updates as compared to previous?
wanna see if 1280 might be better now
in terms of speed
https://github.com/jw1912/bullet/commit/f270e3ea72b35d1b3dfaec90be2d964ca18543a8
its there for you now :))
fucking hell jw you are completely unstoppable
@lofty cedar can you rebase https://tests.stockfishchess.org/tests/view/68f637eb637acd2a11e72348 ?
And are we merging the old version
Is it possible for you to try only quantizing threat weights to i8?
Is there anything speaking against l1=896 ?
was suggesting that a while ago
I think I better start that as well.
we'll need to squeeze a few more Elo I'm afraid.
Ah yeah maybe we need finer increments to test
Was unsure if 7*128 ran into problems with avx registers but I guess not
for sure better than 8*127
but yeah, I think 512bits is the unit of concern.
good old days of 512bytes word vectors are gone 😉
Repeated speedups failing to get past error bars against master
Still kind of disheartening every time it happens though
they should be run until the end, the gains are small enough now that incomplete sprts are probably not very informative.
really dumb question, does it ever make sense to like, train 8 nets in parallel and then select the best one
or do they all end up having the same strength
(of course it's computationally annoying but just wondering the variance in training)
this is kinda what nets are doing internally already due to how subcircuits work
but you can also do model souping which is like a stronger version of this
i know, yeah
iirc linrock always did multiple runs
oh ok
idk what a subcircuit is
lol who is this legendary linrock and where did they go
did they move on from computer chess
linrock trained SF's network for a long while
good read: https://distill.pub/2020/circuits/zoom-in/
because he was always responsible for it, nobody cared to reproduce his training setup
so when he took a indefinite break progress literally stopped
He is still active somewhat
Approved one of the recent threat input tests
But overall I think he is mostly happy to move on now that vondele got sufficiently close with reproducing a pre-spsa net
I think you should rebase since Shawn already merged your passed version
He is a few times tagged here on SF Discord and often he responds
Oh... I see.
new progtest concluded with no difference to last one
it is missing the very slightly better net
btw @frosty imp if "speedups" reaches simp bounds threshold should I just stop there
as per suggestion
as per latest result it is actually 2.95 at simp bounds rn
alright well yeah it is simp bound passing so i feel slightly more at ease
cool
stopped
Oopsies. The new "speedup" doesn't interact well with the newer patches.
ofc, tho I will try QAT first
yes, but the variation between fully trained nets is small, 1-2 Elo.
Does weight permutation work on threat input?
Yes
and is being used AFAICT.
@formal smelt https://github.com/Yoshie2000/bullet/blob/plenty/examples/plenty/0126rrr4.rs#L870-L900
Does this look reasonable to you? L1 biases aren't quantised in-engine ofc, but I doubt that makes a huge difference. Loss looks reasonable (i'm fine-tuning an existing net with this new config, loss is about 5% higher than normal in the first SB of the fine-tune)
you can .faux_quantise(value, true);
lgtm, i would probably just have the function "quantise" the weights only rather than also doing the affine op
fn quantise<'a>(mut layer: Affine<'a, CudaMarker>, value: f32) -> Affine<'a, CudaMarker> {
layer.weights = layer.weights.faux_quantise(value, true);
layer.bias = layer.bias.faux_quantise(value, true);
layer
}
@stray reef how feasible would it be to try and separate the threat tracking to only be done when accumulator update is required?
(on the flip side, if we keep the current structure, is it a sane idea to attempt to prefetch the corresponding weights when the indices are computed?)
actually i think you should also quantise after the pairwise?
I see, thank you!
prefetch corresponding weights when the indices are computed
I tried this with the main net and it didn't help
U should still try it but just a data point
because you do the shift>
I do (after concatting)
line 891
oh yeah i'm blind
Definitely doable, only downside I see is you need sort of duplicate logic of handling moves, as you need to figure out from the move what threats to update in hindsight, and in SF it might be more difficult since it uses make-unmake, not sure
i'm not sure if this is worth anything
ah
so right now for L1=1024 threat tracking and indexing each take about 5% of the overall runtime
idk
https://github.com/Yoshie2000/bullet/blob/plenty/examples/plenty/0126rrr4.rs#L851
@stray reef we have skipping at home :p
how much did that gain?
3-4 SPRT elo
Should I assume stage 5 is best net, or wait for local results
just test stage 5 i guess
alright let's see if the threat-input-psq patch passes soon first
What about a threat finny table? The idea is that when a piece moves to a square, instead of adding the threats of the entire board, we add the previous threat to that piece and the difference between the previous threat and the current threat.
that's exactly what's happening now?
Really?
I mean... isn't the current approach that when a piece moves to a square, we add
the threat of that piece to/from every piece?
nope
i think the biggest issue is fusing the add/sub like that massively inflates
there is a reason it is not done like that for standard psq either
oh you mean the threats of that piece
how is that different from the first message
I see. wouldn't moving a piece then require you to update multiple finny entries
@lofty cedar
^^
@frosty imp In this patch https://tests.stockfishchess.org/tests/view/68f67ce0637acd2a11e723d9 maybe its better to replace pawn_attacks_bb<BLACK>(s) with attacks_bb<PAWN>(s, BLACK) instead of pawn_attacks_bb<BLACK>(square_bb(s)), attacks_bb<PAWN>(s, BLACK) uses a table already so its basically equivalent to the previous version without having to create a new table (using a table saves a few bit shifts, not sure if thats signifcant but the test seems to struggle a little).
^^^^
Also did anyone measure whether this patch https://github.com/xu-shawn/Stockfish/commit/d9cbd59e29ea1cf9b04000c43cd971b275509dd7 is a slowdown? From my measurements it looks like a slowdown, I profiled it and the issues seems to be that the new pawn table gets calculated on every function call for some reason, maybe its better to revert that one
honestly the pawn_attacks_bb thing is just unnecessary
for 768, step 5 is indeed the best, differences are not so large ( -63.62 +/- 1.82, -28.61 +/- 1.82, -27.43 +/- 1.82, -29.80 +/- 1.81, -25.85 +/- 1.81).
Meanwhile 1024 factorized is also ready (-58.02 +/- 1.82, -22.61 +/- 1.82, -20.32 +/- 1.82, -18.55 +/- 1.84, -18.67 +/- 1.82), also here step5 seems just fine. Maybe somebody can also kickoff a test, maybe against the 786 net? https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11805606614/artifacts/browse/step_7ecb955f82bf/
I assume tomorrow evening we will have 1280, and probaby the day after 896
Elo | 9.47 +- 2.93 (95%)
Conf | N=20000 Threads=1 Hash=16MB
Games | N: 20634 W: 6151 L: 5589 D: 8894
Penta | [338, 2253, 4631, 2699, 396]
https://furybench.com/test/3472/
fine-tuning with QAT vs. fine-tuning without QAT (for i8 feature weights)
maybe around -10 fixed nodes to master
pogey
QAT not copium for this then? nice
sounds like the right thing to do ...
there is still some packus accuracy to gain in this impl, maybe 3-4 elo from that. testing STC vs main now
that is a great result tho
seems roughly neutral to master at STC. I'll run another fine-tune with more fine-grained quantisation, inference is a bit slower there but still faster than i16 feature weights
I suppose a positive STC+LTC result with -5 elo at fixed nodes is mergable? if monty did it too
we also did an SMP test
how did that compare to STC?
Fishtest strangely doesn’t reflect the 5 elo loss against master in h2h against current 1024 net
I guess we’ll see in a few days
I guess careful testing needed, in these sequences we know the inference is always the same, between the runs the sha of the testing binary might not be equivalent. I think 768 and 1024 are essentially equivalent.
But that needs a test on fishtest with care on picking the right version of SF.
(or some analysis of the sha of the SF used for playing)
I started a test for this https://tests.stockfishchess.org/tests/view/68f9e323637acd2a11e7299a , it clashes with this https://tests.stockfishchess.org/tests/view/68f67ce0637acd2a11e723d9 @frosty imp patch but the shawn patch is before the cleanup commit so the base doesnt have the suspected slowdown, I hope thats fine
Can someone measure this? I can measure a significant slowdown but the fishtest test doesn look too promising so far
Vs master?
This commit against the one prior to that one
https://github.com/xu-shawn/Stockfish/commits/threat_inputs/
so d9cbd59e29ea1cf9b04000c43cd971b275509dd7 vs d71b0865693593f5e9341bede4750a4cc4896ee5
Compiled by : g++ (GNUC) 15.2.0 on Linux
Compilation architecture : x86-64-avx512icl
Compilation settings : 64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.2.0
Large pages : yes
User invocation : speedtest
Filled invocation : speedtest 12 1536 150
Available processors : 0-11
Thread count : 12
Thread binding : none
TT size [MiB] : 1536
Hash max, avg [per mille] :
single search : 56, 30
single game : 798, 566
Total nodes searched : 2257313338
Total search time [s] : 153.514
Nodes/second : 14704283```
Compiled by : g++ (GNUC) 15.2.0 on Linux
Compilation architecture : x86-64-avx512icl
Compilation settings : 64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.2.0
Large pages : yes
User invocation : speedtest
Filled invocation : speedtest 12 1536 150
Available processors : 0-11
Thread count : 12
Thread binding : none
TT size [MiB] : 1536
Hash max, avg [per mille] :
single search : 53, 30
single game : 794, 563
Total nodes searched : 2224392957
Total search time [s] : 153.514
Nodes/second : 14489837```
Looks like a slowdown on zen 5
Hm thank you that looks like about the same speed difference I get
Maybe im worrying too much here but according to the speedup Elo estimate formula a ~1.5% speedup should be ~3.15 Elo which is (barely) out of error at the moment, but lets see how the test goes
I think that one test is fine, I just wanted to rule out that this is some weird case where the slowdown only exists for me
speedups ought to be transitive in any case.
anyway, tonight probably we have step 5 of the 1280 net, and maybe tomorrow the 896 net...
🙏
doesn't feel like there will be major breakthroughs though..
from local results factorized 1024 maybe another 2 or 3 elo
i just started stc on fishtest a bit ago
yes, seen that.
I tried another way of doing i8 quantisations, this time it was worse (last time STC vs main was neutral)
I think with larger L1s you will both see more benefit from it, as well as see less elo loss from stricter quantisations, it's 100% worth trying in SF
Maybe it's even worth for me to try passing L1=768 as i8 directly, to mitigate the fixed node loss, since it's pretty clear it won't pass my current LTC on its own
What was the different way
I also think threat only i8 is better
Since then you maybe don’t need to quantize so aggressively
less aggressive quantisations at the cost of tiny bit slower inference
ended up not mattering at fixed nodes which way I did it, but the tiny slowdown mattered at STC
(Both with QAT of course)
Can try that tomorrow
Bc I feel like the factorizer screws with psq extra hard
Unless you have a way to bypass
i know nothing about how psq works in the sf arch
I meant like
Quantize the threat feature weights to i8
And keep the psq features as i16
ohh not that psq i see
💀
bullet now has this way to clamp not only factoriser weights and psq weights individually, but also in combination, which is ofc extremely useful for this. but it requires a full training run, not just a fine-tune for me
will do that eventually after my new gpu is set up
the only downside to this is that it means you gotta quantise psq and threat weights differently, and have to scale them back somehow before adding and clamping them
which will make inference a bit slower
if they're different by a factor of 2 it's easy but it's essentially the same slowdown as in my second i8 test (~3 STC elo)
Huh can you not just like clamp threat weights to 128
Most of them are small in absolute value anyways
dunno, if that's true then that'd work
Like in our nets the natural frequency of weights exceeding limit is close to 1%%
But the x2 trick for mulhi is the real issue
Actually how do you deal with mulhi
i'd say i do the standard stuff, what's the x2 thing in SF?
Wait you do the mulhi trick, it’s right in https://github.com/Yoshie2000/PlentyChess/blob/main/src/nnue.cpp#L356
Maybe quantization issue
In sf need to internally store 2x the weights
Otherwise one of the shifts will overflow
You can also get around the shift overflow by shifting both values, as in https://github.com/Yoshie2000/PlentyChess/blob/i8weights-3/src/nnue.cpp#L356
but for my master net it's currently not an issue, one shift works fine
huh
So if this does end up being a problem for SF it's easy to solve
Anyone tried the new muon optimizer?
out of scope for this post
L1=1280 looking like it'll be -10 stc again compared to factorized L1=1024
i guess can try speculative ltc soon
actually i think maybe a speculative stc smp is better
to reduce issues around copying 260 MB of net per concurrency
if anything I agree smp might be more interesting. However, it looks quite a bit weaker indeed. The 896 net should also be fully trained later today.
Its not really a fail, the name was just chosen poorly
I meant... Muon optimizer for threat input...
But maybe let's make another thread instead?
Factorized stage 5 STC passed!
@rocky vigil
It would probably pass LTC... but would it need testing anyway?
Okay... I'll LTC.
no need for LTC... we're developing a branch. It is obviously stronger than the existing STC, only when changing scales is that useful.
(I've stopped it).
In other news 896 finished.. that's probably more interesting, but I guess it will be weaker than 1024. Will get with some more data later today I think.
I’ll pr to Shawn’s branch soon
The play I think now is to figure out how “cleanup” should be reverted
Because it looks like it should be reverted somehow
Just use the attack_bb<pawn> thing.
I guess it means that the attack_bb<pawn> is faster.
@frosty imp pr made
somehow the testing shows it as being worse than both 768 and 1024
yes, that's what I see as well, later today I'll come up with a graph. Want to find some time to do fixed nodes test as well. I wonder if somebody could measure once nps for the 4 sizes we have now (with consistent versions of the code, just net size changes).
so, collected the data now..
so, at fixed nodes outperforming master, at tc, underperforming.
I found the dip in performance for 896 and 1280 interesting, as if these versions are for whatever reason slower than 768 and 1024 (like performance goes up smoothly at fixed nodes)
raw data
$ cat ..
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 1280-nn-71f4e3cc3782.nnue : 38.8 1.9 40878.0 73728 55
2 1024-nn-26b0e5126117.nnue : 25.3 1.8 39486.0 73728 54
3 0896-nn-7347b2877a12.nnue : 20.2 1.9 38958.0 73728 53
4 0768-nn-914a5c3a46dc.nnue : 11.5 1.9 38056.0 73728 52
5 master : 0.0 ---- 137534.0 294912 47
White advantage = 40.31 +/- 0.46
Draw rate (equal opponents) = 45.67 % +/- 0.09
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 master : 0.0 ---- 155112.0 294912 53
2 1024-nn-26b0e5126117.nnue : -15.1 1.8 35304.0 73728 48
3 0768-nn-914a5c3a46dc.nnue : -16.0 1.8 35207.0 73728 48
4 0896-nn-7347b2877a12.nnue : -19.8 1.8 34815.5 73728 47
5 1280-nn-71f4e3cc3782.nnue : -23.1 1.8 34473.5 73728 47
White advantage = 41.79 +/- 0.45
Are we also merging cleanup revert
Can code changes be accompanied with a test always. Nobody is skilled enough to do a cleanup in a hot path and know with 100% certainty it has no slowdown
Compiler behaviour can be unpredictable
After we do this I’ll spec stc smp the 1280
I submitted a pr for this https://github.com/xu-shawn/Stockfish/pull/23, I think the rest of the cleanup commit is just formatting changes so I dont think we need to revert the whole commit
I can also run a non regr against the pre cleanup version if thats needed
Is there a 768 factorised net
Most likely worse than 1024 factorized
like how hard is to read that table #1336647760388034610 message
On mobile it is pretty hard :p
nokia hitting back hard 😉
This data seems to suggest “magic numbers” like 768, 1024, 1536 might be optimal
For whatever reason
well, I'm suggesting it must be speed related.
(given performance at fixed nodes)
if asked to explain I'll mumble cache associativity effects ...
but I really have no idea what's causing this
Why?
Cache lines are only 64-bytes long so at most 64 elements.
This data suggests factorisation benefitted the 1024 more than the 768. Might give some signal that convergence time is still a factor
how many HL values fit in a register
tbh it's probably not related to that
It is more related to the fetching of the weights itself I assume
Yeah... though I thought that in modern caches this was an antique concern.
As in, in practice, shouldn't matter.
Huh
I mean... back in the day of direct mapped cache, it mattered a lot.
Nowadays, lots of people just assumed that approximately the last N lines accessed are in the cache.
But why would this matter? 1024 elements are only like 2kb and it's contiguous so even a direct-mapped cache could do.
happily exchanging this idea for the one shown to explain the effect on performance we measured 😉
Well, beyond 1024 elements, we run out of registers.
There are 32 registers in AVX512.
So, 2048 bytes or 1024 elements.
Though the trailing parts might be lagging.
wait, we're looking for a reason why 896 is worse than 768 and 1024
(also 1280, but well)
I somehow suspect 896 is performing at same speed as 1024
I don't think 1280 is underperforming where it should be?
vs master at tc testing, it is the worst?
Like the fixed nodes looks fine
Yes that is what I expect
anyway, at TC testing the performance curve is not smooth, and that would need explanation, IMO
freelo
I suspect on these 72 core machines the net size is more harmful than at fishtest
And that would also explain 1280
possibly.
We really need deduplicate net!
but again, hard to explain the zigzag performance at tc testing
Or at least this hardware consistently gives results around -10 to fishtest
So we could finally test free from bias.
smp is easier
In ~30 min I can set up stc smp for 1280
Also, as a layman, I find it very difficult to follow threat-inputs development inside Shawn Xu branch....
😕
I'll have smp results soon, but they won't get us further before the other PR is fixed
If Shawn merges the pr in the middle
Can also do new stc vs master after this which I expect to be -5 or -6
maybe this time also stc smp
Why not do LTC Vs pre SPSA net and get a green finally
May be good to put things back into perspective
This also an option lol
What was the best pre-spsa net
I mean take the one vondele trained recently
That was -5.4 Elo without SPSA to master
Fixed games or real sprt
Sprt ig
Well it allows a sprt to be performed which gives a higher guarantee of pre SPSA superiority. Also I think doubling the training time of all the stages is still something to attempt
yeah, I think training for a bit longer is something that needs to be done.
but I suspect the gain is going to be small to be honest.
Yes. Maybe 1 Elo lol
We observed this in plenty testing. Big machines had to stay off the threat inputs tests otherwise they ruined the results. It does not seem to matter about mmap, it is about SMP
For whatever reason the big machines did not perform the same as other machines on STC or LTC tests (1 thread)
so this is not understood?
Yeah
shared memory is implemented in plentychess?
Well maybe it's something to do with only 32MB being real L3 cache. Just a speculation
Like the rest has to go through the infinity fabric
6 days for a net
😉
good things come to those who wait
korean saying
@frosty imp can we merge again
merged
1600 epochs
😮
(sorry for the ping 😭 )
@rocky vigil Your progtest still uses the pre merge version, is that intended?
before both merges, so same version as this test https://tests.stockfishchess.org/tests/view/68fa1682637acd2a11e729fd
Shoot
I might’ve forgotten to push origin
oof
Uh stop the stc
Leave the stc smp
I cannot fix it right now
Not at pc
If you want you can submit a progtest ig
Otherwise it’ll be ~2 hours
Im not an approver, I cant stop the STC either
I submitted a new test
https://tests.stockfishchess.org/tests/view/68fbefeb637acd2a11e72d2a @rocky vigil why not run the corrected STC SMP to match this
you can approve https://tests.stockfishchess.org/tests/view/68fc3184637acd2a11e72d4e lol
tbf the error bars mean it's not gonna say much
Can anyone run speedtest?
Locally, I found some improvement.
But not sure how it works on other machines.
at 72 threads:
# PLAYER : RATING ERROR POINTS PLAYED (%)
1 1280-nn-71f4e3cc3782.nnue : 0.1 2.7 15365.0 30720 50
2 master : 0.0 ---- 62236.0 122880 51
3 1024-nn-26b0e5126117.nnue : -2.9 2.9 15236.0 30720 50
4 0896-nn-7347b2877a12.nnue : -7.6 2.9 15032.5 30720 49
5 0768-nn-914a5c3a46dc.nnue : -8.1 2.7 15010.5 30720 49
Yeah...
Should we go with 1280 then?
Maybe we'd need VVLTC to get it passed on fishtest.
How does it go?
setting it up rn
trying to remember how to use git
I'm getting different benches between threat_inputs and your branch
what am I doing rwong
oh oops I'm using the wrong one
lololl
not really.
ok bench is the same now
I thought we were optimizing for the strongest one at TCEC condition.
But maybe 1024 could make tunes easier to test?
IDK.
Result of 100 runs
base (...ish_0553b61e) = 1388113 +/- 1838
test (./stockfish _af2e862 ) = 1440920 +/- 1520
diff = +52807 +/- 2231
speedup = +0.0380
P(speedup > 0) = 1.0000
hopefully I did that right
keep in mind my computer behaves way off the mean worker on fishtest so idk
Well, you monomorphize cases with small loop.
Yes.
ok now that I have it set up I'll take a look
it's always shawns unless said otherwise lol
no that's wrong... contrary to most other engines, where this could be a goal, we actually have a few million users that run on normal hardware. Asking a million people to download 100MB more, to have an actually weaker engine is not good of a deal.
I think our target should be to not regress at the normal LTC and LTC SMP conditions.
(in fact be stronger at those).
I think it's maybe stronger at normal LTC already... if not for the fishtest condition.
i would be surprised actually if the current net caused a (significantly) larger download after compression
btw I think https://tests.stockfishchess.org/tests/view/68f67ce0637acd2a11e723d9 is outdated now
as I'm going through the code... is there a high-level description of the threats architecture anywhere?
TBH... I was thinking of maybe sending a different version to the TCEC than the version we have normal people download but having multiple versions would complicate development.
The rest of the arch should be the same for now.
The only change is in the threat... where each threat is an input.
Well, a potential where a piece can capture another piece iirc?
you are approximately tracking which pieces attack which other pieces
gotcha
so like
Though it doesn't include pinned piece logic iirc.
and the feature index is (square, square) or (square, piece, square) or what
can't account for pins bc that would take way too long
Square pc_sq, threatened_sq;```
oh wow so it's the full (piece 1, piece 2, square 1, square 2)
exactly, no way.
The target is already reached anyways, moving net size unnecessary. The SMP test on fishtest is neutral without an spsa to master
one of the lower hanging fruits is to test if replacing get_feature_index with a lookup table is worth it
yeah I was thinking that the mapping from this to the feature index must be convoluted lol
oh interesting, so you do de-duplicate threats where one threat implies the other
yep!
smort
do you de-duplicate pawn->bishop, bishop->pawn or is that not possible
The pdf at the first post gives a lot of this info already
oh I didn't know there was one
it's worth trying i think
ok reading, thx
you replace a popcount and multiple lookups to small arrays
Elo: 91.39....
It's a L1 3072 net
that's -100 Elo on fishtest 😉
not quite, but we can extrapolate the graph.
#1336647760388034610 message
if we were really pushing it probably 1536 is optimal at TCEC conditions
again, tcec can't be the goal for us.
would depend on speed yes
it might also be okish at LTC SMP
but would definitely clock a double digit loss at stc
1024 is good
a nice number
Yes but obviously the test was not intended for any type of net size info... It's to demonstrate the concept only
Clearly it was adequate enough to do that given we are here
ok ithink this makes sense now
It's using WDL 1, trained on captures and checks etc. it's a monty net plugged into SF lol
i still think we have a couple of tricks to pull on 1024
definitely will require more effort to squeeze out last elo though
ah, I now see it is a fixed node test.... so well, it means virtually nothing.
It means the threat inputs are worth something over the regular net. It is an important basis to establish at the start
so, first, obviously, this is still nice work etc, all appreciated. but even at same L1, the net is bigger right? So it is quite logical it is better?
i think it only really got rolling once lofty got it to work in yukari
in SF 😉
but I think the discussion on history doesn't really matter to be honest.
I'm still most interesting in getting a better SF out of this.
The magnitude of the fixed nodes test gain combined with the fact it is not at all optimised for usage in an AB engine. But it is not so relevant anyways, I just needed a big number to get ppl motivated to work on it
it does look like my prediction of -5 or -6 for stc progtest will be accurate
much of it probably from net
so, I think we'll probably still get 1-2Elo from net squeezing..
🍋
and we skip spsa 😉
lol
The spsa can be done later. It is inevitable anyways
like it makes sense to do it at the end
once the process is ironed out
in particular I would hope for i8 quantization tests before that
I think that's an example...
how many core hours were spent on the SPSA last time
I guess a few million games at VLTC?

and like it makes further testing so much more difficult.
each individual one is what 60k at ltc smp?
I don't think it is necessary that many. It was done in many stages because it had never been done before at that scale
take i8 as an eexample
yeah
if it comes after spsa, it is almost a lost case.
It makes incremental tweaks to training almost impossible.
the downside of spsa is that you need to compare Y + spsa vs X + spsa in every further test
i.e. if master net didn't have spsa we would already be beating it
I made these arguments before when the spsa stages started stacking up and didn't seem to matter too much then lol
i think since times have changed, linrock largely moved on
and then the net got stuck
so now opinion on that is different
But since Stockfish usually just accepts local improvements, it often means that SPSA gets accepted easily.
You can make the counter argument that Elo is now rarer. I don't think the conditions changed all that much
we have made those since spsa was used the first time like in 2022.
but hard to resist Elo ..
I mean... SPSA-ing the net is often a way to gain easy elo.
but easy to get into a dead-end.
I mean... it should be a final stage where nothing seems to be improving anymore.
Well I mean if you don't have a rule against it obviously ppl are going to do it lol
So, maybe we should set a period of say 6 months and if no new net comes out we SPSA.
Another thing we could do is mention it somewhere in the wiki or somewhere that newly trained nets should be compared to pre-SPSA nets.
Anyways this isn't actually threat net specific, can move to nnue dev. It is only coming up now because the regular arch had no new nets
I am really hoping that like we get this through and it boosts maybe morale or smth, since it must feel bad to have had the exact same net all the way for almost a year now
like, we show that master net is not invincible, and maybe then some floodgates will open
Do we try our chance with LTC SMP SPRT now? And then VLTC SMP (aka VVLTC).
If it gains, we merge the threat input.
No
What's left?
Have some patience
we still have speedup ideas left to try
while we try those in the meanwhile
we can wait about a week to see if double training time
squeezes out anything further
Oh, okay.
imo we should only do the (v)ltc smp as a formality
like only do it when we know it'll pass
just don't do it at all?
i think it has to be done before merging
I mean we do kinda want VVLTC as a progression test anyway.
so like eventually
A SMP STC and LTC SMP is all that is needed
I do not see where we need a VLTC SMP
That is not a normal test TC
i think is maintainer decision
whether we need non-smp ltc
vondele indicated he would prefer at least a nonreg on ltc
which I think we are also close to
maybe, -3 at ltc rn
This is not answering the question
on fishtest conditions
Where does that mean we need a VLTC SMP
oh wait i did not read, yeah there's no point in vltc smp
ltc is good enough
The SMP outperformance is from that similar threats are active across a search. So 1 multi threaded search benefits from this. I observed with the regular net also but not as severe
It's not a mystery really
so essentially that the memory is better able to optimize for hot indices in threats?
measure with perf ....
then why would i8 be worse at smp than normal
Do you know that is related to speed?
(And out of error also)
could you elaborate further on this
No because it is a question lol. Did u measure speed at all
what is the threat inputs vs master pair ratio right now
Look at the result again
You missed the huge overlapping error bars I suppose
to more directly answer, seems to be around 0.9 in stc and 1 in stc smp
tbh
the error bars on this are also more than I would like
i think this got buried
I've put it to prio -1, but will let @frosty imp stop it himself (or modify back to 0 if needed).
in any case i think we can let yoshie be the trailblazer for i8 quant in a/b
would be happy, doesn't the QAT need some changes to the trainer?
yeah there is major work involved in this
The i8 quant is already close in plenty and has more advantage in SF
Larger nets lose less at fixed nodes and have greater speedup
just like how lazy threat calculation should also be worth it
i need to pull up that discussion but i think the main bottleneck is duplicating some make/unmake logic
like the way it's structured right now is that the NNUE knows minimal about how the position works
it pretty much gets fed the differences per position
and runs that through
so if we wanted to do lazy threats we either couple NNUE tighter with position or duplicate some essential position logic for those calculations
both of these are nontrivial
is there a speedup from the fact that a lot of threats are bidirectional (say rooks on the same file) so say we can merge the weights for a pawn on a2 and bishop on b3? not sure how much has been explored already, please link me relevant material
see the pdf on the first post. rough outline of how things are
I see that you handle all non-pawn symmetries, are pawn-bishop and pawn-queen bidirectional encodings too rare to matter?
lol we think alike
if we did I would have found all of your speedups before lol
lazy SMP
improvement on threat_inputs from specializing at the top for different pairs (added.size(), removed.size()) with total less than 4
There are many further improvements, I think those are already handled. But undocumented, have to read code to see them


