#UE Threat Inputs for AB
1 messages · Page 9 of 1
ok so try these two commits vs. each other
yep
speedtest
1 thread and then n thread
1 thread is probably neutral
since no big mem pressure
but for stuff like concurrency or smp it should help
ooh it seems considerably faster on bench actually
lemme run that first then I'll run some speedtest
@torn lagoon no this is an LSS
Bench can be noisy
we need an LSS emoji
VVLSS when
lololol
sample size and tc are inversely proportional
simply run VVSTC to get VVLSS
I guess I can just replace the command huh
Think you also need to change the regex
==================
base (./stockfish ) = 1461604 +/- 1364
test (...sh.after.gcc) = 1527947 +/- 1821
diff = +66343 +/- 2014
speedup = +0.0454
P(speedup > 0) = 1.0000
very promising
👀👀
this is with concurrency right
wait is it just 100 in a row
yep
or 100 distributed over N cores
just need to figure out how to modify the trusty pyshbench...
i think you would also need to set the speedtest invocation to be much less than 150 seconds
yeah...
this would also grant approximately similar speedup
since it's just
total number of threads running search
at any given point
i think
in terms of mem pressure
nevertheless it'll definitely show on fishtest
assuming the net doesn't die
in the meanwhile lemme set up old stage 1 net...
is this i8threats already working?
yes unless I bungled something
when fishtest 👀
I just make -j profile-builded consecutive commits from mr. sscg13
no full net
ah
stage 1
ok running speedtest 16 now, be back with results in ~10 minutes
have you tried this stuff locally?
sanity check this and full training should be done in 2 days
I'm currently training a nnue for Prolix lmao
Should try bt4 data lol
so my laptop is anything but reliable rn
in fact the regular bench is 20% slower than with no load
and highly inconsistent ofc
i will get the stage 1 fixed games test up on fishtest tho
average memory bandwidth bottleneck
Gordon Moore shaking in his boots
do the i8 -> i16 conversions show up anywhere
I am referring to cvtepi8_epi16
ye that compiles to the venerable vpmovsxbw
o
move sign extend byte to word
first pair in, speedtest 16, +3.1%
so a bit more modest with more threads actually
hmm
but we'll see
wait I have an idea
pray tell
I need to double check first
yup then I'll move my king
I'm still hoping there's slightly more efficient ways to load the i8 to i16 than vpmovsxbw spam
my first idea didn't work when Yoshie tried it
but there might be some variant
will muck around later
yeah
if i8 to i16 could be sped up it would be good
though definitely it seems memory -> i8 -> i16 is faster than memory -> i16
ye
at least on some devices... we'll see on fishtest
make sure to turn off autopurge ^_^
curious if anyone has the 256 MB L3 cpus lying around
with this one it'll actually fit in L3
yum
and that might be very big
less mem usage than master net even
my computer has 512 MB but sadly it's split up across the CoRE ComPlExeS
so I don't think it would have ur intended effect
not sure tho
well i can dream
lol
good
1 21663457 22347071 +683614
2 21772336 22595945 +823609
3 21680315 22189962 +509647
4 22011912 22642714 +630802
Result of 4 runs
==================
base (./stockfish ) = 21782005 +/- 157128
test (...sh.after.gcc) = 22443923 +/- 208742
diff = +661918 +/- 127303
speedup = +0.0304
P(speedup > 0) = 1.0000```
speedtest 16
would you like me to try more/less threads and/or other parameters?
@split warren has one, where we also tested it for plenty
ah nice
yessir
I think we can interleave the weights on load,
ie aabb -> abab
use srai to extract high bits
use slli+srai to extract low bits
or was this your idea
yeah...
i used mulhi instead of slli+srai when testing it, but yes that was the idea
I'm still confused why it didn't work tbh
maybe mulhi latency is too high, with small L1 there's really not a lot of iterations
would slli+srai be better?
well first we should hold up and make sure https://tests.stockfishchess.org/tests/view/6909a35aea4b268f1fac2a61 doesn't die
can easily try that in about 45mins
if validation loss is anything to go by it should be fine in terms of eval quality
the absolute best case would be a double whammy of the QA=255 change making it both faster and better now
nah
new evaluation function is huge
the only real work I have contributed is impl
and the random suggestion to only apply i8 to threats
which turned out to be 🔥
that's like saying the people who built the Panama Canal only dug up 82 kilometers of soil, they didn't actually draw the line on a map
fair
😊
yeah it needed to be done, nn-1c000000000 has reigned as 👑 too long
👻nn-1c000000000👻
with this i hope ppl start exploring eval improvements again
yep
what is the L3 size on ncm? :P
once things are cleaned up I'ma try porting some of my layer combining work to threat inputs
I actually think it could work even better
"NCM uses Dell R7515 128-thread EPYC 7702 dedicated servers to perform its dev build tests. Each server plays 16 games concurrently with 30+0.3 time controls. Hash is set to 128MB, and Threads is set to 8."
256 MB shared
256MB l3
fancy schmancy
I wondered that too
bc with that the entire net could've fit in that 256 MB
what are your ideas here?
maybe they have IPC disabled
maybe it's split up across
https://tests.stockfishchess.org/tests/live_elo/690008ee637acd2a11e73441 one idea I've been musing about is combining add/sub with featureTransformer
maybe shuf instead of left shift is better
idk free up ports? idk
concurrent execution smth smth
even with a noobish implementation like that one ^^ it's a decent bump

random walk go
in Alex we completely elide UE (only keeping finny tables) but do that and it's a speedup
basically a small cache to store old accumulators and positions so we can diff the positions and UE that instead of refreshing from scratch
I thought about combining it with UE but I didn't want to get depression coding it, I guess you did XD
lol
you anticipated the psychological effects quite precisely
debugging it took a couple hours
Result of 4 runs
==================
base (./stockfish ) = 70245225 +/- 1147585
test (...sh.after.gcc) = 72228060 +/- 894349
diff = +1982836 +/- 871094
speedup = +0.0282
P(speedup > 0) = 1.0000
64 threads
so seems to scale nicely enough (super noisy at these thread counts tho)
yea pretty consistent speedup
why does this guy have 64 threads
I do, do you guys want me to run something specific on it locally?
it has occurred to me that I maybe should've tested fixed nodes sanity check
trying to read SF inference on phone challenge (impossible)
will look later
Though it'll probably be tomorrow morning as I'm not gonna do that on my phone lol
np the code is shit tho
just probing to see whether it's a potentially good idea. Honestly if in the end it's <4 ELO STC I don't think it's worth it bc you completely break encapsulation of the NN layers
anematode you might want to combine this thing with refreshes like we do at alex
I think it gains more
and is less cancer to implement
but idk
yeah I will def try that!
damn what??
that works?
😎
bruh this nnue training for Prolix is not ideal rn
i can only use like 3 concurrency for fixed nodes
sanity check
instead of 12
prolix
So just to make sure I understood, I gotta clone shawns ti branch and then do speedup between the two branches tomorrow?
Sscg13 is the dev, what's the base?
two consecutive commits in his branch
yea
I'll post the results here tomorrow
i think just base & dev of this?
https://tests.stockfishchess.org/tests/view/6909a35aea4b268f1fac2a61
This is perfect
For speedup wouldnt matter as long as I'm using same net in both branches so we cool
nvm it just finished
right on time
lmao
Bruh I have been meaning to, I'll do a Prolix net train for you while you do this, i gotcha
I know you've sent me the info, I'll actually get to it
kyoot
Ngl, looking at ur net the last time, it's actually a pretty quick job 😉
If the benches between the branches is different a speedup test is not valid
Hash gate, thread gate n now benchgate?
scandal central
Still it is invalid
idk I was scammed for an 11% speedup but it was just because bench was bigger so it had higher nps in pyshbench
much more accurate tho
to avoid that
ok
story of my life
I have seen 2-3% variability between patches with that. I tried
about this the data I uploaded is actually outdated now, lemme see if I can get the newest dataset uploaded
@naive comet lol your neural network code is lowk 10x easier to read and understand than Stockfish's
I feel like it's just overabstracted...
maybe once threat inputs are in an overhaul is in order
btw fixed nodes is concerning so lemme look into inference again first
... Stockfish TI-i8 playing White: 144 - 105 - 251 [0.539] 500
... Stockfish TI-i8 playing Black: 69 - 156 - 275 [0.413] 500
... White vs Black: 300 - 174 - 526 [0.563] 1000
Elo difference: -16.7 +/- 14.8, LOS: 1.4 %, DrawRatio: 52.6 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
1000 of 1000 games finished.```
yeah ik
this should be 255/256
it's also much simpler (although even sf arch in this won't be too complicated)
💀
could that be worth 15 elo
or however much stage 1 is losing at fixed nodes
nah
for me at least it's like minimal
in my old experience
you can honestly retrain from this xd
https://furybench.com/test/3598/ srai(slli(...)) vs. cvtepi8_epi16 (VSTC)
I love the naming
the remaining 5 stages?
yeah 4 fucking stages bro
tbf factorizer stage 1 was also 15 elo above non-factorizer stage 1
and then only 3.5 elo
in the end
just update the branch and retrain from there
The x86 ISA and its consequences have been a disaster for the human race
Ugh hopefully we can figure the regression out 🤞
💀
i think the smallnet just died
with this
actually shoot
yeah
i killed the smallnet
that might regain some portion of elo
so
I have not been able to do i8 on the existing net
in a manner that doesn't lose 300 elo at fixed nodes
if anyone would like to attempt to fix it's at https://github.com/sscg13/Stockfish/tree/threat-inputs-i8-originalnet
seems better than maddubs, that one was -5 at this TC, but also does not seem superior, and this is 2+0.02
i'll stop it so my tune finishes faster :P
@violet badger a minor fix in https://github.com/sscg13/nnue-pytorch/commit/68b56ad3bfa98a6433b3e37fd4b26ba9155fbf2c, doesn't require restarting training
stage 1 looks very impressive after fixes (so far, I am saying this as of 3000 games) so i think we should be good to go for the other 4 stages
The current i8 net loads half the vector at once and casts up.
But was that optimal?
Should we instead load the vector in full then split in half?
you prefer training the later stages with the fix, or we go with the current setup full length, or I train in parallel?
am I reading fishtest correctly that the i8 branch gains about 10Elo STC? That would be impressive of course.
yoshie had similar results, it compresses a lot at LTC because of the slight fixed nodes loss also
I've started a second training run with that fix integrated.
Ah either way works
It would’ve been fine to just go through with the original run
But either way the net will be done in ~2-3 days
Stage 1 of course is not an entirely accurate comparison, but at least it would seem that there is not a large fixed nodes loss
Apruvu sama.
I tried loading and then using vector extract instruction to split instead of loading individual vectors.
both runs are ongoing.. we'll see both.
Oops... fixed it not working on AVX2. Apruvu sama.
Run 1: 73598785 nps
Run 2: 73103840 nps
Run 3: 72480310 nps
Run 4: 72441752 nps
Run 5: 73022612 nps
Run 6: 72743447 nps
Run 7: 73217047 nps
Run 8: 73797795 nps
Run 9: 73452183 nps
Run 10: 73739312 nps
Benchmarking 83eb0e1...
Run 1: 67691034 nps
Run 2: 67220747 nps
Run 3: 67977216 nps
Run 4: 67682587 nps
Run 5: 68257834 nps
Run 6: 67943234 nps
Run 7: 67116181 nps
Run 8: 68663434 nps
Run 9: 67257098 nps
Run 10: 67305347 nps
Engine Average NPS Failures
------------------------- --------------- --------
5a6633ad 73159708 0
83eb0e1 67711471 0```
Is this correct as of now?
"Some speedup, i8 inference, i8 nets training, verbatim nets.....and eventually SPSA the net"
What would be next steps before merging TI?
Gain to master I believe
gain to master until any Elo can be squeezed, and then SPSA the net, I'm assuming
It was agreed to not spsa
speedtest ratio for the two versions
18137844 / 14678640
1.23566243194192377495
18064633 / 14668557
1.23152079649007056385
(AMD Ryzen 9 3950X)
We are off to the races!!!
Interesting that my computer saw the least speed up this time…
really, this stuff is getting pretty HW dependent.
😩
in fact so HW dependent it is currently not compiling on ARM 😉
LOL true
nnue/nnue_accumulator.cpp:362:65: error: 'vec_convert_8_16' was not declared in this scope
362 | acc[k] = vec_sub_16(acc[k], vec_convert_8_16(column[k]));
| ~~~~~~~~~~~~~~~~^~~~~~~~~~~
nnue/layers/../simd.h:173:43: note: in definition of macro 'vec_sub_16'
173 | #define vec_sub_16(a, b) vsubq_s16(a, b)
| ^
Ill do an investigation about the best approach for ARM
I suspect we should have some improvement from vldq4_u8 or whatever it’s called which loads four vectors in one instruction
oh also, should we maybe merge shared memory into this and run another fishtest to see if modifies the situation?
What's different about it?
Other than it being stronger I mean.
I mean it's just exciting to have a different evaluation scheme
shawn's thesis is that Elo stagnation is in large part due to unchanging evaluation and I'm inclined to agree
@prime mica btw the cached updates I suggested would probably be even more performant for threat inputs
yeeee I will try them soon
Hoping for a new stockfish with thread inputs as master as my early christmas present 🙂
XD
ok we'll try to get it in before Dec 25 :)
@violet badger got it working on ARM...
will do some apple silicon speed tests in a bit
wow, fantastic on ARM
==================
base (./stockfish ) = 1070090 +/- 12943
test (./stockfish_i8 ) = 1196706 +/- 12269
diff = +126615 +/- 6814
speedup = +0.1183
P(speedup > 0) = 1.0000
CPU: 10 x arm
Hyperthreading: off
(Apple M1)
==================
base (./stockfish ) = 1389464 +/- 16853
test (./stockfish_i8 ) = 1539608 +/- 18562
diff = +150144 +/- 2732
speedup = +0.1081
P(speedup > 0) = 1.0000
CPU: 12 x arm
(Apple M4)
thread inputs 🥀
hm the arm64 codegen still looks suboptimal on Apple clang
I'll see if I can squeeze out a bit more with vld1q_s8_x4
@rocky vigil we no longer prescale weights?
Nope
It would force extra x2s elsewhere since i8 is restrictive
or in theory (suppose you could double them for free in add/sub) would it be nice?
the reason I'm asking is bc
ARM's i8 -> i16 conversion instructions have a shfito perand
Also ^^
lol
I think just reduce the slli from 7 to 6 to compensate
yeah but does that even help
Or smth
for now I just have shift = 0
not sure
anyway we can flesh it out later, all that matters is it's already a huge win on ARM too
Yeah
Wait for the i8 net(s) to be trained, and after that it will probably be enough elo to pass the sprts against master
Code cleanup also needs to happen
what are your ideas for cleanup
Like just a generic statement
oh sure
I think for one it’s currently hacky how I redefine vec ONE depending on smallnet
Like vec(254 + use_threats)
Also need to add non-avx2 back
There’s a way that Plentychess uses but it was a little too complicated for me to bother copying
Ive wished this too many times to actually admit... But then there's always someone running a cpu from 2004 still
Most often it's these outdated xeon cores v2 or whatever pre avx2 was
sigh
In particular it did not fit in one line lol
I mean we'll just write the straightforward translation and then it'll be good enough right
somewhat crazy idea
could it potentially profitable to use VNNI instructions with multipliers of ±1 to further improve threat input updates
the pain point is that it accumulates to 32 bits
I'll probably try it once it's merged
threat inputs whoops lol
i do this too fwiw
nice tests running... maybe @rocky vigil can merge this already in his branch.
yeah I can do that
I modified it inline so it won't compile on x86 anymore
but I can fix that
yep!
also didn't check this compiles on old arm...
anyway, progress..
==== 4a97c2ba244790c41bff09968d93430966ac5d48 ====
1 Nodes/second : 290009447
2 Nodes/second : 291033770
Average (over 2): 290521608
==== 83eb0e1d835e138194237c33cc968c48f42a6a68 ====
1 Nodes/second : 267842604
2 Nodes/second : 266942817
Average (over 2): 267392710
good 8% speedup
big ball of moss
nice. how long until the net is fully trained?
1-2 days
stage one being +10 elo at stc as expected
maybe +11
idk how much that minor additional smallnet fix is
vs. what?
vs last run stage 1
39h I would guess.
what is QAT
quantization aware training
ohhh
so far the only real change is it knows about the i8 limits
in my experiments it only helped when the quantisation was really tight (when all feature weights are i8)
right now quantisation isn't really different then before
but ofc maybe it's a way to squeeze another elo at the cost of training speed :P
oh you did try quantizin the main net?
what was the fixed-nodes loss
linrock claimed the quantization change is worth 1 elo or so
so it's likely we might not even see fixed nodes loss
if we're only losing 1Elo we're not quantizing hard enough 😉
lololol
int4 SF when?
ideal
if that's for real then we should strongly consider that
oh i meant like
the 127 -> 255 QA change
maybe cancels out the slight loss
from i8
ohh I see
i guess it's a "good antiscaler"
big gain at stc, moderate gain at ltc, neutral at vvltc
ok i checked back what i actually tested. QA=63 + QAT was about as strong as QA=127 (both -10 fixed nodes to master). but with QA=127, QAT did not help (same result against master)
this is all full i8 (except master, that was still i16 back then)
interesting ok
such stuff is technically antiscaling but ppl would be fine adding it
i guess it's only a bad antiscaler if it goes negative
but that's where I think QAT could help. See how much it reduces fixed node Elo loss.
i thought the fixed node loss was only -2 or smth
so, freelo 😉
hopefully the good results continue up to stage 4/5
I think there is also some loss on the other parts of the net.
we could probably soon test stage 3.
that's pretty close to a converged net.
i thought the later layers remained unchanged
I mean they are also quantized from float to int
I'm wondering if part of the SPSA gains are just related to cleaning up quantizing..
that would be cruel but hilarious
i remember viren saying a while ago that the quantization in later layers has a large effect
i think it is worth revisiting
though idt it's exclusive to threat inputs
(QAT on the later layers, that is)
can't test rn but I think it should work
maybe we can extend the quantization to weights later
can try to run this branch as well. I'm just somewhat surprised that this is the way it is done. I would expect some term added to the loss, that drives weights to be close to quantized values.
hmm I'm not sure if the quantization is applied to the weights or activations just from the bullet commit alone
quantized weight version
💦
so the latter version is the thing to run, I assume?
probably
step 3 training finished..
test is up, but I'm not sure what is being tested against what 🙂
I think against stage 3 of previous threat weights run
nope
O?
just stage 3 against stage 5
I see ok
so best threats setup, against current stage 3 i8
stages 4/5 are worth max like 3 elo anyways from what we've seen
yeah.
Against master or against previous threat inputs
I am giddy
Time to sip my morning coffee and watch the Elo while reading the news
you'd better sip Elo while watching the news.
lololol does it taste good
only one little sip and you're hooked, i've heard
Well shit I gotta find some then
let me trigger it a little bit
😍
gg sf 18 is here
what are we expecting Elo wise?
having no expectations is the safest, but some speedup O(10Elo) and some quantization error O(1Elo). As long as prefactors are no 1/3 and 3, all good.
Elo: 6.32 ± 3.6
comports with stage 4/5 being handful of points right?
big error bars tho
test vs master now?
I would wait for the stages 4/5 to finish.. we can than pick the best net
A few stage 4/5 runs are finishing in the next few days
Btw we can look into removing leb128 for the threat weights
It literally cannot perform better
And removing it would simplify the current parsing code
Yeah
Rn what is done is read the entire thing into a big array
And then move it into separate arrays
Because as it turns out our readleb128 also includes a length
So it actually cannot just be read directly
read_little_endian is our function for this
Which maybe also guards against the machine being big endian somehow
Ah nvm single byte
This actually produces a noticeable startup time
that's surprising..
unless you mean read_leb_128 is slow
which would make sense to me
It’s like half a second
Or smth
Idk
At the very least if I open the exe and type uci right away it doesn’t process instantly
So the threat-inputs-i8 is the new optimisation that to my understanding reduces weight precision slightly to improve the speed of the network? Does that make it a slight antiscalar?
what about vltc?
Neutral most likely
Well
The fixed nodes loss is probably maximum 1-2 elo
So the speedup is well worth it
Lots of variables here
has this been tried will the smallnet as well? Im guessing its also highly worth it
Easy in SF. Difficult on the trainer side
not yet, still waiting for nets to train
although stage 3 is already much better at stc
for now yeah
you have seen the light?
Might be possible to train smallnet with TI as well?
Or do you think it wouldn't gain?
Also, does vondele have the recipe for smallnet cooking?
🤔
The primary advantage of smallnet is speed, so I think keeping it without threat inputs is most beneficial
Probably the Elo gain on TI paradigm for smallnet would be compensated by the speed loss, so in the end smallnet with TI might actually be neutral at best
😐
we did train a smallnet with threats.. but it can't gain IMO.
Viren also had an idea to use a single net but either psq inputs only or psq + threats, although experimenting with that can wait for after merge
How long until the net is fully trained?
that's an interesting idea actually
ok quick question, aren't the threats which involve a piece attacking a king in a way that can't be blocked (e.g. slider directly adjacent, or knight attack) completely redundant?
because they are implied by the corresponding main net feature...
like random example
the threat "queen on d8 attacks king on c7" is active if and only if the main net feature "king on c7 and queen on d8" is active
if I'm not mistaken we could actually test this post-training... just need to add the threat weights to the right part of the main net weights then zero out the original
But with threats now the total is not the sum of the parts
Knight on e5 and knight on f7 isn’t the sum of their individual weights
there are 3 running ...
ah this works for certain bucket setups yeah
but only for the correct stm
yes sure
In general they are redundant since we never evaluate positions in check, though rn5 was not able to get it to gain
This one?
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/pipelines/2138356933
But I can't see anything....
😐
do we ever train on positions in check? if not then those features will probably be driven to ~0 anyway...
no.. skipped
surely it would be good to skip add/sub for them though
I'll take another look at rn5's work
Also some interesting thing is bullet initializes threats / psq separately according to their individual sparsities
Idk if it’s any good
But can certainly be tested
In general I wonder how much weight initialization matters
pipelines correspond to PR here https://github.com/vondele/nettest/pulls
right now it is a bit harder to see, as I pushed an additional commit to 2 of 3 PRs, and github doesn't show the pipeline on the previous commit to be active, despite it being active.
👀
seems real.
arm speedups for i8 done some big magic
as an aside, that is the strongest stage 2 I have ever seen
no fucking way
agree, that's sweet ...
not without reason.
impressively calm response
8640 messages later..
how can I download the net?
I'm curious how it'll be on my computer given the lesser speedup
(you can get there via the artifacts of the proper training step)
See if we can squeeze out a bit more..
excellent
I agree, though. It is quite spectacular.
let's triple check somehow 😉
-engine name=reference cmd=/workspace/scratch/packages/stockfish/69a01b88f35db2a5003d42116f573207ca5c275b-profile-build/Stockfish/src/stockfish
undeniable...
Maybe it doesn't scale at all and is therefore useless 
threats known antiscaler.
is it happening👀
looks safe enough to use stage 4 so I'll start a few progtests on fishtest
might be sprt against master time?
well, one at a time..
ok
don't forget to turn off auto purge ;)
Wait, no way we are about to get a fishtest of threat inputs vs master that is gaining?! 🥺
threat inputs
maybe I'll figure out a way to add thread count as a feature
then we'll have true thread inputs ;)
So is it on fishtest yet? 🙂
where do I check the error bars?
it'll slide down and settle at +1 </pessimism>
im betting +7 because thats what the SPRT is currently saying
Okay, guess ill middleground my guesses: 5.5 +-1 😂
i'm guessing 0 +/- 1199.99
imma bet 4, not 3, not 5, but 4
Its gonna be 1 guys...
vvltc smp banger
what is different compared to this in this test?
different net... but not sure if that can explain the delta
and different arch ofc but I don't think that's entirely it either
#bet fail red -0.69
Though I don't think threat input is an anti-scaler.
It's quite well-established that bigger neural networks are good scalers, not meaning that it would scale well to have a bloated net size, but that if the bigger net is already good at STC, it would probably be good at LTC and above.
However, I think it might have to do with search.
I think it was a joke
I think it may be an apparent anti-scaler if the search tuning was so heavily done on the old net that the search now adjusts for all the quirks of the old net.
And since the search-wide tuning was done at VVLTC, as you approach longer TC, you're fighting an increasingly uphill battle.
maybe it becomes anti scaler if the speedup was too good
in that case we just increase L1
yes it was
it looks like a lot more stuff got inlined into evaluate in master than in threat inputs
I wonder if forcing inlining would be good or bad
partial_insertion_sort is still taking a disgusting amount of time 😩
rn the fishtest isn't going great... but I believe that better nets will come! And tuning for it will help a ton
Guys, which net is next? Like to be tested on fishtest, I assume the current threat-inputs-i8 (update net) isn't the strongest one coming?
not sure
Any good ideas that weren't put into that net?
if it's really close between master and threat inputs that's pretty great bc I'm positive there are more speedups to be found
yeah its -0.13 rn
spsa i guess
massive error bars ofc haha
yeah but only +- like 2 elo at this point right?
no I meant elo haha
oh ok
ehhh the trajectories are very variable lol
especially with this where there's probably large inter-computer differences
actually I haven't looked at the residuals yet let's see
it feels like a sport event watching the dials update live lol
nerdiest sporting event in history that is
lololol
yeah just eyeballing the residuals, AVX2 machines are suffering while AVX512 machines are doing swell
probably the vpmovsxbw spam 😩
where are the avx2 matches???
this is the only link I use: https://tests.stockfishchess.org/tests/live_elo/690d2514ec1d00d2c195beb5
where?
yeah is it here?
yep, you see the big table
one thing I've been meaning to add to fishtest is an ability to aggregate by some property
how do I read the avx2 vs 512 differences? The residuals?
yes but if I'm not mistaken the residual doesn't actually tell you whether the worker is significantly lower or higher than the mean
so you have to look at the pentanomial
anyway dw about it
we'll see in 12 hours where we're at
haha, yeah wait till it gets to 50k games ig
why hasn't l2 been increased to 31 to compensate for the smaller accumulator?
Results of New vs Base (30+0.3, 8t, 256MB, UHO_4060_v4.epd):
Elo: 9.17 +/- 10.71, nElo: 21.75 +/- 25.38
LOS: 95.35 %, DrawRatio: 64.17 %, PairsRatio: 1.35
Games: 720, Wins: 201, Losses: 182, Draws: 337, Points: 369.5 (51.32 %)
Ptnml(0-2): [0, 55, 231, 74, 0], WL/DD Ratio: 1.22
LLR: 0.25 (8.4%) (-2.94, 2.94) [0.00, 2.00]```
results trickling in from a VVLTC run I'm doing
probably equivalent to 80+0.8 8t or so on fishtest
not quite as dramatic as the STC on vondele's CI...
Looking into some QAT
We have a couple more training runs so far, might brute force a better net by sheer luck
Also like Daniel said could also test increasing L2 size
in that case elo diff vs master converges to some number > 0 as time -> infinity tho
is the SPRT that's running rn a stage 5 net?
both a stage 4 and stage 5 one are running rn
local testing had them ~~equal but the stage 5 one is performing better on fishtest so far
yeah! so far: stage 4 net = -0.26 elo (30k games) stage 5 net = 2.25 elo (17k games)
btw it's time to look into preparing the branch for PR
so, what would we like the format of the net to be
some minor notes I have
- change the mirroring of threat inputs to efgh (this can be done by permuting the weights, e.g.)
- change the net format to store the i8 weights verbatim
- ensure compilation works with all architectures
- clean up the code generally
- (optionally) do a lil bitcoin mining to rename the net
on the net side there are still a couple other things to try
check if L1=1280 works again after the i8 speedup
or check if L2=31(+1) works with the general L1 reduction
Looking forward to SF18! 🙂
heh would need to be like 10 elo gain for that
although
+X vvltc and another +Y from the (search) spsa that will happen after gets us closer
to sf 18
let's not confuse this thread with SF18, it is going to be complex enough without that aspect 😉
would require a full retrain
though
so 3 days or so
Do it!
need @violet badger to set it up
would also be helpful if someone had a profile of latest
branch
my general estimation is that l2=31 will end up being -5% speed or so
#1336647760388034610 message ?
I'll setup the training, ultimately, one needs to measure to get a real number
yeah
it would appear that l2 etc. take up 8% of total runtime
so I think -5% speed from doubling l2 seems reasonable
so you change both l1 and l2?
i just saw it and thought it should change to make it more accurate
for cosmetic purposes
so started
im going to predict that fails at TC lol
(if this figure is correct)
Giving a prediction does not suggest not giving it a try (obviously)
The vibe I got was there was unusually high optimism for this
nah I suspect 16 is optimal as well
in fact I am surprised 16 is better than 8 even
but indeed there is the possibility that since input -> l1 takes more portion of total time relative to l1 -> l2, l2 could be increased
can I have some clarity - will threat-inputs-i8 eventually be merged into threat_inputs branch?
and if i want to write a patch that applies for both - which one do I base on/test against?
yes we in fact can merge it rn if shawn wants
i think i8 has been proven to be much better so going forward it only matters to test on that
ok
rn anematode stage 5 branch performing slightly better on fishtest but still well within error bars
so idt it matters which of the i8 branches you use
Maybe someone could make Finny table work with threat input?
Though not sure if it would gain.
So what explains the chasm between the run on CI and on fishtest
😩
that's reasonable imo
Yeah we should
if it's antiscaling we have a slight issue
.
just requires positive STC in that case
which is doable
still
but not looking too crazy at very long TC:
Results of New vs Base (30+0.3, 8t, 256MB, UHO_4060_v4.epd):
Elo: 3.43 +/- 3.73, nElo: 8.24 +/- 8.96
LOS: 96.43 %, DrawRatio: 65.57 %, PairsRatio: 1.12
Games: 5780, Wins: 1610, Losses: 1553, Draws: 2617, Points: 2918.5 (50.49 %)
Ptnml(0-2): [0, 470, 1895, 523, 2], WL/DD Ratio: 1.33
LLR: 0.69 (23.5%) (-2.94, 2.94) [0.00, 2.00]
(this is master vs. the stage 5 net)
we'll see on fishtest ofc
also I'm still confused, I thought we benchmarked some very nice speed gains on x86
but shouldn't that help quite a bit on the current SPRTs?
indeed we went from -9 to 0
at stc
ohhh
https://tests.stockfishchess.org/tests/view/690d2514ec1d00d2c195beb5
GROUPED BY ARCH
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -2.45 ± 2.62 | LOS: 3.3% | LLR: -1.77 | [89, 2356, 4773, 2172, 114]
64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 2.85 ± 2.92 | LOS: 97.2% | LLR: 1.11 | [60, 1760, 3752, 1861, 71]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -4.12 ± 3.68 | LOS: 1.4% | LLR: -1.40 | [60, 1200, 2437, 1097, 54]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -2.55 ± 3.97 | LOS: 10.4% | LLR: -0.80 | [51, 1061, 2141, 970, 65]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 4.68 ± 5.74 | LOS: 94.5% | LLR: 0.51 | [20, 445, 985, 498, 20]
64bit POPCNT NEON_DOTPROD | Elo: 45.23 ± 23.07 | LOS: 100.0% | LLR: 0.33 | [0, 15, 55, 40, 2]
ok well pending scaling tests... does that mean that if we get like a 3% speedup across the board on x86 specific to threat inputs (not saying this is easy!) then we should be ok?
Nice summary.. it basically is a question of having 'the right' HW on fishtest right now. VNNI seems to like this as well.
With fleed it would pass in a second xD
ik...
We simply have to market this as mobile first. People only use their cell phones for everything anyway. 
total neon penta over the 2 tests
what LLR for [0, 2] bounds?
0.5

still need 5000 more neon games
it is still somewhat surprising that neon does so well. Is there some particular instruction that works very well?
I think the i8 -> i16 conversions are much cheaper
maybe not the only cause but you can do one load of 128-bits and then unpack it in two instructions into two 128-bit registers fit for accumulation
whereas on x86 we're doing one load + one conversion per accumulation
Yoshie tried a technique I suggested to avoid that but it performed worse
#ifdef USE_NEON
for (IndexType k = 0; k < Tiling::NumRegs; k += 2) {
acc[k] = vec_sub_16(acc[k], vmovl_s8(vget_low_s8(column[k / 2])));
acc[k + 1] = vec_sub_16(acc[k + 1], vmovl_high_s8(column[k / 2]));
}
#else
(vget_low_s8 is a no-op in assembly, just for casting porpoises)
I don't think that explains the full gap tho
yeah there's no way this explains so much elo
especially since it's probably still mostly memory bound on x86
honestly I"m not so sure ab that anymore...
I'll do some profiling later on my friend's older Intel box
I think that big gap can almost only be explained by the used data now somehow fitting in some cache?
or by magic a better access pattern?
hm yeah, maybe the VNNI trend is because of newer machines being more well-endowed
possibly.
I tried fusing load before. Didn't work.
😩
ye I saw
@violet badger someone really wanted this STC to pass huh? xD
oh, it passed 😮
basically just documenting how HW dependent this is ....
let's wait a bit for the second test to pass, and after that submit LTC (for which I will remove again the arm machines).
cool
it does seem the arm machines simply print llr on the tests
well given the Elo numbers we had in the pipeline, it is clear they can easily do that
I would say go ahead and submit one LTC.
ok
i suppose we just choose at random
since stage 4 and stage 5 are well within error bars at fishtest also
since I'm online rn i guess it'll just be stage 4 then
either is fine.
eventually we sprt things against each other.
in principle 4 is nicer, since it would establish a shorter training baseline.
I've also updated the reference in the pipelines to be your branch at the f3f net.
and switched it to sprt
so we will more easily see what is better in future tests.
in principle I'd have to stop and restart the pipeline for it to pick up that commit that changes the reference.
(right now I think the yaml it is testing is not yet with a suitable sf sha).
anematode machine is like solo killing this test kek
smth like -27 -39 +1 -29 pairs
anematode-128cores-7b133829 | Elo: -6.29 ± 4.66 | LOS: 0.4% | LLR: -1.19 | [4, 608, 1351, 516, 5]
Yaaaa the i8 speedup was so small on my computer
The break even point is probably even higher TC
Silly suggestion, are there any things in search which are known to depend strongly on evaluation accuracy
or is the tuning far too diffuse
well, any static eval based heuristic more or less
but it's like hmm
+90 elo from eval with major slowdown ~= +15-20 elo from search patches
at least this is what it was when the very first NNUE was introduced
also in general tuning should handle it nicely anyway
right
like I'm wondering whether it makes sense given that we're within shooting distance of master to try some basic search tuning and see if we can exceed it
and that effort can be done in parallel with trying to speed up x86
eh, not really imho
search tuning can and probably should be done on top of the net imho
which net
which passes
well there's no way to prove that those spsa results actually only work with the threat inputs net
SPSA?
maybe you are just tuning the search to be better regardless of what net is veing used
in that case, you're just hacking the net in and using the tune as an excuse to get a passing SPRT
sure... but that can be validated by using the same parameters with the master net, no?
or running two SPSAs although that's expensive
I would prefer to not touch search with new arch
I think maintainers will also be like this )
kk
yeah would be nice to break even with just net, and my expectation is that would lead to search tweaks afterwards.
ARMed riot against x86 right now
lol
hmph about search I would guess that since it has threat inputs
we can be more aggressive with capture pruning and qsearch
maybe
ofc it's all pretty vague
sure
maybe finally history adjustment will work for captures?
or correction history adjustments ?
#ifdef USE_NEON
constexpr bool threat_inputs = true;
#else
constexpr bool threat_inputs = false;
#endif```
well this one can actually be a rluke
I'm wondering whether speedups matter more at LTC than generally believed...
@foggy wind would u mind running your Elo bucketing script on the PT https://tests.stockfishchess.org/tests/view/68ee711328e6d77fcff9fd63 to see the difference between VNNI and non-VNNI architectures
they are like 70% of what they are at STC I think?
GROUPED BY ARCH
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 30.50 ± 1.84 | LOS: 100.0% | LLR: 29.71 | [2, 1204, 7486, 3301, 4]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 25.27 ± 2.04 | LOS: 100.0% | LLR: 20.40 | [3, 1015, 6107, 2396, 4]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 24.71 ± 2.93 | LOS: 100.0% | LLR: 9.85 | [0, 515, 2989, 1178, 1]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 19.43 ± 3.43 | LOS: 100.0% | LLR: 5.70 | [0, 388, 2176, 755, 2]
64bit SSE41 SSSE3 SSE2 POPCNT | Elo: 21.28 ± 9.48 | LOS: 100.0% | LLR: 0.80 | [1, 57, 300, 115, 1]
because average patch is like ~ the same elo at LTC as it is at STC
and some even hyperscle
but it doesn't mean they are like 30% lol
O wow, +5 delta between VNNI and normal AVX512
gotcha
is like slightly less that 1,5
➗
so speedups should be slightly above 66% stc -> ltc
just that usual stuff for sf releases is having this elo being 1:1
so logical patches in general scale better than speedups
how about for the previous PT, https://tests.stockfishchess.org/tests/view/68d98c39fa806e2e8393b7a1
a bit old data, I think that with the current book it is even more similar.
📏