GROUPED BY ARCH
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 20.85 ± 1.99 | LOS: 100.0% | LLR: 17.97 | [2, 1135, 6434, 2321, 2]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 21.13 ± 2.25 | LOS: 100.0% | LLR: 14.40 | [0, 944, 5114, 1908, 2]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 27.43 ± 2.50 | LOS: 100.0% | LLR: 14.88 | [1, 717, 4149, 1753, 5]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 21.21 ± 2.85 | LOS: 100.0% | LLR: 8.95 | [1, 598, 3197, 1204, 3]
64bit SSE41 SSSE3 SSE2 POPCNT | Elo: 22.17 ± 8.81 | LOS: 100.0% | LLR: 0.99 | [0, 57, 332, 120, 1]
#UE Threat Inputs for AB
1 messages · Page 10 of 1
you can have better game pair ratio but worse elo and worse nElo
huh where did the avx512 + vnni combination go
oh wait I'm reading it wrong
hm ok so the delta was always there... but then the distribution of workers changed a lot
ok, found an avx2-specific speedup... ~1% locally but we'll see on fishtest
the problem is that NumRegistersSIMD is 16
before, the loads would get consistently folded into the memory operand of vpaddw and vpsubw
but now they need a temp register for vpmovsxbw
and thus at least one acc register needs to be spilled...
wait it passed?
would any AVX2 users be able to bench https://github.com/anematode/Stockfish/tree/threat-inputs-i8-st5 vs. the previous commit
STC had a lot of arm workers, and it seems to be doing better on arm.
Why not SMP BTW? Even if not willing to pass with VLTC. Shared memory is not as good because positions in a single engine are more similar.
BTW I get warning on clang (not new to this commit)
1104 | Bitboard threatened = ray & qAttacks & occupied;
| ^
position.cpp:1057:14: note: previous declaration is here
1057 | Bitboard threatened;
| ```
same with gcc
ARCH=x86-64-avx2
Result of 200 runs
==================
base (...sh_avx2.base) = 1739819 +/- 3149
test (...nputs-i8-st5) = 1778412 +/- 3041
diff = +38593 +/- 2346
speedup = +0.0222
P(speedup > 0) = 1.0000
ok, not terrible...
this is on Zen 5 though right? so AVX512 would be the default build
Yes, but this way the AVX2 path is used.
sure
I would say 2.2% is more than just not terrible.
14600k speedtest (avx2 arch, also not default)
a071eec7: 19320421
f9886de7: 19729029
+.0211
ok I'ma hunt for more speedups until I get +4% or so and then put up on fishtest filtered to AVX2
So, it seems a massive slowdown on AVX2?
Oh... your patch was like 3.7 on AVX2 for my machine.
Woopsies!
ok pushed a fix u can retest
Result of 200 runs
==================
base (...sh_avx2.base) = 1738472 +/- 3302
test (...nputs-i8-st5) = 1765637 +/- 2975
diff = +27165 +/- 1085
speedup = +0.0156
P(speedup > 0) = 1.0000
Nodes searched : 1902842
gcc 15.2.1
gotcha
also could u git log
just wanna make sure it's the right commit as I pushed something else a few minutes ago
its the fix commit
ooh ok
pull again and re-test?
I tried cramming DirtyThreat into 32 bits and I think it might help
kk
ur the best
test (./stockfish ) = 837952 +/- 1405
diff = +7987 +/- 1581
speedup = +0.0096
P(speedup > 0) = 1.0000
CPU: 32 x AMD EPYC 7502P 32-Core Processor```
underwhelming
I'ma try the idea of making indices an LUT, forgot who said that
Result of 200 runs
==================
base (...sh_avx2.base) = 1746542 +/- 3695
test (...nputs-i8-st5) = 1710156 +/- 2848
diff = -36387 +/- 1309
speedup = -0.0208
P(speedup > 0) = 0.0000
yeah definitely
😭
❯ ./stockfish_avx2.threat-inputs-i8-st5 compiler
Stockfish dev-20251107-cd5e513d by the Stockfish developers (see AUTHORS file)
Compiled by : g++ (GNUC) 15.2.1 on Linux
Compilation architecture : x86-64-avx2
Compilation settings : 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.2.1 20250813
it is
mmk
could you try comparing the commit "fix" vs the commit "cram"
if "cram" is rly bad then it should be -0.035
hmm something is weird. I compiled again and got a different md5sum
I'm pretty sure I did. But well
so something was wrong with the binary, but still
Result of 200 runs
==================
base (...sh_avx2.base) = 1746615 +/- 4120
test (...nputs-i8-st5) = 1756439 +/- 3361
diff = +9824 +/- 1416
speedup = +0.0056
P(speedup > 0) = 1.0000
yes
sigh
ok I'll have to look at the disassembly on GCC 15 later
it probably did something rly dumb
i wonder, if stockpile the claimed minor speedups and then combine, is this sound methodology
LUT seems to be working... but my computer has plenty of cache
I'ma see if I can compact it a little then I'll push
I think it's ok... long term we can verify that each of them is real or not
ah nice
Result of 100 runs
==================
base (....ti.avx2.gcc) = 1298550 +/- 1517
test (./stockfish ) = 1329132 +/- 2045
diff = +30582 +/- 1763
speedup = +0.0236
P(speedup > 0) = 1.0000
CPU: 128 x AMD EPYC 9755 128-Core Emb Processor
bench matches...
lol ok I crammed it to 17 bits but split across two arrays which is dumb
probably makes the most sense to have it either be 24 bits or 17 bits but contiguous
ok yeah that destroyed perf
hm now that I think about it, the LUT might actually be accessed in a fairly ok way
depending on how we index it
anyway, I pushed LUT if any of y'all wanna test
we might have to put the LUT in shared memory lmao
big boi
testing on Zen 2 now
speaking of this aren't the pext LUTs also somewhat large
so it looks like we currently 2 elo away in ltc 1thread
i kinda guessed this
after i8 "antiscaling"
that's kinda rough
It isn't anti scaling. It didn't have the arm benefit
does i8 antiscaling mean the whole enterprise is antiscaling?
Yo guys, anyone else notice that the Threat input fishtest is going pretty good right now? For the STC stage 4 and 5 nets 😄
yes you can read the thread above for details haha
long story short, it's a (ar)mirage
Plz interpret the results correctly otherwise this gets annoying fast
how are you sure
Well at least look into it instead of just declaring it anti scaling
(not saying you're wrong, just trying to understand)
I ran some VVLTC games and it was at 3 +/- 2 ELO while I think at STC it's -5 ELO on my machine
so there's some hope there... but fishtest seems to have the opposite trend
base (...kfish.ti.gcc) = 830469 +/- 1142
test (./stockfish ) = 846138 +/- 1085
diff = +15670 +/- 1420
speedup = +0.0189
P(speedup > 0) = 1.0000
CPU: 32 x AMD EPYC 7502P 32-Core Processor
where is the LCT fishtest link?
yeah 8 threads tends to be +5 elo over 1 thread anyways
God everyone is going to just read these results wrong now
we forcibly passed the stc
relax lol
is that stage 5 net or stage 4?
with vondele machines
we are just experimenting
stc was -1 elo before that
oh its stage 4
stage 4 == stage 5 for all intents and purposes
we'll just have to test stage 5 and se
I think you should add some things in the info so it is less misleading tbh
who are we misleading tho
It is far too easy to just read the Elo off the sprt and draw invalid conclusions
which suggests actually maybe we drop it?
or
everyone working on threat inputs knows that the SPRT has many asterisks attached to it
i have consistently noticed stage 5 is not much better than stage 4
ye
there is a good chance tuning the search for the new net will fix the scaling problem, no?
.
per maintainer decision this is not to be done until after we win w/o changing search
So now we declared a scaling problem?
.
but SSS, I only ran 5000 VVLTC games
why do u have a bee in your bonnet about this lol
I am referring to this message as a response to "who are we misleading". This is not a difficult chain of reasoning to follow
the "scaling" btw is neutral within error bars
Yes
i wouldn't read too much into it
Finally a logical response
all the sprt shows is that it's not some +5 superscaler at least for 1 thread
but theoretically it could be better than neutral right?
I still think we should only be testing the stage 5 net, since its .5 elo higher on fishtest stc
performance is also really machine dependent
and not just the fact that arm machines have it at +15 elo
i would also caution into reading too much into this
jees, really!
.
too bad I don't have an arm machine
wait, does it also improve on phone hardware? Since that's arm?
presumably
Result of 100 runs
==================
base (...kfish.ti.gcc) = 1517185 +/- 2488
test (./stockfish ) = 1565808 +/- 2403
diff = +48623 +/- 2682
speedup = +0.0320
P(speedup > 0) = 1.0000
some improvements on Zen 5 as well...
ok I'ma put this up on fishtest
there was a little meming about releasing sf18 since for arm devices it's at that level
Hmm is there a way to contribute cores to a specific test on fishtest?
yeah no vondele just flooded fishtest with machines until some joined the threat input tests
There is also the 8 Elo of net spsa Elo that can be added. Which is always forgotten
Well not you lol. But I get messages why isn't threat inputs working well all the time
In DMs
💀
I look at this as just the beginning
do ppl just not comment here on public thread
who is DMing you lol
or is this just the dev thread
celebrity 😩
Idk why they don't message here. I guess they just see the first post is me and then DM
I don't count as a dev and I'm here lol
You need to be aware this exists and discussions are happening here, else discord ui is very good at hiding it from the rest of the world
Yeah they never messaged here so maybe they just don't see it. And only know it is from me from general engine dev
I mean I should make a copypasta for the response. It's like the same every time
You should just reply with 'and how's the alternative doing?'
Anyways yes it is true I am somewhat annoyed at getting messaged multiple times why isn't threat inputs "working" yet. When it is working perfectly fine when considering all factors
There, I put the link to here in the nnue channel
the most prudent thing for now is to wait for https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/12017736932 to finish in 2 hours
this offers 2 more chances at getting a slightly better net
anyways I've modified the fishtest description
I'm too easy to bait tbh, character flaw. Someone sends me a sprt and asks 'does this means threat inputs isn't promising' and I'll always respond kek
just in case ppl are lurking there
🎣
ok let's see...
I wish SF was better than shashchess 😩
maybe we should merge in some of his changes
nah
it literally is
by like 50 elo
at ltc
or smth
Baited lmao
lol
oh shoot
ez
🤡
more like 100, this is urgent guys, maybe shashchess changes will giga scale with threat inputs omg
💀
clean up the code generally
is the biggest one to me lmao
that's the one where other ppl are gonna have to suggest stuff
I'm not a C++ guru tbh so i don't know how to make things clean
but I'll think about it
did nobody ever bother to delete these 25 lines
this is actually been unused since like
what's that thing called in genetics
shawn's first dual-accumulator
Genetic hitchhiking, also called genetic draft or the hitchhiking effect, is when an allele changes frequency not because it itself is under natural selection, but because it is near another gene that is undergoing a selective sweep and that is on the same DNA chain. When one gene goes through a selective sweep, any other nearby polymorphisms th...
vestigial structures? 
true
it's a relic of when full_threats used to also include the halfkav2hm
wow that 40 core worker actually dumped negative llr on that test
omg DirtyBoardData is getting copied around for some reason
return value optimization isn't happening
👀
The stage 4 Threat inputs ltc test is steadliy improving somehow btw, like its -1.3now, it was -2.3 before
is it cos of the arm testing hardware?
idk what ur looking at
oh are you looking at https://tests.stockfishchess.org/tests/live_elo/690e8be1ec1d00d2c195c37f ?
I think I bungled something here but I'm not sure, breaking it up into each change to find out what the problem is
oh you mean Elo
yeah who knows
when a test is hardware dependent is bounces around everywhere
yeah
just gotta sneak in some fleet for next progtest 
ah
kinda annoying, should really have different actitecture separation
whats so ah?
is this one line using like 4% of total runtime?
no lol
if I interpret the vmovdqu correctly
oh
i see
hm we might want to add a template parameter to do_move which is whether or not to do dirty piece calculations
most of the time... but still adds a branch everywhere
I tried it b4 I think
but worth a second try
ugh this is pissing me off
too many things to template
lol it gets memcpyed 3 times I think
i'm too lazy to make things nullable so here's what we're gonan do
Result of 100 runs
==================
base (...vx2.69a01b88) = 1261593 +/- 2401
test (./stockfish ) = 1274991 +/- 2447
diff = +13398 +/- 2445
speedup = +0.0106
P(speedup > 0) = 1.0000
CPU: 128 x AMD EPYC 9755 128-Core Emb Processor
ok let's hope this works on fishtest
that's just removing one copy lol
ok time to try to remove the other one
can someone point me to the latest current threats input branch? I'm going to spend the weekend understanding it
sscg's threat-inputs-i8 is what I'm working off of
and was the basis for the recent SPRTs
lol
I really like this line
just make sure it's really 0
they are two different things
oh wait
I can't READ
wait no wonnder it's not working
thx babe
OK
time to put on fishtest
ok three speedups to try out on fishtest... let's see how they work
in other news the stage 2 nets are somehow now extremely strong relative to training time
oh that's exciting
!!!!!!!!!!
what's different...
just the 254/255 thing?
or is there something else
the new run is this yeah
255/256 instead of 127/128
is that in a fork somewhere?
yes
mr. sscg13's fork
ok going out but will look for more speedups when I get back
nice nice
It is 3am
where
uk
you are correct
could also be western americas
Go to bed dude
It’s not party time yet 
for indexing, there is also yoshie's bullet configs, that might be more readable
Are there any ways to make LUT more compact
tbh like [12][64][12][64] is good enough i think
does bullet support threat inputs natively, or you need to manually do it
no, read yoshie's configs
you can reduce that last [64] with a popcount and another lookup
actually
that kinda defeats the purpose
yeah idt there's a fast way to decrease the current 2^20 size
so if I'm understanding correctly, the threat inputs are tuples of (from square, to square, piece on from square, piece threatened on to square) counting both 'threats' of capturing my own pieces (i.e defended pieces) as well as threats to capture opponents pieces. And one optimization is you de-duplicate symmetric threats. So e.g if you have bishop on A1 threatening bishop on B2, then obviously that implies the bishop on B2 threatens the bishop on A1 so you fold those inputs together into one. The idea is e.g we don't need 'pawn threatens bishop' as a input because the 'bishop threatens pawn' already contains that information based on the input squares. You also skip symmetric NvN or BvB etc attacks with a from < to comparison.
So essentially the network knows where each piece is relative to the king, and also knows all the pieces that are defended/defending/attacked/attacking it.
yep
if you don't do eval in check you can also skip X -> K, but this so far hasn't gained at fishtest
the threat inputs also have horizontal mirroring, but no additional buckets bc that would blow up the size
keep us updated on stage 3! 🙂
looking good.
u can also check out plenty main if you want simpler inference
yeah I have been using plenty more than SF to understand 😂
Dodo and mallard looking ok so far… both apply to avx2
Time to see what else there is
maybe optimizing make_index
updated profile (sscg's threat-inputs-i8)
ok so i guess increasing L2 to 31 will cause 3% slowdown is my estimate
we'll see in a few days
how good is i8 now
huh
evaluate?
oh I meant in the source code not the profile
ah yup
that loop is the bane of my existence
I finally realized why GCC bungles it in the old form and it's so dumb
it's because biases is right before weights
so because it loads from biases first, it thinks it's a good idea to replace weights[x] with biases[x + 1]
and then disaster ensues
what
in the assembly I mean
oh
idk what u are referring to
but
gl
hf
well basically after you find the nnz's
instead of adding a bunch of 16
vectors
u add a bunch of 32 vectors
idk how much slower that would be
hm ok
I actually think that'll be less than a 2x slowdown
because
the entire propagation takes like 7%
the loop is actually currently somewhat bottlenecked by index calculations rather than actual vector instruction spam
i assume this propagate takes most of that time
at least on VNNI, I think on AVX512 and AVX2 it's the vector instructions
so idk what the relative ratio of nnz / avx
nnz is very little time IME
although probably more time on AVX2
I have a complicated theory to speed it up on AVX2 that I'll get to at some point... but doesn't benefit threat inputs any more than master
but yeah in 2 days or so we'll find out
👍
you're welcome to try it, just change l2Big in the source code from 15 to 31
which net should I use?
perfect ok
ye by like 10 elo lmao
at stc
i8 is effectively the main branch rn
And QAT?
lol the branch predictor HATES this section
all the conditions are like 20 to 60% probability and unpredictable
idk how to reduce the # of branches tho, let's see
static exchange evaluation
u are a funny one
wait why is append_changed_indices only used in double_inc_update
oh FeatureSet is a template parameter 
@foggy wind (and others willing): could u do a bench on https://tests.stockfishchess.org/tests/view/690f03b0ec1d00d2c195c3da
==================
base (...vx2.69a01b88) = 1284931 +/- 1876
test (./stockfish ) = 1295177 +/- 1889
diff = +10246 +/- 2242
speedup = +0.0080
P(speedup > 0) = 1.0000```
for me but who knows if it's real
ok time to revisit LUT...
sscg I like your idea of making the table 64x smaller by not including from in the index
attacks_bb with two arguments looks pretty fast
another really demented idea is to re-index the features in a deliberately inefficient way... using the fact that the unused indices won't use any cache space
but making the indexing very fast
however it'd use a lot of RAM for zeros :/
it's not 64x smaller
wait
no it can be 64x smaller
yeah
for the price of one attacks_bb lookup and some arithmetic
yep
forgot the exact order of indexing myself briefly lol
==================
base (...at-inputs-i8) = 2027832 +/- 3402
test (../snowy-plover) = 2010917 +/- 3329
diff = -16915 +/- 1227
speedup = -0.0083
P(speedup > 0) = 0.0000
CPU: 6 x AMD Ryzen 5 9600X 6-Core Processor
Hyperthreading: on ```
```Result of 200 runs
==================
base (...ad-inputs-i8) = 807325 +/- 3379
test (../snowy-plover) = 814006 +/- 3549
diff = +6681 +/- 570
speedup = +0.0083
P(speedup > 0) = 1.0000
CPU: 4 x Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
Hyperthreading: on```
GCC 15.2.0 on AMD, GCC 15.2.1 on Intel
Yeah
although I only have Linux
Both boxes are linux
gn
Result of 200 runs
==================
base (...sh_avx2.base) = 1730842 +/- 3170
test (...snowy-plover) = 1750735 +/- 3277
diff = +19893 +/- 1960
speedup = +0.0115
P(speedup > 0) = 1.0000
Oh my benchmarks were with the default arch, so avx512icl
This may be the reason for those results
what if instead of slli+srai, we do shuf+srai? cuz idk free ports or smth smth @stray reef @prime mica
i got something to work with _mm512_shuffle_epi8, but it doesn't seem to be a speedup locally
mmm
p5 is usually quite constrained so i tend to avoid shuffles where possible as a rule of thumb
speaking of this should we set this as reference
for the QAT run testing
probably, do you have this already merged in a branch?
I've updated reference for the QAT run, as this has been integrated in sscg13 latest branch
on the arm nodes, threat-inputs-i8 is now just 2% slower than master...
==== master ====
1 Nodes/second : 298577711
2 Nodes/second : 297081100
Average (over 2): 297829405
==== threat-inputs-i8 ====
1 Nodes/second : 291537416
2 Nodes/second : 292647054
Average (over 2): 292092235
so QAT training run doesn't seem to yield anything stronger than what we have so far.
:(
@foggy wind can u run your bucketing script on https://tests.stockfishchess.org/tests/view/690e9695ec1d00d2c195c38f ? I bungled the arch filtering so I can’t see the effect on AVX2 lmao
The patch does nothing elsewhere
GROUPED BY ARCH
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -0.58 ± 1.67 | LOS: 24.7% | LLR: -1.46 | [131, 4300, 11127, 4261, 117]
64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -0.51 ± 2.20 | LOS: 32.5% | LLR: -0.79 | [64, 2592, 6337, 2592, 47]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 5.24 ± 2.42 | LOS: 100.0% | LLR: 3.06 | [45, 1985, 5248, 2248, 58]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 0.38 ± 2.58 | LOS: 61.5% | LLR: -0.06 | [57, 1902, 4641, 1949, 43]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -1.18 ± 4.58 | LOS: 30.7% | LLR: -0.30 | [24, 626, 1515, 615, 20]
Do you think that's necessary? The bmi2 and avx2 values are relevant, right? Together, they look good.
@violet badger
The threats.yaml recipe
https://github.com/vondele/nettest/blob/main/threats.yaml
still contains duplicated binpacks that were removed from large.yaml recipe.
Do you think it might be worth to test the removal of those binpacks in threats.yaml recipe as well?
this is nuts
how is this even possible...
So, if I'm not mistaken, the AVX2 + BMI2 results have an Elo rating of 2.94 +/- 1.77 and an LLR of 2.98. So this is already a passed test in these architectures.
You could also get machines on fishtest to run with a simpler architecture than the default. I did it this way in the past: https://github.com/official-stockfish/Stockfish/compare/master...Torom:Stockfish:NoAVX512
ye
the issue with this is that we optimize for the default build
there are plenty of changes which are amazing on my computer's avx2 but fail on my friend's machine etc.
🐟 🧪
📉 
maybe the optimizer realizes the computations are unused and can throw them away on the original
story of my life bro
seems good
kyoot
hmmm maybe instead of a big LUT we can just micro optimize the existing calculations
in particular not using LUTs and instead using constants...
how would you propose doing this
0xcafebabe30913928 >> (attkr & 7)```
type beat
especially because the constants could probably be hoisted given that make_index is usually used in a loop
we can generate the constants via constexpr so it shouldn't be unreadable
but idk I'll try the medium-sized LUT first
and see how it dose on fishtest
yeah
[16][64][16] or smth
this isn't even that bad
much smaller than pext lookups
^_^
I recall doing smth like this before
combining this into 1 mask was slower for me tho locally
i feel like lookup tables aren't actually that slow as long as they're small and fit in cache
agree for the most part
main issue is if a branch depends on the result
because they can have quite high latency
i mean for the threat indexing i would guess the CPU attempts to do all of them simultaneously
while it waits for the lookups
citation needed
i am nowhere near a cpu expert
sure
it might end up being profitable summing two LUTs one indexed with attkr/to/attkd and one attkr/from/to
the latter can be byte-sized
will try both
actually yeah this seems rly succulent
the latter will have quite good cache locality while the first one will be relatively smol
why tf is PIECE_TYPE_NB 8 when there are 6 piece types (or 7 if you include no piece)
ok this is the most demented codegen I've ever seen
i just know that of 0-7 two are no piece
yeah I think 0 and 7 are hte no piece
mask registers are this obscure type of register introduced in AVX512
and it's very weird to be using them for integers...
sigh
can anyone bench https://tests.stockfishchess.org/tests/live_elo/69102050ec1d00d2c195c4f1 on their computer
Result of 20 runs
==================
base (./sf-old ) = 1380434 +/- 10321
test (./stockfish ) = 1406364 +/- 10545
diff = +25930 +/- 2466
speedup = +0.0188
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
``` on my cpu
what is this new optimization meta
O shit
ok let's hope this works on fishtest 🤞
👀
can't profile locally atm bc my friend is running some weird DFT simulation lol

could u double check that the benches are the same
just for my peace of mind lol
should be 24..
matches up
😊 thank u sir
https://tests.stockfishchess.org/tests/view/69102686ec1d00d2c195c4f7 @warm thistle when u have time if u could check this one that'd be swell
oh yeah later on u can update the net
forgot to mention
it's +3 elo on vondele local test
Result of 20 runs
==================
base (./sf-old ) = 1392913 +/- 11095
test (./stockfish ) = 1421577 +/- 10865
diff = +28663 +/- 3084
speedup = +0.0206
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
ok, so maybe slightly better
ooh cool, versus which?
crazy speedup 🤤
@prime mica see meme
[-1, 1] wtf lol
is this on ARM tho
reference = f3fa...
ah the stage 4 net we've been testing with?
yeah
ok that's exciting
new one is indeed very good
cj speaking misinfo 🗣️
that 255/256 instead of 127/128 is 3 elo
lololol
not negligible
maybe time for a new SPRT soon then?
ye after speedup
👍
@prime mica you should name at least one of your patches titmouse
funneh name will pass 100%
isnt this 2 changes in 1
I did this concept before previously and it ended up -0.5%...
if it's this... then that's not the same idea
bc here we're plopping it into a 64 bit integer
"combining this into 1 mask was slower for me tho locally"
oh I see orry
ill see if i can find it cuz i didnt commit cuz it was slower
uhh
let me look in my recycle bin
also in https://tests.stockfishchess.org/tests/live_elo/69102686ec1d00d2c195c4f7 I tried putting the shift amount in the LUT
which I think should work quite nicely tbh
WAIT
we don't even need that do we
it can just be a bit in the LUT
😭
I missed the forest for the trees
that is funny
I will support this concept ^^
ok ostrich time
I think this should be nearly optimal index calculation at leats with this LUT setup
checking whether the feature doesn't exist is four instructions on ARM and x86 🤓
you don't need to do "return dimensions" specifically, as long as you guarantee it's >= dimensions
yeah ik
my hope with the sf_assume(index != Dimensions) is that (some) compilers will be able jump thread the continue and/or elide one compare
GCC seems to do so, haven't tested clang
ah I see
ok so current state of affairs: we should be able to do plover + mallard + dodo + ostrich
dodo is AVX2 only but the rest are general
once all of them are resolved maybe we can SPRT against master with the new net
oh
interesting
@prime mica do you still need speedtests for this patch?
ostrich > cassowary > woodcock
they're just iterations on the same idea
but I'm leaving them up in case I made a mistake...
no need
but lemme try combining the four patches and we can see whether the speedups are ~additive
if u could benchmark that that'd be greatly appreciated
my computer is currently suffering my friend's DFT simulation
So ostrich?
oh yeah if you could that one by itself that'd be great
👀
On it
Kevin got 2.88% speedup
which is roughly expected
but it'll probs be less on fishtest
==================
base (...at-inputs-i8) = 2023444 +/- 1829
test (../ostrich ) = 2050542 +/- 1935
diff = +27097 +/- 1232
speedup = +0.0134
P(speedup > 0) = 1.0000
CPU: 6 x AMD Ryzen 5 9600X 6-Core Processor
Hyperthreading: on ```
Result of 200 runs
base (...at-inputs-i8) = 813188 +/- 3192
test (../ostrich ) = 828548 +/- 3321
diff = +15360 +/- 588
speedup = +0.0189
P(speedup > 0) = 1.0000
CPU: 4 x Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
Hyperthreading: on ```
ok cool
so not amazing but not zero
could u also try https://github.com/anematode/Stockfish/tree/moa
that has the things combined
Result of 20 runs
==================
base (./sf-old ) = 1347420 +/- 8957
test (./stockfish ) = 1384998 +/- 8926
diff = +37578 +/- 2477
speedup = +0.0279
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
on mine
huh
that doesn't add up
should be more like +4% lol
could you try git revert de7fbe6b and re-compile/re-bench
possible that one of them doesn't like the other
yes i'll try
thx legend
helping in any way i can because i am not good at optim 💀
lol
I was not either until the pandemic when I was so bored I wrote https://github.com/anematode/high-perf-bogosort/tree/main
then I got pissed off at how inefficient 95% of modern software is
Result of 20 runs
==================
base (./sf-old ) = 1308364 +/- 14174
test (./stockfish ) = 1338017 +/- 14966
diff = +29652 +/- 2742
speedup = +0.0227
P(speedup > 0) = 1.0000
CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
hm ok
the only optim i know is rewriting things in asm ☠️
maybe the +2.88% measurement was a fluke then
could be natural variation i guess lol
oh well we'll see on fishtest etc.
also running speedtest here..
4c902c7c443b6b71d06895097ecd8d85aa3a2e99 vs caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3
lol what are those
moa vs no-bird
loll kk
somehow it helps often afterwards to understand what was really what. One reason to use shas in nettest to pin versions, not branches.
agree
these days I have a bins folder formatted as stockfish.<arch>.<compiler>.<first nibbles of hash>
makes sense... assuming you always use 'profile-build' 😉
yeah, easy enough.
virtually no gain on the arm nodes, but no damage done either:
==== 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 ====
1 Nodes/second : 292248079
2 Nodes/second : 290685623
Average (over 2): 291466851
==== caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 ====
1 Nodes/second : 292968214
2 Nodes/second : 292490843
Average (over 2): 292729528
that's a little surprising honestly
I'll investigate later, I'm guessing at least one of the changes is counterproductive on ARM
let me just run a few more of the shas and we'll see in a bit.
Maybe somebody else can run the following script:
$ cat speedtesting.sh
set -e
max_rep=2
for branch in 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 de7fbe6ba5c98b585aadb0a81ea7c3cb12fc4d54 a227ff9916e98a33c198a1749f23f2783934be8a 064d09f40f0ef0e9daf19609faaf79aa471ec362 a6721c73eea0b11eaf9b574ed509265211be5738 d8eb3207dbf7f264035ee52ddfa1db9c39471841 638e1786bde4e1f3eb17be5a32da3650666318d9
do
echo "==== $branch ===="
if [ ! -f stockfish.$branch ]; then
git checkout $branch > compile.out.$branch 2>&1
make -j profile-build >> compile.out.$branch 2>&1
mv stockfish stockfish.$branch
fi
for iter in $(seq 1 ${max_rep})
do
if [ ! -f speedtest.out.$branch.$iter ]; then
./stockfish.$branch speedtest > speedtest.out.$branch.$iter 2>&1
fi
echo $iter $(grep "Nodes/second" speedtest.out.$branch.$iter)
done
echo "Average (over $max_rep): " $(grep "Nodes/second" speedtest.out.$branch.* | awk '{s=s+$NF; c++}END{printf("%16d\n",int(s/c))}')
done
not sure if all the commits are functional btw
but thx that's very useful
you are a bash god
nah, the syntax highlighting is done by discord
looking ok on Apple silicon
Result of 100 runs
==================
base (./stockfish.ti ) = 1566081 +/- 3859
test (./stockfish ) = 1587327 +/- 3990
diff = +21247 +/- 2395
speedup = +0.0136
P(speedup > 0) = 1.0000
not what I was hoping for tho
well, I think the main interest right now is anyway x86
yeah true
==================
base (...at-inputs-i8) = 2017676 +/- 2737
test (../moa ) = 2064445 +/- 2825
diff = +46769 +/- 1582
speedup = +0.0232
P(speedup > 0) = 1.0000
CPU: 6 x AMD Ryzen 5 9600X 6-Core Processor
Hyperthreading: on ```
ok cool
underwhelming but decent
==================
base (...kfish.ti.gcc) = 830128 +/- 901
test (./stockfish ) = 844316 +/- 1042
diff = +14188 +/- 1326
speedup = +0.0171
P(speedup > 0) = 1.0000
on my friend's Zen 2 server
all that matters is beating master tho lol
i think that's looking plausible? 🤞
i mean i feel like if the individual ones pass
just merge them
no need to run an extra sprt
oh sure I meant threat inputs against master
ah
lemme look
yeah just wing it, add the net change and try an stc sprt vs master
alone it should be +1 elo, with speedups +4 or smth
kk do u wanna do it or should I
I can do it
ok great
more throughput
yes true
so pull moa right
ye
ok
you haven't made any changes to threat-inputs-i8 recently right
no, except for net change
kk perfect
oh wow with the ultimate SSS on my laptop bench is up 10%
ah gotcha
Nodes searched : 2324801
Nodes/second : 1043447```(baseline)
```Total time (ms) : 1974
Nodes searched : 2324801
Nodes/second : 1177710```(new) lmao


I love this emoji so fucking much omg
I use it all the time elsewhere with no context
one person thought it was a reference to the Schutzstaffel for some reason?
need to get their mind out of the gutter
OK, for what it is worth:
==== 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 ====
1 Nodes/second : 292248079
2 Nodes/second : 290685623
Average (over 2): 291466851
==== caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 ====
1 Nodes/second : 292968214
2 Nodes/second : 292490843
Average (over 2): 292729528
==== de7fbe6ba5c98b585aadb0a81ea7c3cb12fc4d54 ====
1 Nodes/second : 292234932
2 Nodes/second : 293382923
Average (over 2): 292808927
==== a227ff9916e98a33c198a1749f23f2783934be8a ====
1 Nodes/second : 292874238
2 Nodes/second : 287634242
Average (over 2): 290254240
==== 064d09f40f0ef0e9daf19609faaf79aa471ec362 ====
1 Nodes/second : 287458645
2 Nodes/second : 289818521
Average (over 2): 288638583
==== a6721c73eea0b11eaf9b574ed509265211be5738 ====
1 Nodes/second : 292221507
2 Nodes/second : 293205229
Average (over 2): 292713368
==== d8eb3207dbf7f264035ee52ddfa1db9c39471841 ====
1 Nodes/second : 289783271
2 Nodes/second : 289119242
Average (over 2): 289451256
==== 638e1786bde4e1f3eb17be5a32da3650666318d9 ====
1 Nodes/second : 290801279
2 Nodes/second : 289798590
Average (over 2): 290299934
so this part of the code looks like
start:
vpaddw ymm0, ymm0, [rax]
vpaddw ymm1, ymm1, [rax + 32]
vpaddw ymm2, ymm2, [rax + 64]
...
vpaddw ymm15, ymm15, [rax + ...]
;; increment base pointer, loop...
jle start
on master, right?
Yeah
because the compilers are very smort and they're able to keep every accumulator in a register, and use CISC to enjoy
but on threat inputs, it looks like
start:
vpmovsxbw ymm15, [rax]
vpaddw ymm0, ymm0, ymm15
vpmovsxbw ymm15, [rax + 16]
vpaddw ymm1, ymm1, ymm15
...
;; uh oh...
;; increment base pointer, loop...
jle start
you need one temp register for the i8 -> i16 conversion
so the compiler has to spill at least one of the accumulator registers
Right that
and then it is sad
because it has to store/load across every iteration
in theory we could us 16 for the main net and 12 for threats but I was lazy
Does this not apply to avx512
no bc we have registers set to 16 there as well
and avx512 has an extra set of 16 to play with
Ah hah
so smallnet might be slowed down by this patch?
NEON is fine as well
yeah, main net too... but in my benchmarks the loss from 16 -> 12 was unmeasurable
definitely worth making it conditional on the feature set though
well main net will be threats, but would be speedup of threats branch if the effect on smallnet is measurable.
there's also a chance that some old compilers aren't clever enough to do this w/o spilling, and we've been spilling this whole time. But I find that unlikely
something is really cooking on fishtest..
main net has accumulator refreshes for both the ksq components and the threats right?
idk in general, having written more x86 vector loops than I'd like to admit, I'm very scared to use all the registers unless you're writing in assembly
bc you are really relying on the compiler to be smort
ok I'm betting +2 against master
😊
so based on this... the branchless insertion into added/removed was pretty bad?
maybe try git revert a227ff and see if that helps?
I could definitely see that one being arch (and compiler) dependent
also are these nodes 70 cores, 140 cores, 280 cores or what
4 x 72
gotcha
will try later...
https://tests.stockfishchess.org/tests/live_elo/69105b3dec1d00d2c195c569
Test was doing so well early on. SSS, I know, but still... 😔
yep, there's hope...
oh I see
yeah probably arch dependence tbh
one of the first workers was a NEON :P
Is there any hope of bringing the performance on other archs to similar to ARM?
Btw, if this is the case, then will there be a specific binary be preferred to be sent to competitions like TCEC or CCC? I think that happens even now due to other reasons?
There is no need to
good question, these competitions have standardized hardware so it won't be the case
Wth? I glanced away for like 10 mins, and the LLR shot up.
you should assume < Dimensions instead cuz one of the checks uses that iirc
25000 games in and +2 elo now, looking good! 😄
Will LTC be tested soon? 🙂
presumably after the STC test, if it passes
@prime mica the more accurate measurements now:
==== caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 ====
Average (over 10): 292442968
==== de7fbe6ba5c98b585aadb0a81ea7c3cb12fc4d54 ====
Average (over 10): 292813467
==== a227ff9916e98a33c198a1749f23f2783934be8a ====
Average (over 10): 290405859
==== 064d09f40f0ef0e9daf19609faaf79aa471ec362 ====
Average (over 10): 289685089
==== a6721c73eea0b11eaf9b574ed509265211be5738 ====
Average (over 10): 292753228
==== d8eb3207dbf7f264035ee52ddfa1db9c39471841 ====
Average (over 10): 289360047
==== 638e1786bde4e1f3eb17be5a32da3650666318d9 ====
Average (over 10): 290681425
==== 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 ====
Average (over 10): 291312522
danke
very interesting
so branchless added/removed was bad
064d09f.. being bad is no surprise, that's cassowary
I think that's like the most clear one?
yes
should I try with a revert of just that one?
yeah no smoking gun unfortunately...
I think the eneral pattern though is that using LUTs is not as good on ARM
or at least the machine u are testing on
yeah, might be that right now it is in the memory sweet spot
the neoverse cores seem to have a very good amount of cache according to Wikipedia...
Package L#0
NUMANode L#0 (P#0 117GB)
L3 L#0 (114MB)
L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
sure.. maybe different branch to avoid confusion 😉
yep!
ptarmigan
good names are essential for passing sprt, I believe.
you can try that one vs. the previous commit (which is the best lookup table–based make_index function I've cooked thus far)
i.e.
LUT: 6018b8cd092665c7cb7c0bb943a7f2fc48de72d9
no LUT: 9fb3700ef1c7e8578b676ffddf9c644eb9cf0b4d
ok, started both shas.
danke
out of curiosity have u ever gotten to see the nodes in person
or are they locked away in some massive basement
exciting
It was holding like +13 for a little bit. SSS, but OK. Not looking so hot now, though I guess it's atleast on par with Master.
patience my friend
it'll probably settle down around 1.5 methinks
still a huge win. I have a couple more speedups in the pipeline + we will have training tweaks I'm guessing
also it should scale well with TC
Yeah, yeah, I know. I'm just kind of impatient for a new net because honestly, search improvements have kind of tapered off as of recently, and maybe this will change that.
==== 6018b8cd092665c7cb7c0bb943a7f2fc48de72d9 ====
Average (over 10): 290043742
==== 9fb3700ef1c7e8578b676ffddf9c644eb9cf0b4d ====
Average (over 10): 281965813
it would appear LUT is still effective
i'm pretty sure this is more likely to restart net improvements rather than search improvements heh
hopefully both 🙂
can we have another run of the script on https://tests.stockfishchess.org/tests/view/69105b3dec1d00d2c195c569 ? Maybe the script would be a nice feature on fishtest itself. Also category of windows/linux for certain patches.
GROUPED BY ARCH
64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 0.28 ± 3.11 | LOS: 57.1% | LLR: -0.10 | [58, 1664, 3319, 1663, 64]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -0.41 ± 3.43 | LOS: 40.7% | LLR: -0.32 | [54, 1325, 2762, 1306, 57]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -2.18 ± 4.37 | LOS: 16.4% | LLR: -0.57 | [27, 836, 1659, 796, 26]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 4.10 ± 4.82 | LOS: 95.2% | LLR: 0.62 | [22, 599, 1364, 668, 19]
64bit POPCNT NEON_DOTPROD | Elo: 16.20 ± 8.23 | LOS: 100.0% | LLR: 0.92 | [5, 163, 467, 235, 10]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -5.97 ± 10.42 | LOS: 13.1% | LLR: -0.24 | [11, 159, 322, 137, 11]
nice to see we're still winning on arm machines
these are now mostly macs
the rest is still a bit unclear, all LLRs still close to 0
I am pretty sure the rest total to negative LLR
(slightly)
the total LLR is relatively close to the sum of the group LLRs
Does pext play any role here, or should AVX2 and BMI2 perform equally?
I would've thought BMI2 would be good because it makes threat tracking slightly faster but I guess not
btw the first != assume is redundant right
should be redundant.
sad that the elo gains from net and speedups seem not additive
Maybe selectively choose speedups once (if) they pass? Idk, how practical that is.
spikiness is really something
standard tests don't exactly have W-L casually spiking up/down by 100 in the span of only 5k games
Hmm, noticed that too.
that's just depending on which machines join the test I would say
I find that uniquely frustrating... But it is what it is.
i wonder what the individual machine results look like...
But usually it doesn't make much sense to go into such detail because the error bars simply become too large.
yep
you can click on the link?
but right there you have it.
fair
I suspect individual machine results on "normal" tests might also look like this
due to the variance
yes, individual machines hard to converge, might be possible in local tests though.
locally, for example, the branch is still much slower for me:
Result of 100 runs
==================
base (./stockfish.master ) = 1107456 +/- 3570
test (./stockfish.patch ) = 984159 +/- 3285
diff = -123297 +/- 4113
speedup = -0.1113
P(speedup > 0) = 0.0000
CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on
ah yeah an x86 machine
but I suspect that slowdown is more than average, idk.
i see.
the persistent speedup work has really paid off with patience
well amazing process I think.
I will see if I can get some progress with net training, but not sure there is low hanging fruit. We'll see.
right the ideal process maybe is different than master
i also suspect low hanging fruit is mostly gone though
i wonder if the original NNUE development project resulted in a similar feeling
of slowly reaching towards hce, then exceeding it, then exceeding it by a lot
vondele, don't forget to remove and test (if needed) the duplicated binpack lines in threats.yaml recipe
🙂
of course that would need testing.. so let's see. Maybe as part of future training.
am hopeful that we get passing sprts soon enough, the real barrier seems to be LTC single thread
I suspect that cleanup work is quite nontrivial
but it still seems preemptive to start doing it now
LTC single thread was -0.5Elo in the previous test... this ought to be positive now?
single thread
Result of 200 runs
==================
base (...es/stockfish) = 2295252 +/- 4815
test (...stockfish.ti) = 2048251 +/- 5352
diff = -247001 +/- 2504
speedup = -0.1076
P(speedup > 0) = 0.0000
CPU: 16 x AMD Ryzen 9 9950X3D 16-Core Processor
Hyperthreading: on
32 threads speedtest
sf_base = 42662044 +/- 111193 (95%)
sf_test = 41059945 +/- 108936 (95%)
diff = -1602099 +/- 112245 (95%)
speedup = -3.75533% +/- 0.263% (95%)
In multithreading, the difference becomes significantly smaller.
I thought that this only meant that larger nets are better with longer TC. But it's great that speed is also a factor.
The suspected explanation is that smp searches similar positions and has better access patterns to the threat features
amicic carrying BlessRNG
https://tests.stockfishchess.org/tests/live_elo/69107881ec1d00d2c195c5c2
but emu doing ok... @foggy wind maybe you can benchmark?
pretty much every instruction saved off make_index is measurable locally lmao. It's kinda cooked
this is interesting but yeah makes sense...
so in a way threat inputs (hopefully) scales well in two ways ^_^
I found out that a fair # of compilers aren't inlining this which is not ideal
causes like ten extra instructions, converting Square int8_t to int for example (because the ABI says that if you pass an int8_t, the upper 56 bits are undefined)
so I tried putting up a test to force inlining it but it doesn't compile for some reason
will try again in a bit
arm cores literally llr printers with their +15 elo
I'm pretty sure the actual llr is negative on x86
locally, my x86 is at -10Elo
Results of master vs patch (10+0.1, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 10.98 +/- 5.65, nElo: 20.59 +/- 10.58
LOS: 99.99 %, DrawRatio: 48.67 %, PairsRatio: 1.27
Games: 4146, Wins: 1168, Losses: 1037, Draws: 1941, Points: 2138.5 (51.58 %)
Ptnml(0-2): [18, 451, 1009, 572, 23], WL/DD Ratio: 1.20
hmm
oh in a good way
no
wait what
well, in favor of master
fishtest x86 seems to be -2 or so
I think I have a particularly slow guy 😉
hmmm so how do we interpret this, net and speedups not being additive? speedups themselves not being additive?
maybe worth sanity checking the new net by itself without the extra changes?
raspberry pi moment
don't think the net by itself is to doubt. Played quite a few games.
generally speedups and other improvements are being only applied at like 1/2 rate against master
or at least that's what it feels like
I think actually, there is an argument for that.
at least Naphthalin has suggested before that self-play tends to overestimate Elo differences. That would be the case in the speedup vs reference, while threats vs master is no longer (or less) self-pay.
that would make sense actually
what's this junk in the header of update_accumulator_incremental
oh I guess this is reading the feature index from the added/removed hm
ok let's try specializing update_accumulator_incremental for the most common (added.size(), removed.size())
at least finally snowy egret isn't insta-failing on fishtest
ok another reasonable interpretation is that the speedups aren't actually that big
most of them have drifted down quite a bit
on fishtest
this is a general rule of thumb
you can't add up fishtest sprt elo to estimate a pt
probably it's the same with speedups
right they're all overestimates
this also applies to net training
yeah the net was vondele local fitbit
since nets are tested constantly almost every +8 +/-5 ends up being like at best 2
because you only submit ones that are showing good results, but if you test for like hundreds of them you are getting flukes
eh it was only like 5-10 that we tested
well, there is NCM
which tests every sf commit
and one commit that was like +11 +/- 5 elo
was a comment change
which is pretty obviously non-functional kek
I remember a HCE patch
passed fishtest
with double SPRT
only for me to notice that it can't even theoretically do anything at not FRC
lmaoo
and fishtest didn't run FRC book back then
so this can happen, even back in sf 10 times average elo / passer was like 0,5 elo
so definitely most of what you see is a big overshoot / lucky run, this is pretty normal and not smth you can really fix in general
one problem is that back then we didn't really know anything about scaling so we were like "+6 elo STC, 1,5 LTC, fine"
nowadays it's always a suspect of being a bad scaler since literally half of the search scales in a weird way duh
STC passed
@foggy wind maybe u could check the x86 distribution?
wait when did it
ok so it went from 0 llr to pass in the span of 20000 games
👍
machine luck???
tfw
👀
idk
seems like amicic did 2 runs and both were done by 0 llr
can be just luck
I recall some of my SPRT sitting at 2,5 LLR LTC for like 60k games
so from 30 to 90r
and at 120k it failed with -2,95
this stuff just happens sometimes
@rocky vigil if you want you can merge emu as I'm pretty confident it's good
but up to you
o
why would you hurt feelings of @jolly tangle so much though
https://tests.stockfishchess.org/tests/view/69105b3dec1d00d2c195c569
GROUPED BY ARCH
64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 1.66 ± 2.59 | LOS: 89.6% | LLR: 0.69 | [76, 2361, 4780, 2416, 95]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 0.35 ± 2.64 | LOS: 60.3% | LLR: -0.10 | [98, 2227, 4697, 2234, 104]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 0.93 ± 3.24 | LOS: 71.2% | LLR: 0.15 | [52, 1499, 3067, 1512, 62]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 3.01 ± 3.94 | LOS: 93.3% | LLR: 0.65 | [28, 918, 2024, 985, 29]
64bit POPCNT NEON_DOTPROD | Elo: 20.68 ± 6.52 | LOS: 100.0% | LLR: 1.89 | [7, 276, 744, 425, 20]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -4.27 ± 8.34 | LOS: 15.8% | LLR: -0.28 | [15, 241, 485, 223, 12]
