#UE Threat Inputs for AB

1 messages · Page 10 of 1

foggy wind
#
GROUPED BY ARCH

64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT        | Elo: 20.85 ± 1.99 | LOS: 100.0% | LLR: 17.97 | [2, 1135, 6434, 2321, 2]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: 21.13 ± 2.25 | LOS: 100.0% | LLR: 14.40 | [0, 944, 5114, 1908, 2]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT   | Elo: 27.43 ± 2.50 | LOS: 100.0% | LLR: 14.88 | [1, 717, 4149, 1753, 5]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT             | Elo: 21.21 ± 2.85 | LOS: 100.0% | LLR:  8.95 | [1, 598, 3197, 1204, 3]
64bit SSE41 SSSE3 SSE2 POPCNT                  | Elo: 22.17 ± 8.81 | LOS: 100.0% | LLR:  0.99 | [0, 57, 332, 120, 1]
lapis parrot
#

you can have better game pair ratio but worse elo and worse nElo

prime mica
#

oh wait I'm reading it wrong

#

hm ok so the delta was always there... but then the distribution of workers changed a lot

prime mica
#

ok, found an avx2-specific speedup... ~1% locally but we'll see on fishtest

#

the problem is that NumRegistersSIMD is 16

#

before, the loads would get consistently folded into the memory operand of vpaddw and vpsubw

#

but now they need a temp register for vpmovsxbw

#

and thus at least one acc register needs to be spilled...

frosty imp
#

wait it passed?

prime mica
summer swan
#

Why not SMP BTW? Even if not willing to pass with VLTC. Shared memory is not as good because positions in a single engine are more similar.

#

BTW I get warning on clang (not new to this commit)

 1104 |         Bitboard threatened = ray & qAttacks & occupied;
      |                  ^
position.cpp:1057:14: note: previous declaration is here
 1057 |     Bitboard threatened;
      |   ```
foggy wind
prime mica
#

ok, not terrible...

#

this is on Zen 5 though right? so AVX512 would be the default build

foggy wind
#

Yes, but this way the AVX2 path is used.

prime mica
#

sure

foggy wind
#

I would say 2.2% is more than just not terrible.

summer swan
prime mica
#

ok I'ma hunt for more speedups until I get +4% or so and then put up on fishtest filtered to AVX2

lofty cedar
#

So, it seems a massive slowdown on AVX2?

lofty cedar
#

Oh... your patch was like 3.7 on AVX2 for my machine.

prime mica
#

I made a mistake lol

#

wrong bench

#

will fix and we'll see

lofty cedar
#

Woopsies!

prime mica
#

ok pushed a fix u can retest

foggy wind
# prime mica ok pushed a fix u can retest
Result of 200 runs
==================
base (...sh_avx2.base) =    1738472  +/- 3302
test (...nputs-i8-st5) =    1765637  +/- 2975
diff                   =     +27165  +/- 1085

speedup        = +0.0156
P(speedup > 0) =  1.0000
prime mica
#

hm ok

#

can u verify that the binary gives 1902842 as teh bench?

foggy wind
#

Nodes searched : 1902842

prime mica
#

ok that's good

#

testing on my friend's Zen 2 EPYC now

#

compiler btw?

foggy wind
#

gcc 15.2.1

prime mica
#

gotcha

#

also could u git log

#

just wanna make sure it's the right commit as I pushed something else a few minutes ago

foggy wind
#

its the fix commit

prime mica
#

ooh ok

#

pull again and re-test?

#

I tried cramming DirtyThreat into 32 bits and I think it might help

foggy wind
#

kk

prime mica
#

ur the best

#
test (./stockfish    ) =     837952  +/- 1405
diff                   =      +7987  +/- 1581

speedup        = +0.0096
P(speedup > 0) =  1.0000

CPU: 32 x AMD EPYC 7502P 32-Core Processor```
#

underwhelming

#

I'ma try the idea of making indices an LUT, forgot who said that

foggy wind
#
Result of 200 runs
==================
base (...sh_avx2.base) =    1746542  +/- 3695
test (...nputs-i8-st5) =    1710156  +/- 2848
diff                   =     -36387  +/- 1309

speedup        = -0.0208
P(speedup > 0) =  0.0000
prime mica
#

h u h

#

why??

#

it works fine on my end..

#

u sure u did profile-build right

foggy wind
#

yeah definitely

prime mica
#

😭

foggy wind
#
❯ ./stockfish_avx2.threat-inputs-i8-st5 compiler
Stockfish dev-20251107-cd5e513d by the Stockfish developers (see AUTHORS file)

Compiled by                : g++ (GNUC) 15.2.1 on Linux
Compilation architecture   : x86-64-avx2
Compilation settings       : 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.2.1 20250813
prime mica
#

that makes no sense

#

could you check that the bnech is the same?

foggy wind
#

it is

prime mica
#

mmk

#

could you try comparing the commit "fix" vs the commit "cram"

#

if "cram" is rly bad then it should be -0.035

foggy wind
#

hmm something is weird. I compiled again and got a different md5sum

prime mica
#

that happens to be sometimes if I don't explicitly make clean

#

but not sure

foggy wind
#

I'm pretty sure I did. But well

prime mica
#

lol

#

cosmic ray /s

foggy wind
#

so something was wrong with the binary, but still

Result of 200 runs
==================
base (...sh_avx2.base) =    1746615  +/- 4120
test (...nputs-i8-st5) =    1756439  +/- 3361
diff                   =      +9824  +/- 1416

speedup        = +0.0056
P(speedup > 0) =  1.0000
prime mica
#

gross

#

that's between "use stage 5 net" and "cram" ?

foggy wind
#

yes

prime mica
#

sigh

#

ok I'll have to look at the disassembly on GCC 15 later

#

it probably did something rly dumb

lofty cedar
#

I got around 1% speedup.

#

But well, not that many rounds.

rocky vigil
#

i wonder, if stockpile the claimed minor speedups and then combine, is this sound methodology

prime mica
#

LUT seems to be working... but my computer has plenty of cache

#

I'ma see if I can compact it a little then I'll push

prime mica
prime mica
#
Result of 100 runs
==================
base (....ti.avx2.gcc) =    1298550  +/- 1517
test (./stockfish    ) =    1329132  +/- 2045
diff                   =     +30582  +/- 1763

speedup        = +0.0236
P(speedup > 0) =  1.0000

CPU: 128 x AMD EPYC 9755 128-Core Emb Processor

bench matches...

#

lol ok I crammed it to 17 bits but split across two arrays which is dumb

#

probably makes the most sense to have it either be 24 bits or 17 bits but contiguous

#

ok yeah that destroyed perf

#

hm now that I think about it, the LUT might actually be accessed in a fairly ok way

#

depending on how we index it

#

anyway, I pushed LUT if any of y'all wanna test

#

we might have to put the LUT in shared memory lmao

#

big boi

#

testing on Zen 2 now

rocky vigil
prime mica
#

yeah...

#

definitely worth a shot

rocky vigil
#

so it looks like we currently 2 elo away in ltc 1thread

#

i kinda guessed this

#

after i8 "antiscaling"

prime mica
#

that's kinda rough

twilit oriole
#

It isn't anti scaling. It didn't have the arm benefit

prime mica
#

does i8 antiscaling mean the whole enterprise is antiscaling?

amber fern
#

Yo guys, anyone else notice that the Threat input fishtest is going pretty good right now? For the STC stage 4 and 5 nets 😄

prime mica
#

yes you can read the thread above for details haha

#

long story short, it's a (ar)mirage

twilit oriole
prime mica
#

how are you sure

twilit oriole
#

Well at least look into it instead of just declaring it anti scaling

prime mica
#

(not saying you're wrong, just trying to understand)

#

I ran some VVLTC games and it was at 3 +/- 2 ELO while I think at STC it's -5 ELO on my machine

#

so there's some hope there... but fishtest seems to have the opposite trend

#
base (...kfish.ti.gcc) =     830469  +/- 1142
test (./stockfish    ) =     846138  +/- 1085
diff                   =     +15670  +/- 1420

speedup        = +0.0189
P(speedup > 0) =  1.0000

CPU: 32 x AMD EPYC 7502P 32-Core Processor
amber fern
prime mica
#

ok I'ma put this up on fishtest, AVX2 only

rocky vigil
#

yeah 8 threads tends to be +5 elo over 1 thread anyways

twilit oriole
rocky vigil
#

we forcibly passed the stc

prime mica
#

relax lol

amber fern
rocky vigil
#

with vondele machines

prime mica
#

we are just experimenting

rocky vigil
#

stc was -1 elo before that

amber fern
#

oh its stage 4

prime mica
#

stage 4 == stage 5 for all intents and purposes

amber fern
twilit oriole
prime mica
#

who are we misleading tho

twilit oriole
#

It is far too easy to just read the Elo off the sprt and draw invalid conclusions

rocky vigil
#

or

prime mica
#

everyone working on threat inputs knows that the SPRT has many asterisks attached to it

rocky vigil
#

i have consistently noticed stage 5 is not much better than stage 4

prime mica
#

ye

amber fern
twilit oriole
rocky vigil
twilit oriole
#

So now we declared a scaling problem?

prime mica
#

no?

#

like I said my local tests even indicate it could scale well

rocky vigil
#

might be time to try 1280 later

#

tho

prime mica
#

but SSS, I only ran 5000 VVLTC games

#

why do u have a bee in your bonnet about this lol

twilit oriole
#

I am referring to this message as a response to "who are we misleading". This is not a difficult chain of reasoning to follow

rocky vigil
twilit oriole
#

Yes

rocky vigil
#

i wouldn't read too much into it

twilit oriole
#

Finally a logical response

rocky vigil
#

all the sprt shows is that it's not some +5 superscaler at least for 1 thread

amber fern
rocky vigil
#

yes?

#

the combined error bars are like 4 elo

amber fern
#

I still think we should only be testing the stage 5 net, since its .5 elo higher on fishtest stc

rocky vigil
#

performance is also really machine dependent

#

and not just the fact that arm machines have it at +15 elo

rocky vigil
rocky vigil
amber fern
#

too bad I don't have an arm machine

#

wait, does it also improve on phone hardware? Since that's arm?

rocky vigil
#

presumably

prime mica
#
Result of 100 runs
==================
base (...kfish.ti.gcc) =    1517185  +/- 2488
test (./stockfish    ) =    1565808  +/- 2403
diff                   =     +48623  +/- 2682

speedup        = +0.0320
P(speedup > 0) =  1.0000

some improvements on Zen 5 as well...

#

ok I'ma put this up on fishtest

rocky vigil
#

there was a little meming about releasing sf18 since for arm devices it's at that level

split warren
#

Hmm is there a way to contribute cores to a specific test on fishtest?

twilit oriole
#

Nope

#

Been a requested feature for years

rocky vigil
#

yeah no vondele just flooded fishtest with machines until some joined the threat input tests

twilit oriole
#

There is also the 8 Elo of net spsa Elo that can be added. Which is always forgotten

rocky vigil
#

not forgotten

#

we've been well over pre-spsa 5 stage

#

since i8

twilit oriole
#

Well not you lol. But I get messages why isn't threat inputs working well all the time

#

In DMs

rocky vigil
#

💀

split warren
#

I look at this as just the beginning

rocky vigil
prime mica
rocky vigil
#

or is this just the dev thread

prime mica
#

celebrity 😩

twilit oriole
#

Idk why they don't message here. I guess they just see the first post is me and then DM

amber fern
split warren
#

You need to be aware this exists and discussions are happening here, else discord ui is very good at hiding it from the rest of the world

rocky vigil
#

yeah

#

afterwards it would be best to just make it mainstream

#

nnue-dev

#

etc.

twilit oriole
#

Yeah they never messaged here so maybe they just don't see it. And only know it is from me from general engine dev

#

I mean I should make a copypasta for the response. It's like the same every time

prime mica
#

lol

#

Threat inputs copypasta incoming

split warren
#

You should just reply with 'and how's the alternative doing?'

twilit oriole
#

Anyways yes it is true I am somewhat annoyed at getting messaged multiple times why isn't threat inputs "working" yet. When it is working perfectly fine when considering all factors

amber fern
#

There, I put the link to here in the nnue channel

rocky vigil
#

this offers 2 more chances at getting a slightly better net

#

anyways I've modified the fishtest description

twilit oriole
#

I'm too easy to bait tbh, character flaw. Someone sends me a sprt and asks 'does this means threat inputs isn't promising' and I'll always respond kek

rocky vigil
#

just in case ppl are lurking there

prime mica
#

ok let's see...

#

I wish SF was better than shashchess 😩

#

maybe we should merge in some of his changes

rocky vigil
rocky vigil
#

by like 50 elo

#

at ltc

#

or smth

twilit oriole
#

Baited lmao

prime mica
#

lol

rocky vigil
#

oh shoot

prime mica
#

ez

rocky vigil
#

🤡

amber fern
# rocky vigil or smth

more like 100, this is urgent guys, maybe shashchess changes will giga scale with threat inputs omg

rocky vigil
#

💀

rocky vigil
#

this got buried

#

what do others think

prime mica
#

clean up the code generally
is the biggest one to me lmao

rocky vigil
#

that's the one where other ppl are gonna have to suggest stuff

prime mica
#

I'm not a C++ guru tbh so i don't know how to make things clean

#

but I'll think about it

rocky vigil
#

did nobody ever bother to delete these 25 lines

#

this is actually been unused since like

prime mica
#

what's that thing called in genetics

rocky vigil
#

shawn's first dual-accumulator

prime mica
#

Genetic hitchhiking, also called genetic draft or the hitchhiking effect, is when an allele changes frequency not because it itself is under natural selection, but because it is near another gene that is undergoing a selective sweep and that is on the same DNA chain. When one gene goes through a selective sweep, any other nearby polymorphisms th...

frosty imp
prime mica
#

that too

#

I thought it was copied from the other features

frosty imp
#

true

rocky vigil
prime mica
#

why is there this random massive memcpy in do_move

#

are we passing something by value

rocky vigil
prime mica
#

omg DirtyBoardData is getting copied around for some reason

#

return value optimization isn't happening

rocky vigil
#

👀

prime mica
#

disgusting

#

ok I guess I'll make it an out parameter

amber fern
#

The stage 4 Threat inputs ltc test is steadliy improving somehow btw, like its -1.3now, it was -2.3 before

#

is it cos of the arm testing hardware?

prime mica
#

I think I bungled something here but I'm not sure, breaking it up into each change to find out what the problem is

prime mica
#

oh you mean Elo

#

yeah who knows

#

when a test is hardware dependent is bounces around everywhere

amber fern
#

yeah

rocky vigil
#

just gotta sneak in some fleet for next progtest Kappa

prime mica
amber fern
#

kinda annoying, should really have different actitecture separation

amber fern
rocky vigil
prime mica
#

no lol

rocky vigil
#

if I interpret the vmovdqu correctly

prime mica
#

but it's memcpying it around a lot

#

we'll see

rocky vigil
#

like there are 22 vmovdqus

#

and their average is like 0.3%

prime mica
#

oh

#

it's a fraction of the method's executoin time, not the whole program

rocky vigil
#

oh

prime mica
#

and do_move is like 6%

#

but we'll see

rocky vigil
#

i see

prime mica
#

hm we might want to add a template parameter to do_move which is whether or not to do dirty piece calculations

rocky vigil
#

wouldn't it be true

#

most of the time tho

prime mica
#

most of the time... but still adds a branch everywhere

naive comet
#

but worth a second try

prime mica
#

ugh this is pissing me off

#

too many things to template

#

lol it gets memcpyed 3 times I think

#

i'm too lazy to make things nullable so here's what we're gonan do

#
Result of 100 runs
==================
base (...vx2.69a01b88) =    1261593  +/- 2401
test (./stockfish    ) =    1274991  +/- 2447
diff                   =     +13398  +/- 2445

speedup        = +0.0106
P(speedup > 0) =  1.0000

CPU: 128 x AMD EPYC 9755 128-Core Emb Processor
#

ok let's hope this works on fishtest

#

that's just removing one copy lol

#

ok time to try to remove the other one

jolly tangle
#

can someone point me to the latest current threats input branch? I'm going to spend the weekend understanding it

prime mica
#

sscg's threat-inputs-i8 is what I'm working off of

#

and was the basis for the recent SPRTs

#

lol

#

I really like this line

#

just make sure it's really 0

rocky vigil
#

they are two different things

prime mica
#

oh wait

#

I can't READ

#

wait no wonnder it's not working

#

thx babe

#

OK

#

time to put on fishtest

#

ok three speedups to try out on fishtest... let's see how they work

rocky vigil
#

in other news the stage 2 nets are somehow now extremely strong relative to training time

prime mica
#

oh that's exciting

prime mica
#

just the 254/255 thing?

#

or is there something else

rocky vigil
#

255/256 instead of 127/128

jolly tangle
prime mica
#

yes

#

mr. sscg13's fork

#

ok going out but will look for more speedups when I get back

rocky vigil
#

nice nice

twilit oriole
#

It is 3am

prime mica
#

where

twilit oriole
#

uk

prime mica
#

you are correct

twilit oriole
#

going out = Asia?

#

Everywhere else is night lol

rocky vigil
#

could also be western americas

prime mica
#

Go to bed dude

frosty imp
prime mica
#

Lol

#

Just dinner with a friend

rocky vigil
prime mica
#

Are there any ways to make LUT more compact

rocky vigil
#

tbh like [12][64][12][64] is good enough i think

jolly tangle
#

does bullet support threat inputs natively, or you need to manually do it

rocky vigil
rocky vigil
#

actually

#

that kinda defeats the purpose

#

yeah idt there's a fast way to decrease the current 2^20 size

jolly tangle
#

so if I'm understanding correctly, the threat inputs are tuples of (from square, to square, piece on from square, piece threatened on to square) counting both 'threats' of capturing my own pieces (i.e defended pieces) as well as threats to capture opponents pieces. And one optimization is you de-duplicate symmetric threats. So e.g if you have bishop on A1 threatening bishop on B2, then obviously that implies the bishop on B2 threatens the bishop on A1 so you fold those inputs together into one. The idea is e.g we don't need 'pawn threatens bishop' as a input because the 'bishop threatens pawn' already contains that information based on the input squares. You also skip symmetric NvN or BvB etc attacks with a from < to comparison.

So essentially the network knows where each piece is relative to the king, and also knows all the pieces that are defended/defending/attacked/attacking it.

rocky vigil
#

yep

#

if you don't do eval in check you can also skip X -> K, but this so far hasn't gained at fishtest

#

the threat inputs also have horizontal mirroring, but no additional buckets bc that would blow up the size

amber fern
violet badger
stray reef
jolly tangle
#

yeah I have been using plenty more than SF to understand 😂

prime mica
#

Dodo and mallard looking ok so far… both apply to avx2

#

Time to see what else there is

#

maybe optimizing make_index

#

updated profile (sscg's threat-inputs-i8)

rocky vigil
#

we'll see in a few days

prime mica
#

gotcha, that puts more work on which function exactly?

#

the vpdpbusd spam?

rocky vigil
#

on eval::nnue::network<...>::evaluate()

#

I think

frosty imp
#

how good is i8 now

prime mica
#

evaluate?

#

oh I meant in the source code not the profile

rocky vigil
#

6.97%/1.55%

#

for 1024 / 128

#

oh

#

uh

#

it'll put more pressure

prime mica
#

ah yup

#

that loop is the bane of my existence

#

I finally realized why GCC bungles it in the old form and it's so dumb

#

it's because biases is right before weights

#

so because it loads from biases first, it thinks it's a good idea to replace weights[x] with biases[x + 1]

#

and then disaster ensues

rocky vigil
#

what

prime mica
#

in the assembly I mean

rocky vigil
#

oh

#

idk what u are referring to

#

but

#

gl

#

hf

#

well basically after you find the nnz's

#

instead of adding a bunch of 16

#

vectors

#

u add a bunch of 32 vectors

prime mica
rocky vigil
#

idk how much slower that would be

prime mica
#

I actually think that'll be less than a 2x slowdown

#

because

rocky vigil
#

the entire propagation takes like 7%

prime mica
#

the loop is actually currently somewhat bottlenecked by index calculations rather than actual vector instruction spam

rocky vigil
#

i assume this propagate takes most of that time

prime mica
#

at least on VNNI, I think on AVX512 and AVX2 it's the vector instructions

rocky vigil
#

so idk what the relative ratio of nnz / avx

prime mica
#

nnz is very little time IME

#

although probably more time on AVX2

#

I have a complicated theory to speed it up on AVX2 that I'll get to at some point... but doesn't benefit threat inputs any more than master

rocky vigil
#

but yeah in 2 days or so we'll find out

prime mica
#

👍

rocky vigil
#

you're welcome to try it, just change l2Big in the source code from 15 to 31

prime mica
#

which net should I use?

rocky vigil
#

for speed testing purposes

prime mica
#

perfect ok

frosty imp
#

What’s the current progress?

#

Did QAT work? Has i8 beaten the baseline

rocky vigil
#

at stc

#

i8 is effectively the main branch rn

frosty imp
#

And QAT?

rocky vigil
#

give it 12 hours

prime mica
#

lol the branch predictor HATES this section

#

all the conditions are like 20 to 60% probability and unpredictable

#

idk how to reduce the # of branches tho, let's see

lapis parrot
#

static exchange evaluation

prime mica
#

u are a funny one

#

wait why is append_changed_indices only used in double_inc_update

#

oh FeatureSet is a template parameter blob_facepalm

prime mica
#
==================
base (...vx2.69a01b88) =    1284931  +/- 1876
test (./stockfish    ) =    1295177  +/- 1889
diff                   =     +10246  +/- 2242

speedup        = +0.0080
P(speedup > 0) =  1.0000```
for me but who knows if it's real
#

ok time to revisit LUT...

#

sscg I like your idea of making the table 64x smaller by not including from in the index

#

attacks_bb with two arguments looks pretty fast

#

another really demented idea is to re-index the features in a deliberately inefficient way... using the fact that the unused indices won't use any cache space

#

but making the indexing very fast

#

however it'd use a lot of RAM for zeros :/

rocky vigil
#

wait

#

no it can be 64x smaller

#

yeah

#

for the price of one attacks_bb lookup and some arithmetic

prime mica
#

yep

rocky vigil
#

forgot the exact order of indexing myself briefly lol

prime mica
#

index luvr 9000

#

anyway signing off for tn we'll see if any of the speedups make it

rocky vigil
#

ye

#

fair

torn lagoon
# prime mica <@398510765910523904> (and others willing): could u do a bench on https://tests....
==================
base (...at-inputs-i8) =    2027832  +/- 3402
test (../snowy-plover) =    2010917  +/- 3329
diff                   =     -16915  +/- 1227

speedup        = -0.0083
P(speedup > 0) =  0.0000

CPU: 6 x AMD Ryzen 5 9600X 6-Core Processor
Hyperthreading: on ```

```Result of 200 runs
==================
base (...ad-inputs-i8) =     807325  +/- 3379
test (../snowy-plover) =     814006  +/- 3549
diff                   =      +6681  +/- 570

speedup        = +0.0083
P(speedup > 0) =  1.0000

CPU: 4 x Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
Hyperthreading: on```
prime mica
#

wut

#

compielr?

#

that is so weird this one shouldn't be arch dependent

torn lagoon
prime mica
#

😭

#

can I just send u two binaries and u try

#

that way I can know the assembly

torn lagoon
#

Yeah

prime mica
#

although I only have Linux

torn lagoon
prime mica
#

dunno how to cross compiler to windows

#

Oh! perfect ok

#

I'll send tmrw

#

gn

torn lagoon
#

gn

foggy wind
torn lagoon
#

Oh my benchmarks were with the default arch, so avx512icl

torn lagoon
naive comet
#

what if instead of slli+srai, we do shuf+srai? cuz idk free ports or smth smth @stray reef @prime mica

stray reef
naive comet
#

mmm

plain flower
#

p5 is usually quite constrained so i tend to avoid shuffles where possible as a rule of thumb

rocky vigil
#

for the QAT run testing

violet badger
#

probably, do you have this already merged in a branch?

violet badger
#

I've updated reference for the QAT run, as this has been integrated in sscg13 latest branch

#

on the arm nodes, threat-inputs-i8 is now just 2% slower than master...

==== master ====
1 Nodes/second : 298577711
2 Nodes/second : 297081100
Average (over 2):  297829405
==== threat-inputs-i8 ====
1 Nodes/second : 291537416
2 Nodes/second : 292647054
Average (over 2):  292092235
violet badger
#

so QAT training run doesn't seem to yield anything stronger than what we have so far.

prime mica
#

:(

#

The patch does nothing elsewhere

foggy wind
# prime mica <@398510765910523904> can u run your bucketing script on https://tests.stockfish...
GROUPED BY ARCH

64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                | Elo: -0.58 ± 1.67 | LOS:  24.7% | LLR: -1.46 | [131, 4300, 11127, 4261, 117]
64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo: -0.51 ± 2.20 | LOS:  32.5% | LLR: -0.79 | [64, 2592, 6337, 2592, 47]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                       | Elo:  5.24 ± 2.42 | LOS: 100.0% | LLR:  3.06 | [45, 1985, 5248, 2248, 58]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT                            | Elo:  0.38 ± 2.58 | LOS:  61.5% | LLR: -0.06 | [57, 1902, 4641, 1949, 43]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                  | Elo: -1.18 ± 4.58 | LOS:  30.7% | LLR: -0.30 | [24, 626, 1515, 615, 20]
prime mica
#

Ok that’s good

#

I’ll restart it with the proper filtering ig

#

Thx legend

foggy wind
#

Do you think that's necessary? The bmi2 and avx2 values are relevant, right? Together, they look good.

prime mica
#

ya but it'd be nice to get an official estimate of the magnitude

#

idk

green moat
prime mica
#

how is this even possible...

foggy wind
prime mica
#

ye

prime mica
#

there are plenty of changes which are amazing on my computer's avx2 but fail on my friend's machine etc.

prime mica
#

ugh why did snowy-egret fail

#

it must be the dumb optional dirty threats stuff

rocky vigil
prime mica
#

maybe the optimizer realizes the computations are unused and can throw them away on the original

green moat
prime mica
#

kyoot

prime mica
#

hmmm maybe instead of a big LUT we can just micro optimize the existing calculations

#

in particular not using LUTs and instead using constants...

rocky vigil
prime mica
#
0xcafebabe30913928 >> (attkr & 7)```
#

type beat

#

especially because the constants could probably be hoisted given that make_index is usually used in a loop

#

we can generate the constants via constexpr so it shouldn't be unreadable

#

but idk I'll try the medium-sized LUT first

#

and see how it dose on fishtest

rocky vigil
#

yeah

#

[16][64][16] or smth

#

this isn't even that bad

#

much smaller than pext lookups

prime mica
#

^_^

naive comet
#

combining this into 1 mask was slower for me tho locally

rocky vigil
#

i feel like lookup tables aren't actually that slow as long as they're small and fit in cache

prime mica
#

agree for the most part

#

main issue is if a branch depends on the result

#

because they can have quite high latency

rocky vigil
#

i mean for the threat indexing i would guess the CPU attempts to do all of them simultaneously

#

while it waits for the lookups

#

citation needed

#

i am nowhere near a cpu expert

prime mica
#

sure

#

it might end up being profitable summing two LUTs one indexed with attkr/to/attkd and one attkr/from/to

#

the latter can be byte-sized

#

will try both

#

actually yeah this seems rly succulent

#

the latter will have quite good cache locality while the first one will be relatively smol

prime mica
#

why tf is PIECE_TYPE_NB 8 when there are 6 piece types (or 7 if you include no piece)

rocky vigil
#

that would be because

#

0 and 7 mod 8

#

are no-piece

prime mica
#

advanced

rocky vigil
#

wait

#

i forgor

prime mica
#

ok this is the most demented codegen I've ever seen

rocky vigil
#

maybe it's PAWN = 2

#

idk

prime mica
#

GCC uses a mask register as a temp register

rocky vigil
#

i just know that of 0-7 two are no piece

prime mica
#

yeah I think 0 and 7 are hte no piece

rocky vigil
#

idk what that means

#

(using mask register as temp register)

#

lmao

prime mica
#

mask registers are this obscure type of register introduced in AVX512

#

and it's very weird to be using them for integers...

#

sigh

warm thistle
#
Result of  20 runs
==================
base (./sf-old       ) =    1380434  +/- 10321
test (./stockfish    ) =    1406364  +/- 10545
diff                   =     +25930  +/- 2466

speedup        = +0.0188
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
``` on my cpu
frosty imp
prime mica
#

ok let's hope this works on fishtest 🤞

warm thistle
#

👀

prime mica
#

can't profile locally atm bc my friend is running some weird DFT simulation lol

prime mica
prime mica
#

just for my peace of mind lol

#

should be 24..

warm thistle
#

matches up

prime mica
#

😊 thank u sir

warm thistle
#

on it

#

(oh i'm compiling without pgo btw idk if that will affect anything)

prime mica
#

that's ok

#

as long as both are w/o PGO

rocky vigil
#

oh yeah later on u can update the net

#

forgot to mention

#

it's +3 elo on vondele local test

warm thistle
#
Result of  20 runs
==================
base (./sf-old       ) =    1392913  +/- 11095
test (./stockfish    ) =    1421577  +/- 10865
diff                   =     +28663  +/- 3084

speedup        = +0.0206
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
prime mica
#

ok, so maybe slightly better

prime mica
prime mica
#

I'm trying 😭

#

threat inputs my beloved

rocky vigil
prime mica
#

lollll

#

reference = master?

warm thistle
#

[-1, 1] wtf lol

prime mica
#

is this on ARM tho

rocky vigil
prime mica
#

ah the stage 4 net we've been testing with?

rocky vigil
#

yeah

prime mica
#

ok that's exciting

rocky vigil
#

new one is indeed very good

#

cj speaking misinfo 🗣️

#

that 255/256 instead of 127/128 is 3 elo

prime mica
#

lololol

rocky vigil
#

not negligible

prime mica
#

maybe time for a new SPRT soon then?

rocky vigil
#

ye after speedup

prime mica
#

👍

lapis parrot
#

@prime mica you should name at least one of your patches titmouse

#

funneh name will pass 100%

prime mica
#

lol

#

bushtit

#

blue footed booby

#

etc.

#

African wild ass

naive comet
#

I did this concept before previously and it ended up -0.5%...

prime mica
#

can I see what you did exactly

#

screenshot maybe

prime mica
#

bc here we're plopping it into a 64 bit integer

naive comet
#

"combining this into 1 mask was slower for me tho locally"

prime mica
#

oh I see orry

naive comet
#

ill see if i can find it cuz i didnt commit cuz it was slower

#

uhh

#

let me look in my recycle bin

prime mica
#

which I think should work quite nicely tbh

#

WAIT

#

we don't even need that do we

#

it can just be a bit in the LUT

#

😭

#

I missed the forest for the trees

#

that is funny

naive comet
prime mica
#

ok ostrich time

#

I think this should be nearly optimal index calculation at leats with this LUT setup

#

checking whether the feature doesn't exist is four instructions on ARM and x86 🤓

rocky vigil
prime mica
#

yeah ik

#

my hope with the sf_assume(index != Dimensions) is that (some) compilers will be able jump thread the continue and/or elide one compare

#

GCC seems to do so, haven't tested clang

rocky vigil
#

ah I see

prime mica
#

ok so current state of affairs: we should be able to do plover + mallard + dodo + ostrich

#

dodo is AVX2 only but the rest are general

#

once all of them are resolved maybe we can SPRT against master with the new net

violet badger
#

sounds good.

#

but why are we excluding Scolopax minor from the list?

torn lagoon
prime mica
#

ostrich > cassowary > woodcock

#

they're just iterations on the same idea

#

but I'm leaving them up in case I made a mistake...

prime mica
#

but lemme try combining the four patches and we can see whether the speedups are ~additive

#

if u could benchmark that that'd be greatly appreciated

#

my computer is currently suffering my friend's DFT simulation

torn lagoon
prime mica
#

oh yeah if you could that one by itself that'd be great

warm thistle
torn lagoon
#

On it

prime mica
#

Kevin got 2.88% speedup

#

which is roughly expected

#

but it'll probs be less on fishtest

torn lagoon
#
==================
base (...at-inputs-i8) =    2023444  +/- 1829
test (../ostrich     ) =    2050542  +/- 1935
diff                   =     +27097  +/- 1232

speedup        = +0.0134
P(speedup > 0) =  1.0000

CPU: 6 x AMD Ryzen 5 9600X 6-Core Processor
Hyperthreading: on ```

Result of 200 runs

base (...at-inputs-i8) = 813188 +/- 3192
test (../ostrich ) = 828548 +/- 3321
diff = +15360 +/- 588

speedup = +0.0189
P(speedup > 0) = 1.0000

CPU: 4 x Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
Hyperthreading: on ```

prime mica
#

ok cool

#

so not amazing but not zero

#

that has the things combined

rocky vigil
#

lmao

#

everything everywhere all at once

prime mica
#

yes

#

moas are an enormous extinct bird

warm thistle
#
Result of  20 runs
==================
base (./sf-old       ) =    1347420  +/- 8957
test (./stockfish    ) =    1384998  +/- 8926
diff                   =     +37578  +/- 2477

speedup        = +0.0279
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
#

on mine

prime mica
#

huh

#

that doesn't add up

#

should be more like +4% lol

#

could you try git revert de7fbe6b and re-compile/re-bench

#

possible that one of them doesn't like the other

warm thistle
#

yes i'll try

prime mica
#

thx legend

warm thistle
#

helping in any way i can because i am not good at optim 💀

prime mica
#

lol

#

then I got pissed off at how inefficient 95% of modern software is

warm thistle
#
Result of  20 runs
==================
base (./sf-old       ) =    1308364  +/- 14174
test (./stockfish    ) =    1338017  +/- 14966
diff                   =     +29652  +/- 2742

speedup        = +0.0227
P(speedup > 0) =  1.0000

CPU: 8 x AMD Ryzen 7 7700X 8-Core Processor
Hyperthreading: on
prime mica
#

hm ok

warm thistle
#

the only optim i know is rewriting things in asm ☠️

prime mica
#

maybe the +2.88% measurement was a fluke then

warm thistle
#

could be natural variation i guess lol

prime mica
#

oh well we'll see on fishtest etc.

violet badger
#

also running speedtest here..

#

4c902c7c443b6b71d06895097ecd8d85aa3a2e99 vs caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3

prime mica
#

lol what are those

violet badger
#

moa vs no-bird

prime mica
#

loll kk

violet badger
#

somehow it helps often afterwards to understand what was really what. One reason to use shas in nettest to pin versions, not branches.

prime mica
#

agree

#

these days I have a bins folder formatted as stockfish.<arch>.<compiler>.<first nibbles of hash>

violet badger
#

makes sense... assuming you always use 'profile-build' 😉

prime mica
#

lololol

#

that has bitten me before

violet badger
#

yeah, easy enough.

#

virtually no gain on the arm nodes, but no damage done either:

==== 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 ====
1 Nodes/second : 292248079
2 Nodes/second : 290685623
Average (over 2):  291466851
==== caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 ====
1 Nodes/second : 292968214
2 Nodes/second : 292490843
Average (over 2):  292729528
prime mica
#

that's a little surprising honestly

#

I'll investigate later, I'm guessing at least one of the changes is counterproductive on ARM

violet badger
#

let me just run a few more of the shas and we'll see in a bit.

prime mica
#

cool thxx

#

testing on Apple silicon as well

violet badger
#

Maybe somebody else can run the following script:

$ cat speedtesting.sh 
set -e
max_rep=2
for branch in 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 de7fbe6ba5c98b585aadb0a81ea7c3cb12fc4d54 a227ff9916e98a33c198a1749f23f2783934be8a 064d09f40f0ef0e9daf19609faaf79aa471ec362 a6721c73eea0b11eaf9b574ed509265211be5738 d8eb3207dbf7f264035ee52ddfa1db9c39471841 638e1786bde4e1f3eb17be5a32da3650666318d9
do
 echo "==== $branch ===="
 if [ ! -f stockfish.$branch ]; then
   git checkout $branch > compile.out.$branch 2>&1
   make -j profile-build >> compile.out.$branch 2>&1
   mv stockfish stockfish.$branch
 fi
 for iter in $(seq 1 ${max_rep})
 do
   if [ ! -f speedtest.out.$branch.$iter ]; then
     ./stockfish.$branch speedtest > speedtest.out.$branch.$iter 2>&1
   fi
   echo $iter $(grep "Nodes/second" speedtest.out.$branch.$iter)
 done
 echo "Average (over $max_rep): " $(grep "Nodes/second" speedtest.out.$branch.* | awk '{s=s+$NF; c++}END{printf("%16d\n",int(s/c))}')
done
prime mica
#

not sure if all the commits are functional btw

#

but thx that's very useful

#

you are a bash god

violet badger
#

nah, the syntax highlighting is done by discord

prime mica
#

looking ok on Apple silicon

Result of 100 runs
==================
base (./stockfish.ti ) =    1566081  +/- 3859
test (./stockfish    ) =    1587327  +/- 3990
diff                   =     +21247  +/- 2395

speedup        = +0.0136
P(speedup > 0) =  1.0000
#

not what I was hoping for tho

violet badger
#

well, I think the main interest right now is anyway x86

prime mica
#

yeah true

torn lagoon
#
==================
base (...at-inputs-i8) =    2017676  +/- 2737
test (../moa         ) =    2064445  +/- 2825
diff                   =     +46769  +/- 1582

speedup        = +0.0232
P(speedup > 0) =  1.0000

CPU: 6 x AMD Ryzen 5 9600X 6-Core Processor
Hyperthreading: on ```
prime mica
#

ok cool

#

underwhelming but decent

#
==================
base (...kfish.ti.gcc) =     830128  +/- 901
test (./stockfish    ) =     844316  +/- 1042
diff                   =     +14188  +/- 1326

speedup        = +0.0171
P(speedup > 0) =  1.0000

on my friend's Zen 2 server

#

all that matters is beating master tho lol

#

i think that's looking plausible? 🤞

rocky vigil
#

2% is good, it's like 4 elo

#

lol

#

at stc

prime mica
#

kk

#

should we wait for my patches to finish or just SPRT early

rocky vigil
#

i mean i feel like if the individual ones pass

#

just merge them

#

no need to run an extra sprt

prime mica
#

oh sure I meant threat inputs against master

rocky vigil
#

ah

#

lemme look

#

yeah just wing it, add the net change and try an stc sprt vs master

#

alone it should be +1 elo, with speedups +4 or smth

prime mica
#

kk do u wanna do it or should I

rocky vigil
#

I can do it

prime mica
#

ok great

rocky vigil
#

more throughput

prime mica
#

yes true

rocky vigil
#

so pull moa right

prime mica
#

ye

rocky vigil
#

ok

prime mica
#

you haven't made any changes to threat-inputs-i8 recently right

rocky vigil
#

no, except for net change

prime mica
#

kk perfect

rocky vigil
#

oh wow with the ultimate SSS on my laptop bench is up 10%

prime mica
#

lol

#

SSS lover

#

probably bc different net

#

so different node count

rocky vigil
#

no like baseline

#

vs the stuff I just pulled

prime mica
#

ah gotcha

rocky vigil
#
Nodes searched  : 2324801
Nodes/second    : 1043447```(baseline)
```Total time (ms) : 1974
Nodes searched  : 2324801
Nodes/second    : 1177710```(new) lmao
prime mica
#

I love this emoji so fucking much omg

#

I use it all the time elsewhere with no context

#

one person thought it was a reference to the Schutzstaffel for some reason?

#

need to get their mind out of the gutter

rocky vigil
#

oh wow that was fast

prime mica
#

wait what about the new net

#

oh you changed it too ok

#

misleading name 😅

rocky vigil
#

Oh yeah

#

Btw I’m curious what does reducing register count do

#

(In dodo)

violet badger
#

OK, for what it is worth:

==== 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 ====
1 Nodes/second : 292248079
2 Nodes/second : 290685623
Average (over 2):  291466851
==== caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 ====
1 Nodes/second : 292968214
2 Nodes/second : 292490843
Average (over 2):  292729528
==== de7fbe6ba5c98b585aadb0a81ea7c3cb12fc4d54 ====
1 Nodes/second : 292234932
2 Nodes/second : 293382923
Average (over 2):  292808927
==== a227ff9916e98a33c198a1749f23f2783934be8a ====
1 Nodes/second : 292874238
2 Nodes/second : 287634242
Average (over 2):  290254240
==== 064d09f40f0ef0e9daf19609faaf79aa471ec362 ====
1 Nodes/second : 287458645
2 Nodes/second : 289818521
Average (over 2):  288638583
==== a6721c73eea0b11eaf9b574ed509265211be5738 ====
1 Nodes/second : 292221507
2 Nodes/second : 293205229
Average (over 2):  292713368
==== d8eb3207dbf7f264035ee52ddfa1db9c39471841 ====
1 Nodes/second : 289783271
2 Nodes/second : 289119242
Average (over 2):  289451256
==== 638e1786bde4e1f3eb17be5a32da3650666318d9 ====
1 Nodes/second : 290801279
2 Nodes/second : 289798590
Average (over 2):  290299934
prime mica
#
start:
vpaddw ymm0, ymm0, [rax]
vpaddw ymm1, ymm1, [rax + 32]
vpaddw ymm2, ymm2, [rax + 64]
...
vpaddw ymm15, ymm15, [rax + ...]
;; increment base pointer, loop...
jle start

on master, right?

rocky vigil
#

Yeah

prime mica
#

because the compilers are very smort and they're able to keep every accumulator in a register, and use CISC to enjoy

#

but on threat inputs, it looks like

#
start:
vpmovsxbw ymm15, [rax]
vpaddw ymm0, ymm0, ymm15
vpmovsxbw ymm15, [rax + 16]
vpaddw ymm1, ymm1, ymm15
...
;; uh oh...
;; increment base pointer, loop...
jle start
#

you need one temp register for the i8 -> i16 conversion

#

so the compiler has to spill at least one of the accumulator registers

rocky vigil
#

Right that

prime mica
#

and then it is sad

#

because it has to store/load across every iteration

#

in theory we could us 16 for the main net and 12 for threats but I was lazy

rocky vigil
#

Does this not apply to avx512

prime mica
#

no bc we have registers set to 16 there as well

#

and avx512 has an extra set of 16 to play with

rocky vigil
#

Ah hah

violet badger
prime mica
#

NEON is fine as well

prime mica
#

definitely worth making it conditional on the feature set though

violet badger
#

well main net will be threats, but would be speedup of threats branch if the effect on smallnet is measurable.

prime mica
violet badger
#

something is really cooking on fishtest..

prime mica
#

idk in general, having written more x86 vector loops than I'd like to admit, I'm very scared to use all the registers unless you're writing in assembly

#

bc you are really relying on the compiler to be smort

#

ok I'm betting +2 against master

#

😊

prime mica
#

maybe try git revert a227ff and see if that helps?

#

I could definitely see that one being arch (and compiler) dependent

#

also are these nodes 70 cores, 140 cores, 280 cores or what

violet badger
#

4 x 72

prime mica
#

gotcha

violet badger
#

will try later...

prime mica
#

cool!

#

ostrich is really flying running away with it

dark stream
prime mica
#

yep, there's hope...

#

oh I see

#

yeah probably arch dependence tbh

#

one of the first workers was a NEON :P

dark stream
prime mica
#

ehhh

#

definitely needs more research

#

I don't think it'll ever be closed fully

dark stream
#

Btw, if this is the case, then will there be a specific binary be preferred to be sent to competitions like TCEC or CCC? I think that happens even now due to other reasons?

rocky vigil
#

There is no need to

prime mica
#

good question, these competitions have standardized hardware so it won't be the case

dark stream
#

Wth? I glanced away for like 10 mins, and the LLR shot up.

naive comet
amber fern
#

Will LTC be tested soon? 🙂

desert tree
#

presumably after the STC test, if it passes

violet badger
#

@prime mica the more accurate measurements now:

==== caee28c4e8fe2ea52d191ee56e27b1cb9cebabf3 ====
Average (over 10):  292442968
==== de7fbe6ba5c98b585aadb0a81ea7c3cb12fc4d54 ====
Average (over 10):  292813467
==== a227ff9916e98a33c198a1749f23f2783934be8a ====
Average (over 10):  290405859
==== 064d09f40f0ef0e9daf19609faaf79aa471ec362 ====
Average (over 10):  289685089
==== a6721c73eea0b11eaf9b574ed509265211be5738 ====
Average (over 10):  292753228
==== d8eb3207dbf7f264035ee52ddfa1db9c39471841 ====
Average (over 10):  289360047
==== 638e1786bde4e1f3eb17be5a32da3650666318d9 ====
Average (over 10):  290681425
==== 4c902c7c443b6b71d06895097ecd8d85aa3a2e99 ====
Average (over 10):  291312522
prime mica
#

danke

#

very interesting

#

so branchless added/removed was bad

#

064d09f.. being bad is no surprise, that's cassowary

violet badger
#

I think that's like the most clear one?

prime mica
#

yes

violet badger
#

should I try with a revert of just that one?

prime mica
#

oh that one was superseded

#

by the following few commits

violet badger
#

I see.

#

so any particular one to revert, or just fine as is?

prime mica
#

yeah no smoking gun unfortunately...

#

I think the eneral pattern though is that using LUTs is not as good on ARM

#

or at least the machine u are testing on

violet badger
#

yeah, might be that right now it is in the memory sweet spot

prime mica
#

the neoverse cores seem to have a very good amount of cache according to Wikipedia...

violet badger
#
  Package L#0
    NUMANode L#0 (P#0 117GB)
    L3 L#0 (114MB)
      L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0)
prime mica
#

delicious

#

ok lemme try reverting just the LUT and you can retest?

violet badger
#

sure.. maybe different branch to avoid confusion 😉

prime mica
#

yep!

violet badger
#

ptarmigan

prime mica
#

lol

#

sure why not

violet badger
#

good names are essential for passing sprt, I believe.

prime mica
#

you can try that one vs. the previous commit (which is the best lookup table–based make_index function I've cooked thus far)

#

i.e.
LUT: 6018b8cd092665c7cb7c0bb943a7f2fc48de72d9
no LUT: 9fb3700ef1c7e8578b676ffddf9c644eb9cf0b4d

violet badger
#

ok, started both shas.

prime mica
#

danke

#

out of curiosity have u ever gotten to see the nodes in person

#

or are they locked away in some massive basement

#

exciting

dark stream
prime mica
#

patience my friend

#

it'll probably settle down around 1.5 methinks

#

still a huge win. I have a couple more speedups in the pipeline + we will have training tweaks I'm guessing

#

also it should scale well with TC

dark stream
#

Yeah, yeah, I know. I'm just kind of impatient for a new net because honestly, search improvements have kind of tapered off as of recently, and maybe this will change that.

violet badger
rocky vigil
#

it would appear LUT is still effective

rocky vigil
violet badger
#

hopefully both 🙂

violet badger
foggy wind
# violet badger can we have another run of the script on https://tests.stockfishchess.org/tests/...
GROUPED BY ARCH

64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo:     0.28 ±    3.11 | LOS:  57.1% | LLR: -0.10 | [58, 1664, 3319, 1663, 64]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                | Elo:    -0.41 ±    3.43 | LOS:  40.7% | LLR: -0.32 | [54, 1325, 2762, 1306, 57]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                       | Elo:    -2.18 ±    4.37 | LOS:  16.4% | LLR: -0.57 | [27, 836, 1659, 796, 26]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT                            | Elo:     4.10 ±    4.82 | LOS:  95.2% | LLR:  0.62 | [22, 599, 1364, 668, 19]
64bit POPCNT NEON_DOTPROD                                     | Elo:    16.20 ±    8.23 | LOS: 100.0% | LLR:  0.92 | [5, 163, 467, 235, 10]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                  | Elo:    -5.97 ±   10.42 | LOS:  13.1% | LLR: -0.24 | [11, 159, 322, 137, 11]
rocky vigil
#

nice to see we're still winning on arm machines

violet badger
#

these are now mostly macs

#

the rest is still a bit unclear, all LLRs still close to 0

rocky vigil
#

I am pretty sure the rest total to negative LLR

#

(slightly)

#

the total LLR is relatively close to the sum of the group LLRs

foggy wind
rocky vigil
#

I would've thought BMI2 would be good because it makes threat tracking slightly faster but I guess not

#

btw the first != assume is redundant right

violet badger
#

should be redundant.

rocky vigil
#

sad that the elo gains from net and speedups seem not additive

dark stream
#

Maybe selectively choose speedups once (if) they pass? Idk, how practical that is.

rocky vigil
#

spikiness is really something

#

standard tests don't exactly have W-L casually spiking up/down by 100 in the span of only 5k games

violet badger
#

that's just depending on which machines join the test I would say

dark stream
rocky vigil
foggy wind
#

But usually it doesn't make much sense to go into such detail because the error bars simply become too large.

rocky vigil
#

yep

violet badger
#

but right there you have it.

rocky vigil
#

fair

#

I suspect individual machine results on "normal" tests might also look like this

#

due to the variance

violet badger
#

yes, individual machines hard to converge, might be possible in local tests though.

#

locally, for example, the branch is still much slower for me:

Result of 100 runs
==================
base (./stockfish.master       ) =    1107456  +/- 3570
test (./stockfish.patch        ) =     984159  +/- 3285
diff                             =    -123297  +/- 4113

speedup        = -0.1113
P(speedup > 0) =  0.0000

CPU: 16 x AMD Ryzen 9 3950X 16-Core Processor
Hyperthreading: on 
rocky vigil
#

ah yeah an x86 machine

violet badger
#

but I suspect that slowdown is more than average, idk.

rocky vigil
#

10% slowdown is actually really good

#

it used to be -30%

#

and then -20%

violet badger
#

i see.

rocky vigil
#

the persistent speedup work has really paid off with patience

violet badger
#

well amazing process I think.

#

I will see if I can get some progress with net training, but not sure there is low hanging fruit. We'll see.

rocky vigil
#

right the ideal process maybe is different than master

#

i also suspect low hanging fruit is mostly gone though

#

i wonder if the original NNUE development project resulted in a similar feeling

#

of slowly reaching towards hce, then exceeding it, then exceeding it by a lot

green moat
violet badger
#

of course that would need testing.. so let's see. Maybe as part of future training.

rocky vigil
#

am hopeful that we get passing sprts soon enough, the real barrier seems to be LTC single thread

#

I suspect that cleanup work is quite nontrivial

#

but it still seems preemptive to start doing it now

violet badger
#

LTC single thread was -0.5Elo in the previous test... this ought to be positive now?

rocky vigil
#

Surely

#

1 elo maybe

foggy wind
# violet badger locally, for example, the branch is still much slower for me: ``` Result of 100 ...

single thread

Result of 200 runs
==================
base (...es/stockfish) =    2295252  +/- 4815
test (...stockfish.ti) =    2048251  +/- 5352
diff                   =    -247001  +/- 2504

speedup        = -0.1076
P(speedup > 0) =  0.0000

CPU: 16 x AMD Ryzen 9 9950X3D 16-Core Processor
Hyperthreading: on

32 threads speedtest

sf_base = 42662044 +/- 111193 (95%)
sf_test = 41059945 +/- 108936 (95%)
diff    = -1602099 +/- 112245 (95%)
speedup = -3.75533% +/- 0.263% (95%)

In multithreading, the difference becomes significantly smaller.

rocky vigil
#

Yep

#

SMP favors threat inputs

foggy wind
#

I thought that this only meant that larger nets are better with longer TC. But it's great that speed is also a factor.

rocky vigil
#

The suspected explanation is that smp searches similar positions and has better access patterns to the threat features

lapis parrot
#

amicic carrying BlessRNG

prime mica
#

pretty much every instruction saved off make_index is measurable locally lmao. It's kinda cooked

prime mica
#

so in a way threat inputs (hopefully) scales well in two ways ^_^

prime mica
#

causes like ten extra instructions, converting Square int8_t to int for example (because the ABI says that if you pass an int8_t, the upper 56 bits are undefined)

#

so I tried putting up a test to force inlining it but it doesn't compile for some reason

#

will try again in a bit

rocky vigil
#

I'm pretty sure the actual llr is negative on x86

violet badger
#

locally, my x86 is at -10Elo

prime mica
#

ugh yeah

#

LOL what

#

why is it soooo bad

violet badger
#
Results of master vs patch (10+0.1, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 10.98 +/- 5.65, nElo: 20.59 +/- 10.58
LOS: 99.99 %, DrawRatio: 48.67 %, PairsRatio: 1.27
Games: 4146, Wins: 1168, Losses: 1037, Draws: 1941, Points: 2138.5 (51.58 %)
Ptnml(0-2): [18, 451, 1009, 572, 23], WL/DD Ratio: 1.20

rocky vigil
#

hmm

prime mica
#

oh in a good way

violet badger
#

no

prime mica
#

wait what

violet badger
#

well, in favor of master

prime mica
#

oh I see

#

grotesque

rocky vigil
#

fishtest x86 seems to be -2 or so

violet badger
#

I think I have a particularly slow guy 😉

prime mica
#

hmmm so how do we interpret this, net and speedups not being additive? speedups themselves not being additive?

#

maybe worth sanity checking the new net by itself without the extra changes?

lapis parrot
#

raspberry pi moment

violet badger
#

don't think the net by itself is to doubt. Played quite a few games.

prime mica
#

👍

#

ok well I'll keep hunting

violet badger
#

yeah, best approach ...

#

if emu adds up, that's another fair bit.

rocky vigil
#

or at least that's what it feels like

violet badger
#

I think actually, there is an argument for that.

#

at least Naphthalin has suggested before that self-play tends to overestimate Elo differences. That would be the case in the speedup vs reference, while threats vs master is no longer (or less) self-pay.

prime mica
#

that would make sense actually

#

what's this junk in the header of update_accumulator_incremental

#

oh I guess this is reading the feature index from the added/removed hm

#

ok let's try specializing update_accumulator_incremental for the most common (added.size(), removed.size())

#

at least finally snowy egret isn't insta-failing on fishtest

#

ok another reasonable interpretation is that the speedups aren't actually that big

#

most of them have drifted down quite a bit

#

on fishtest

lapis parrot
#

this is a general rule of thumb

#

you can't add up fishtest sprt elo to estimate a pt

#

probably it's the same with speedups

prime mica
#

right they're all overestimates

lapis parrot
#

this also applies to net training

rocky vigil
#

yeah the net was vondele local fitbit

lapis parrot
#

since nets are tested constantly almost every +8 +/-5 ends up being like at best 2

rocky vigil
#

3 +- 2

#

who knows how much of it translated to fishtest

lapis parrot
#

because you only submit ones that are showing good results, but if you test for like hundreds of them you are getting flukes

rocky vigil
#

eh it was only like 5-10 that we tested

lapis parrot
#

well, there is NCM

#

which tests every sf commit

#

and one commit that was like +11 +/- 5 elo

#

was a comment change

#

which is pretty obviously non-functional kek

prime mica
#

lolololol

#

comment optimization meta

lapis parrot
#

I remember a HCE patch

#

passed fishtest

#

with double SPRT

#

only for me to notice that it can't even theoretically do anything at not FRC

prime mica
#

lmaoo

lapis parrot
#

and fishtest didn't run FRC book back then

#

so this can happen, even back in sf 10 times average elo / passer was like 0,5 elo

#

so definitely most of what you see is a big overshoot / lucky run, this is pretty normal and not smth you can really fix in general

#

one problem is that back then we didn't really know anything about scaling so we were like "+6 elo STC, 1,5 LTC, fine"

#

nowadays it's always a suspect of being a bad scaler since literally half of the search scales in a weird way duh

lapis parrot
#

STC passed

prime mica
#

@foggy wind maybe u could check the x86 distribution?

rocky vigil
#

ok so it went from 0 llr to pass in the span of 20000 games

#

👍

#

machine luck???

prime mica
#

tfw

warm thistle
#

👀

lapis parrot
#

seems like amicic did 2 runs and both were done by 0 llr

#

can be just luck

#

I recall some of my SPRT sitting at 2,5 LLR LTC for like 60k games

#

so from 30 to 90r

#

and at 120k it failed with -2,95

#

this stuff just happens sometimes

prime mica
#

@rocky vigil if you want you can merge emu as I'm pretty confident it's good

#

but up to you

lapis parrot
foggy wind
# prime mica <@398510765910523904> maybe u could check the x86 distribution?

https://tests.stockfishchess.org/tests/view/69105b3dec1d00d2c195c569

GROUPED BY ARCH

64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT | Elo:     1.66 ±    2.59 | LOS:  89.6% | LLR:  0.69 | [76, 2361, 4780, 2416, 95]
64bit AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                | Elo:     0.35 ±    2.64 | LOS:  60.3% | LLR: -0.10 | [98, 2227, 4697, 2234, 104]
64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                       | Elo:     0.93 ±    3.24 | LOS:  71.2% | LLR:  0.15 | [52, 1499, 3067, 1512, 62]
64bit AVX2 SSE41 SSSE3 SSE2 POPCNT                            | Elo:     3.01 ±    3.94 | LOS:  93.3% | LLR:  0.65 | [28, 918, 2024, 985, 29]
64bit POPCNT NEON_DOTPROD                                     | Elo:    20.68 ±    6.52 | LOS: 100.0% | LLR:  1.89 | [7, 276, 744, 425, 20]
64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT                  | Elo:    -4.27 ±    8.34 | LOS:  15.8% | LLR: -0.28 | [15, 241, 485, 223, 12]
prime mica
#

hey we're positive on all of them but one

#

sss but still

#

can you try just grouping x86 into one pot and NEON into another