#UE Threat Inputs for AB

1 messages ยท Page 6 of 1

rocky vigil
#

idk how much a net spsa is worth

frosty imp
#

spsa should be the last resort imo

#

it's absolutely detrimental to net training development

rocky vigil
#

True

naive comet
#

Stc -15 is not bad

rocky vigil
#

If no spsa we are essentially relying on scaling

frosty imp
#

hopefully there's good scaling

twilit oriole
#

was measured

rocky vigil
#

Issue is master training run pre-spsa also is like -5 elo

twilit oriole
#

well. just test vs that and if it beats it at LTC then spsa

#

its not good enough rn anyways. needs work

frosty imp
#

well we need factorizer anyways

twilit oriole
#

fix the factoriser, scale the L1 on the training side. that 30 elo is too low

rocky vigil
#

Yeah letโ€™s see how 1280 does w/o factorizer

#

Maybe it was actually worth a ton of elo

frosty imp
#

why not add the factorizer

rocky vigil
#

And I am a fool for not figuring out how to add it

frosty imp
#

what you do is to add an extra 768 inputs at the end of the loaded indices. those are psq features that are active regardless of king position

rocky vigil
#

Btw viren regarding like training longer

#

Since stages 4/5 donโ€™t appear to help much

#

Do we just try like 1200SB for stages 1, 2, 3?

twilit oriole
#

ig

#

factoriser has benefits even with unlimited data and should be as simple as adding extra features?

rocky vigil
#

Yeah

twilit oriole
#

only thing is to make sure it doesnt clip when it gets combined in main net weights

frosty imp
rocky vigil
#

I thought nnue-pytorch handled the coalescing

frosty imp
#

well yes but you need to define the mapping

twilit oriole
#

yeah idk how it does things. but in bullet i would need to half the scale

frosty imp
rocky vigil
#

Oh ok

frosty imp
#

could use a better system which would also make training faster

rocky vigil
#

ouch

#

maybe i was too optimistic in how much things were worth

#

or like i thought we could get 80% speed of master

#

when it turns out it's closer to 70%

#

that 10% difference is indeed 20 elo

prime mica
#

โ˜น๏ธ

rocky vigil
#

@frosty imp this look good?

#

on the data loader side

frosty imp
rocky vigil
#

yeah

#

lmao

#

decided that indexing the features probably costs 0 speed

#

in comparison to actually training

#

bruh you were right about not being able to add multiple features

frosty imp
#

Lgtm

rocky vigil
#

is there anything else

#

to change

stray reef
#

I'm a bit lost on where the test vs master is, and what has been tested since then

if it's -15 plus 5 elo from smallnet plus some other small speedups plus training improvements that's great honestly

rocky vigil
#

-20 + 5 elo from smallnet + some other small speedups + training

#

actually on vondele's machines it's like -25 but we don't talk about that

stray reef
#

big error bars

#

SF needs verbatim nets

#

I think it's essential for this

rocky vigil
#

this seems wrong

#

when used for threats

#

ah nvm

#

i read the def

#

it seems correct

#

extremely overengineered though

#

pushed to branch

#

@frosty imp can you give it a try (the factorized features)

stray reef
#

could have been better, but no regression at least, still a partial success

violet badger
#

so stage 5 finished as well, maybe another 1 Elo or so..

stray reef
#

bring back the 13 stage training for threat inputs Kappa

violet badger
#

That's 8 more Elo it seems.. ๐Ÿ˜‰

#

currently training are l1=128 and l1=1280. The former can be used if there would be a need for a threats smallnet (idk), and the latter might give a hint on larger sizes, even though it is a small increment.

#

now, at scale this might look already interesting..

--------------------------------------------------
Results of master vs patch (60+0.6, 288t, 16000MB, UHO_Lichess_4852_v1.epd):
Elo: -5.79 +/- 18.13, nElo: -16.23 +/- 50.76
LOS: 26.54 %, DrawRatio: 74.44 %, PairsRatio: 0.77
Games: 180, Wins: 45, Losses: 48, Draws: 87, Points: 88.5 (49.17 %)
Ptnml(0-2): [0, 13, 67, 10, 0], WL/DD Ratio: 1.09
--------------------------------------------------
rocky vigil
#

hmmm

violet badger
#

just a teaser I think ...

prime mica
#

what kind of system has so many threads wtf

violet badger
#

fitbit

prime mica
#

right...

violet badger
#

that's at c2396284 ... so shawn's nn-598188c9a702.nnue branch

rocky vigil
#

i mean the stage 5 is only like 2 elo (maybe)

violet badger
#

plus small net another 5..

rocky vigil
#

ah but does it scale

#

i guess it increases nps by a few %

frosty imp
naive comet
#

maybe after this one we can reprofile

frosty imp
#

maybe put it on fishtest?

#

incremental threat was 0 on my speedtest and +10 sprt

rocky vigil
naive comet
#

ok

frosty imp
naive comet
#

I submitted from phone xd

rocky vigil
#

xd

#

i can also give it a test in my

#

or hangon

#

i have -3% but uh yeah

#

big noise

#

ig

#

let's just see how fishtest works

naive comet
#

it's over ๐Ÿ˜”

rocky vigil
#

๐Ÿ’€

naive comet
#

guys i'm not getting something

#

when we add or remove a piece from a square, we have to recompute the attacks through / blocked by it

#

the idea here is, on captures, we remove from a square then add back to that same square

#

so we prevent any recomputation and also reduce the number of extra updates to do

#

but,

#

this does more updates than the current best TI branch

#

???

plain flower
naive comet
#

yeah exactly so taking into account piece mutations makes us have more updates somehow xd

plain flower
#

so i have add-piece, remove-piece, mutate-piece

#

not sure, that doesn't make much sense

regal steeple
# naive comet yeah exactly so taking into account piece mutations makes us have more updates s...

I think the idea works, I had that idea as well and only noticed you already had that idea and implemented it before me too late, https://github.com/rn5f107s2/Stockfish/compare/40e85bebee329ac27018bc0ca80e247df80235dd...rn5f107s2:Stockfish:7c47493cd258aa3ff18d092a2ec01e5418eb0cc2 this is my branch, I get an average of
6.74802 updates for your branch
6.11372 for threat_inputs and
5.88847 for mine,
so Im pretty sure it works in theory

naive comet
#

hmm so the issue is I implemented it poorly

#

I'd suggest you continue your test

regal steeple
naive comet
#

that is actually hilarious

#

actually you could use attackers_to() function for that cant you

#

oh but you reduce computation

#

nvm ignore me

#

OHHHH wait I think I know why yours is better

#

you remove before swap

#

I swap before remove

regal steeple
#

Do you want to run yours then again? Im fine with stopping mine

naive comet
#

I think you should keep yours

naive comet
#

this is on standard bench

#

yours: 6.51507

#

I mean mine is microscopically less cuz I implemented it for castling too

#

next step is promotions I guess

#

I tried that but for some reason it changes my bench

#

I'll look into it tmr

violet badger
#

Not sure it is useful, but better have it available.

frosty imp
#

could run a simplification against the smallnet branch

violet badger
#

yeah.

#

did we already run a stage 5 net test, with all improvements so far merged?

#

(e.g. fixed nodes test against master)

rocky vigil
#

Lemme pr smallnet to Shawnโ€™s branch

#

where's the copy button

violet badger
rocky vigil
rocky vigil
frosty imp
#

rn5 speedup not merged yet

violet badger
#

ah, you mean on fishtest.

#

sure. We have fairly good estimates of the stages nevertheless.

#

let me paste them

frosty imp
#

oh cool

violet badger
#

5: Elo: -27.08 +/- 1.82, nElo: -50.97 +/- 3.40

#

4: Elo: -25.29 +/- 1.82, nElo: -47.45 +/- 3.40

#

3: Elo: -30.98 +/- 1.82, nElo: -58.16 +/- 3.40

#

2: Elo: -33.88 +/- 1.84, nElo: -63.06 +/- 3.40

#

1: Elo: -74.37 +/- 1.83, nElo: -142.55 +/- 3.40

#

so that would suggest 4 is right now the strongest.. but within error of that test

frosty imp
#

I see

rocky vigil
#

Bruh

#

Knew smth like this would happen

#

Yeah idt that test includes smallnet either

#

Combined the two are probably worth 10 elo or so

rocky vigil
frosty imp
#

not sure

rocky vigil
#

Iโ€™ll be back in ~20 min or so

frosty imp
#

will run a threatnet test after rn5 passes

rocky vigil
#

Yeah fair

#

With stage 4?

#

Or 3

frosty imp
#

I mean smallnet

rocky vigil
#

Oh I thought you meant stc estimation vs master

frosty imp
#

But progress check would be good as well

rocky vigil
#

Which is probably close to -10 at fishtest rn

frosty imp
#

Yeah

rocky vigil
#

Getting closer to -5

#

Which is what we can get w/o spsa

#

Just need factorizer to workโ€ฆ

rocky vigil
#

do you know which code the error comes from?

frosty imp
#

Not sure

#

Happens during sanity checking

rocky vigil
#

where is sanity checking?

#

like what does it check

#

ah it happens at the beginning of traning

#

sanity checking dataloader

#

oh shoot @frosty imp i might be stupid

#

forgot to expose the factorized features

#

ok that should be fixed

#

didn't find any other errors during code staring

frosty imp
#

also needs to expose on line 1269

rocky vigil
#

Bruh

#

I just left my laptop

#

Oop

rocky vigil
#

I wonโ€™t be back for a few hours; misread time of event

frosty imp
#

i'll make a PR. works now as far as I can test

#

seems that my gpu doesn't have enough memory to run the training

candid ivy
#

not sure if the preloading goes to the gpu or cpu memory, you can try lowerng that

violet badger
#

or for testing just reduce l1 ...

candid ivy
#

or that is probably the easiest ๐Ÿ˜„

frosty imp
#

ah right

#

smallnet trains fine

violet badger
#

for me l1=1024 seems to be need about 6GB GPU mem

rocky vigil
#

gg

#

Should stack with smallnet

#

ig wait for @regal steeple pr

frosty imp
#

ah I merged it now

rocky vigil
#

That also works

prime mica
rocky vigil
#

So I guess for the new training run

#

Can we try 1200 SB for stages 1, 2, 3?

#

Also

#

Or will it take too long

prime mica
#

thx!

lofty cedar
#

Do we start LTC now?

#

Is it finished?

frosty imp
#

well we could, but I believe we still have factorizers left to try

rocky vigil
#

yeah

#

factorizers + longer stage 1, 2, 3

#

since the threat inputs are sparser

#

and no good factorization scheme

lofty cedar
#

Anyway... threat input could benefit a lot from fast incremental threat calculation.

#

Which was implemented in the clockwork HCE engine.

#

If someone wanna take a look.

rocky vigil
frosty imp
#

adopting clockwork's scheme is way too much change

rocky vigil
#

though of course more speedups are welcome

lofty cedar
#

Yeah...

frosty imp
#

not practical short to medium term

twilit oriole
rocky vigil
#

like if we get another rn5 level speedup we just win

frosty imp
#

stage 5 net untested

#

so not in there

rocky vigil
#

vondele says stage 5 is basically neutral (if not slightly worse) with stage 4

#

so what's next is factorizer

#
  • maybe 50% longer stages 1, 2, 3
#

idk

twilit oriole
#

It's a fixed nodes test right? not reliable exactly

rocky vigil
twilit oriole
#

The stage 5 net was tested STC?

frosty imp
#

maybe better scaling from higher wdl

frosty imp
rocky vigil
#

vondele did 40k games stc locally

frosty imp
#

it's in the ci pipeline that the test was made

twilit oriole
#

I see

frosty imp
#

#1336647760388034610 message

rocky vigil
#

so move from 800 SB to 1200

frosty imp
#

is there a factorizer net in the training pipeline?

rocky vigil
#

idt vondele has started one yet

#

lemme check

twilit oriole
#

Oh u got factoriser working?

frosty imp
#

it's working now

rocky vigil
#

yeah

rocky vigil
#

i mean i copy pasted this from halfka code

#

so it probably just works

frosty imp
#

no but let me check right now

rocky vigil
#

yeah before we start real run

#

nice to do

frosty imp
rocky vigil
#

eh

#

did u load correct feature set?

frosty imp
#

oops

rocky vigil
#

well i can confirm popcount for sample 0 matches

#

probably some more effort to verify the psq

frosty imp
#

cool

#

well that's probably more training time

twilit oriole
frosty imp
#

@stray reef

rocky vigil
#

oh yea

#

should be good as well

frosty imp
rocky vigil
#

๐Ÿ’€

rocky vigil
# frosty imp

gimme a bit, i have setup a framework in desmos that should let me verify each position by hand relatively quickl

lofty cedar
#

I guess let's try LTC then. Perhaps a progression test capstone.

rocky vigil
#

sample 0 correct

#

sample 1 correct

#

sample 2 correct

#

i think this should be pretty good yeah

rocky vigil
#

for the coalescer

frosty imp
#

if only the serializer is deterministic ๐Ÿ’€

rocky vigil
#

like i'm pretty sure get_feature_factors works

#

is there any way for you to isolate it individually and call it

#

like as a standalone python function

#

it shouldn't be that hard?

#

to make

frosty imp
#

well you have the get_coalesced_ft

rocky vigil
#

no i just wanna be sure get_feature_factors works

frosty imp
#

which merges virtual weights

rocky vigil
#

i think

#

i gonna assume the actual coalescer work

rocky vigil
frosty imp
#

i don't think there is

#

yeah

rocky vigil
#

replace the constants

frosty imp
#

well you already have a featureset object

#

in the model

#

so just call feature.get_feature_factors

rocky vigil
#

hangon lemme just quickly test it in an isolated

#

python file

#

ok yeah it works on sample 0

#

i think that's good lol

frosty imp
#

@violet badger any chance to get a factorized run in the near future?

rocky vigil
#

so am pretty confident it works

rocky vigil
#

i guess the question reduces to is factorizer woth 5 elo?

#

i would hope so...

frosty imp
#

it's worth a lot in #engines-dev engines

#

not sure about stockfish

rocky vigil
#

viren claims factorizer still helps even with unlimited data

#

so we'll see

twilit oriole
#

U have to remember the threats act as a pseudo factoriser it's not so simple

frosty imp
#

true

rocky vigil
#

ah right

#

ngl i was definitely too optimistic about the gains

#

the actual results are a lot more close

#

speaking of scaling we should have 1280 in a couple days

#

or like, stage 3 of 1280 in like half a day

sharp sail
rocky vigil
#

@stray reef do you think using your lookup scheme to index threats would be measurably faster?

naive comet
#

I have a smol idea I will try later

naive comet
#

also rn5 that speedup is :xdd:

violet badger
frosty imp
naive comet
#

maybe we can try horizontal mirroring for the threat inputs? not sure how worth it would be

frosty imp
#

I believe we already do that

naive comet
#

oh oopz

naive comet
#

guys latest bench is wrong

frosty imp
#

Oops

#

Pushed empty bench commit

amber fern
#

You guys any where close to making Thread inputs stronger than main yet? How many elo off are we? ๐Ÿ™‚

naive comet
#

15ish I think?

stray reef
#

worth trying

naive comet
#

zack

naive comet
#

I see, nice

stray reef
#

idk why i never thought about just using magics for that one part instead of keeping all threats, like in rn5s patch

naive comet
#

I mean incremental is just intuitively faster lol

amber fern
#

So what does UE Thread Inputs actually mean, hows it better despite to my understanding being slower? Sorry I'm not a dev lol

stray reef
#

The network explicitly receives all piece interactions (threats on enemies, defenses on own pieces) as input, so it has a lot more important information directly available

#

It is a lot stronger given equal nodes for this reason, but since keeping the threat information up to date is relatively expensive, it's slower overall

prime mica
#

what's the distribution of the threat weights?

#

I'm wondering whether a light quantization on the threat weights would be less detrimental than quantizing the main NN

lofty cedar
#

Well, quantization saves memory, but adds cost in interpretation.

#

So, not sure if it helps.

prime mica
#

I'm pretty sure add/sub is grotesquely memory bandwidth bound

#

but idk I'll try it out at somep oint

lofty cedar
#

Like... even if you imagined that you could losslessly compress the data... should you?

prime mica
#

we'll find out

amber fern
#

I tried running sf17.1 with this new nnue file, but I got this error in arena:

lofty cedar
prime mica
#

hm

lofty cedar
#

So, instead of storing each weight, we store cluster weight and the cluster membership of each one.

prime mica
#

the lookup table would be fairly expensive tho

lofty cedar
#

But since my re-quantization code was botched, it got like -1000 elo at fixed node.

prime mica
#

lol

#

beautiful

stray reef
lofty cedar
lofty cedar
#

Though my quantization code is botched currently so I can't try.

amber fern
naive comet
lofty cedar
#

Oh, I see what was wrong.

#

The packing code contained the idea that didn't work.

amber fern
#

Did I just get rate limited for checking out the website too much? LOL

candid ivy
#

happens when there's a test with a big diff

lofty cedar
#

Considering threat input adds significant complexity, what kind of elo do we need to merge?

#

5 elo maybe?

amber fern
#

Yay, I successfully build the sf using the threads branch as suggested, but I was wondering why the default nnue file was different to the current 'best' that yall told me to download: nn-bf4519f857f4.nnue, you guys got me to download: nn-598188c9a702

naive comet
#

@frosty imp @rocky vigil if this one is real it's gg

stray reef
#

nice idea

#

tracking threats_by_square is essentially free in your impl

lofty cedar
#

Does Stockfish TI training data include positions with check?

stray reef
#

afaik vondele just uses the (almost)master net training pipeline, so no

candid ivy
stray reef
#

https://furybench.com/test/3386/ rn5's speedup in plenty (needed to change back some stuff that already relied on threat information)
not 2+0.02 this time but 8+0.08

lofty cedar
candid ivy
lofty cedar
#

Doesn't look good on my machine... at least currently... at VVLTC.

candid ivy
#

bruh just wait till we run this on fishtest

formal smelt
lofty cedar
#

It's current. Maybe things will change.

stray reef
#

no way, it will end at 247.9 elo

naive comet
lofty cedar
#

Having some 100-games sanity test... not saying that they would be significant in the grand scheme of things.

candid ivy
#

but rather useless

lofty cedar
#

Want to see it playing, except I don't even know how well it plays because Stockfish plays are inscrutable to mere mortals like me.

twilit oriole
twilit oriole
lofty cedar
#

Oh... I see.

naive comet
#

yeah I recall linrock trying that

twilit oriole
#

The way we did it is just use i8 FT weights but keep the calculation in i16. This just halves the mem bandwidth used to fetch FT weights and works very well

#

Was only -5 fixed nodes for us

stray reef
#

how much faster?

twilit oriole
#

I guess it helps even if the weights are in L3 cache

#

The value L1 there is 3072 so it's pretty good indication it will work at least for SF

#

Rounding the weights when quantising them is crucial

stray reef
#

okay cool

naive comet
#

honestly memory bandwidth is not an issue with mmap is it

formal smelt
#

well the linked monty test is with mmap

#

halving the number of bytes loaded is beneficial regardless

stray reef
rocky vigil
#

At least a vast majority should work

#

Oh viren already said so

#

If we can get it to work and add sub is mem bandwidth bound as you said we should be able to shave off a significant amount of the 25%(?) runtime that it currently uses

#

actually question

#

i8 quantizing threat weights means you should not do the x2 scaling

#

on loading the weights

#

@prime mica 81765782 weights losslessly quantizable to i8 out of 81772544

#

aka close to 99.99% of them

twilit oriole
#

Yeah but what's getting clipped are the most important weights :p might need training scale change

rocky vigil
#

true

#

but training scale change = re-spsa search

#

which is annoying

twilit oriole
#

Not really, you can always scalar multiple the eval

rocky vigil
#

that works?

rocky vigil
#

i guess the solution is to do (psq + 2 * threat)

twilit oriole
#

We just i8 quantised the entire FT

rocky vigil
#

ah interesting

#

since our accumulators are separate we could keep i16 for psq

#

if needed

rocky vigil
#

if i ran psq features probably would be nowhere near as good

twilit oriole
#

There may be a problem because the multilayer is quantised so aggressively it makes it harder to quantise the FT. Because we have the inverse, can't do fast multilayer after quantising the FT to i8

rocky vigil
#

i mean it is definitely worth a try to quantize threat only

#

how much of a bottleneck of memory bandwidth are we talking about?

#

in comparison to compute

twilit oriole
#

Don't have comparable numbers because we don't have UE

rocky vigil
#

would it be worth it to try and decompress 8 bit format live

twilit oriole
#

No

rocky vigil
#

ok

#

actually lemme check with the other net

#

this is bf4

#

it shouldn't make a difference

twilit oriole
#

Well I would just test the inference speedup at least. I expect significant but single digit %

rocky vigil
twilit oriole
#

Yes but multilayer far less quantised

rocky vigil
#

i guess we'll see later

#

btw is there a way to disable the net sha check on comp

#

it takes like 5 seconds

#

which adds up :p

regal steeple
rocky vigil
#

81759318 weights losslessly quantizable to i8 out of 81772544 nn-598

#

viren could (approve)

#

i think

#

if he's heere

rocky vigil
#

actually leb makes this annoying

#

gimme a second to write it in (little) endian

regal steeple
rocky vigil
#

actually how do you get the vec_t to interpret it as i16

#

if it's i8 originally

#

to my understanding reinterpret_cast will just fill it with 2x the amount of i8 values

#

cvtepi8_epi16

#

ok

naive comet
rocky vigil
#

like the i8 to i16 simd conversion

stray reef
#

just convert to i16 at startup to save some instructions Kappa Kappa Kappa

rocky vigil
#

actually from startpos it seems to be a few %

#

but like

#

i am def doing it wrong

#

bc the pv is cooked

rocky vigil
#

bahhh rust makes this so different

#

Nodes/second : 1029999 for current vs Nodes/second : 1021921 for i8 test

#

so um

#

i am not doing this correctly

#

anyways i leave it to simd experts to try this later

rocky vigil
#

wait cj does this actually work

#

ah because enemy only matters when the piece types are the same

#

nice find

#

btw it appears that "real compression" performs better on threat inputs

#

i.e. zipping master net gives 60 MB, and zipping l1=1024 threat net gives 60 MB as well

#

asdadasds l1=1280 too large for fishtest

#

btw 1280 is like, 15% slower

#

than 1024

#

from bench

#

lemme try speedtest

#

ok bench is just slow with 1280

rocky vigil
rocky vigil
#
...      Stockfish TI-1280 playing White: 145 - 76 - 279  [0.569] 500
...      Stockfish TI-1280 playing Black: 77 - 128 - 295  [0.449] 500
...      White vs Black: 273 - 153 - 574  [0.560] 1000
Elo difference: 6.3 +/- 14.0, LOS: 80.8 %, DrawRatio: 57.4 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
1000 of 1000 games finished.``` stage 4 or 5 (or factorizer) need to bring serious improvement if we want this to be viable
frosty imp
#

Fixed nodes?

rocky vigil
#

yep

#

20k

#

ofc stc would be better

#

but it exceeds file size limit on fishtest

frosty imp
#

how about 896 HL

rocky vigil
#

let's wait for 1280 to finish

frosty imp
#

surely if 1024->1280 is bad then 1024->896 is good

rocky vigil
#

and then see

#

maybe stage 4 is where the breakthrough happens

#

larger net being slower to train and all that

rocky vigil
#

if it adds overhead but decreases avg threat features updated

rocky vigil
frosty imp
#

merged

rocky vigil
#

oh nice

#

yeah i guess yoshie was right that fusing is not a gain

rocky vigil
frosty imp
#

oh bruh

#

fixed

twilit oriole
#

Those error bars are too large. A 10 Elo fixed nodes improvement is fine

rocky vigil
#

not vondele's 288 core fitbit or whatever he uses to do the local tests

twilit oriole
#

yes but your statement a "serious improvement" is needed is incorrect

rocky vigil
#

strange

rocky vigil
#

i'm pretty sure

#

again let's wait for stage 4/5

twilit oriole
#

It's a 6 +- 14 test. serious improvement to number of games is what is needed is my point lol there's no indication it is underperforming where it needs to be

rocky vigil
#

do u have hardware free to test

#

i can set up a branch if that's what you want

twilit oriole
#

sure

rocky vigil
#

would've put it on fishtest to check stc but we hit the size limit

twilit oriole
#

will it require me getting the net and all that crap. cos im doing through ssh takes time

rocky vigil
#

no auto download

twilit oriole
#

ah

#

well its fine if i can just wget it or smth

#

got 384 thread machine free in an hour, when will branch be done?

#

ig I can just do the TC tests also

rocky vigil
#

for getting a zipped version

#

branch is up now

twilit oriole
#

link

rocky vigil
twilit oriole
#

cool

rocky vigil
#

not even exceeding size limit by much even

#

130 MB vs 128 MB limit I'm pretty sure

twilit oriole
#

well changing the limit is just changing a number in the nginx config on fishtest server

#

its easy to do

rocky vigil
#

lmao

#

if local tc looks promising prob can get it done on fishtest

twilit oriole
#

can u make/link branch for 1024 net also

stray reef
twilit oriole
#

i do this test after my meeting ig lol

#

when will stage 4 complete

rocky vigil
#

uh

#

half a day?

twilit oriole
#

cool

rocky vigil
#

like to reduce also i.e. lichess load

#

oh wait it's not compatible with mmap

#

maybe we do it for the releases only

#

factorizer is surprisingly not that much of a slowdown

#

so far at least

#

still holding in mid-80 its/sec

rocky vigil
#

aha

#

that also works

#

might take a couple days

#

who knows

rocky vigil
#

or are you going to wait for stage 4

twilit oriole
#

Wait probably. Should ask @violet badger to increase the net size limit on fishtest server also

twilit oriole
#

Ah nice u already read lol

violet badger
#

well not merged and not me.. but yes

#

doesn't adjust the memory estimates for the workers though so is not fully complete.

twilit oriole
#

I think there's some nginx limit or smth to adjust also. Outside of PR

rocky vigil
#

yeah ppigazzini is the fishtest maintainer right

violet badger
#

yes, and indeed could need a bit more.

rocky vigil
#

higher than what the listed speedup suggests

#

but indeed should be very good

violet badger
#

so that's on top of what the 10k STC test tested, right..

rocky vigil
#

yes

violet badger
#

would imply near parity with master..

rocky vigil
#

once this finishes i plan to do both stc / ltc 10k games

lofty cedar
#

Hmm? Why not SPRT against master right away? After it finishes?

#

If it gains now, then it ends here.

rocky vigil
#

hmm

#

not expecting pass @ stc yet

#

ltc would be different story

#

idk what others think but i only wanted to do sprt vs master once we iron everything out and finalize it

#

since that will include a LTC SMP

lofty cedar
#

Maybe...

#

I mean, a change this drastic needs VVLTC gainer.

violet badger
#

I agree with finalizing before doing more advanced tests.

#

I expect also some net improvements could be found.

rocky vigil
#

it is expected* to scale with both time and threads (*known in plentychess, we'll see how it goes in stockfish relatively soon, it seems?)

#

yeah patience is good

lofty cedar
#

Yes... there's a good chance we could gain now at LTC.

But we could finalize things first.

rocky vigil
#

things are moving very fast actually

violet badger
#

I also still see warnings like:

position.cpp:1060:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
 1060 |         Bitboard threatened = ray & qAttacks & occupied;
      |                  ^~~~~~~~~~
position.cpp:1030:14: note: shadowed declaration is here
 1030 |     Bitboard threatened = attacks_bb(pc, s, occupied) & occupied;
      |              ^~~~~~~~~~
position.cpp: In instantiation of 'void Stockfish::Position::update_piece_threats(Stockfish::Piece, Stockfish::Square, Stockfish::DirtyThreats*) [with bool put_piece = true]':
position.h:346:35:   required from here
position.cpp:1060:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
 1060 |         Bitboard threatened = ray & qAttacks & occupied;
      |                  ^~~~~~~~~~
position.cpp:1030:14: note: shadowed declaration is here
 1030 |     Bitboard threatened = attacks_bb(pc, s, occupied) & occupied;
      |              ^~~~~~~~~~
position.cpp: In instantiation of 'void Stockfish::Position::update_piece_threats(Stockfish::Piece, Stockfish::Square, Stockfish::DirtyThreats*) [with bool put_piece = false]':
position.h:353:36:   required from here
position.cpp:1060:18: warning: declaration of 'threatened' shadows a previous local [-Wshadow]
 1060 |         Bitboard threatened = ray & qAttacks & occupied;
      |                  ^~~~~~~~~~
position.cpp:1030:14: note: shadowed declaration is here
 1030 |     Bitboard threatened = attacks_bb(pc, s, occupied) & occupied;
      |              ^~~~~~~~~~
#

probably easy fixes

rocky vigil
#

in comparison to the 6 month hiatus before

lofty cedar
#

Because once it passes, there would be more things to do like search tune and so on that could lock in the decisions.

#

Though I'd agree we should finalize things first.

#

So we don't waste compute on VVLTC search tunes on incomplete net.

rocky vigil
#

our goal is to pass this (preferably, with a comfortable gain) without having to resort to spsa

#

which looks a lot more doable with the new speedups

#

but it might also help with increasing L1

#

according to data above this reduces the avg. number of features updated which is more improvement with higher L1

violet badger
#

hard to track what is actually the most up-to-date branch ๐Ÿ˜‰

rocky vigil
#

i think shawn has been trying to keep updated as tests pass

lofty cedar
#

Someone mentioned that like 99+% of the feature transformer weights losslessly compress to i8.

#

Though the rest could remain problematic.

rocky vigil
#

either clamping it directly does not work, or I have done something wrong elsewhere in the inference

#

in any case I gave it a quick try and got ~~neutral in speedtest

#

though someone more knowledgeable with simd could of course do better I presume

lofty cedar
#

anematode is probably the most knowledgeable we have at this.

rocky vigil
#

it helps with memory bandwidth yes, but adds many extra cvtepi8_epi16 calls, unless I am missing something

#

and also an extra addition pass

#

because the x2 trick doesn't work anymore

lofty cedar
#

But then... what about our ambition to deduplicate the net? Now, do we have to start over again?

rocky vigil
violet badger
#

these things will get sorted eventually, it is more or less orthogonal to the threats arch.

lofty cedar
#

I'm glad the net deduplication didn't contain a bunch of arch-specific hacks to get it working.

rocky vigil
#

i think threats make it more possible

#

since you gain more speed ostensibly

#

and less quantization penalty

#

since the threat feature weights are much less in abs value

lofty cedar
#

Oh, it already gained like 40 elo in fishtest condition.

#

But there were some problems that made it not viable to be merged right away.

rocky vigil
#

oh i was referring to i8 quantizatio

#

sorry

lofty cedar
#

Oh, I see.

rocky vigil
twilit oriole
#

I'm tracking who has contributed, we are gonna end up with a PR with 20 coauthors or smth lol

rocky vigil
#

i think it's 10 so far right?

twilit oriole
#

Nah it's way more

rocky vigil
#

really

twilit oriole
#

There's a lot of ppl that contributed earlier who are no longer active here

#

But I remembered

rocky vigil
#

I have u, jw, (ravenslofty - since yoshie's impl is inspired from yukari), yoshie, disservin, linrock, vondele, me, shawn, rn5, cj

#

were there others?

twilit oriole
#

Yes

#

But anyways not critical rn

rocky vigil
#

will look funny if it happens

#

20 ppl authored and (vondele/disservin) merged

rocky vigil
lofty cedar
#

Though... well... this is Stockfish. 20+ people on months of work for maybe like 10 elo gain.

rocky vigil
#

ehhhh

#

most of the work happened in march

twilit oriole
#

More like 2 months work

rocky vigil
#

and in the last month

#

so that would be ~2 months

#

yeah

prime mica
#

so hyped

violet badger
#

and we're still working on it...

#

could be another two months for all I know ๐Ÿ˜‰

prime mica
#

lol

violet badger
#

but it is fun..

#

and actually nice it is many people contributing.

lofty cedar
#

And now that we have our training infrastructure back up, we could try new archs.

rocky vigil
#

there are a lot of cool ideas, note that a full net still takes several days though...

lofty cedar
#

I've calculated that a subnet made up of only pawn/minor piece/major piece/diagonal piece inputs all could be cached with like 80% hit rate if not more.

rocky vigil
#

so right now what will happen is basically experimentation with L1 and training schedule

violet badger
rocky vigil
#

not really big arch changes

violet badger
#

would speedup testing/training 2x.

rocky vigil
#

thought it required 2x the hardware :p

violet badger
#

(I think the actual data loader is probably OK; but something else is not).

#

HW is there.

lofty cedar
#

With 2048 cache entries, a subnet with only pawn or minor pieces could be cached with about 90% hit rate.

#

Well, not sure what can we make of it.

rocky vigil
#

this probably belongs in a separate like thread

lofty cedar
#

Yeah...

#

Though mentioning it because... well... we could now try...

rocky vigil
#

let's wait, there are more important things for now

#

anyways if we merge this into master this thread will be abandoned and further discussion will just be in nnue-dev

lofty cedar
#

Yeah...

rocky vigil
#

anyways there's not much to do now while waiting, if someone wants to try a potential improvement would be to update the threats lazily as well, since we don't use them for anything other than the nnue

#

but this is a lot of effort

twilit oriole
#

Oh damn they aren't even lazy updated kekw

#

Isn't that like a decent speed boost

rocky vigil
#

should be

prime mica
#

regarding the i8 quantization, it'll probably be pretty machine dependent whether it's faster or not

#

but ideally it can just be optional

#

while maintaining a consistent bench

rocky vigil
#

which will incur some (hopefully minor) loss

prime mica
#

right

#

would be a balancing act

#

although, if the exceeding elements are extremely rare, we can do a scalar cleanup for those rows

#

without much overhead

#

(I tried this with the main net but the exceeding elements are too common for it to work)

twilit oriole
#

The i8 speedup is more so on high thread counts

#

And makes net smaller

#

Smaller than master even I think

prime mica
#

on my machine, I got a speedup even single threaded

#

but my computer has proven very weird in terms of perf characteristics

#

so probably wouldn't generalize

violet badger
#

zen5 is however modern, and we should prioritize moving forward IMO.

rocky vigil
#

how modern are the majority of fishtest workers btw

twilit oriole
rocky vigil
#

less memory pressure on fishtest workers should help a lot as well yeah

#

especially with mmap

prime mica
twilit oriole
#

Binary size

prime mica
#

I see

rocky vigil
#

note that lichess / chess.com would also like it

#

if our nets were smaller

prime mica
#

I mean if you're willing to not use LEB128 then you could get it nearly the same size, e.g. use -127..=127 as literal i8 values and -128 as a prefix byte

#

but yeah there is an elegance about making it all i8

rocky vigil
#

actually curious how we should continue to have the 7mb smallnet

#

like for websites

twilit oriole
#

Not a major concern I think, they already use a custom binary

prime mica
#

how does stockfish wasm even work? I don't see anything webassembly-specific in the repository...

rocky vigil
#

lichess has a dedicated stockfish wasm repo?

#

idk what they do there

prime mica
#

Ohhh ok

twilit oriole
#

They zstd compress the net and that iirc

rocky vigil
#

oh ok if they do it on their side it's good

#

it shouldn't impact them too much then

#

threat inputs compress better which will offset the raw size (assuming i16)

#

though i8 is obviously preferable

formal smelt
#

It just worksโ„ข

rocky vigil
#

yeah i saw and realized

#

it wouldn't be helpful lol

formal smelt
#

This seems like the kind of thing that definitely doesnโ€™t need manual simd

rocky vigil
#

honestly to get around mulhi trick

#

i think if we wanted to do that

#

i8 * (i8 = 2) -> i16 mul is better

frosty imp
#

because non-code contributors were not usually coauthored

#

but mentioned in the PR

green moat
frosty imp
#

many more would need credits in the PR

rocky vigil
#

yeah it might depend on how we do coauthor/credit split

#

a lot of ppl need credits yeah

frosty imp
#

I guess disservin actually contributed to the original threat-inputs branch

prime mica
twilit oriole
#

Don't need to be posting lists lol. I already have it and none of the posted ones are complete anyways

rocky vigil
#

yeah let's figure this out at the end

frosty imp
#

are we getting too ahead of ourselves here Kappa

rocky vigil
#

indeed

twilit oriole
#

All we need to agree, I get to make the PR Kappa

rocky vigil
#

though i wouldn't think it is wrong to be feeling pretty good about it

#

the scary time was before rn5 speedup

prime mica
#

dumb question, why isn't threat information (or something equivalent) already encoded somehow in the main network through training

frosty imp
#

can we clean up the different horizontal mirroring scheme btw

rocky vigil
frosty imp
prime mica
#

gotcha ok

upbeat pewter
#

it's really hard to generalise that information just from the PST inputs without a lot of layers

daring wren
prime mica
#

true

rocky vigil
#

conversely, you cannot figure out psq information from threat information

#

the reason why this stuff doesn't work is like

#

you have to think of nnues as not really deep networks

#

so they are very contained by the additive structure

#

of the first layer

#

which carries most of the info

upbeat pewter
#

part of why I wanted to play with threat inputs is that I always wanted to wire the attack table information I already had into the eval

#

and up until yoshie cracked it, I was the only AB engine that could do so without being majorly crippled in performance (though I did need to quarter my net width)

rocky vigil
prime mica
#

I love this emoji btw

rocky vigil
#

we still have work to do

violet badger
#

yeah 5-10 Elo still needed I think

rocky vigil
#

i'll do it when we get closer to pass it hink

frosty imp
#

yeah and also on the trainer side ig

lofty cedar
#

Maybe someone could also use something similar to splat_moves to update threats faster?

#

Maybe...

#

Though the ideal byteboard... would be pretty hard.

prime mica
#

I mean does threat updates still take a serious amoutn of time after cj/sscg/shawn's work?

rocky vigil
#

the big gain still todo

#

is compute threat updates lazily

#

overall it should still be like high single digit % of runtime

lofty cedar
#

How do we update lazily?

#

When the threat update depends on the board state.

rocky vigil
#

we postpone the threat updates to when we need them

#

(i.e. on eval)

lofty cedar
#

Yeah... I know, but each threat update depends on the board state... and updating the entire thing means... well... wait... tracking every added and removed piece? It could be as much as recomputing the entire threat...

#

Could be faster... IDK.

frosty imp
prime mica
#

update_accumulator_incremental ๐Ÿ˜ฉ

frosty imp
prime mica
#

I'll try the i8 compression on ur branch later today

frosty imp
#

๐Ÿ™

frosty imp
#

uh oh

#
info depth 39 seldepth 56 multipv 1 score cp -190 nodes 56649998 nps 1583021 hashfull 515 tbhits 0 time 35786 pv f8f5 b1c2 a6a5 d7d8 g8g7 d8a5 g7g8 a5a6 g8g7 c2c3 g7f7 c3b4 f7g7 b4a3 g7h7 a6a7 h7h8 a3a2 h8g8 a2b1 f5f1 b1c2 f1f5 a7b8 g8g7 c2c3 g7h7 b8c8 h7g7 c3b4 g7h7 b4a3 h7g7 c8b8 g7h7 a3a2 h7g7 b8d6 g7h7 a2a3 f5f7 a3b4 f7f5 d6g3
info depth 40 currmove f8f5 currmovenumber 1
stockfish: nnue/nnue_accumulator.cpp:115: void Stockfish::Eval::NNUE::AccumulatorStack::push(const Stockfish::DirtyBoardData&): Assertion `size + 1 < psq_accumulators.size()' failed.
#
position fen r4rk1/1b2bp2/p2p4/1p3pNp/4P2P/1P1Q1Pq1/1P6/1K1R2R1 b - - 1 25
go
twilit oriole
#

Which branch is that

frosty imp
#

cj branch

frosty imp
#
position fen 5rk1/3Q4/p5p1/1p5p/8/1P6/1P6/1K6 b - - 0 37
go
rocky vigil
#

Huh how does it have anything to do with threats

frosty imp
#

it's broken on master actually

#

submitting issue rn

twilit oriole
#

What the heck kek

frosty imp
#

will bisect in a moment ๐Ÿ™ƒ

lofty cedar
lofty cedar
#

About 12%.

prime mica
#

Interesting

#

Could there be even more if we sorted them

#

Each eliminated pair is worth quite a lot

lofty cedar
#

The data is quite structured.

#

As in, the first few almost always pair with the last few.

#

In order.

#

But I'm not sure if the cost of checking would outweight it so I only check one pair.

prime mica
#

gotcha

rocky vigil
frosty imp
#

the assert is wrong

#

the logic is correct

prime mica
#

ok that makes sense

#

otherwise it would have crahsed by now

rocky vigil
frosty imp
rocky vigil
#

oh ok

#

so it looks like tracking and indexing threats each take up ~5% of the runtime

#

after cj speedup

#

so lazy tracking could be a couple % gain

lofty cedar
#

Though... well... there's this patch that needs aprxval. This one is a low-hanging fruit.

rocky vigil
#

actually why is it that we get duplicate features in both added and removed

lofty cedar
#

Well, I think the feature got added in one move but then it didn't use the net so it got removed later on.

rocky vigil
#

huh shouldn't incremental always be one-move updates

#

strange

#

stage 1 validation loss 0.00305 w/ factorizer compared to 0.0031 from old run

#

not sure how much this can be read into

lofty cedar
#

Has anyone added the weight permutation or something to the system?

rocky vigil
#

the only weight permutation to be done is re-indexing the threats to be efgh mirrored (so as to remain consistent with the psq mirroring)

#

this does not functionally change the evaluation, so I'm delaying it to finishing touches

rocky vigil
#

let's see how it fares on fishtest

#

how big are added / removed on average? if intersection is significant it might be worth it to search for cancellations

naive comet
#

old profile

naive comet
frosty imp
naive comet
#

goat

frosty imp
#

why is there an intersection

#

a slider piece moving along its ray or something?

naive comet
#

when you add a piece then remove it then there's an extra + and - from the slider attacking the piece behind it

frosty imp
#

I wonder if you can do something about it then?

#

just hardcode threat updates for each movetype maybe?

naive comet
#

I mean we already had a threat deduplication patch going

#

from rn5

twilit oriole
naive comet
#

it reduced updates by ~0.7 on avg I think

frosty imp
naive comet
#

@lofty cedar

twilit oriole
#

Well a dbg on repeat would be useful ig

naive comet
twilit oriole
frosty imp
#

two updates for 12% of the time

#

that's 0.24

naive comet
#

ok well gg I guess

frosty imp
#

@naive comet plz pr

lofty cedar
lofty cedar
#

It looks like my patch speeds up more than I expected.

#

It measured barely 1% on my machine.

rocky vigil
#

interesting

#

well

#

any and all elo is good ๐Ÿ˜„

prime mica
#

๐Ÿš€

rocky vigil
lofty cedar
#

Are you merging my patch too?

rocky vigil
#

lemme keep it to patches that have passed fishtest...

#

there's plenty of time to be patient

lofty cedar
#

Oh, okay.

rocky vigil
#

alright now we wait

rocky vigil
prime mica
#

lololol

#

the truth hurts

twilit oriole
prime mica
#

๐Ÿ˜ฉ

#

what's the ELO gain fixed nodes again?

twilit oriole
#

30 but it is misleading

#

Because of how threat inputs work

prime mica
#

elaborate?

twilit oriole
#

Different game phases have a very varying speed diffs and fixed nodes differential. The fixed nodes gain occurs in the positions with the most slowdown

#

You have to just read the STC and LTC the fixed nodes does not tell about expected scaling

prime mica
#

gotcha

twilit oriole
#

Well the new speedups don't appear to help in PT much

#

So that's a big issue

#

What a terrible result kek

rocky vigil
#

this is so cooked

#

what

prime mica
#

๐Ÿ˜ญ

twilit oriole
#

Well it may be related to the fact you rebased on master

#

Which is optimising for master net

#

Gainer patched there may not necessarily translate

#

Either that or the branch is fucked in some way

naive comet
#

@rocky vigil @regal steeple remember to rebase

twilit oriole
#

It is rebased. That's the PT diff

naive comet
#

ok nice

twilit oriole
#

At best if the test got super unlucky in both STC and LTC it's equal to previous PT

#

We can just attribute it to that these new tests didn't have Shawn's blessings

#

I think it is actually likely the previous PT was just super lucky

#

Since the jump to -10 didn't add up from the previous approx -25

prime mica
naive comet
#

I have a good idea for more speed, I will :prayge: this works

#

once I get back home

rocky vigil
#

to be fair what I see in sprts

#

one side can apparently just randomly gain 5 elo

#

then lose it

#

so idk anymore

twilit oriole
#

Well I think this PT was too early so not enough gainers to overcome error bars. There is still actual gain

#

It's just not visible enough yet

#

Adding sprt Elos is not valid lol

#

You have to assume on the low end for all of them

rocky vigil
#

is 10k games even enough to check scaling

#

ig I can extend

#

if we want more

twilit oriole
#

I don't think you can extend it

#

Fishtest "feature"

#

Try to if you want to see what I mean

rocky vigil
#

eh whatever

#

oh yeah

#

"unable to modify number of games in a fixed game test" lmao

#

actual const int games = 10000

twilit oriole
#

It's dumb, I disabled the check on our instance lol

prime mica
twilit oriole
#

It's just a check in the code that throws that message when you try

#

We used to be able to till a few years ago

#

Like it's intentional design choice to not allow users to do it now

frosty imp
#

merged

naive comet
#

bam

frosty imp
#

๐Ÿ’ฅ

naive comet
#

whoo

frosty imp
#

sanity check factorized vs unfactorized stage 1

frosty imp
#

hmm seems alright

#

stopping the test

lofty cedar
#

What's this factorization?

frosty imp
#

just bringing back factorized weights for psq inputs

lofty cedar
#

What's factorizing weight in the first place?

frosty imp
#

we add an extra bucket active regardless of ksq, then merging that bucket to all other buckets after training

#

help with convergence in rarer buckets

lofty cedar
#

Oh, I see.

#

Though the elo in the test can be misleading.

#

I thought we were like only 10 elo away but after a few more patches it's further.

#

Though still within the error bar.

twilit oriole
#

Well stage 1 ofc will be much better. It won't be that huge difference in the end

violet badger
naive comet
#

yikes

violet badger
#

well, not infinitely far from beating it.

#

3-4% speedup, or a clear improvement in the training.

candid ivy
#

if i checked correctly then apply sometimes does remove/add the same threats can that be?
like the value in the added list also exists in the removed list ? so we are doing some unnecessary ops no?

frosty imp
#

yeah thatโ€™s what people are now optimizing

candid ivy
#

ah ๐Ÿ‘

green moat
violet badger
#

btw, I wonder what that 'factorized' pipeline is actually using, as it is setup to use just --features=Full_Threats .. not --features=Full_Threats^ ?

rocky vigil
#

Huh

violet badger
#

so, you agree that it should be using the latter?