#UE Threat Inputs for AB

1 messages · Page 7 of 1

rocky vigil
#

Yes it should be using ^ for the factorized features

violet badger
#

so, I'd better setup an additional run with that enabled...

#

makes me wonder why the first quick test of stage 1 seemed so good?

#

it did finish stage 2... so I guess we could test again.

rocky vigil
#

Closer if you consider that we are around this close to master w/o spsa for a standard net

violet badger
#

yes, I've been keeping that in mind 🙂

rocky vigil
#

Though I hope there is still some 5-10 elo more we can squeeze in total

#

I don’t think the fishtest actually helps let us know the scaling

#

At least it shouldn’t scale poorly

violet badger
#

well... #1336647760388034610 message is a test for scaling to some extend. Looks good but no cigar

rocky vigil
#

Oh shoot I thought that was also stc just done faster locally

violet badger
#

stc, but with a few more threads

green moat
violet badger
#

no need to ping ...

rocky vigil
#

It remains to be seen whether 1280 can be worth it over 1024

#

Stage 4 should be done now right?

green moat
rocky vigil
#

@stray reef do you think lookup could still be faster than current indexing scheme post-cj speedup?

stray reef
#

i think it's worth trying the full lookup table for sure

#

but i personally will try this way as well

lofty cedar
#

Another potential speedup patch inbound!

stray reef
#

ig it's a question of larger table vs. no branches/one pseudo_attacks lookup and popcount saved

lofty cedar
#

I think larger tables are relatively free if their cache lines aren't used.

#

They just stay out of the caches and don't cause slowdown.

rocky vigil
#

Can try it ig

rocky vigil
#

oh

#

size limit increased

#

gonna try 1280 stc / ltc ig

twilit oriole
#

is this with factoriser

rocky vigil
#

nope

#

is just stage 4

#

also it's on fishtest now

#

so you can just pull the two branches involved

#

if you want to run locally

twilit oriole
rocky vigil
#

yeah idk why "factorized" stage 1 is so much better this time

#

training luck?

twilit oriole
sharp sail
#

how much is factorized training worth?

twilit oriole
#

this net isnt factorized. they thought it was but it wasnt

rocky vigil
#

we will find out for threat inputs in ~ 3 days

#

real factorization

sharp sail
#

was the master net trained without factorizer?

rocky vigil
#

the current best net is w/o factorizer

twilit oriole
#

the master net is with factorizer

#

ofc

rocky vigil
#

since this one reduced avg. number of threats updated

#

oh well

twilit oriole
#

continue the fake factoriser run

#

the sprt became even better after resuming

rocky vigil
#

yeah both are active rn

#

it'll be really funny though

#

if just better training luck == elo

#

though bad overall in long term

twilit oriole
#

thats not normal. bullet doesnt have this

rocky vigil
#

idk how aggressive the skipping / filtering is

#

but each stage is much less than 1 epoch I think

violet badger
#

I wonder if the branch had an additional fix?

rocky vigil
#

i did not edit anything touching the original threat inputs definition so I wouldn't expect that to be the case

#

but who knows

#

what is going on at ltc rn

#

with the 1280 net

violet badger
#

idk..

twilit oriole
#

Can you do a training run getting rid of king buckets? at l1 1280 again

violet badger
#

no..

#

I think I've seen what the issue is.

twilit oriole
#

yeah i mean in general lol

#

not related to this

rocky vigil
#

need to define new feature set

twilit oriole
#

I tested it at L1 3072 at it was -20 fixed nodes at that L1 size. But others tried at smaller L1s and it gained a lot

rocky vigil
#

it could be done though

#

let's wait to see

#

if this single machine is just very anomalous

violet badger
#
diff --git a/model/features/full_threats.py b/model/features/full_threats.py
index 8219c47..6ff36fd 100644
--- a/model/features/full_threats.py
+++ b/model/features/full_threats.py
@@ -160,9 +160,11 @@ class FactorizedFeatures(FeatureBlock):
     def get_feature_factors(self, idx: int) -> list[int]:
         if idx >= self.num_real_features:
             raise Exception("Feature must be real")
-
-        a_idx = idx % NUM_PLANES_REAL
-        k_idx = idx // NUM_PLANES_REAL
+        if idx < 79856:
+            return [idx]
+        
+        a_idx = (idx - 79856) % NUM_PLANES_REAL
+        k_idx = (idx - 79856) // NUM_PLANES_REAL
 
         if a_idx // NUM_SQ == 10 and k_idx != KingBuckets[a_idx % NUM_SQ]:
             a_idx += NUM_SQ
#

is that specific to the factorizer or a bug fix in general?

rocky vigil
#

this is specific to factorizer

#

i think

#

isn't this code not called?

#

when factorizer is not used

violet badger
#

Ok, looking at the diff between the trainers:
git diff 73696ad5f56e6ba216ba693bf5ad41a278004e36 5bcb0036825206ad6a23df6ed1b07211e3a73f58

#

which are the shas of the two versions used for training.

#

It also contains the change to the rng.. but I doubt that matters?

rocky vigil
#

probably not

#

i really don't know how training luck is 30 elo though

violet badger
#

I have certainly never seen that.

#

At the end of the pipeline there is maybe 1-2 Elo variation

#

(certainly <5Elo)

rocky vigil
#

approx how long will it take for stage 1 of the real factorized run?

#

we could also compare then

violet badger
#

The usual <24h.

#

Prob 14h.

rocky vigil
#

yeah so it won't take that long to find out...

violet badger
#

I wonder if it makes sense to start training L1 = 768...

rocky vigil
#

perhaps

#

stc of 1280 is holding steady so far

#

i guess really just wait a while for ltc

violet badger
#

I think the better performance of that factorized net is probably because of the reference net used, I don't see it to be the stage 2 equivalent. The training run has these nets:

Step 1 : starting from None leading to 574f3061fd9e
--> step 1 is final already. Result: /workspace/scratch/574f3061fd9e/run/lightning_logs/version_1/checkpoints/nn-3e22bf1f564d.nnue
Step 2 : starting from 574f3061fd9e leading to e3109a97a662
--> step 2 is final already. Result: /workspace/scratch/e3109a97a662/run/lightning_logs/version_1/checkpoints/nn-a878500a97a8.nnue
Step 3 : starting from e3109a97a662 leading to 6d0eccfc51a2
--> step 3 is final already. Result: /workspace/scratch/6d0eccfc51a2/run/lightning_logs/version_1/checkpoints/nn-bf4519f857f4.nnue
Step 4 : starting from 6d0eccfc51a2 leading to bedc9e9b73fd
--> step 4 is final already. Result: /workspace/scratch/bedc9e9b73fd/run/lightning_logs/version_2/checkpoints/nn-598188c9a702.nnue
Step 5 : starting from bedc9e9b73fd leading to e919dd3ada1a
--> step 5 is final already. Result: /workspace/scratch/e919dd3ada1a/run/lightning_logs/version_1/checkpoints/nn-d1dc1ab9cb1c.nnue
#

Step 1 : starting from None leading to fbfaa6b547c6
--> step 1 is final already. Result: /workspace/scratch/fbfaa6b547c6/run/lightning_logs/version_1/checkpoints/nn-020430fc567b.nnue

twilit oriole
#

has to be, this sprt result is impossible lol

violet badger
#

so the proper test would be nn-3e22bf1f564d.nnue vs nn-020430fc567b.nnue

rocky vigil
#

lmao

#

nn-fd9f...

#

that base net

#

is a 100 SB net

#

so 1/8 of the real stage 1

#

yeah this test is meaningless

twilit oriole
#

first hashgate now netgate Kappa

rocky vigil
#

i kinda wanna see in a few days

#

if we are at the level of master replication attempt

#

(i.e. pre-spsa)

violet badger
#

so, at least we know we can't stop training after 100epochs.

regal steeple
violet badger
#

let me try..

#

probably no difference?

Result of 100 runs
==================
base (./stockfish.base         ) =     977742  +/- 3024
test (./stockfish.new          ) =     977027  +/- 3049
diff                             =       -715  +/- 2094

speedup        = -0.0007
P(speedup > 0) =  0.2520

But well, always tricky to measure small difference

#

it might also be that things now are a bit HW dependent

rocky vigil
#

it might? be time to try 768

#

with the factorizer

regal steeple
#

Thank you, thats quite interesting, my local testing showed a decent speedup, but im not sure, maybe I made some mistake in my test

violet badger
#

I think it depends a bit on what dominates, and probably in my case slow memory access dominates.

#

but well, we will see what fishtest figures out..

rocky vigil
#

the 1280 results are strange

twilit oriole
#

i dont think they are. can be explained its just too slow, maybe undertrained etc

#

I think the current net size is well selected and should focus on optimising it fully first

#

it terms of training schedules etc

violet badger
#

I already played quite a bit around with lr / alpha, but no gains so far.

rocky vigil
#

yeah 1024 seems good, we could give 768 a try later just to confirm

green moat
#

nn-e0189470ae73.nnue available for "Use 1280" pipeline (based on threat_inputs branch)

rocky vigil
#

kind of a wash though at least with current estimates

#

like it will be 1-2 elo stronger than stage 4

#

that'll put it at maybe -10 elo stc, -4 elo ltc

#

768 might be more interesting

#

probably antiscales but maybe the base stc gain will be high

prime mica
#

hm

#

do u have an estimate on how much incremental threat updates would help the situation

#

I also wonder whether some of the threats are "low information" in the sense that they're already encoded somehow in the main net

#

like if you have a queen right next to the king lol

rocky vigil
prime mica
#

😩

#

piddly

rocky vigil
#

Like threat tracking is ~5% of runtime rn

prime mica
#

oh that's not much

rocky vigil
#

Threat indexing is another 5% still

#

It adds up

prime mica
#

true

rocky vigil
#

The biggest time sink is accumulating

#

Which is 20%

prime mica
#

🤮

#

I need to gather more data on whether it's truly memory bound on other computers

rocky vigil
#

Idk if that can be done faster though, unless the raw number of updates is decreased

prime mica
#

my computer is weird because it has extremely high arithmetic throughput

rocky vigil
#

Wait what machine do you have

prime mica
#

it's a recent AMD EPYC machine

rocky vigil
#

Oh yeah

prime mica
#

so it can do 4x 512-bit vpaddw per cycle

#

er wait no

#

2x

#

but still quite a bit more than most computers on fishtest

rocky vigil
#

Using many threads probably stresses the avx / memory more

prime mica
#

true

rocky vigil
#

That’s why I think viren said i8 would only be worth it at smp

prime mica
#

ye

rocky vigil
#

But fishtest conditions are also multithread basically in terms of memory pressure

#

Bc of concurrency

prime mica
#

right

#

although hm

#

what if we tried combining the shared memory branch

#

like

rocky vigil
#

I also learned there is no simd i8 * i8 = i16 mul

prime mica
#

threat inputs, shared memory vs. master, shared memory

rocky vigil
#

So i8 requires double add

#

Or we drop mulhi trick

prime mica
#

a tragedy

#

let's call up the CPU manufactures and have them add vpsfaddsubw

rocky vigil
prime mica
#

gotcha

#

what is Monty?

rocky vigil
#

CPU mcts engine

prime mica
#

oh! cool

rocky vigil
#

The one where this idea originated from

#

Since they got it to work in Monty first

#

And then it worked in Yukari, then Plentychess

#

And soon hopefully sf

prime mica
#

I see

#

it's cool that stockfish imports ideas from other engines!

#

like big ideas would probably be really hard to test and push through bc master is so carefully tuned

rocky vigil
#

I mean that’s the purpose of all this stuff being open source

prime mica
#

sure

rocky vigil
#

And the collaborative nature

prime mica
rocky vigil
#

corrhist which was a big gain (like 6 elo, it’s literally a whole third of the progress from 17 to 17.1) shortly after SF17 was also originally done in other ab engines

rocky vigil
#

But I’m hoping we can go the full way

#

And yeah big ideas like this require many many people

rocky vigil
#

@prime mica one of these is the current version

#

the one on the right is

prime mica
#

cool beans

sharp sail
naive comet
#

I have another idea

#

but the thing is that all my smart ideas fail and my dumb ideas tend to work

#

just look at my last speedup for example

#

like how tf does that give 2%

twilit oriole
naive comet
#

pogey

violet badger
#

5 stages of 1280:
1: Elo: -83.06 +/- 1.85, nElo: -158.79 +/- 3.40 nn-8f15e80a1212.nnue
2: Elo: -44.34 +/- 1.84, nElo: -82.88 +/- 3.40 nn-ee65bf2468c5.nnue
3: Elo: -41.99 +/- 1.84, nElo: -78.44 +/- 3.40 nn-da4726ad1062.nnue
4: Elo: -38.09 +/- 1.84, nElo: -71.15 +/- 3.40 nn-07f85ae62b17.nnue
5: Elo: -36.27 +/- 1.86, nElo: -67.03 +/- 3.40 nn-e0189470ae73.nnue

#

vs #1336647760388034610 message of 1024

#

(not the latest optimized SF playing of course)

rocky vigil
#

yeah -11 stc to neutral ltc what is this

twilit oriole
#

the convergence time increased

twilit oriole
rocky vigil
#

well 768 will probably be huge antiscaler at this rate

#

💀

#

oh well

#

might as well see

violet badger
twilit oriole
#

i would assume safest bet is to increase length of all stages and redo. dunno how other stuff might affect things

#

but wait for factoriser first ig that should help a bit

violet badger
#

right, maybe smarter to wait for the factorizer.

#

that one (the for real one) should finish step 1 soon (2h?) and I think we should run a sanity check against the corresponding step without factorizer.

stray gyro
#

Is there a reason why we don't push indices directly to active in append_active_indices?

#

I'm seeing ~1.5% speedup, also bench looks identical (to xu-shawn/threats_inputs)

naive comet
#

yeah

#

it's just a useless intermediate list

stray gyro
#

Another free speedup...

naive comet
#

yeah

rocky vigil
#

wait shoot where'd that come from

#

i thought I removed that

#

it's tech debt back when I was doing "UE at home"

stray gyro
#

Idk what's the latest version, I just looked at shawn's branch and it seemed strange to have that.

naive comet
#

also mineta ray is unused in the threats updates in Position

rocky vigil
#

yeah i thought i removed it in the retry

naive comet
#

you should prolly include that too

rocky vigil
#

i guess i didn't, and then shawn copied it over

#

i mean free speedups always nice

stray gyro
rocky vigil
#

yeah i guess just put it up on fishtest

#

should be free 2-3 elo

stray gyro
#

I'd rather let it included directly because it's trivial. fishtest is already a bit under the strain atm.

violet badger
#

maybe 1280 is different... who knows. Still some work to do, but we're making progress.

violet badger
#
(60+0.6, 72t, 32000MB, UHO_Lichess_4852_v1.epd):
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)
   1 master    :     0.0   ----  4173.0    8192    51
   2 patch     :    -6.7    5.5  4019.0    8192    49

(patch is 3fc4b6a58c288001f929acc560cb8b28adf03125 (cj-latest-speedup branch))

#

relative to #1336647760388034610 message

prime mica
#

this is 8192 games, each one having engines with 72 threads?

violet badger
#

yes

prime mica
#

gotcha

rocky vigil
#

ok so 1024 is approximately neutral scaling with master

#

in threads and time

#

so probably it is a good spot

#

though of course 768 might pull a surprise...

violet badger
#

well, I should probably compute the results with 1 thread... I think there is good scaling actually. One thing we're definitely seeing, as with all arch changes, is that the real result depends on the HW.

rocky vigil
#

LTCs on fishtest do get a larger diversity of machines (at least for the same number of games), so that might play a role

naive comet
#

then would 1280 have more promise?

rocky vigil
#

i still think 1024 is good

#

but yeah a longer trained 1280 is definitely promising

rocky vigil
# naive comet yikes...

to be fair, all this test shows is that stc 72t and ltc 72t are around the same, as stc, at -7

twilit oriole
rocky vigil
#

was waiting for mineta to do so but I can make one

twilit oriole
#

I don't see what is 'yikes' about -6 at this stage it's completely fine

#

Need a bit of patience...

#

There are gainers to come

stray gyro
#

If there are no pending improvements incoming I can make a test

naive comet
#

I think sscg recently made one

stray gyro
#

You mean vs master or vs lates threat?

#

oh good

rocky vigil
#

oh i already have the branch set up

#

but have not made test

#

i'll make it

#

it'll run slightly faster

stray gyro
#

sure

rocky vigil
#

here we go

#

btw smth strange

#

@regal steeple can this test be reconciled with upstream changes

violet badger
#

so, thread scaling is real in this context. Relative to #1336647760388034610 message

(60+0.6, 1t, 64MB, UHO_Lichess_4852_v1.epd)
   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)
   1 master    :     0.0   ----  9367.0   17920    52
   2 patch     :   -16.2    3.7  8553.0   17920    48
prime mica
#

😲

violet badger
#

I assume quite a bit comes from the memory pressure (i.e. the known thing where nets are not (yet) shared between processes, using 72t is an easy workaround).

prime mica
#

was about to say that

violet badger
#

So, for completeness... STC to follow.

rocky vigil
#

btw factorized stage 1 is done

#

so I'll put that on fishtest

#

and hopefully avoid netgate...

prime mica
rocky vigil
#

shawn tested the last "fake factorized" stage 1 net against the test net which only had 1/8 of a stage

#

and surprise surprise +30 elo

violet badger
#

(chess equivalent, obviously)

prime mica
#

lol

#

that's actually rly funny

#

defrauded by threat inputs

regal steeple
rocky vigil
#

bruh forgot to change the fixed games preset

#

can just stop after 20k or so if the results are clear

rocky vigil
violet badger
violet badger
regal steeple
violet badger
#

against the shawn threat_inputs branch, I assume

violet badger
#
speedup        = +0.0135
P(speedup > 0) =  1.0000
regal steeple
#

Thank you that looks good I guess

violet badger
#

yes, it does

regal steeple
#

I got around the same value, so hardware difference seems to not be an issue

stray reef
#

Some data from pohls tests

PlentyChess 7 TI Test

STC (3min+1sec, ratinglist conditions, 512MB):

Torch 4 a512             : 1000 (+215,=472,-313), 45.1 %, -34 +- 15
Stockfish 17.1 250330    : 1000 (+122,=503,-375), 37.4 %, -89 +- 15


LTC (30min+10sec, 512MB):

PlentyChess 7.0.0 a512   : 1000 (+259,=497,-244), 50.8 %, 5 +- 15
Torch 4 a512             : 1000 (+221,=476,-303), 45.9 %, -28 +- 15
Stockfish 17.1 250330    : 1000 (+142,=491,-367), 38.8 %, -79 +- 15

the error bars are not great ofc, but the trend is there.
and keep in mind it's 512MB hash in both cases, he doesn't have RAM for more

plain flower
#

no STC plentychess results?

violet badger
stray reef
violet badger
#

I've also started a factorized 1280 training run.

naive comet
regal steeple
finite wind
#

what do you predict? what is the future of threat inputs?

violet badger
#

threatened ? More seriously, not quite stronger than master, but close. There are still inference patches that will speedup, training sessions that will improve, and tests to be done. So quite some work. Even if this is not certain that it would be merged, it definitely helped to revamp some of our tools and processes.

twilit oriole
rocky vigil
#

this could be simplification yes

#

oh bruh

#

i thought it would've sped up at least by some nontrivial amount

#

who knows

#

eh I'll just recalculate the llr later

#

it's kind of a waste of games to just restart the test

rocky vigil
violet badger
#

at 72t it should not be, the full net should fit in the socket's L3 cache.

rocky vigil
#

for the stc 1t i mean

violet badger
#

yeah, in that case most likely.

rocky vigil
#

gg speedup

#

first one in a little bit

naive comet
#

coolio

green moat
violet badger
#

The Elo difference to reference is much better in the tests that are executed in the end, but one would need to check if the shas of the playing engines are the same. If the engines are the same, would be quite an improvement.

#

Probably the playing engine that is different.

Previous:
75edbee01e6f8cb53a2555499192ccaddb883577  b7f553ee8b28a4abace6c1056dceb1d69169873a
Elo: -25.29 +/- 1.82, nElo: -47.45 +/- 3.40
Factorized:
75edbee01e6f8cb53a2555499192ccaddb883577  d5fad05e412e3118f94ab79aa5e03067ac86d204
Elo: -21.06 +/- 2.19, nElo: -39.16 +/- 4.06

So, 4 Elo progress in this test, but we can't attribute to net or playing engine.

#

quite a few moving targets.

green moat
#

Would it better then testing nn-6b685002b4b6.nnue against nn-598188c9a702.nnue on Fishtest?

violet badger
#

yes.

#

that's the test to run, or maybe once the testing at the end of the pipeline finishes, so we take the proper net.

#

but it is already rather clear that this time step 5 is better (nn-6b685002b4b6.nnue)

stray reef
#

@twilit oriole I just tried i8 input weights, it's 13% faster but unfortunately -20 fixed nodes (https://furybench.com/test/3426/). I guess the difference to monty is the i8 l1 matmul. but maybe you (or someone else) have/has any ideas how to sneak more accuracy into my impl?

#

there is still a factor of 2 up for grabs technically, if i replace _mm512_cmpgt_epi32_mask with _mm512_cmpneq_epi32_mask for nnz calculation, though i've not found a nice way to do cmpneq on avx2 and below yet

#

another thing is, i saw you using 128 (or 127?) for input weight quantisation, whereas I am limited to 64/63 due to the factoriser ([-98, 125] is the range rn). my weights are clamped to [-0.99, 0.99] during training

naive comet
stray reef
#

cmpeq_epi32_mask (also cmpneq) is all avx512 only

lofty cedar
#

Where do I make a PR for threat inputs?

violet badger
lofty cedar
#

My test got a bit of a problem due to one worker having 10 residual... which may subject passed test to additional tests.

violet badger
#

don't worry, that one will be purged

naive comet
stray reef
#

ohh didn't see _mm256_cmpeq_epi32 is avx2, yes, thx

green moat
lofty cedar
#

Yes, it just changes the update_piece_threats to basically precompute the pawn attack bitboard for 1 square (which simply was not there because we didn't need it), and saved a few attack recalculations by re-using values we need anyway.

naive comet
#

@lofty cedar

#

@frosty imp @regal steeple

#

also aside from that patch, @stray reef do you have Finny tables for your threat inputs? I was thinking about it, do you do a large bitmask and calculate difference that way by any chance?

naive comet
#

cuz your patch hinges on the fact that we always compute rAttacks, qAttacks and bAttacks

#

but with rn5's patch we dont do it on mutate_piece

#

so combining the 2 means you are computing rAttacks, qAttacks, bAttacks even when unnecessary

#

ofc this is a trivial specialcasing away but still

lofty cedar
#

Oh... okay...

#

But well, I think the compiler won't miss special case. After all, it's a template argument.

naive comet
#

well still good to at least do a speedtest using either speedtest or pyshbench before merging

lofty cedar
#

And if one were to be pedantic, one could also say that if the compiler didn't recognize the fact that you can elide qAttack in the case, then you don't need to compute that...

#

I see.

regal steeple
stray reef
naive comet
#

oh

stray reef
naive comet
#

skip the while loop even

#

at least my impl did that

regal steeple
#

I think that was possible while threats to square were still getting incrementally tracked because threats by sliders were getting added in the third loop but now they are getting added in the second loop

naive comet
#

surely you can still do that by:

<pseudocode>
threats_remove()
threats_add()
mutate_board()

or something in that order

#

needs checking at least

regal steeple
#

Im not sure im following, say we have a position like this 6k1/8/5n2/4p3/4P3/8/6B1/6K1 b - - 0 1 with f6e4 played, we still need to remove the bishop to pawn threat and add the bishop to knight threat

prisma hatchBOT
naive comet
#

ill experiment after your pr gets merged

lofty cedar
#

Okay! STC passed.

prime mica
#

the goat

violet badger
#

The net from the factorized-not-really-factorized pipeline is essentially the same strength as what we have, but maybe 1 Elo progress:

   1 nn-6b685002b4b6.nnue    :  2300.6    0.6  73981.5  147456    50
   2 nn-598188c9a702.nnue    :  2299.4    0.6  73474.5  147456    50
#

(I guess good enough to include in the branch..)

#

(and also uploaded to make that easy..)

formal smelt
#

We were going to try it also

#

Relatively soon

stray reef
#

awesome

#

i know nothing about QAT, is there something in some git repo that shows how it's done in bullet?

frosty imp
#

merged speedup and net

lofty cedar
#

Has anyone tried doubling the size of the later layers in threat net?

violet badger
#

long ago (at least for SF) people tried these things, at that point it didn't help much?

#

might be things have changed, but I somehow doubt.

lofty cedar
#

Yeah... but neural networks show that a more detailed input scheme often require larger net to interpret.

#

Also, the slowdown in later layers are not that significant anyway if it really helps.

violet badger
#

pure guess here, but I think for later layers to be really useful, they might need to be significantly wider. Somehow neither 32 nor 64 can present enough features to the later layers to be able to reason much about the board...

#

They might help a little bit introducing non-linearity or so, but not 'reasoning' or 'tactics'

lofty cedar
#

I see.

rocky vigil
#

Put it against xu-shawn threat_inputs not master

lofty cedar
#

Hmm? Correct branch? I did put it against threat_inputs?

#

Oops...

#

I tried to put it against threat_input...

#

But for some reason, it sent to master.

rocky vigil
#

wanna see if 1280 might be better now

#

in terms of speed

tulip gust
#

fucking hell jw you are completely unstoppable

rocky vigil
#

And are we merging the old version

rocky vigil
violet badger
#

Is there anything speaking against l1=896 ?

frosty imp
#

was suggesting that a while ago

violet badger
#

I think I better start that as well.

#

we'll need to squeeze a few more Elo I'm afraid.

rocky vigil
#

Ah yeah maybe we need finer increments to test

#

Was unsure if 7*128 ran into problems with avx registers but I guess not

violet badger
#

for sure better than 8*127

#

but yeah, I think 512bits is the unit of concern.

#

good old days of 512bytes word vectors are gone 😉

rocky vigil
violet badger
#

they should be run until the end, the gains are small enough now that incomplete sprts are probably not very informative.

prime mica
#

really dumb question, does it ever make sense to like, train 8 nets in parallel and then select the best one

#

or do they all end up having the same strength

#

(of course it's computationally annoying but just wondering the variance in training)

tulip gust
#

but you can also do model souping which is like a stronger version of this

prime mica
#

O

#

to be clear I don't mean selecting the best one at runtime

tulip gust
#

i know, yeah

frosty imp
#

iirc linrock always did multiple runs

prime mica
#

oh ok

#

idk what a subcircuit is

#

lol who is this legendary linrock and where did they go

#

did they move on from computer chess

frosty imp
#

linrock trained SF's network for a long while

frosty imp
#

because he was always responsible for it, nobody cared to reproduce his training setup

#

so when he took a indefinite break progress literally stopped

prime mica
#

😩

#

does he still respond to questions

rocky vigil
#

He is still active somewhat

#

Approved one of the recent threat input tests

#

But overall I think he is mostly happy to move on now that vondele got sufficiently close with reproducing a pre-spsa net

rocky vigil
frosty imp
#

@lofty cedar

#

eh gonna set TP to 25

green moat
lofty cedar
#

Oh... I see.

rocky vigil
#

new progtest concluded with no difference to last one

#

it is missing the very slightly better net

#

btw @frosty imp if "speedups" reaches simp bounds threshold should I just stop there

rocky vigil
frosty imp
#

just pr now

#

idt that it even needs testing

rocky vigil
#

alright well yeah it is simp bound passing so i feel slightly more at ease

#

cool

#

stopped

lofty cedar
#

Oopsies. The new "speedup" doesn't interact well with the newer patches.

stray reef
violet badger
lofty cedar
#

Does weight permutation work on threat input?

stray reef
#

Yes

violet badger
#

and is being used AFAICT.

stray reef
formal smelt
#

lgtm, i would probably just have the function "quantise" the weights only rather than also doing the affine op

#
fn quantise<'a>(mut layer: Affine<'a, CudaMarker>, value: f32) -> Affine<'a, CudaMarker> {
    layer.weights = layer.weights.faux_quantise(value, true);
    layer.bias = layer.bias.faux_quantise(value, true);
    layer
}
rocky vigil
#

@stray reef how feasible would it be to try and separate the threat tracking to only be done when accumulator update is required?
(on the flip side, if we keep the current structure, is it a sane idea to attempt to prefetch the corresponding weights when the indices are computed?)

formal smelt
prime mica
#

prefetch corresponding weights when the indices are computed
I tried this with the main net and it didn't help

#

U should still try it but just a data point

formal smelt
stray reef
#

line 891

formal smelt
#

oh yeah i'm blind

stray reef
#

i'm not sure if this is worth anything

rocky vigil
#

ah

#

so right now for L1=1024 threat tracking and indexing each take about 5% of the overall runtime

#

idk

formal smelt
stray reef
#

hehe yeah :P

#

still haven't gotten around to switching to binpacks

formal smelt
#

how much did that gain?

stray reef
#

3-4 SPRT elo

rocky vigil
#

768 seems to have finished training

#

1024 to follow shortly

rocky vigil
frosty imp
#

just test stage 5 i guess

rocky vigil
#

alright let's see if the threat-input-psq patch passes soon first

frosty imp
#

Cursed kekgasm

#

I’d say that one needs LTC so it’s probably not getting in soon

rocky vigil
#

eh fine

#

I'll just start STC + LTC then

#

for 768

lofty cedar
#

What about a threat finny table? The idea is that when a piece moves to a square, instead of adding the threats of the entire board, we add the previous threat to that piece and the difference between the previous threat and the current threat.

frosty imp
#

that's exactly what's happening now?

lofty cedar
#

Really?

#

I mean... isn't the current approach that when a piece moves to a square, we add
the threat of that piece to/from every piece?

frosty imp
#

nope

rocky vigil
#

i think the biggest issue is fusing the add/sub like that massively inflates

#

there is a reason it is not done like that for standard psq either

frosty imp
#

oh you mean the threats of that piece

#

how is that different from the first message

#

I see. wouldn't moving a piece then require you to update multiple finny entries

lofty cedar
#

Uggh...

#

I see.

regal steeple
#

@frosty imp In this patch https://tests.stockfishchess.org/tests/view/68f67ce0637acd2a11e723d9 maybe its better to replace pawn_attacks_bb<BLACK>(s) with attacks_bb<PAWN>(s, BLACK) instead of pawn_attacks_bb<BLACK>(square_bb(s)), attacks_bb<PAWN>(s, BLACK) uses a table already so its basically equivalent to the previous version without having to create a new table (using a table saves a few bit shifts, not sure if thats signifcant but the test seems to struggle a little).

naive comet
#

^^^^

regal steeple
naive comet
#

honestly the pawn_attacks_bb thing is just unnecessary

violet badger
#

I assume tomorrow evening we will have 1280, and probaby the day after 896

stray reef
#
Elo   | 9.47 +- 2.93 (95%)
Conf  | N=20000 Threads=1 Hash=16MB
Games | N: 20634 W: 6151 L: 5589 D: 8894
Penta | [338, 2253, 4631, 2699, 396]

https://furybench.com/test/3472/
fine-tuning with QAT vs. fine-tuning without QAT (for i8 feature weights)

#

maybe around -10 fixed nodes to master

naive comet
#

pogey

desert tree
#

QAT not copium for this then? nice

violet badger
#

sounds like the right thing to do ...

stray reef
#

there is still some packus accuracy to gain in this impl, maybe 3-4 elo from that. testing STC vs main now

formal smelt
stray reef
#

seems roughly neutral to master at STC. I'll run another fine-tune with more fine-grained quantisation, inference is a bit slower there but still faster than i16 feature weights

#

I suppose a positive STC+LTC result with -5 elo at fixed nodes is mergable? if monty did it too

formal smelt
#

we also did an SMP test

stray reef
#

how did that compare to STC?

formal smelt
#

+11 rather than +16 or something

#

+- error

rocky vigil
violet badger
#

I guess careful testing needed, in these sequences we know the inference is always the same, between the runs the sha of the testing binary might not be equivalent. I think 768 and 1024 are essentially equivalent.

#

But that needs a test on fishtest with care on picking the right version of SF.

#

(or some analysis of the sha of the SF used for playing)

regal steeple
lofty cedar
#

I suspect we're at about -5 elo.

#

Which is about the pre-tune net.

prime mica
#

💪

#

so close

regal steeple
torn lagoon
#

Vs master?

regal steeple
#

This commit against the one prior to that one

torn lagoon
#
Compiled by                : g++ (GNUC) 15.2.0 on Linux
Compilation architecture   : x86-64-avx512icl
Compilation settings       : 64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.2.0
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 12 1536 150
Available processors       : 0-11
Thread count               : 12
Thread binding             : none
TT size [MiB]              : 1536
Hash max, avg [per mille]  : 
    single search          : 56, 30
    single game            : 798, 566
Total nodes searched       : 2257313338
Total search time [s]      : 153.514
Nodes/second               : 14704283```
#
Compiled by                : g++ (GNUC) 15.2.0 on Linux
Compilation architecture   : x86-64-avx512icl
Compilation settings       : 64bit AVX512ICL VNNI AVX512 BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.2.0
Large pages                : yes
User invocation            : speedtest 
Filled invocation          : speedtest 12 1536 150
Available processors       : 0-11
Thread count               : 12
Thread binding             : none
TT size [MiB]              : 1536
Hash max, avg [per mille]  : 
    single search          : 53, 30
    single game            : 794, 563
Total nodes searched       : 2224392957
Total search time [s]      : 153.514
Nodes/second               : 14489837```
#

Looks like a slowdown on zen 5

regal steeple
#

Hm thank you that looks like about the same speed difference I get

#

Maybe im worrying too much here but according to the speedup Elo estimate formula a ~1.5% speedup should be ~3.15 Elo which is (barely) out of error at the moment, but lets see how the test goes

torn lagoon
#

I can test on a pre-avx Intel if it helps

#

To check for variance

regal steeple
#

I think that one test is fine, I just wanted to rule out that this is some weird case where the slowdown only exists for me

rocky vigil
#

ahhhh

#

why does pt never show progress

prime mica
#

😭

#

nontransitivity

violet badger
#

speedups ought to be transitive in any case.

#

anyway, tonight probably we have step 5 of the 1280 net, and maybe tomorrow the 896 net...

rocky vigil
#

🙏

violet badger
#

doesn't feel like there will be major breakthroughs though..

rocky vigil
#

from local results factorized 1024 maybe another 2 or 3 elo

#

i just started stc on fishtest a bit ago

violet badger
#

yes, seen that.

stray reef
#

I tried another way of doing i8 quantisations, this time it was worse (last time STC vs main was neutral)
I think with larger L1s you will both see more benefit from it, as well as see less elo loss from stricter quantisations, it's 100% worth trying in SF

Maybe it's even worth for me to try passing L1=768 as i8 directly, to mitigate the fixed node loss, since it's pretty clear it won't pass my current LTC on its own

rocky vigil
#

What was the different way

#

I also think threat only i8 is better

#

Since then you maybe don’t need to quantize so aggressively

stray reef
#

ended up not mattering at fixed nodes which way I did it, but the tiny slowdown mattered at STC

rocky vigil
#

Interesting

#

Have you gotten threat only i8 to work

stray reef
#

(Both with QAT of course)

stray reef
rocky vigil
#

Bc I feel like the factorizer screws with psq extra hard

#

Unless you have a way to bypass

stray reef
#

i know nothing about how psq works in the sf arch

rocky vigil
#

I meant like

#

Quantize the threat feature weights to i8

#

And keep the psq features as i16

stray reef
#

ohh not that psq i see

rocky vigil
#

💀

stray reef
#

bullet now has this way to clamp not only factoriser weights and psq weights individually, but also in combination, which is ofc extremely useful for this. but it requires a full training run, not just a fine-tune for me

#

will do that eventually after my new gpu is set up

stray reef
#

which will make inference a bit slower

#

if they're different by a factor of 2 it's easy but it's essentially the same slowdown as in my second i8 test (~3 STC elo)

rocky vigil
#

Most of them are small in absolute value anyways

stray reef
#

dunno, if that's true then that'd work

rocky vigil
#

Like in our nets the natural frequency of weights exceeding limit is close to 1%%

#

But the x2 trick for mulhi is the real issue

rocky vigil
stray reef
rocky vigil
#

Uh x2 to use some mulhi trick

#

Because mulhi preserves sign

rocky vigil
#

Maybe quantization issue

#

In sf need to internally store 2x the weights

#

Otherwise one of the shifts will overflow

stray reef
#

but for my master net it's currently not an issue, one shift works fine

rocky vigil
#

huh

stray reef
#

So if this does end up being a problem for SF it's easy to solve

lofty cedar
#

Anyone tried the new muon optimizer?

frosty imp
#

out of scope for this post

rocky vigil
#

L1=1280 looking like it'll be -10 stc again compared to factorized L1=1024

#

i guess can try speculative ltc soon

rocky vigil
#

to reduce issues around copying 260 MB of net per concurrency

violet badger
#

if anything I agree smp might be more interesting. However, it looks quite a bit weaker indeed. The 896 net should also be fully trained later today.

prime mica
#

(sry I read smth wrong)

#

I thought it was the other way around

regal steeple
#

Its not really a fail, the name was just chosen poorly

prime mica
#

that it was actually a good change

#

okok

#

sry for ping

lofty cedar
#

But maybe let's make another thread instead?

#

Factorized stage 5 STC passed!

@rocky vigil

#

It would probably pass LTC... but would it need testing anyway?

#

Okay... I'll LTC.

violet badger
#

no need for LTC... we're developing a branch. It is obviously stronger than the existing STC, only when changing scales is that useful.

#

(I've stopped it).

#

In other news 896 finished.. that's probably more interesting, but I guess it will be weaker than 1024. Will get with some more data later today I think.

rocky vigil
#

I’ll pr to Shawn’s branch soon

#

The play I think now is to figure out how “cleanup” should be reverted

#

Because it looks like it should be reverted somehow

lofty cedar
#

Just use the attack_bb<pawn> thing.

#

I guess it means that the attack_bb<pawn> is faster.

rocky vigil
#

@frosty imp pr made

rocky vigil
violet badger
#

yes, that's what I see as well, later today I'll come up with a graph. Want to find some time to do fixed nodes test as well. I wonder if somebody could measure once nps for the 4 sizes we have now (with consistent versions of the code, just net size changes).

violet badger
#

so, collected the data now..

#

so, at fixed nodes outperforming master, at tc, underperforming.

#

I found the dip in performance for 896 and 1280 interesting, as if these versions are for whatever reason slower than 768 and 1024 (like performance goes up smoothly at fixed nodes)

#

raw data

$ cat ..

   # PLAYER                       :  RATING  ERROR    POINTS  PLAYED   (%)
   1 1280-nn-71f4e3cc3782.nnue    :    38.8    1.9   40878.0   73728    55
   2 1024-nn-26b0e5126117.nnue    :    25.3    1.8   39486.0   73728    54
   3 0896-nn-7347b2877a12.nnue    :    20.2    1.9   38958.0   73728    53
   4 0768-nn-914a5c3a46dc.nnue    :    11.5    1.9   38056.0   73728    52
   5 master                       :     0.0   ----  137534.0  294912    47

White advantage = 40.31 +/- 0.46
Draw rate (equal opponents) = 45.67 % +/- 0.09


   # PLAYER                       :  RATING  ERROR    POINTS  PLAYED   (%)
   1 master                       :     0.0   ----  155112.0  294912    53
   2 1024-nn-26b0e5126117.nnue    :   -15.1    1.8   35304.0   73728    48
   3 0768-nn-914a5c3a46dc.nnue    :   -16.0    1.8   35207.0   73728    48
   4 0896-nn-7347b2877a12.nnue    :   -19.8    1.8   34815.5   73728    47
   5 1280-nn-71f4e3cc3782.nnue    :   -23.1    1.8   34473.5   73728    47

White advantage = 41.79 +/- 0.45
rocky vigil
#

Are we also merging cleanup revert

twilit oriole
#

Can code changes be accompanied with a test always. Nobody is skilled enough to do a cleanup in a hot path and know with 100% certainty it has no slowdown

#

Compiler behaviour can be unpredictable

rocky vigil
regal steeple
#

I can also run a non regr against the pre cleanup version if thats needed

rocky vigil
#

Anyways this + net should take us to -5 stc

#

Or smth

twilit oriole
#

Is there a 768 factorised net

rocky vigil
violet badger
#

like how hard is to read that table #1336647760388034610 message

twilit oriole
#

On mobile it is pretty hard :p

violet badger
#

nokia hitting back hard 😉

rocky vigil
#

This data seems to suggest “magic numbers” like 768, 1024, 1536 might be optimal

#

For whatever reason

violet badger
#

well, I'm suggesting it must be speed related.

#

(given performance at fixed nodes)

#

if asked to explain I'll mumble cache associativity effects ...

#

but I really have no idea what's causing this

lofty cedar
#

Why?
Cache lines are only 64-bytes long so at most 64 elements.

twilit oriole
#

This data suggests factorisation benefitted the 1024 more than the 768. Might give some signal that convergence time is still a factor

daring wren
#

tbh it's probably not related to that

lofty cedar
#

There are normally 8 or 16 AVX registers I guess?

#

Each 64 bytes.

twilit oriole
#

It is more related to the fetching of the weights itself I assume

lofty cedar
#

As in, in practice, shouldn't matter.

twilit oriole
#

Huh

lofty cedar
#

I mean... back in the day of direct mapped cache, it mattered a lot.

#

Nowadays, lots of people just assumed that approximately the last N lines accessed are in the cache.

lofty cedar
violet badger
#

happily exchanging this idea for the one shown to explain the effect on performance we measured 😉

lofty cedar
#

Well, beyond 1024 elements, we run out of registers.

#

There are 32 registers in AVX512.

#

So, 2048 bytes or 1024 elements.

#

Though the trailing parts might be lagging.

violet badger
#

wait, we're looking for a reason why 896 is worse than 768 and 1024

#

(also 1280, but well)

lofty cedar
#

Oh... well... that gets even more confusing.

#

Has anyone inspected the assembly?

rocky vigil
#

I somehow suspect 896 is performing at same speed as 1024

twilit oriole
#

I don't think 1280 is underperforming where it should be?

violet badger
#

vs master at tc testing, it is the worst?

rocky vigil
#

Like the fixed nodes looks fine

twilit oriole
#

Yes that is what I expect

violet badger
#

anyway, at TC testing the performance curve is not smooth, and that would need explanation, IMO

#

freelo

rocky vigil
#

I suspect on these 72 core machines the net size is more harmful than at fishtest

#

And that would also explain 1280

violet badger
#

possibly.

lofty cedar
#

We really need deduplicate net!

violet badger
#

but again, hard to explain the zigzag performance at tc testing

rocky vigil
#

Or at least this hardware consistently gives results around -10 to fishtest

lofty cedar
#

So we could finally test free from bias.

rocky vigil
#

In ~30 min I can set up stc smp for 1280

green moat
violet badger
#

I'll have smp results soon, but they won't get us further before the other PR is fixed

rocky vigil
#

If Shawn merges the pr in the middle

rocky vigil
#

maybe this time also stc smp

twilit oriole
#

Why not do LTC Vs pre SPSA net and get a green finally

#

May be good to put things back into perspective

rocky vigil
#

What was the best pre-spsa net

twilit oriole
#

I mean take the one vondele trained recently

#

That was -5.4 Elo without SPSA to master

rocky vigil
twilit oriole
#

Sprt ig

violet badger
#

I think that's not particularly useful

#

just add 5 Elo to that result.

twilit oriole
#

Well it allows a sprt to be performed which gives a higher guarantee of pre SPSA superiority. Also I think doubling the training time of all the stages is still something to attempt

violet badger
#

yeah, I think training for a bit longer is something that needs to be done.

#

but I suspect the gain is going to be small to be honest.

twilit oriole
#

Yes. Maybe 1 Elo lol

twilit oriole
#

For whatever reason the big machines did not perform the same as other machines on STC or LTC tests (1 thread)

violet badger
#

so this is not understood?

twilit oriole
#

Yeah

violet badger
#

(as in mmap not solving this)..

#

instruction cache?

prime mica
#

shared memory is implemented in plentychess?

twilit oriole
#

Well maybe it's something to do with only 32MB being real L3 cache. Just a speculation

#

Like the rest has to go through the infinity fabric

violet badger
#

one big shared L3 on the 72 core testing..

#

(afaik)

prime mica
#

good things come to those who wait

violet badger
#

korean saying

rocky vigil
#

@frosty imp can we merge again

frosty imp
#

merged

rocky vigil
#

cool

#

lemme set up some progtests ig

#

and the 1280

green moat
#

(sorry for the ping 😭 )

regal steeple
#

@rocky vigil Your progtest still uses the pre merge version, is that intended?

rocky vigil
#

I might’ve forgotten to push origin

prime mica
#

oof

rocky vigil
#

Uh stop the stc

#

Leave the stc smp

#

I cannot fix it right now

#

Not at pc

#

If you want you can submit a progtest ig

#

Otherwise it’ll be ~2 hours

regal steeple
#

Im not an approver, I cant stop the STC either

regal steeple
twilit oriole
lofty cedar
#

Can anyone run speedtest?

#

Locally, I found some improvement.

#

But not sure how it works on other machines.

prime mica
#

sure!

#

same comparison as in the fishtest?

violet badger
#

at 72 threads:

#
   # PLAYER                       :  RATING  ERROR   POINTS  PLAYED   (%)
   1 1280-nn-71f4e3cc3782.nnue    :     0.1    2.7  15365.0   30720    50
   2 master                       :     0.0   ----  62236.0  122880    51
   3 1024-nn-26b0e5126117.nnue    :    -2.9    2.9  15236.0   30720    50
   4 0896-nn-7347b2877a12.nnue    :    -7.6    2.9  15032.5   30720    49
   5 0768-nn-914a5c3a46dc.nnue    :    -8.1    2.7  15010.5   30720    49
lofty cedar
lofty cedar
#

Maybe we'd need VVLTC to get it passed on fishtest.

lofty cedar
prime mica
#

setting it up rn

#

trying to remember how to use git

#

I'm getting different benches between threat_inputs and your branch

#

what am I doing rwong

#

oh oops I'm using the wrong one

#

lololl

lofty cedar
#

This one was without the two last PRs.

#

Threat input moves quickly.

prime mica
#

I feel threatened

#

ok let's see, checking out the older commit

violet badger
prime mica
#

ok bench is the same now

lofty cedar
#

But maybe 1024 could make tunes easier to test?

#

IDK.

prime mica
#

Result of 100 runs

base (...ish_0553b61e) = 1388113 +/- 1838
test (./stockfish _af2e862 ) = 1440920 +/- 1520
diff = +52807 +/- 2231

speedup = +0.0380
P(speedup > 0) = 1.0000

#

hopefully I did that right

#

keep in mind my computer behaves way off the mean worker on fishtest so idk

lofty cedar
#

Ooh... quite a nice speedup. Almost 4%!

#

Mine was somewhere around 2% I think.

prime mica
#

huzzah

#

what is the change?

lofty cedar
#

Well, you monomorphize cases with small loop.

prime mica
#

oh that is smort

#

what's the current threat_inputs branch? Still shawn's?

lofty cedar
#

Yes.

prime mica
#

ok now that I have it set up I'll take a look

rocky vigil
#

it's always shawns unless said otherwise lol

prime mica
#

lol

#

benevolent dictator

violet badger
#

I think our target should be to not regress at the normal LTC and LTC SMP conditions.

#

(in fact be stronger at those).

rocky vigil
#

smp is looking decent

#

i think single thread will be most of the issue going forward

lofty cedar
#

I think it's maybe stronger at normal LTC already... if not for the fishtest condition.

rocky vigil
#

i would be surprised actually if the current net caused a (significantly) larger download after compression

prime mica
#

as I'm going through the code... is there a high-level description of the threats architecture anywhere?

lofty cedar
lofty cedar
prime mica
#

sure but like

#

what is a "threat" hahaha

lofty cedar
#

The only change is in the threat... where each threat is an input.

#

Well, a potential where a piece can capture another piece iirc?

rocky vigil
prime mica
#

gotcha

rocky vigil
#

so like

lofty cedar
#

Though it doesn't include pinned piece logic iirc.

rocky vigil
#

white pawn on b2 attacks white pawn on c3

#

yeah

prime mica
#

and the feature index is (square, square) or (square, piece, square) or what

rocky vigil
#

can't account for pins bc that would take way too long

prime mica
#
    Square pc_sq, threatened_sq;```
#

oh wow so it's the full (piece 1, piece 2, square 1, square 2)

twilit oriole
#

The target is already reached anyways, moving net size unnecessary. The SMP test on fishtest is neutral without an spsa to master

rocky vigil
#

one of the lower hanging fruits is to test if replacing get_feature_index with a lookup table is worth it

prime mica
#

oh interesting, so you do de-duplicate threats where one threat implies the other

rocky vigil
#

yep!

prime mica
#

smort

rocky vigil
#

it's prob worth a few % in speed

#

not that it's ever been tested

prime mica
#

do you de-duplicate pawn->bishop, bishop->pawn or is that not possible

twilit oriole
#

The pdf at the first post gives a lot of this info already

prime mica
#

oh I didn't know there was one

prime mica
#

ok reading, thx

rocky vigil
#

you replace a popcount and multiple lookups to small arrays

violet badger
twilit oriole
#

It's a L1 3072 net

violet badger
#

that's -100 Elo on fishtest 😉

#

not quite, but we can extrapolate the graph.

#

#1336647760388034610 message

rocky vigil
#

if we were really pushing it probably 1536 is optimal at TCEC conditions

violet badger
#

again, tcec can't be the goal for us.

rocky vigil
#

would depend on speed yes

#

it might also be okish at LTC SMP

#

but would definitely clock a double digit loss at stc

#

1024 is good

#

a nice number

twilit oriole
#

Yes but obviously the test was not intended for any type of net size info... It's to demonstrate the concept only

#

Clearly it was adequate enough to do that given we are here

prime mica
#

ok ithink this makes sense now

twilit oriole
#

It's using WDL 1, trained on captures and checks etc. it's a monty net plugged into SF lol

rocky vigil
#

i still think we have a couple of tricks to pull on 1024

#

definitely will require more effort to squeeze out last elo though

violet badger
#

ah, I now see it is a fixed node test.... so well, it means virtually nothing.

twilit oriole
#

It means the threat inputs are worth something over the regular net. It is an important basis to establish at the start

violet badger
#

so, first, obviously, this is still nice work etc, all appreciated. but even at same L1, the net is bigger right? So it is quite logical it is better?

rocky vigil
twilit oriole
#

I disagree tbh

#

There was work on it well before that (obviously yukari helped)

violet badger
#

in SF 😉

#

but I think the discussion on history doesn't really matter to be honest.

#

I'm still most interesting in getting a better SF out of this.

twilit oriole
rocky vigil
#

it does look like my prediction of -5 or -6 for stc progtest will be accurate

#

much of it probably from net

violet badger
#

so, I think we'll probably still get 1-2Elo from net squeezing..

prime mica
#

🍋

rocky vigil
#

pre-spsa i guess

#

i want spsa to be a final step though

violet badger
#

and we skip spsa 😉

prime mica
#

lol

twilit oriole
#

The spsa can be done later. It is inevitable anyways

rocky vigil
#

like it makes sense to do it at the end

#

once the process is ironed out

#

in particular I would hope for i8 quantization tests before that

violet badger
#

I think that's an example...

prime mica
#

how many core hours were spent on the SPSA last time

violet badger
#

I guess a few million games at VLTC?

prime mica
violet badger
#

and like it makes further testing so much more difficult.

rocky vigil
#

each individual one is what 60k at ltc smp?

twilit oriole
#

I don't think it is necessary that many. It was done in many stages because it had never been done before at that scale

violet badger
#

take i8 as an eexample

rocky vigil
#

yeah

violet badger
#

if it comes after spsa, it is almost a lost case.

#

It makes incremental tweaks to training almost impossible.

rocky vigil
#

the downside of spsa is that you need to compare Y + spsa vs X + spsa in every further test

#

i.e. if master net didn't have spsa we would already be beating it

twilit oriole
#

I made these arguments before when the spsa stages started stacking up and didn't seem to matter too much then lol

rocky vigil
#

i think since times have changed, linrock largely moved on

#

and then the net got stuck

#

so now opinion on that is different

lofty cedar
#

But since Stockfish usually just accepts local improvements, it often means that SPSA gets accepted easily.

twilit oriole
#

You can make the counter argument that Elo is now rarer. I don't think the conditions changed all that much

violet badger
#

but hard to resist Elo ..

lofty cedar
#

I mean... SPSA-ing the net is often a way to gain easy elo.

violet badger
#

but easy to get into a dead-end.

lofty cedar
#

I mean... it should be a final stage where nothing seems to be improving anymore.

prime mica
#

lol

#

there is something grotesque about spsa

twilit oriole
#

Well I mean if you don't have a rule against it obviously ppl are going to do it lol

lofty cedar
#

So, maybe we should set a period of say 6 months and if no new net comes out we SPSA.

#

Another thing we could do is mention it somewhere in the wiki or somewhere that newly trained nets should be compared to pre-SPSA nets.

twilit oriole
#

Anyways this isn't actually threat net specific, can move to nnue dev. It is only coming up now because the regular arch had no new nets

rocky vigil
#

I am really hoping that like we get this through and it boosts maybe morale or smth, since it must feel bad to have had the exact same net all the way for almost a year now

#

like, we show that master net is not invincible, and maybe then some floodgates will open

lofty cedar
#

Do we try our chance with LTC SMP SPRT now? And then VLTC SMP (aka VVLTC).

If it gains, we merge the threat input.

twilit oriole
#

No

lofty cedar
#

What's left?

twilit oriole
#

Have some patience

rocky vigil
#

we still have speedup ideas left to try

#

while we try those in the meanwhile

#

we can wait about a week to see if double training time

#

squeezes out anything further

lofty cedar
#

Oh, okay.

rocky vigil
#

imo we should only do the (v)ltc smp as a formality

#

like only do it when we know it'll pass

twilit oriole
#

just don't do it at all?

rocky vigil
#

i think it has to be done before merging

lofty cedar
#

I mean we do kinda want VVLTC as a progression test anyway.

rocky vigil
#

so like eventually

twilit oriole
#

A SMP STC and LTC SMP is all that is needed

#

I do not see where we need a VLTC SMP

#

That is not a normal test TC

rocky vigil
#

i think is maintainer decision

#

whether we need non-smp ltc

#

vondele indicated he would prefer at least a nonreg on ltc

#

which I think we are also close to

#

maybe, -3 at ltc rn

twilit oriole
#

This is not answering the question

rocky vigil
#

on fishtest conditions

twilit oriole
#

Where does that mean we need a VLTC SMP

rocky vigil
#

ltc is good enough

twilit oriole
#

The SMP outperformance is from that similar threats are active across a search. So 1 multi threaded search benefits from this. I observed with the regular net also but not as severe

#

It's not a mystery really

rocky vigil
#

so essentially that the memory is better able to optimize for hot indices in threats?

twilit oriole
#

Yes

#

Less cache misses

violet badger
#

measure with perf ....

rocky vigil
#

then why would i8 be worse at smp than normal

twilit oriole
#

(And out of error also)

rocky vigil
twilit oriole
#

No because it is a question lol. Did u measure speed at all

rocky vigil
#

oh i was referring to monty results for this

#

i ofc have not measured speed

rare jacinth
#

what is the threat inputs vs master pair ratio right now

twilit oriole
#

You missed the huge overlapping error bars I suppose

rocky vigil
#

oh right

#

average megagainer in [0, 4] sprt

rocky vigil
#

tbh

#

the error bars on this are also more than I would like

violet badger
#

I've put it to prio -1, but will let @frosty imp stop it himself (or modify back to 0 if needed).

rocky vigil
#

in any case i think we can let yoshie be the trailblazer for i8 quant in a/b

violet badger
#

would be happy, doesn't the QAT need some changes to the trainer?

rocky vigil
#

yeah there is major work involved in this

twilit oriole
#

The i8 quant is already close in plenty and has more advantage in SF

rocky vigil
#

ofc will probably be worth it

#

at some point

twilit oriole
#

Larger nets lose less at fixed nodes and have greater speedup

rocky vigil
#

just like how lazy threat calculation should also be worth it

#

i need to pull up that discussion but i think the main bottleneck is duplicating some make/unmake logic

#

like the way it's structured right now is that the NNUE knows minimal about how the position works

#

it pretty much gets fed the differences per position

#

and runs that through

#

so if we wanted to do lazy threats we either couple NNUE tighter with position or duplicate some essential position logic for those calculations

#

both of these are nontrivial

rare jacinth
#

is there a speedup from the fact that a lot of threats are bidirectional (say rooks on the same file) so say we can merge the weights for a pawn on a2 and bishop on b3? not sure how much has been explored already, please link me relevant material

twilit oriole
#

see the pdf on the first post. rough outline of how things are

rare jacinth
#

I see that you handle all non-pawn symmetries, are pawn-bishop and pawn-queen bidirectional encodings too rare to matter?

rare jacinth
#

if we did I would have found all of your speedups before lol

prime mica
#

lazy SMP

#

improvement on threat_inputs from specializing at the top for different pairs (added.size(), removed.size()) with total less than 4

twilit oriole
prime mica
#

not sure whether it's independent from what Alice did tho

#

oh I see, it's similar but it's pulling it out of the loop

#

interesting