#UE Threat Inputs for AB

1 messages · Page 2 of 1

rocky vigil
#
Engine::load_networks()
network::load(rootDirectory, evalfilePath)
network::load_internal()
network::load()
Read L1 network::read_parameters()
FT::read_parameters()
network::load()
Read L1 network::read_parameters()
FT::read_parameters()
network::load()
Read L1 network::read_parameters()
FT::read_parameters()```
#

I don't understand how it reads 3 times

frosty imp
#

where did you add the cout

rocky vigil
#

uh

#

latest commit I pushed just now

#

has all the debug couts

#

i know at least one issue is that it is expecting a non-empty header

#

which it never gets

frosty imp
#

one call from load_internal

#

one call from load_user_net

rocky vigil
#

oh huh

#

load_user_net

#

hmm

frosty imp
#

another call from load_user_net

rocky vigil
#

but shouldn't they all trigger this cout

#

unless there are 3 dirs

#

right

#

ok

#

so the load of L1 fails without telling me it fails

#

huh

#

stream.fail is false here apparently, according to this last cout

#

therefore this should return true

#

oh lmao @frosty imp insertinig the std::cout here breaks the one line if statement

#

gah

#

i hate my life

frosty imp
rocky vigil
#
eval
Engine::verify_networks()
network::verify()
Current path: nn-98b68b5a9455.nnue
info string NNUE evaluation using nn-98b68b5a9455.nnue (7MiB, (15776, 256, 1))


 NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
|   r   |   n   |   b   |   q   |   k   |   b   |   n   |   r   |
| +631  | +614  | -43.2 | -62.5 |       | -37.0 | +616  | -56.6 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   p   |   p   |   p   |   p   |   p   |   p   |   p   |
| +619  | +621  | +573  | +343  | +231  | +506  | +623  | -76.1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |   P   |   P   |   P   |   P   |
| +579  | +245  | +58.9 | +285  | +376  | +23.3 | +245  | +573  |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   R   |   N   |   B   |   Q   |   K   |   B   |   N   |   R   |
| +53.8 | +255  | +46.8 | +362  |       | +304  | +254  | +53.3 |
+-------+-------+-------+-------+-------+-------+-------+-------+


NNUE evaluation        -271.85 (white side)
Final evaluation       +27.16 (white side) [with scaled NNUE, ...]```
#

ah yes

#

we love to see it

#

ok I forgot to divide by QA*QB

#

that explains the hilariously high values

#

nvm I didn't

#

oh

#

I forgot to do CReLU

#

average x^2 activation

#
+-------+-------+-------+-------+-------+-------+-------+-------+
|   r   |   n   |   b   |   q   |   k   |   b   |   n   |   r   |
| -0.00 |  0.00 | -0.00 | -0.01 |       | -0.00 |  0.00 | -0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   p   |   p   |   p   |   p   |   p   |   p   |   p   |
|  0.00 |  0.00 |  0.00 | -0.00 | -0.00 |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |   P   |   P   |   P   |   P   |
| -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   R   |   N   |   B   |   Q   |   K   |   B   |   N   |   R   |
| +0.66 | +0.49 | +0.40 | -0.02 |       | -0.35 | -0.46 | -0.18 |
+-------+-------+-------+-------+-------+-------+-------+-------+


NNUE evaluation        +2.60 (white side)
Final evaluation       +9.55 (white side) [with scaled NNUE, ...]```
#

aha

#

it still doesn't work

rocky vigil
#

yay /s

#

evaling startpos

#

between different searches

#

gives different results

#

yeah I think I have UB somewhere

#

when typing 'eval' command twice in a row results in different outputs

#

there is nondeterminism happening here and I can't find it gah

#

only good news is I think active features are correctly computed

violet badger
#

time for a run through valgrind ...

#

but that looks like progress

rocky vigil
#
-20337
25
0
-5344
-20337
25
0
-5312
-20337
25
0
-8224
-20337
8
0
-5344
-20337
25
0
-8256
-20337
9
0
9
0
0
0
8
0
7
0
-5308
-20337
25
0
-5344
-20337
25
0
-5308
-20337
25
0
-5344
-20337
25
0
2
0
9
0
9
0
0
0
-5476
-20337
25
0
-5680
-20337
25
0
-5476
-20337
25
0
1
0
51
0
-5352
-20337
25
0
-5360
-20337
25
0
-5352
-20337
25
0
1
0
51
0
-5360
-20337
25
0
-6992```
#

these look like normal accumulator values...

#

ok what the FT expects to output

#

does not remotely match what the second layer receives as input

#

huh I guess I'll debug this tmrw

rocky vigil
#

Ok inference works now

#

As in it plays superhuman chess

#

Uh

#

Got a HalfKAv2hm net

#

Or so

#

(Also single layer)

rocky vigil
#

I estimate an 8x gain with ue is reasonable

#

So overall at the same size perhaps 2-3x slower

#

In the midgame, significantly better than 8x should also be possible

#

Considering that I also compute the psq features from scratch

#

Also would it be fine to run a fixed nodes test on fishtest later (assuming I get also a halfkav2hm -> 256 net)

#

Or should we wait for more work on training side

rocky vigil
#

@formal smelt is the deduplication strategy to always take the lower index feature of a pair

round stone
#

There’s an L1-256 multilayer net that’s reasonably strong in stockfish nnue format, used by lichess, that i can find later

rocky vigil
#

Ok yeah that would work

round stone
#

Alright i’ll find it later, afk now

rocky vigil
#

When you do that can you also remove the “bulletbullet” padding at the end of skip-bm, rename it to nn-98b68b5a9455.nnue and upload it to fishtest

round stone
#

You mean for future bullet nets? The L1-256 i was going to find later is already uploaded on fishtest somewhere

rocky vigil
#

Yeah

#

To get it on fishtest

#

So maybe I can start a fixed nodes test

#

With more compute than just my laptop

round stone
#

Sure, you got inference working with the arch of those nets?

rocky vigil
rocky vigil
round stone
#

Alright np

rocky vigil
#

Non-ue

round stone
#

Ok i can upload those later too. Feel free to upload too if you want to test sooner

formal smelt
rocky vigil
#

Wait I claim the (threat-256)-1x8 still has padding

rocky vigil
round stone
rocky vigil
#

Ok yeah I’ll just wait for the net(s) to be found

#

And then I can set up a test

foggy wind
twilit oriole
rocky vigil
rocky vigil
#

I don’t have a branch for base halfkav2hm yet though

#

Might need to wait a couple hours for that

rocky vigil
#

Yes I will not be back at computer for an hour and a half

rocky vigil
#

(With 98b68b5a9455)

twilit oriole
#

Oh I can't test till tomorrow evening. You should be able to put it on one of the OB instances if you need a fixed nodes I think

rocky vigil
#

Ah I see

#

Doesn’t OB need additional modifications

twilit oriole
#

Don't think so

rocky vigil
#

Oh right it works bc auto download net

#

So you don’t need to do any makefile shenanigans

round stone
#

early fixed nodes results:

Results of ./sscg13-sf/src/stockfish vs ./Stockfish-256/src/stockfish (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 42.92 +/- 8.13, nElo: 60.53 +/- 11.34
LOS: 100.00 %, DrawRatio: 42.01 %, PairsRatio: 1.79
Games: 3604, Wins: 1472, Losses: 1029, Draws: 1103, Points: 2023.5 (56.15 %)
Ptnml(0-2): [69, 306, 757, 453, 217], WL/DD Ratio: 3.40
#
Results of ./sscg13-sf/src/stockfish vs ./Stockfish/src/stockfish (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -159.73 +/- 6.80, nElo: -260.30 +/- 9.65
LOS: 0.00 %, DrawRatio: 22.95 %, PairsRatio: 0.07
Games: 4976, Wins: 561, Losses: 2700, Draws: 1715, Points: 1418.5 (28.51 %)
Ptnml(0-2): [503, 1282, 571, 115, 17], WL/DD Ratio: 2.59
rocky vigil
#

Ok not bad for single layer

#

What is the approximate speed difference just curious?

#

I estimate that full threats are roughly the same speed (maybe even slightly faster) compared to simplified so this is promising

round stone
#
Result of  10 runs
==================
base (...rc/stockfish) =     238162  +/- 123
test (...rc/stockfish) =    2636398  +/- 8043
diff                   =   +2398236  +/- 7955

speedup        = +10.0698
P(speedup > 0) =  1.0000
#

non-UE threats less than 1/10 the speed of L1-256

rocky vigil
#

Yeah ok

#

non-ue doesn’t ue the psq inputs either

round stone
#
Result of  10 runs
==================
base (...rc/stockfish) =     238051  +/- 159
test (...rc/stockfish) =    1114922  +/- 4646
diff                   =    +876871  +/- 4576

speedup        = +3.6835
P(speedup > 0) =  1.0000

this is speed vs. master

rocky vigil
#

What positions primarily consist of

#

In this test do you know

#

I think ue can be anywhere between 4 - 16x faster

#

Depending on the stage of the game

round stone
#

these are speeds based on whatever is in the stockfish bench position list

rocky vigil
#

Ah ok

rocky vigil
#

Yeah mostly midgame

#

Then probably over 10x speedup from good ue

#

And properly optimized vector operations

#

If you have time I’ll also try and implement full threat inputs later

#

According to the fixed montytrain

formal smelt
#

i think that might be a bit optimistic

rocky vigil
#

Well you are going from ~70 avg features processed to ~8 in the midgame

#

And you are also going from compiler autovec to proper SIMD kernels

formal smelt
#

the compiler autovec for the updates alone is going to be basically perfect

#

its addition/subtraction in a loop

rocky vigil
#

Hmm I see

#

Btw how expensive do you think looping through all the pieces to get all active threats is

#

Compared with an accumulator update

formal smelt
#

why dont you just time how long it currently takes to calculate all the indices

#

and compare it to the average time per accumulator update

rocky vigil
rocky vigil
#

Ok writing difference of vectors of size ~ 800k elements (of which ~750k shared) is 2-3 msec

#

Meaning that for vectors of size ~80 the time is negligible

rocky vigil
#

@formal smelt a couple questions regarding full inputs

  1. Pawn doesn’t distinguish between the exact type of piece it attacks, only enemy/friend? why is this
  2. MAP[] appears to be color insensitive, I assume I’m missing something?
  3. pawn->bishop, pawn->queen, pawn->king, bishop->queen, rook->queen, king->queen are the features excluded by deduplication right? And then pawn->enemy pawn or piece->piece is by whether from < to? But what about the cases of King->Rook and King->Bishop that are duplicated? I don’t see the code currently handling that
#

It is relatively simple for me to modify simplified input code to process full inputs after I know all the details

formal smelt
#
  1. The only case color would matter is for pawns and you can see that gets handled separately
#
  1. You can see those are in fact handled in map_king_threat
rocky vigil
formal smelt
#

Ah bruh

#

I copy pasted the fix from simplified threat inputs

rocky vigil
#

or is that only for spsa.

#

actually I think it works but will be cursed

lofty cedar
#

-160 against master takes a lot to overcome.

rocky vigil
#

stockfish expects [i16 L1 bias LEB128] [i16 L1 weights LEB128] [i32 L1 PSQT LEB128] [i16 L2 bias little-endian bucket 1] [i16 L2 weight little-endian bucket 1] [i16 L3 bias little-endian bucket 1] [i16 L3 bias little-endian bucket 1] [i16 L3 weight little-endian bucket 1] ... (bucket 2) (bucket 3) ... (bucket 8)
but there is a thing where the input sizes for L2, L3, L4 are padded to a multiple of 32

rocky vigil
lofty cedar
#

Oh, well, I see.

#

Also, would the later layer's bucket still be indexed the same way?

#

Or would we go directly to the indexing scheme where most buckets got 3 pieces count?

rocky vigil
#

this might have to be tested

lofty cedar
#

We're resetting the board.

#

If we wait, there would be more and more sunk costs into suboptimal data training and so on.

rocky vigil
#

hmm

#

I'm not too familiar with training side so

lofty cedar
rocky vigil
#

the net is only single layer

#

no multilayer yet

#

and there are still other things we can test

#

like full inputs vs simplified

#

ultimately I think it comes down to how fast we can make UE

#

rn I compute entirely from scratch (including psq inputs, so in midgame there are an average of like 50-60 features per side)

rocky vigil
formal smelt
#

You can use a callback to run whatever code that writes it like that when you’d usually be saving a checkpoint

rocky vigil
#

if the LEB128 is a problem I can hack sf code to load it in little-endian and write it in LEB128

formal smelt
#

I’m sure there is a leb128 crate you can use

rocky vigil
#

yeah ig linrock can go for a multilayer simplified net

#

using bullet

#

when he has time

#

and I'll figure out the formatting and whatever

formal smelt
rocky vigil
#

should I attempt to use this net at all or should I wait for changes to get pushed

rocky vigil
#

@frosty imp how do you think I could try and have accumulator caches go by ply, instead of by ksq-color

#

sort of like

#

to evaluate this position

#

we compute its active features

#

then we look backwards to find the latest computed ply

#

and try to ue the difference from that

frosty imp
#

imo accumulator caches are not necessary for this kind of inputs

#

since there is no need for full refreshes

rocky vigil
#

compute all attacks

rocky vigil
frosty imp
#

just use update_accumulator_incremental for everything

rocky vigil
#

uh

#

then I need to pass a pos

#

into it

#

instead of whatever is currently passed

frosty imp
#

i think you can just extend dirtypiece

#

to contain the threat indicies too

#

or just pre-calculate every changed index on makemove

rocky vigil
#

well my ue plan was to just compute every active threat index and then remove the ones that are already there

frosty imp
#

oh ok

rocky vigil
#

this is simplest

#

for now

frosty imp
#

then just store all active indicies in dirtyPiece

#

maybe rename it to something else

rocky vigil
#

what is stateinfo btw

frosty imp
#

it's the copy-make part of the board object

#

sf does partial copy-make

rocky vigil
#

ok

#

also I have to ue both psq and threats

#

so uh

#

I guess I can just define a new structure

#

along with dirtypiece

rocky vigil
frosty imp
#

maybe not a vector

#

SF creates a new stateinfo object in the beginning of every search call

#

so gotta allocate on the stack

rocky vigil
#

huh

#

ok

#

array of length 128 works as well

frosty imp
#

use ValueList

#

defined in misc.h i think

rocky vigil
#

bruh valuelist is like exactly what I've been using in nnue code

frosty imp
#

ah ok

rocky vigil
#

valuelist is basically like a vector right

frosty imp
#

yeah

#

fixed-capacity vector

rocky vigil
#

also the ue accumulator updates will probably just be for loops

#

jw says compiler autovec is essentially perfect for simple arithmetic like that anyways

#

someone else can rewrite the simd tilings and other things if this actually works

frosty imp
#

sf nnue code is due for a refactor anyways

rocky vigil
#

btw I am open to other ue ideas more advanced than just "compute the threat difference from the last state and apply it (to both colors)"

lofty cedar
#

Granted, some code are hard to read, but they're out of necessity rather than because of bad engineering.

naive comet
#

the NNUE code is overly abstracted

lofty cedar
#

Hmm? What's the issue?

#

We need abstraction to contain the monstrosity that is the high-performance code.

candid ivy
#

Again stop talking nonsense

frosty imp
#

you can look at countless other engines for cleaner NNUE code

lofty cedar
#

I understand that high-performance code can sometimes be hard to read because well, the logic had to be complicated. You had to do some fancy tricks to go fast. There is no other way. For a performance-critical program, this is sometimes needed.
So, the best we can do is to abstract away those unreadable code into easy-to-understand functions.

But you're saying there are better ways?

frosty imp
#

not all abstractions are equal

#

we need abstractions != we need bad abstractions

lofty cedar
#

I agree that unreadable code are usually bad, but sometimes they're necessary evils, especially in a performance-critical software. What I would do when there is no other option is wrap them in "untouchable" functions.

frosty imp
#

well that hypothetical situation doesn't apply here

#

so maybe we can discuss that in another place

lofty cedar
frosty imp
#

you can argue for no. 2, but parts that aren't architecture-dependent is not clean either

lofty cedar
#

Though I guess that's a red herring. The NNUE logic isn't some sort of nightmarish code. It's just a straightforward SIMD code after all.

frosty imp
#

NNUE inference is not very complicated

naive comet
frosty imp
#

2?

naive comet
#

yes

#

this has strayed off the topic of UE threat inputs

#

we should move off

lofty cedar
#

Sure.

round stone
#

leb128 only matters if we have something that can beat the current master

rocky vigil
#

ok

#

well the issue is a full threat net

#

might be like 160 mb

#

for 80k -> 1024

round stone
#

Getting any kind of UE working will be important for measuring baselines at TC. Currently still don’t know how far we actually are from master

#

Yea 160mb is huge, but fine for testing

#

Still unclear whether full threat 1024 UE will be strong enough vs. master at TC

rocky vigil
#

yeah I hope to get UE working tmrw

twilit oriole
twilit oriole
# round stone Getting any kind of UE working will be important for measuring baselines at TC. ...

It will be an important test but keep in mind it is still going to be missing some large optimisations.

The full threat input net is extremely sparse, so permuting it will yield 20+ Elo at STC

Shared net weights (e.g mmap) probably also gains 20+ at STC single threaded because memory accesses are less predictable with threat inputs than king buckets (need a larger portion of the net in cache)

formal smelt
#

that is indeed wrong lol

#

at least i think it is

#

@twilit oriole I have pushed fixes

rocky vigil
twilit oriole
#

its added complexity for a tiny gain. but sf is about tiny gains so sure you can add them if you want lol

#

ah it is mentioned

#

hmm actually. duplicate inputs here would only occur in check. in which case eval is skipped anyways?

#

hm for enemy at least. i guess there is still friendly

#

and i guess it is super common for a friendly rook to be next to the king lol

rocky vigil
#

Yeah king-rook would save a bit bc castling

#

It’s fine

#

If this case doesn’t have deduplication I can also make a small change to the indexing

#

Changing the indexing is pretty easy on my side

rocky vigil
#

@frosty imp I cannot take only stateinfo into update_accumulator_incremental because I need access to the entire board

#

So how should I format this

#

Unless stateinfo will also store 8 bitboards for the pieces

rocky vigil
#

Ok I think I might just add 8 bitboards to stateinfo

#

Gah code rewrite

rocky vigil
#

Aight appending active features only needs color bb, piece bb, and piece array now

rocky vigil
#

Untested incremental update function now up at my branch

rocky vigil
#

Lmao ue is up at my branch but only ~2x faster

#

I probably messed smth up

#

Yeah can someone test

#

And also maybe give suggestions on my hacked ue

rocky vigil
#

I think I am having some fundamental ue impl issue here

#

Not overhead

#

Because if my debug info is to be trusted

#

I have like 20 updates/color/position

#

After 1M nodes from startpos

#

This is approximately 3x better than no ue

#

But still seems very high

rocky vigil
#

Before declining down to 3

#

@frosty imp hints?

frosty imp
#

no idea

#

could you send diff

#

do some profiling?

rocky vigil
rocky vigil
frosty imp
#

just find a good profiler and follow the tutorials

rocky vigil
frosty imp
#

does the ue output match the eval starting from scratch

rocky vigil
#

Yes

#

Same bench

#

& everything

formal smelt
#

a 2x speedup is not unreasonable from UE

rocky vigil
#

What’s more concerning is why it declines from 5x to 3x

#

Over more time

formal smelt
#

wdym?

#

like if you search longer?

rocky vigil
#

Yes

formal smelt
#

weird indeed

rocky vigil
#

well might be getting trolled by laptop but there is still a noticeable slowdown

rocky vigil
#

how on earth is it still faster than non-ue then

rocky vigil
rocky vigil
#

ok but what is the issue there

frosty imp
#

maybe the update from scratch always got triggered?

#

like in the beginning nothing is computed

#

so i'd assume this always triggers

#

hmm yeah

rocky vigil
#

yeah but after 10 million nodes

frosty imp
#

i guess the fix is just to label the accumulator as computed

#

in update_accumulator_scratch

rocky vigil
#

I refuse to believe nothing iis wrong

#

oh shoot

#

you may be rght

rocky vigil
#

bruh

#

using update_accumulator_incremental is actually slower

#

lemme do a little bit of optimize

#

is it possible to clear a valuelist

frosty imp
#

yeah

#

write a function clear()

#

just set size to 0

rocky vigil
#

bruh there's no prebuilt function

frosty imp
#

pr it sabaping

rocky vigil
frosty imp
#

it doesn't

#

but you'd overwrite them anyway when pushing new elements

rocky vigil
#

oh wait they don't matter

#

like we can just loop until size

#

and not further

rocky vigil
frosty imp
#

yeah

rocky vigil
#

or will it also find the remaining elements as well

#

like if I push_back 1, 2, 3, 4, 5

#

then set size to 2

#

will the for loop only find the values 1, 2

frosty imp
#

yeah

rocky vigil
#

or all 5 values

frosty imp
#

everything will work

rocky vigil
#

yay

#

nvm I broke smth

#

smth in this is broken

#

i don't know how it's broken

rocky vigil
#

i have no idea why this doesn't work

rocky vigil
#

anyways I have been unable to get incremental update to be faster at all

#

so rn the "ue" optimization is probably literally accumulator reuse

#

i might be really borking because my statistics show that using my code, incremental vs scratch computation are ~ the same amount of compute

#

which I really don't believe

#

ok it might be that my write_difference isn't working

rocky vigil
#

i actually need help

#

incremental doesn't work and I have no idea why

#

some comments:

#

assertion that the threats in computed stateinfo match the ones computed passes

#

assertion that the features are sorted passes

#

different results are returned using dirtypiece vs write_difference of psq

#

well

#

it is very probable that append_active_psq and append_active_threats indeed compute the correct indices

#

given that I have used these functions to do non-ue

#

and the version which doesn't use update_accumulator_incremental is also sound

#

so the failure point is probably in write_difference

#

but idk how that is wrong

#
 NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
|   r   |   n   |   b   |   q   |   k   |   b   |   n   |   r   |
|  0.00 |  0.00 |  0.00 |  0.00 |       |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   p   |   p   |   p   |   p   |   p   |   p   |   p   |   p   |
|  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|       |       |       |       |       |       |       |       |
|       |       |       |       |       |       |       |       |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   P   |   P   |   P   |   P   |   P   |   P   |   P   |   P   |
|  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
|   R   |   N   |   B   |   Q   |   K   |   B   |   N   |   R   |
|  0.00 |  0.00 |  0.00 |  0.00 |       |  0.00 |  0.00 |  0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+

NNUE evaluation        +0.20 (white side)
Final evaluation       +0.26 (white side) [with scaled NNUE, ...]```
#

this is the other trademark of broken update_accumulator_incremental

#

gets the eval right but completely borks the piece value estimation

rocky vigil
#

which has the same bench

#

alright I'm going to wait for someone else to help me debug

#

if you really wanna try some "optimized non-ue" in the meanwhile

#

^^also computes everything from scratch, but somehow is 2-3x faster

frosty imp
#

what part of UE is wrong

#

does making, say, e2e4 on startpos give you the wrong result via ue?

rocky vigil
#

stockfish code is so convoluted idk how to test this

frosty imp
#

just make move on a pos object, then call evaluate

#

i think you can find some example on making move in the extend_pv function

rocky vigil
#

if only removing a piece

#

it doesn't actually change the position or stateinfo

#

so the accumulator gets re-used

rocky vigil
#

so far, with my current implementation, we have anywhere between 30 to 80 accumulator updates per node on average, depending on position

#

note that rn it's still just non-ue with accumulator re-use while I attempt to debug my incremental calculations

#

i think, if we can get this down to 20 avg.

#

then threat inputs should be quite good

rocky vigil
#

Stockfish Threatnet (last working version) vs Stormphrax 6: 39 - 55 - 106

#

Very rough STC estimate

formal smelt
#

what about against the smallnet

rocky vigil
#

I can’t download it rn bc my school wifi blocks fishtest

#

Also the STC will suck even harder

#

Considering that I still haven’t gotten update_accumulator_incremental to work

rocky vigil
#

So it’s basically non-ue with accumulator reuse

#

Somehow still 2-3x faster than scratch computation at eval time

rocky vigil
#

I literally can’t see the issue

rocky vigil
rocky vigil
#

average stockfish debug

daring wren
#

engin

rocky vigil
#

...

#

at least w.r.t. the difference calculation for added/removed features

#

@frosty imp ```c++
template<Color Perspective>
void update_accumulator(const Position& pos) {
StateInfo* st = pos.state();
if ((st->*accPtr).computed[Perspective])
return; // nothing to do

    // Look for a usable already computed accumulator of an earlier position.
    // Always try to do an incremental update as most accumulators will be reusable.
    do
    {
        if (!st->previous || st->previous->next != st)
        {
            // compute accumulator from scratch for this position
            update_accumulator_scratch<Perspective>(pos);/*
            if (st != pos.state())
                // when computing an accumulator from scratch we can use it to
                // efficiently compute the accumulator backwards, until we get to a king
                // move. We expect that we will need these accumulators later anyway, so
                // computing them now will save some work.
                update_accumulator_incremental<Perspective, BACKWARDS>(
                  pos.square<KING>(Perspective), st, pos.state());*/
            return;
        }
        st = st->previous;
    } while (!(st->*accPtr).computed[Perspective]);
    // Start from the oldest computed accumulator, update all the
    // accumulators up to the current position.
    update_accumulator_incremental<Perspective>(pos.square<KING>(Perspective), pos.state(), st);
}```
#

would you know why this code, from a 100k node startpos search, would only trigger update_accumulator_scratch once only, at the very beginning?

frosty imp
#

you fixed the bug right

#

like the one with computed flag

rocky vigil
#

both update_accumulator_scratch and update_accumulator_incremental

#

set computed = true at the end

frosty imp
#

hmm

#

maybe try calling evaluate once on the root position before search?

rocky vigil
#

uh

#
else
        {
            for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = (computed->*accPtr).accumulation[Perspective][j];
            }
            acc_updates++;
            threat_loops += (int)removed.size();
            threat_loops += (int)added.size();
            // Difference calculation for the activated features
            for (auto index : added)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                for (IndexType i = 0; i < TransformedFeatureDimensions; ++i)
                    (next->*accPtr).accumulation[Perspective][i] += weights[offset + i];
            }
            // Difference calculation for the deactivated features
            for (auto index : removed)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                for (IndexType i = 0; i < TransformedFeatureDimensions; ++i)
                    (next->*accPtr).accumulation[Perspective][i] -= weights[offset + i];
            }
        }```
#

there is an issue in this

#

piece of code

frosty imp
#

hmm I don't see anything wrong from a quick skim

rocky vigil
#

yeah that's why this is so suspicious

frosty imp
#

does the offset calculation match other parts of the code?

#

idr how it works

rocky vigil
# rocky vigil ```c++ else { for (IndexType j = 0; j < TransformedFeatureDi...

replacing the block with

else
        {
            acc_updates++;
            threat_loops += (int)newthreats.size();
            threat_loops += (int)newpsq.size();
            for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = biases[j];
            }
            for (auto index : newpsq)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
            for (auto index : newthreats)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
        }

works but defeats the purpose of this function

#

i am equally perplexed

#

the (flawed) version of incremental halves avg. accumulator updates per node

rocky vigil
#

like I think incremental on every move is not optimal with threat inputs

#

because e.g. if you have a long series of captures

#

you do not really need to evaluate how the threats change with each intermediate capture

#

just skip all the way to the end

rocky vigil
#

at least, it has matched all the manual experiments

rocky vigil
#

we can replace it with

for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = biases[j];
            }
            for (auto index : oldpsq)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
            for (auto index : oldthreats)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }

and it will work

#

meaning

#

despite (computed->*accPtr).computed[Perspective] being true

#

the accumulator is not correct

#

ok my github is up to date

#

I would really like help examining where accumulator.accumulation is updated

#

and how it can become "outdated"

naive comet
#

honestly if you want a proof of concept you can just remove the accumulator updates altogether and do everything via the finny refreshes

#

it is not much of a slowdown and not an absolute PITA to implement

violet badger
# rocky vigil I would really like help examining where accumulator.accumulation is updated

Just a quick look at the warnings, I'm seeing:

position.cpp:1022:16: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of a non-trivial type ‘struct Stockfish::StateInfo’ leaves 1088 bytes unchanged [-Wclass-memaccess]
 1022 |     std::memcpy(&newSt, st, offsetof(StateInfo, accumulatorBig));
      |     ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from position.cpp:19:

I'd be worried about this kind of warnings.

rocky vigil
#

also clang never gave me this warning

#

perhaps I should install gcc

#

what does this code do

desert tree
#

that seems to be finding the byte offset of a member of a struct

rocky vigil
#

ok so how would it be off

#

if I add extra things to StateInfo

frosty imp
#

it's partially copying the stateinfo struct

#

(remember the partial copy-make sf uses)

desert tree
#

aka

struct S {
u32 a;
u32 b;
}
-> offsetof(S, b) == 4 // probably
rocky vigil
#

ok

rocky vigil
frosty imp
rocky vigil
#

This is all before accumulatorBig

candid ivy
#

it might work might not work depends a bit on the context

rocky vigil
#

Is it because of adding a Eval::NNUE::FeatureSet::IndexList in StateInfo

#

I can store them in Accumulator instead I think

#

And that should make StateInfo trivial again

candid ivy
#

Isn’t accumulator in state info too ? It would have the same result since if a member is not trivially copyable then the struct itself isn’t either

rocky vigil
#

but since we only memcpy the non-accumulator portion of StateInfo

#

perhaps that issue could be avoided

#

otherwise I could look into storing pointers instead

#

but idk how that would work

#

actually since valuelist is very simple implementation

#

what if I just construct a trivial version of it

#

lemme try that

candid ivy
#

First check if that’s actually the struct causing the warning

rocky vigil
frosty imp
#
template<typename T, std::size_t MaxSize>
class ValueList {

   public:
    std::size_t size() const { return size_; }
    void        push_back(const T& value) { values_[size_++] = value; }
    const T*    begin() const { return values_; }
    const T*    end() const { return values_ + size_; }
    const T&    operator[](int index) const { return values_[index]; }

   private:
    T           values_[MaxSize];
    std::size_t size_ = 0;
};
rocky vigil
#

and if I comment out the IndexList declarations in stateinfo I get it is trivial

#

valuelist should be easy to do in a trivial manner

#

it's essentially just an array

frosty imp
#

idk why it isn't trivially copyable

rocky vigil
#

it might be because of the [] operator

frosty imp
#

well the member functions aren't copied anyway

rocky vigil
#

yeah I also have no idea why it's nontrivial

desert tree
#

maybe static_assert(std::is_trivially_copyable_v<T>)?

#

otherwise it will not be trivially copyable

rocky vigil
#

well

#

idk what this means lol

rocky vigil
desert tree
#

can you send the error you get

desert tree
#

for static_assert(std::is_trivially_copyable<ValueList<T, N>>)?

rocky vigil
#

essentially memcpy doesn't work because adding valuelists makes stateinfo nontrivial

#

whereas I would like to have a valuelist store all active features corresponding to a given accumulator

desert tree
#

compiler and language version?

rocky vigil
#

so that I don't need to recompute them to take the difference

rocky vigil
#

also since millions of these will be processed for accumulator updates

#

trivialness might actually impact the speed

#

in the meanwhile I'll just replace it with an actual array

#

and see if that fixes things

rocky vigil
rocky vigil
#

maybe vscode is not to be trusted though

#

ok godbolt backs up this claim

frosty imp
#

could you send godbolt link

rocky vigil
#

how do I get a link

frosty imp
#

"share"

#

at top right corner

rocky vigil
desert tree
#

nvm im dumb

rocky vigil
#

anyways the compiler in godbolt (hopefully) doesn't lie

#

even though I have no idea what the issue could possibly be

desert tree
#

also why cout

frosty imp
#

i think it's just the custom constructor

desert tree
#

instead of static_assert

frosty imp
#
    std::cout << std::boolalpha << std::is_trivially_copyable_v<ValueList<std::uint32_t, 128ULL>> << std::endl;
#

but this is still true though

desert tree
#

is_trivial vs is_trivially_copyable

#

are not the same

frosty imp
#

yeah

#

but didn't the warning say trivially copyable

#

ah

desert tree
#

non-trivial type

rocky vigil
#

vondele's warning just says 'writing to an object of a non-trivial type'

desert tree
#

thats so strange

#

ill take a look tomorrow

#

gotta sleep

rocky vigil
#

anyways I can really just like

#

manually replace the functionality

#

with only an array

#

but that's stupid

#

like just have arr[0] be the replacement for size_ or whatever

frosty imp
#

i think it's the initialization of size_=0 that causes the problem

desert tree
#

nahh please have a separate member no?

frosty imp
#

idk how you can do that without breaking stuff

rocky vigil
#

unless it's the initialization in general

desert tree
#

maybe all members need initialization or none?

frosty imp
#

you need to remove the initialization

desert tree
rocky vigil
#

yes removing initialization makes it work

#

well then

frosty imp
#

probably the initialization makes the constructor non-trivial?

rocky vigil
#

ok but how do you get the functionality without the initialization of size_

frosty imp
#

eh you probably can't

#

you know this would be a great time to refactor accumulator updates Kappa

rocky vigil
#

what if we just leave size_ uninitialized

#

and then use clear() to set it to 0

#

does this bypass work

frosty imp
#

that could solve it

#

but i don't like the extra step needed to use it elsewere

#

feels error prone

rocky vigil
#

yeah it's

#

annoying

#

this is basically doing a constructor without actually doing one

frosty imp
#

how about you just do it that way for the prototype

rocky vigil
#

yeah I'll just try and see if it works for ue now

frosty imp
#

would you be open to rebasing this onto some other refactors

rocky vigil
#

and if we get close to merging we can figure a better solution out

#

sure

#

yeah

frosty imp
#

yeah I can try making it nice but it'll probably take a few days lol

rocky vigil
#

i mean only if this will get merged

#

myself I'm pretty confident that threat inputs will work with sufficient optimization but

#

thankfully valuelist is only used once

#

'capturesSearched' and 'quietsSearched'

frosty imp
#

i really hope it works lol

#

threat inputs might solve a whole class of fortress issues

rocky vigil
#

welp not initializing size_ = 0 breaks the code somehow even though I thought I tracked down and added .clear() every time a valuelist is declared

frosty imp
#

did you clear it in the movepicker constructor

#

nvm it doesn't use valuelsit

rocky vigil
#
Assertion failed: (next->*accPtr).accumulation[Perspective][j] == (computed->*accPtr).accumulation[Perspective][j], file nnue/nnue_feature_transformer.h, line 264```
#

lmao it still doesn't work

#

the accumulator still gets messed up somehow

#
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                (next->*accPtr).accumulation[Perspective][j] = biases[j];
            }
            for (auto index : oldfeatures)
            {
                const IndexType offset = TransformedFeatureDimensions * index;
                assert(offset < TransformedFeatureDimensions * InputDimensions);
                for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
            }
            for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
                assert((next->*accPtr).accumulation[Perspective][j] == (computed->*accPtr).accumulation[Perspective][j]);
            }``` relevant code
#

uh @frosty imp I pushed commit to my branch so can you compile it with gcc and tell me all the warnings

#

because clang did not give me any memcpy warnings

frosty imp
rocky vigil
#

ok so no more memcpy warnings

#

this is not inspiring

rocky vigil
#

also for a startpos search at least

#

after update_accumulator_scratch being called on startpos

#

everything else will be update_accumulator_incremental

#

anyways since Stockfish::StateInfo is back to being nontrivial the memcpy should be fine

rocky vigil
#

is it still possible that something in the memcpy is going wrong?

#

I don't know any other change that could affect the accumulation

#

since nowhere else in the code besides feature_transformer is it modified

rocky vigil
#

I still have no idea where accumulator.accumulation is modified besides in the feature transformer functions

#

so I really don't know how it would stop matching

#

I don't think it's because of any memcpy because a 4096 byte buffer in the accumulator struct before the values doesn't work either

#
92 -60 -152 39 -71 -20 -113 31 -45 262 -76 -104 193 -98 72 -6 65 4 -2 -1025 88 -300 295 33 -270 -292 -442 37 72 -7 -184 252 -209 -70 -514 -235 130 -243 -256 -112 -57 140 -418 -121 -75 13 -906 110 77 -150 103 -231 -155 -1937 -2310 -98 -191 -522 -27 1 46 -2177 -185 135 -358 12 -131 -44 -81 -127 -87 -952 -430 -416 -6 -56 140 -22 85 -49 -131 10 -68 218 -1726 -630 -92 -23 -81 139 -459 30 -223 -53 13 -785 -32 -88 244 99 107 159 93 -272 -399 -551 -136 15 17 -731 -328 -116 -48 -294 8 54 -88 -609 -204 28 13 -739 36 -149 -213 -128 -201 -53 -417 -161 -180 -10 -2 25 -11 -175 56 -130 -90 -178 -3416 39 -656 -84 181 69 -5 -394 -137 -763 31 -205 46 11 -72 -71 -41 196 39 -1601 -521 -223 47 -3563 -384 -96 43 57 215 -195 -72 -12 -191 -153 -26 106 65 -2286 9 -192 190 -50 2 -256 170 -167 -43 50 70 71 -641 346 -2 72 -10 67 -107 -42 40 36 306 -476 -122 -188 33 -36 -604 119 -275 93 -917 2 -701 -43 -410 -128 -156 -946 -80 64 -446 40 6 38 75 -648 177 -27 -425 -20 -32 -1993 -174 40 -117 -7 60 -148 124 -226 -71 27 -358 -47 147 -168 -226 -601 5 7 -231 -273 -12 -9 -174 -121```
#
101 -61 -132 28 -64 -10 -105 45 -51 267 -77 -99 180 -299 81 -17 72 33 -41 -975 88 -298 265 65 -251 -304 -402 -103 138 -101 -186 250 -221 -224 -502 -140 128 -168 -239 -114 -56 133 -424 -114 -81 17 -937 128 33 -135 101 -334 -151 -1822 -2369 -95 -189 -363 3 -31 28 -2126 -166 136 -382 2 -647 -19 -166 -126 -87 -998 -443 -606 -31 -49 169 -27 86 -60 -127 6 -75 198 -1701 -549 -107 -29 -109 145 -437 20 -213 -15 25 -753 -47 -80 237 138 101 143 123 -291 -395 -530 -118 -50 7 -706 -330 -101 -53 -448 34 40 -176 -606 -197 8 -11 -837 27 -140 -208 -100 -200 -58 -435 -181 -150 -17 38 15 -12 -162 -73 -136 -90 16 -3478 -108 -680 -509 163 65 -9 -393 -197 -729 39 -111 46 -4 -69 -75 -26 211 11 -1556 -523 -209 43 -3528 -265 -71 20 -8 211 -180 -81 2 -181 -131 -27 115 69 -2229 19 -219 208 -45 8 -243 159 -181 -15 56 43 73 -659 311 -9 80 4 75 -129 -38 25 12 303 -464 -115 -163 32 -32 -598 41 -283 111 -935 -14 -705 -83 -416 -60 -177 -939 -78 66 -455 48 1 -155 74 -740 166 -53 -358 -37 -7 -2134 -175 -45 -108 -3 67 -158 131 -232 -81 15 -345 -67 139 196 -187 -627 39 -4 -221 -270 -11 31 -169 -58```
#

what

#

ok then

#

idk how useful this is though

#
8 9 10 11 12 13 14 15 65 70 130 133 192 199 260 323 429 432 433 434 435 436 438 439 505 510 570 573 632 639 700 763 938 957 1622 1623 1643 1644 2728 2735 2832 2833 4607 4608 4611 4612 4613 7445 7446 7447 7448 7449 8350 8353 8754 8771 8773 9635 9636 9656 9657 11022 11023 11120 11127 13407 13408 13413 13414 15333 15334 15335 15336 15337```
#

this should be the position after 1. c2c3

rocky vigil
#

so somehow it's getting passed the accumulator of the wrong perspective

#

wot

#

ok I am a clown

#

I have the same

#

IndexList

#

for both perspectives

#

because I forgot to split by perspective

#

so of course the features and stuff will cease to match...

rocky vigil
#

UE reduces accumulator updates by more than 4x in 10M node search from startpos

#

but is not that much faster

#

maybe because of write_difference overhead

#

speedtest Nodes/second : 1809147 (non-ue)

#

Nodes/second : 1736032 (ue)

#

lmao

#

write_difference overhead is actually insane apparently

rocky vigil
#

ok branch updated

#

uh

#

tbh ue being like

#

~same speed

#

with anywhere between 1/3 and 1/5 of the accumulator updates

#

is shocking

#

(ly bad overhead)

#

for speedtesting purposes

#

(and/or profile it, the only major difference should be way less accumulator updates but many usages of write_difference)

#

i would appreciate it a lot

#

according to my debug statistics, which I can run now that I have 'real ue', this estimate is approximately correct (per side, so ~16 total per eval)

rocky vigil
#

I am suspicious of the bench change when using dirtypiece to perform the psq feature updates

#

But the short STC I ran (50 - 31 - 119) doesn’t lie

#

It’s like a 10% speedup

#

@formal smelt are we still doing king-rook and king-bishop deduplication

#

In full inputs

formal smelt
#

i thought we weren't doing that

rocky vigil
#

Ok sure that’s fine

#

Yeah now that ue is in a better shape

#

(Overhead is constant but massively less accumulator updates scales much better with larger nets)

#

I can work on supporting full threats as well

rocky vigil
rocky vigil
round stone
rocky vigil
#

Ok looks like ue is much better on your machine compared to mine

#

On my laptop it’s barely like 10%

rocky vigil
#

@formal smelt can I get active features for kiwipete, full inputs (including deduplication) when you have time

candid ivy
#
~/bench_parallel.sh ./stockfish_a2604d40 ./stockfish_46581e8a 13 10
sf_base =   335055 +/-   1366 (95%)
sf_test =   739457 +/-   6135 (95%)
diff    =   404401 +/-   5258 (95%)
speedup = 120.69692% +/- 1.570% (95%)
twilit oriole
#

Since net weights are not shared between instances

#

If you do single threaded the result will be far better

round stone
#

is there at least another 2x speedup expected beyond this?

candid ivy
#
diff --git a/src/nnue/nnue_feature_transformer.h b/src/nnue/nnue_feature_transformer.h
index 35027bf6..ae832a67 100644
--- a/src/nnue/nnue_feature_transformer.h
+++ b/src/nnue/nnue_feature_transformer.h
@@ -256,21 +256,25 @@ class FeatureTransformer {
             acc_updates++;
             threat_loops += (int)removed.size();
             threat_loops += (int)added.size();
+
+            auto* acc_ptr = &((next->*accPtr).accumulation[Perspective][0]);
+
             // Difference calculation for the activated features
             for (auto index : added)
             {
-                const IndexType offset = TransformedFeatureDimensions * index;
-                assert(offset < TransformedFeatureDimensions * InputDimensions);
+                const auto* weight_ptr = &weights[TransformedFeatureDimensions * index];
+
                 for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
-                    (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
+                    acc_ptr[j] += weight_ptr[j];
             }
+
             // Difference calculation for the deactivated features
             for (auto index : removed)
             {
-                const IndexType offset = TransformedFeatureDimensions * index;
-                assert(offset < TransformedFeatureDimensions * InputDimensions);
+                const auto* weight_ptr = &weights[TransformedFeatureDimensions * index];
+
                 for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
-                    (next->*accPtr).accumulation[Perspective][j] -= weights[offset + j];
+                    acc_ptr[j] -= weight_ptr[j];
             }
         }

unless i did something wrong i got a good speedup from this

sf_base =   731585 +/-   8007 (95%)
sf_test =  1162599 +/-  14972 (95%)
diff    =   431013 +/-  10956 (95%)
speedup = 58.91504% +/- 1.498% (95%)
formal smelt
#

uh hang on

#

ill just push to the montytrain branch

#

simplified inputs ye?

rocky vigil
#

Simpler inputs is already worked out

rocky vigil
rocky vigil
rocky vigil
#

well I expect this to be -100 elo so far

#

going to test this, and also with a +50% time odds to simulate further optimization

rocky vigil
#

Uh

#

Bench altering

twilit oriole
#

I don't get how a 256 threat net is that slow. It shouldn't be much, 80%+ time is in search

rocky vigil
#

I swear locally it was +30 elo so I looked past it

#

Idk though

twilit oriole
#

Because vondele said the 3072 master net is only 50% of runtime in eval

rocky vigil
#

Yeah I probably optimized smth poorly

#

Maybe the special SIMD kernels actually are meaningful

rocky vigil
lofty cedar
#

Threat input tests went horribly.

#

200 elo is a long shot.

formal smelt
#

i wouldn't take that test as indicating anything

lofty cedar
#

I see.

rocky vigil
rocky vigil
#

nothing else can explain that ue is less than 2x faster than non-ue with around 1/4 the accumulator updates on average

rocky vigil
rocky vigil
# lofty cedar I see.

and also +60% time only performing 50 elo better at STC UHO is way below what we would expect from scaling data

daring wren
#

is it? I thought it was 2x time odds ~= 70-80 elo

rocky vigil
#

ok well in that case

#

+40 fixed nodes 1/2 speed

#

in no way equals -270 stc elo

#

so either way something is wrong

daring wren
#

maybe net has bad scaling

#

have you tried fixed nodes but with higher node count?

rocky vigil
#

that was probably against my better judgement

#

so now I'm running a non-SSS test on fishtest

#

what the - (check back in on this in a couple hours)

#

I really got scammed by SSS

#

maybe I straight up loaded the wrong compile

#

anyways pls fix dirtypiece

#

it's worth another like 20% if done properly or smth

rocky vigil
lofty cedar
#

Interesting.

rocky vigil
#

ok yeah tmrw I will try to get dirtypiece working again

#

and maybe also try and optimize incremental more

#

(in the difference calculation for features)

candid ivy
#

the enemy bool in the make_index can be rewritten as bool enemy = (attkr ^ attkd) & 8; in my profile it was kinda slow i think, (~1.4-2% speedup)

candid ivy
#

@rocky vigil whats the purpose of sorting the features in the append_active_threats ? I get the same bench without them

candid ivy
#
diff --git a/src/nnue/layers/screlu_affine.h b/src/nnue/layers/screlu_affine.h
index aeb7e951..06ad8249 100644
--- a/src/nnue/layers/screlu_affine.h
+++ b/src/nnue/layers/screlu_affine.h
@@ -56,24 +56,23 @@ class SCReLUAffine {
     }

     // Forward propagation
-    OutputType evaluate(InputType* input, IndexType bucket) {
+    OutputType evaluate(const InputType* input, IndexType bucket) {
         assert(bucket < OutputBuckets);
-        constexpr IndexType Start = 0;
-        OutputType output = 255*(std::int32_t)biases[bucket];
-        for (IndexType i = Start; i < InputDimensions; i++) {
-            input[i] = std::min((std::int16_t)255, std::max(input[i], (std::int16_t)0));
-        }
-        for (IndexType i = Start; i < InputDimensions; i++) {
-            intermediate[i] = input[i]*weights[bucket*InputDimensions+i];
-        }
-        for (IndexType i = Start; i < InputDimensions; ++i)
-        {
-            output += (std::int32_t)(input[i])*(std::int32_t)(intermediate[i]);
+
+        const IndexType weightOffset = bucket * InputDimensions;
+        const auto* weights_ptr = &(weights[weightOffset]);
+
+        std::int32_t output = 255 * static_cast<std::int32_t>(biases[bucket]);
+
+        for (IndexType i = 0; i < InputDimensions; i++) {
+            const std::int16_t clipped = std::clamp(input[i], static_cast<std::int16_t>(0), static_cast<std::int16_t>(255));
+
+            output += clipped * clipped * weights_ptr[i];
         }
+
         return output / 255;
     }

-    alignas(CacheLineSize) std::int16_t intermediate[InputDimensions];
     alignas(CacheLineSize) std::int16_t biases[OutputBuckets];
     alignas(CacheLineSize) std::int16_t weights[OutputBuckets * InputDimensions];
 };

you can speedup evaluate by like 5% for me, if you need the sort order for the active inputs, then I wonder if you can avoid the sort by changing the loop so that all indices which are generated next are already in ascending order?

candid ivy
#

then I wonder if you can avoid the sort by changing the loop so that all indices which are generated next are already in ascending order?
maybe they already are?

rocky vigil
candid ivy
#

yeye i checked for some reason bench was unchanged but they weren't in ascending order

rocky vigil
#

9594b46 appears to be a significant gain in bench

#

Though the school laptop’s antivirus sometimes screws the results

rocky vigil
rocky vigil
#

Also if someone wants to debug the dirtypiece

candid ivy
rocky vigil
#

I mostly copied it from master

#

Make_index(piece, from, from, piece, ksq) does get you the psq index

#

As a hack

#

Figuring this out should also be a large gain

candid ivy
rocky vigil
#

Yeah turns out declaring many small vectors is highly unnecessary

rocky vigil
#

I think this is what is breaking

#

Yeah I probably mess up the mirror refresh

rocky vigil
#

Alright I fixed dirtypiece

#

In 286c9e6a

#

Seems to be significant gain as well

candid ivy
#

~16.5%

rocky vigil
#

Probably not as big of a speedup though

rocky vigil
#

Multilayer inference will be better because we have proper code for it

#

Ok I have no new optimizations planned rn so I’ll run a test to see where we are at

#

we’re probably ready to test multilayer with simplified inputs @round stone

rocky vigil
candid ivy
#

with some additional optimizations and simd you can get another 11%

twilit oriole
#

this is not permuted as well right?

rocky vigil
#

Nope

#

Just whatever bullet default order is

#

At this point it’s almost certain that at L1=256 simplified threats are superior to halfkav2hm at STC

rocky vigil
#

Besides if computing the threat indices is the bottleneck rn that implies favorable scaling to large L1

#

Speaking of computing threat indices bulk pawn attacks might be considerable speedup

rustic bough
#

MultiThreading is broken in 9f21b44. Was ok in 2db74f4.

rocky vigil
#

Since half the pieces being looped through will be pawns

#

Huh interesting

#

Yeah uh

#

I see

#

Lemme bisect it quickly

#

9594b46 breaks multithreading

#

sigh

#

Do the threads like access the same featuretransformer class or smth

#

As long as they access separate featuretransformers it shouldn’t break?

candid ivy
rocky vigil
#

Welp n*ram for n threads

candid ivy
#

it's really not much ram?

rocky vigil
#

I guess fishtest already has this problem though

candid ivy
#

how many pieces are in that array, you are sorting and adding it constantly, it's like max 30 elements with 4 bytes or something?

rocky vigil
#

Max 16 with 4 bytes

candid ivy
#

64 bytes per thread kekgasm

rocky vigil
#

Oh but the commit afterwards moves removed, added also into featuretransformer

#

Whatever that’s like 1KB

#

Uh adding static thread_local gives me a bunch of warnings on forward declaration

#

And then undefined symbol errors

candid ivy
#

show me

#

also just move the vector into the function again if you do that

#

or if it is max 16, i mean use another valuelist not a vector

rocky vigil
rocky vigil
#

Still at school lol

candid ivy
rocky vigil
#

Wait I claim valuelist already has these

candid ivy
#

it has them with const

#

sorting is well non const

rocky vigil
#

Ok well yeah I won’t be able to do changes for a couple hours

#

At that point I’ll also run another STC on fishtest

#

To see where are are single thread now

round stone
# rocky vigil we’re probably ready to test multilayer with simplified inputs <@450517669570150...

alright, something like this to start?

const HIDDEN_SIZE: usize = 256;
let mut trainer = TrainerBuilder::default()
    .quantisations(&[255, 64])
    .optimiser(optimiser::AdamW)
    .loss_fn(Loss::SigmoidMPE(2.6))
    .input(ThreatInputs)
    .output_buckets(outputs::MaterialCount::<8>)
    .feature_transformer(HIDDEN_SIZE)
    .activate(Activation::SCReLU)
    .add_layer(16)
    .activate(Activation::CReLU)
    .add_layer(32)
    .activate(Activation::CReLU)
    .add_layer(1)
    .build();
formal smelt
#

You could just recreate the SF arch

#

It should be more or less what is in the advanced example

round stone
#

sure, if the inference code for SF arch is reusable or easy to set up for this

#

down to use whatever arch. lmk what multi-layer arch is easiest to get inference working for @rocky vigil

frosty imp
#

can bullet leb compress the weights

round stone
#

also found that there's a newer L1-256 which is the one lichess is using: nn-9067e33176e8.nnue

#

there's no leb compression implementation in bullet

formal smelt
#

this is hardly something needed in bullet

#

just postprocess the file

rocky vigil
rocky vigil
#

I’ll get it to work either way (including weight reading)

#

Whatever you think is best

round stone
#

alright, then i'm inclined to start simple with the arch

#

less code to deal with

#

and ignore leb128 until the end, or if larger L1 gets annoying to deal with during testing

#

since it has no effect on strength, and strength is the important part now

rocky vigil
#

Yeah that’s fine

rocky vigil
# candid ivy show me
      'Stockfish::Eval::NNUE::FeatureTransformer<256, &Stockfish::StateInfo::accumulatorBig>::added' required here, but
      no definition is available [-Wundefined-var-template]
  129 |         added.clear();
      |         ^
nnue/nnue_feature_transformer.h:291:17: note: in instantiation of function template specialization
      'Stockfish::Eval::NNUE::FeatureTransformer<256,
      &Stockfish::StateInfo::accumulatorBig>::update_accumulator_scratch<Stockfish::WHITE>' requested here
  291 |                 update_accumulator_scratch<Perspective>(pos);
      |                 ^
nnue/nnue_feature_transformer.h:309:9: note: in instantiation of function template specialization
      'Stockfish::Eval::NNUE::FeatureTransformer<256,
      &Stockfish::StateInfo::accumulatorBig>::update_accumulator<Stockfish::WHITE>' requested here
  309 |         update_accumulator<WHITE>(pos);
      |         ^
nnue/network.cpp:222:25: note: in instantiation of member function 'Stockfish::Eval::NNUE::FeatureTransformer<256,
      &Stockfish::StateInfo::accumulatorBig>::transform' requested here
  222 |     featureTransformer->transform(pos, acc);
      |                         ^
nnue/nnue_feature_transformer.h:59:47: note: forward declaration of template entity is here
   59 |     static thread_local FeatureSet::IndexList added;
      |                                               ^
nnue/nnue_feature_transformer.h:129:9: note: add an explicit instantiation declaration to suppress this warning if
      'Stockfish::Eval::NNUE::FeatureTransformer<256, &Stockfish::StateInfo::accumulatorBig>::added' is explicitly
      instantiated in another translation unit
  129 |         added.clear();```
rocky vigil
#

hmm we are not cooking with this one

rocky vigil
#

uh can someone run bench of threat-inputs vs halfkav2hm-256-base

rocky vigil
#

what is going on with the residuals in the test

#

btw

#

this is the first time I've seen red

round stone
#

speed of latest threat-inputs vs. L1-256nn-9067e33176e8.nnue as main net

Result of  20 runs
==================
base (...-256-s3-9067) =    2760160  +/- 6232
test (...g13-sf-mar11) =    1059262  +/- 3025
diff                   =   -1700898  +/- 5835

speedup        = -0.6162
P(speedup > 0) =  0.0000
rocky vigil
#

ok we are not cooking very hard on the speed

#

that's probably why still -100 elo

#

i mean the current code will perform (relatively) better with large nets because I think calculating threat indices is still the bottleneck

#

i will try and bulk pawn threats later

#

once multithreading is resolved

round stone
#

for multilayer, floats for later layers ok? based on bullet morelayers.rs

const HIDDEN_SIZE: usize = 256;
let mut trainer = TrainerBuilder::default()
    .advanced_quantisations(&[QuantTarget::I16(255), QuantTarget::I16(64), QuantTarget::Float, QuantTarget::Float])
    .optimiser(optimiser::AdamW)
    .loss_fn(Loss::SigmoidMPE(2.6))
    .input(ThreatInputs)
    .output_buckets(outputs::MaterialCount::<8>)
    .feature_transformer(HIDDEN_SIZE)
    .activate(Activation::SCReLU)
    .add_layer(16)
    .activate(Activation::CReLU)
    .add_layer(32)
    .activate(Activation::CReLU)
    .add_layer(1)
    .build();
lofty cedar
#

Still like -100 elo.

round stone
#

-100 elo vs. L1-256 main net means still a long ways though

#

yea, morelayers can be a baseline but isn't going to help that much

twilit oriole
#

I did say I don't think simplified threat inputs are sufficient lol

#

But it's fine, if it's close enough then we know the full will be better

rocky vigil
rocky vigil
twilit oriole
#

Interesting

#

So that slowdown actually decreases as net size increases

rocky vigil
#

according to my statistics you have ~ 2x as more accumulator updates with threats compared to halfkav2hm

#

in midgame

#

now obviously because of hilariously poor overhead this doesn't match what we see at small sizes

#

and yeah I would appreciate help with ue impl

#

like if there's a way to compute threat difference without looping through all the pieces

round stone
#

i'll try L1-512 simple threats for another data point then

rocky vigil
#

ok

#

I expect it to be not much slower

round stone
#

and i'll look into full threats soon too

rocky vigil
#

if you want I can take the current net, permute the ft/l2 weights 7 more times, and "effectively" have a l1 = 2048 for speed testing purposes

round stone
#

i can train an actual L1-2048 and use an early checkpoint for speedtest purposes

rocky vigil
#

yes I believe everything there is ready

#

although inference side it's not quite ready yet

#

I'll work on it

round stone
#

alright full threats would be more important than multilayer

rocky vigil
#

yeah sure the fixed nodes test of full threats vs simplified would probably be more informative

rocky vigil
#

moving indexlists to classes so that we don't declare a bunch of temporary ones (a noticeable speed gain in single thread btw) currently breaks multithreading

round stone
#
Result of  10 runs
==================
base (...g13-sf-mar11) =    1063281  +/- 1289
test (...1-sscg13-512) =     254596  +/- 485
diff                   =    -808684  +/- 1307

speedup        = -0.7606
P(speedup > 0) =  0.0000

assuming i did this right, this shows simple threats L1-512 being quite a lot slower than L1-256

#

early results after 10 superbatches of training on SF data

#
Architecture           : (15776 -> 512)x2 -> 1x8
Inputs                 : Threat inputs
Number of Weights      : 8.09m
rocky vigil
#

hmm

#

wait 4x slower is a bit

#

suspicious

#

like doubling L1 should never make it 4x slower

round stone
#

16mb .nnue file. only change in the engine code otherwise was setting L1 to 512

rocky vigil
#

can you send it to me so I can test

round stone
rocky vigil
#

I got ~900k for 256 vs ~800k for 512

#

10M node search from startpos with 256: info depth 33 seldepth 40 multipv 1 score cp 28 lowerbound nodes 10000621 nps 625625 hashfull 999 tbhits 0 time 15985 pv d2d4 bestmove d2d4 ponder d7d5 Number of accumulator updates: 15940910 Number of feature indices looped through: 189471834
10M node search from startpos with 512: info depth 30 seldepth 37 multipv 1 score cp 35 lowerbound nodes 10000298 nps 570467 hashfull 1000 tbhits 0 time 17530 pv e2e4 bestmove e2e4 ponder e7e5 Number of accumulator updates: 15923538 Number of feature indices looped through: 190573575

round stone
#

what do the Nodes/second numbers show when you run stockfish bench with both?

rocky vigil
#

though I don't have the benchmarking script so I need to run it manually

round stone
#

weird, we'll this training finish and see how it fares on fishtest

#

this is all i changed:

-#define EvalFileDefaultNameBig "nn-98b68b5a9455.nnue"
+#define EvalFileDefaultNameBig "nn-ff12e5c0b08b.nnue"

 // Number of input feature dimensions after conversion
-constexpr IndexType TransformedFeatureDimensionsBig = 256;
+constexpr IndexType TransformedFeatureDimensionsBig = 512;
rocky vigil
#

that is also all I changed

round stone
#

oh wait, my branch wasn't updated with the latest speed updates

#

nm, looks less slow now

#

it should be on this commit right? 9f21b44 disservin screlu affine speedup

rocky vigil
#

yeah

#

that one

#

unfortunately a couple of the single-thread speedups break multithread

#

I am hoping for someone to help me resolve those (I have no experience coding multithreaded)

#

otherwise it would be a shame to have to roll those back

round stone
#

L1-512 on top of 9f21b44 looks a lot better

Result of  10 runs
==================
base (...g13-sf-mar11) =    1057315  +/- 5402
test (...rofile-build) =     916618  +/- 4834
diff                   =    -140697  +/- 6748

speedup        = -0.1331
P(speedup > 0) =  0.0000
rocky vigil
#

mm

#

looks like those speedups were worth a lot

#

how far back was your old branch just curious

#

the speedups today should only be like +30-40% compared to yesterday

round stone
#

Updating 46581e8a..9f21b44c

#

46581e8a enable backward incremental updates

rocky vigil
rocky vigil
round stone
#

how does it perform if you try it locally?

rocky vigil
#

upper 700k (for L1=256)

#

then again my laptop is like really not great for speed testing

rocky vigil
#

anyways better speed is better speed lol

round stone
#

yea, any speed we can get is good

#

i wouldn't worry about multithreaded for now either

#

it's nice if it works of course. however the main blocker is getting anything on par with master

round stone
naive comet
#

yeah that's 100% a huge speedup

candid ivy
#

that's barely a speedup, it just makes multi threaded work again

rustic bough
#

With a3427fc multithreading works again without crash. But the analysis is totally weird.

naive comet
#

I guess this is quite different

#

I didn't read it in detail

candid ivy
candid ivy
# rocky vigil i mean the current code will perform (relatively) better with large nets because...
{
    Color c = order[Perspective][i];
    PieceType pt = PAWN;
    Piece attkr = make_piece(c, pt);
    Bitboard bb  = colorBB[c] & pieceBB[pt];
    indices.clear();

    auto right = c == WHITE ? NORTH_EAST : SOUTH_WEST;
    auto left = c == WHITE ? NORTH_WEST : SOUTH_EAST;
    auto attacks_left = (c == WHITE ? shift<NORTH_EAST>(bb) : shift<SOUTH_WEST>(bb)) & occupied;
    auto attacks_right = (c == WHITE ? shift<NORTH_WEST>(bb) : shift<SOUTH_EAST>(bb)) & occupied;

    while (attacks_left) {
        Square to = pop_lsb(attacks_left);
        Square from = to - right;
        Piece attkd = board[to];
        indices.push_back(make_index<Perspective>(attkr, from, to, attkd, ksq));
    }

    while (attacks_right) {
        Square to = pop_lsb(attacks_right);
        Square from = to - left;
        Piece attkd = board[to];
        indices.push_back(make_index<Perspective>(attkr, from, to, attkd, ksq));
    }

    std::sort(indices.begin(), indices.end());

    for (auto threat : indices) {
        active.push_back(threat);
    }
}

here you go, ~6% for me

naive comet
#

try using emplace_back and see if it helps

rocky vigil
round stone
#

+135 elo at 25k nodes per move: L1-512 vs. L1-256, simple threats

rocky vigil
#

What the

#

+130 fixed nodes
-20% speed maybe
+16 STC elo

#

It does not add up

round stone
#

the speed difference measured with bench positions may not be reflective of speed changes throughout actual games

#
Result of  10 runs
==================
base (...rofile-build) =     913329  +/- 4211
test (...rofile-build) =     668868  +/- 2992
diff                   =    -244461  +/- 4790

speedup        = -0.2677
P(speedup > 0) =  0.0000
#

look who's hiding in the last bytes of the L1-1024

lapis parrot
rocky vigil
#

Hmm looks like we still need major optimization work

#

Fixed nodes results are really strong though

lapis parrot
#

did we have any measurements of how much elo we actually get from doubling nowadays?

rocky vigil
#

Surely -20% speed is not -100 elo

#

-100 elo is more like -50% speed

rocky vigil
rocky vigil
round stone
rocky vigil
#

I actually just suck at optimization and I have no idea what is going wrong

#

Like all the data suggests that threat inputs should be significantly more accurate as evaluation but the STC never matches

violet badger
rocky vigil
#

Ok well I think I fixed multithread

#

So hopefully speedtest works again

violet badger
#

you can run speedtest with 1 thread..

#

but great you fixed multithread

rocky vigil
#

Oh I see

#

It is hopefully not a huge single thread speedup loss

violet badger
#

./stockfish speedtest 1 16 5 (speedtest [threads] [hash (MiB)] [runtime (s)])

#

(you might want to give it a bit more than 5s though)

rocky vigil
#

Running a couple 150 sec 4 thread for 256, 512, 1024

#

Will have results in several minutes

round stone
rocky vigil
#

So I can’t download new nets nohope

round stone
#

those results somehow indicate 1024 is negative vs. 512 at fixed nodes

formal smelt
#

isnt that 512 vs 1024

#

would be a very weird way to display it if not

rocky vigil
#

Uh can you get a Google drive link to 512 and 1024 quickly

#

So I can download

round stone
#

can you download if i upload them here directly?

rocky vigil
#

Uh I need to transfer from phone to school laptop then

#

So Google drive might be easier

#

L1=256 4 thread 662978

#

(This computer is very slow disregard the absolute numbers)

round stone
#

uploaded there. too much work to figure out google drive

rocky vigil
#

Bruh the school WiFi also manages to block that nohope

#

School WiFi is actually terrible

#

Fine I’ll figure out phone -> laptop transfer

#

But you might have a wait longer

round stone
#

is it blocking based on the domain, or the filename extension or what?

candid ivy
rocky vigil
#

Because I have a class soon

rocky vigil
#

Some kind of filter

candid ivy
violet badger
#

you can also just have discord on the laptop...

rocky vigil
#

School laptop as well

#

There are a lot of things I can do on this laptop but discord and fishtest are not in that category

#
  • the web filtering is partially built in as software on the laptop as well
rocky vigil
violet badger
#

I see.

candid ivy
#

btw that is the profile

rocky vigil
#

Bruh overhead is like 4x actual accumulator updates