#UE Threat Inputs for AB
1 messages · Page 2 of 1
I don't understand how it reads 3 times
https://github.com/sscg13/Stockfish/tree/threat-inputs if you want branch
where did you add the cout
uh
latest commit I pushed just now
has all the debug couts
i know at least one issue is that it is expecting a non-empty header
which it never gets
another call from load_user_net
but shouldn't they all trigger this cout
unless there are 3 dirs
right
ok
so the load of L1 fails without telling me it fails
huh
stream.fail is false here apparently, according to this last cout
therefore this should return true
oh lmao @frosty imp insertinig the std::cout here breaks the one line if statement
gah
i hate my life

eval
Engine::verify_networks()
network::verify()
Current path: nn-98b68b5a9455.nnue
info string NNUE evaluation using nn-98b68b5a9455.nnue (7MiB, (15776, 256, 1))
NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
| r | n | b | q | k | b | n | r |
| +631 | +614 | -43.2 | -62.5 | | -37.0 | +616 | -56.6 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| p | p | p | p | p | p | p | p |
| +619 | +621 | +573 | +343 | +231 | +506 | +623 | -76.1 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| P | P | P | P | P | P | P | P |
| +579 | +245 | +58.9 | +285 | +376 | +23.3 | +245 | +573 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| R | N | B | Q | K | B | N | R |
| +53.8 | +255 | +46.8 | +362 | | +304 | +254 | +53.3 |
+-------+-------+-------+-------+-------+-------+-------+-------+
NNUE evaluation -271.85 (white side)
Final evaluation +27.16 (white side) [with scaled NNUE, ...]```
ah yes
we love to see it
ok I forgot to divide by QA*QB
that explains the hilariously high values
nvm I didn't
oh
I forgot to do CReLU
average x^2 activation
+-------+-------+-------+-------+-------+-------+-------+-------+
| r | n | b | q | k | b | n | r |
| -0.00 | 0.00 | -0.00 | -0.01 | | -0.00 | 0.00 | -0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| p | p | p | p | p | p | p | p |
| 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| P | P | P | P | P | P | P | P |
| -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 | -0.42 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| R | N | B | Q | K | B | N | R |
| +0.66 | +0.49 | +0.40 | -0.02 | | -0.35 | -0.46 | -0.18 |
+-------+-------+-------+-------+-------+-------+-------+-------+
NNUE evaluation +2.60 (white side)
Final evaluation +9.55 (white side) [with scaled NNUE, ...]```
aha
it still doesn't work
yay /s
evaling startpos
between different searches
gives different results
yeah I think I have UB somewhere
when typing 'eval' command twice in a row results in different outputs
there is nondeterminism happening here and I can't find it gah
only good news is I think active features are correctly computed
-20337
25
0
-5344
-20337
25
0
-5312
-20337
25
0
-8224
-20337
8
0
-5344
-20337
25
0
-8256
-20337
9
0
9
0
0
0
8
0
7
0
-5308
-20337
25
0
-5344
-20337
25
0
-5308
-20337
25
0
-5344
-20337
25
0
2
0
9
0
9
0
0
0
-5476
-20337
25
0
-5680
-20337
25
0
-5476
-20337
25
0
1
0
51
0
-5352
-20337
25
0
-5360
-20337
25
0
-5352
-20337
25
0
1
0
51
0
-5360
-20337
25
0
-6992```
these look like normal accumulator values...
ok what the FT expects to output
does not remotely match what the second layer receives as input
huh I guess I'll debug this tmrw
Ok inference works now
As in it plays superhuman chess
Uh
Got a HalfKAv2hm net
Or so
(Also single layer)
Speedtest 510398 for (non-ue threats->256) vs 1246511 for master
I estimate an 8x gain with ue is reasonable
So overall at the same size perhaps 2-3x slower
In the midgame, significantly better than 8x should also be possible
Considering that I also compute the psq features from scratch
Also would it be fine to run a fixed nodes test on fishtest later (assuming I get also a halfkav2hm -> 256 net)
Or should we wait for more work on training side
@round stone do you have a single layer (HalfKAv2hm ->256) net (bullet format) that I can compare with for fixed nodes?
@formal smelt is the deduplication strategy to always take the lower index feature of a pair
There’s an L1-256 multilayer net that’s reasonably strong in stockfish nnue format, used by lichess, that i can find later
Ok yeah that would work
Alright i’ll find it later, afk now
When you do that can you also remove the “bulletbullet” padding at the end of skip-bm, rename it to nn-98b68b5a9455.nnue and upload it to fishtest
You mean for future bullet nets? The L1-256 i was going to find later is already uploaded on fishtest somewhere
Yeah
To get it on fishtest
So maybe I can start a fixed nodes test
With more compute than just my laptop
Sure, you got inference working with the arch of those nets?
Yeah future bullet nets need the padding removed because the parser expects eof after reading all the weights
I have inference working with these nets yes
Alright np
Non-ue
See here
Ok i can upload those later too. Feel free to upload too if you want to test sooner
once output buckets are added this wont be an issue anyway
Wait I claim the (threat-256)-1x8 still has padding
Doing this on my phone will be quite a pain so I’ll just wait
It’s fine to test. I’d say testing sooner than later is good
This should be the small network that Lichess uses: https://tests.stockfishchess.org/api/nn/nn-4fd273888b72.nnue
Fishtest doesn't support fixed nodes. You can post the stuff here and I can run. Or I think Stockfish works on OB also
Uh ok
https://github.com/sscg13/Stockfish/tree/threat-inputs (you might get some warnings on unused parameters when compiling, ignore those)
You also need to do this
I don’t have a branch for base halfkav2hm yet though
Might need to wait a couple hours for that
no, there's a newer one that's L1-256
this one is L1-256. forgot if there's an even newer one
https://tests.stockfishchess.org/tests/view/64b6b6abdc56e1650abab4e8
https://tests.stockfishchess.org/api/nn/nn-ecb35f70ff2a.nnue
padding trimmed, uploaded here:
https://tests.stockfishchess.org/api/nn/nn-98b68b5a9455.nnue
Yes I will not be back at computer for an hour and a half
@twilit oriole let me know if you are able to compile
(With 98b68b5a9455)
Oh I can't test till tomorrow evening. You should be able to put it on one of the OB instances if you need a fixed nodes I think
Don't think so
Oh right it works bc auto download net
So you don’t need to do any makefile shenanigans
early fixed nodes results:
Results of ./sscg13-sf/src/stockfish vs ./Stockfish-256/src/stockfish (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 42.92 +/- 8.13, nElo: 60.53 +/- 11.34
LOS: 100.00 %, DrawRatio: 42.01 %, PairsRatio: 1.79
Games: 3604, Wins: 1472, Losses: 1029, Draws: 1103, Points: 2023.5 (56.15 %)
Ptnml(0-2): [69, 306, 757, 453, 217], WL/DD Ratio: 3.40
Results of ./sscg13-sf/src/stockfish vs ./Stockfish/src/stockfish (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -159.73 +/- 6.80, nElo: -260.30 +/- 9.65
LOS: 0.00 %, DrawRatio: 22.95 %, PairsRatio: 0.07
Games: 4976, Wins: 561, Losses: 2700, Draws: 1715, Points: 1418.5 (28.51 %)
Ptnml(0-2): [503, 1282, 571, 115, 17], WL/DD Ratio: 2.59
25k nodes per move: https://github.com/sscg13/Stockfish/tree/threat-inputs
- about +40 vs. L1-256 as main net, smallnet disabled
- about -160 vs. master
Ok not bad for single layer
What is the approximate speed difference just curious?
I estimate that full threats are roughly the same speed (maybe even slightly faster) compared to simplified so this is promising
Result of 10 runs
==================
base (...rc/stockfish) = 238162 +/- 123
test (...rc/stockfish) = 2636398 +/- 8043
diff = +2398236 +/- 7955
speedup = +10.0698
P(speedup > 0) = 1.0000
non-UE threats less than 1/10 the speed of L1-256
Result of 10 runs
==================
base (...rc/stockfish) = 238051 +/- 159
test (...rc/stockfish) = 1114922 +/- 4646
diff = +876871 +/- 4576
speedup = +3.6835
P(speedup > 0) = 1.0000
this is speed vs. master
What positions primarily consist of
In this test do you know
I think ue can be anywhere between 4 - 16x faster
Depending on the stage of the game
these are speeds based on whatever is in the stockfish bench position list
Ah ok
Yeah mostly midgame
Then probably over 10x speedup from good ue
And properly optimized vector operations
If you have time I’ll also try and implement full threat inputs later
According to the fixed montytrain
hm
i think that might be a bit optimistic
Well you are going from ~70 avg features processed to ~8 in the midgame
And you are also going from compiler autovec to proper SIMD kernels
the compiler autovec for the updates alone is going to be basically perfect
its addition/subtraction in a loop
Hmm I see
Btw how expensive do you think looping through all the pieces to get all active threats is
Compared with an accumulator update
why dont you just time how long it currently takes to calculate all the indices
and compare it to the average time per accumulator update
Touching anything in sf code takes a lot of effort hmm
Ok writing difference of vectors of size ~ 800k elements (of which ~750k shared) is 2-3 msec
Meaning that for vectors of size ~80 the time is negligible
@formal smelt a couple questions regarding full inputs
- Pawn doesn’t distinguish between the exact type of piece it attacks, only enemy/friend? why is this
- MAP[] appears to be color insensitive, I assume I’m missing something?
- pawn->bishop, pawn->queen, pawn->king, bishop->queen, rook->queen, king->queen are the features excluded by deduplication right? And then pawn->enemy pawn or piece->piece is by whether from < to? But what about the cases of King->Rook and King->Bishop that are duplicated? I don’t see the code currently handling that
It is relatively simple for me to modify simplified input code to process full inputs after I know all the details
- It does distinguish, look at the target, enemy bool is just used to cut out some inputs
- The only case color would matter is for pawns and you can see that gets handled separately
- You can see those are in fact handled in map_king_threat
let threat = offsets::PAWN + usize::from(enemy) * indices::PAWN + (src / 8 - 1) * 14 + attack; it doesn't look lke target is used to me
actually I think node tm exists though and should be good enough
or is that only for spsa.
actually I think it works but will be cursed
So, not working?
-160 against master takes a lot to overcome.
stockfish expects [i16 L1 bias LEB128] [i16 L1 weights LEB128] [i32 L1 PSQT LEB128] [i16 L2 bias little-endian bucket 1] [i16 L2 weight little-endian bucket 1] [i16 L3 bias little-endian bucket 1] [i16 L3 bias little-endian bucket 1] [i16 L3 weight little-endian bucket 1] ... (bucket 2) (bucket 3) ... (bucket 8)
but there is a thing where the input sizes for L2, L3, L4 are padded to a multiple of 32
keep in mind this is L1=256 and master is L1=3072
Oh, well, I see.
Also, would the later layer's bucket still be indexed the same way?
Or would we go directly to the indexing scheme where most buckets got 3 pieces count?
this might have to be tested
This is a chance to test our data pipeline and so on.
We're resetting the board.
If we wait, there would be more and more sunk costs into suboptimal data training and so on.
Here was the recommended idea.
I'm actually pretty excited for this
the net is only single layer
no multilayer yet
and there are still other things we can test
like full inputs vs simplified
ultimately I think it comes down to how fast we can make UE
rn I compute entirely from scratch (including psq inputs, so in midgame there are an average of like 50-60 features per side)
@formal smelt how feasible do you think this would be with current bullet
Writing the net to a specific file format has nothing to do with bullet
You can use a callback to run whatever code that writes it like that when you’d usually be saving a checkpoint
if the LEB128 is a problem I can hack sf code to load it in little-endian and write it in LEB128
I’m sure there is a leb128 crate you can use
yeah ig linrock can go for a multilayer simplified net
using bullet
when he has time
and I'll figure out the formatting and whatever
https://github.com/official-monty/montytrain/blob/main/policy/src/main.rs#L84 you can see using a callback here
should I attempt to use this net at all or should I wait for changes to get pushed
@frosty imp how do you think I could try and have accumulator caches go by ply, instead of by ksq-color
sort of like
to evaluate this position
we compute its active features
then we look backwards to find the latest computed ply
and try to ue the difference from that
isn't that the same as efficient updates?
imo accumulator caches are not necessary for this kind of inputs
since there is no need for full refreshes
ok sure I can just have make_move
compute all attacks
but what kind of function can I construct then
wdym
just use update_accumulator_incremental for everything
i think you can just extend dirtypiece
to contain the threat indicies too
or just pre-calculate every changed index on makemove
well my ue plan was to just compute every active threat index and then remove the ones that are already there
oh ok
what is stateinfo btw
ok
also I have to ue both psq and threats
so uh
I guess I can just define a new structure
along with dirtypiece
actually this is unnecessary I can maybe just directly add a vector of threats in the stateinfo itself
maybe not a vector
SF creates a new stateinfo object in the beginning of every search call
so gotta allocate on the stack
bruh valuelist is like exactly what I've been using in nnue code
ah ok
valuelist is basically like a vector right
also the ue accumulator updates will probably just be for loops
jw says compiler autovec is essentially perfect for simple arithmetic like that anyways
someone else can rewrite the simd tilings and other things if this actually works
sf nnue code is due for a refactor anyways
btw I am open to other ue ideas more advanced than just "compute the threat difference from the last state and apply it (to both colors)"
What refactor? It looks clean to me.
Granted, some code are hard to read, but they're out of necessity rather than because of bad engineering.
the NNUE code is overly abstracted
Hmm? What's the issue?
We need abstraction to contain the monstrosity that is the high-performance code.
Again stop talking nonsense
it is definitely not necessary
you can look at countless other engines for cleaner NNUE code
I understand that high-performance code can sometimes be hard to read because well, the logic had to be complicated. You had to do some fancy tricks to go fast. There is no other way. For a performance-critical program, this is sometimes needed.
So, the best we can do is to abstract away those unreadable code into easy-to-understand functions.
But you're saying there are better ways?
I agree that unreadable code are usually bad, but sometimes they're necessary evils, especially in a performance-critical software. What I would do when there is no other option is wrap them in "untouchable" functions.
well that hypothetical situation doesn't apply here
so maybe we can discuss that in another place
I thought the other engines managed to do so because they were either
- less optimized
- not supporting as many architectures as Stockfish does.
well number 1 doesn't really describe what's going on
you can argue for no. 2, but parts that aren't architecture-dependent is not clean either
Is it actually possible to write a readable high-performance code under hardcore performance optimization?
Sometimes it can happen, but you could end up in scenarios like comments longer than the code themselves.
Though I guess that's a red herring. The NNUE logic isn't some sort of nightmarish code. It's just a straightforward SIMD code after all.
NNUE inference is not very complicated
there are literally 2 optimisations on sf nnue code that I know of but cannot implement because of the current state of code
2?
Sure.
I’ll train multilayer bullet format nets later, along with fixed full threat input nets
leb128 only matters if we have something that can beat the current master
Getting any kind of UE working will be important for measuring baselines at TC. Currently still don’t know how far we actually are from master
Yea 160mb is huge, but fine for testing
Still unclear whether full threat 1024 UE will be strong enough vs. master at TC
yeah I hope to get UE working tmrw
So does the branch work? Training is ready to start once it is confirmed
It will be an important test but keep in mind it is still going to be missing some large optimisations.
The full threat input net is extremely sparse, so permuting it will yield 20+ Elo at STC
Shared net weights (e.g mmap) probably also gains 20+ at STC single threaded because memory accesses are less predictable with threat inputs than king buckets (need a larger portion of the net in cache)
clearly i was too tired last night
that is indeed wrong lol
at least i think it is
@twilit oriole I have pushed fixes
I still don’t see this btw
its added complexity for a tiny gain. but sf is about tiny gains so sure you can add them if you want lol
ah it is mentioned
hmm actually. duplicate inputs here would only occur in check. in which case eval is skipped anyways?
hm for enemy at least. i guess there is still friendly
and i guess it is super common for a friendly rook to be next to the king lol
Yeah king-rook would save a bit bc castling
It’s fine
If this case doesn’t have deduplication I can also make a small change to the indexing
Changing the indexing is pretty easy on my side
@frosty imp I cannot take only stateinfo into update_accumulator_incremental because I need access to the entire board
So how should I format this
Unless stateinfo will also store 8 bitboards for the pieces
Aight appending active features only needs color bb, piece bb, and piece array now
Untested incremental update function now up at my branch
Lmao ue is up at my branch but only ~2x faster
I probably messed smth up
Yeah can someone test
And also maybe give suggestions on my hacked ue
I think I am having some fundamental ue impl issue here
Not overhead
Because if my debug info is to be trusted
I have like 20 updates/color/position
After 1M nodes from startpos
This is approximately 3x better than no ue
But still seems very high
Also in very short searches speed is up to 5x faster
Before declining down to 3
@frosty imp hints?
At my branch lol
How to profile
just find a good profiler and follow the tutorials
But this is abnormal no? It should be like less than 10
does the ue output match the eval starting from scratch
a 2x speedup is not unreasonable from UE
Yes
weird indeed
well might be getting trolled by laptop but there is still a noticeable slowdown
lmao I'm right
"update_accumulator_incremental" is never called
how on earth is it still faster than non-ue then
the speedup is real (I'm getting ue ~ 60% of master rn) but for the life of me I cannot figure out why update_accumulator_incremental is never called
I even have https://github.com/sscg13/Stockfish/blob/threat-inputs/src/nnue/nnue_feature_transformer.h#L126 here and this line is literally never triggered
ok but what is the issue there
maybe the update from scratch always got triggered?
like in the beginning nothing is computed
so i'd assume this always triggers
hmm yeah
yeah but after 10 million nodes
i guess the fix is just to label the accumulator as computed
in update_accumulator_scratch
bruh
using update_accumulator_incremental is actually slower
lemme do a little bit of optimize
is it possible to clear a valuelist
bruh there's no prebuilt function
pr it 
wait but this doesn't erase the 1, 2, ... elements
does the (for index : features) notation do this
yeah
or will it also find the remaining elements as well
like if I push_back 1, 2, 3, 4, 5
then set size to 2
will the for loop only find the values 1, 2
yeah
or all 5 values
everything will work
i have no idea why this doesn't work
anyways I have been unable to get incremental update to be faster at all
so rn the "ue" optimization is probably literally accumulator reuse
i might be really borking because my statistics show that using my code, incremental vs scratch computation are ~ the same amount of compute
which I really don't believe
ok it might be that my write_difference isn't working
i actually need help
incremental doesn't work and I have no idea why
some comments:
(branch is https://github.com/sscg13/Stockfish/tree/threat-inputs once again)
assertion that the threats in computed stateinfo match the ones computed passes
assertion that the features are sorted passes
different results are returned using dirtypiece vs write_difference of psq
well
it is very probable that append_active_psq and append_active_threats indeed compute the correct indices
given that I have used these functions to do non-ue
and the version which doesn't use update_accumulator_incremental is also sound
so the failure point is probably in write_difference
but idk how that is wrong
NNUE derived piece values:
+-------+-------+-------+-------+-------+-------+-------+-------+
| r | n | b | q | k | b | n | r |
| 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| p | p | p | p | p | p | p | p |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| | | | | | | | |
| | | | | | | | |
+-------+-------+-------+-------+-------+-------+-------+-------+
| P | P | P | P | P | P | P | P |
| 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
| R | N | B | Q | K | B | N | R |
| 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 |
+-------+-------+-------+-------+-------+-------+-------+-------+
NNUE evaluation +0.20 (white side)
Final evaluation +0.26 (white side) [with scaled NNUE, ...]```
this is the other trademark of broken update_accumulator_incremental
gets the eval right but completely borks the piece value estimation
this one actually persists, even if only using compute-from-scratch
which has the same bench
alright I'm going to wait for someone else to help me debug
if you really wanna try some "optimized non-ue" in the meanwhile
^^also computes everything from scratch, but somehow is 2-3x faster
what part of UE is wrong
does making, say, e2e4 on startpos give you the wrong result via ue?
stockfish code is so convoluted idk how to test this
just make move on a pos object, then call evaluate
i think you can find some example on making move in the extend_pv function
I think I understood the 0.00 phenomenon btw
if only removing a piece
it doesn't actually change the position or stateinfo
so the accumulator gets re-used
so far, with my current implementation, we have anywhere between 30 to 80 accumulator updates per node on average, depending on position
note that rn it's still just non-ue with accumulator re-use while I attempt to debug my incremental calculations
i think, if we can get this down to 20 avg.
then threat inputs should be quite good
Stockfish Threatnet (last working version) vs Stormphrax 6: 39 - 55 - 106
Very rough STC estimate
what about against the smallnet
I can’t download it rn bc my school wifi blocks fishtest
Also the STC will suck even harder
Considering that I still haven’t gotten update_accumulator_incremental to work
This version is using update_accumulator_scratch for everything
So it’s basically non-ue with accumulator reuse
Somehow still 2-3x faster than scratch computation at eval time
I am still looking for help
I literally can’t see the issue
I estimate there is still close to 200 STC elo to be gained from speedup
average stockfish debug
engin
i did this for startpos and move e2e4, and I found that update_accumulator_incremental was completely correct in this case
...
at least w.r.t. the difference calculation for added/removed features
@frosty imp ```c++
template<Color Perspective>
void update_accumulator(const Position& pos) {
StateInfo* st = pos.state();
if ((st->*accPtr).computed[Perspective])
return; // nothing to do
// Look for a usable already computed accumulator of an earlier position.
// Always try to do an incremental update as most accumulators will be reusable.
do
{
if (!st->previous || st->previous->next != st)
{
// compute accumulator from scratch for this position
update_accumulator_scratch<Perspective>(pos);/*
if (st != pos.state())
// when computing an accumulator from scratch we can use it to
// efficiently compute the accumulator backwards, until we get to a king
// move. We expect that we will need these accumulators later anyway, so
// computing them now will save some work.
update_accumulator_incremental<Perspective, BACKWARDS>(
pos.square<KING>(Perspective), st, pos.state());*/
return;
}
st = st->previous;
} while (!(st->*accPtr).computed[Perspective]);
// Start from the oldest computed accumulator, update all the
// accumulators up to the current position.
update_accumulator_incremental<Perspective>(pos.square<KING>(Perspective), pos.state(), st);
}```
would you know why this code, from a 100k node startpos search, would only trigger update_accumulator_scratch once only, at the very beginning?
yes?
both update_accumulator_scratch and update_accumulator_incremental
set computed = true at the end
uh
else
{
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
(next->*accPtr).accumulation[Perspective][j] = (computed->*accPtr).accumulation[Perspective][j];
}
acc_updates++;
threat_loops += (int)removed.size();
threat_loops += (int)added.size();
// Difference calculation for the activated features
for (auto index : added)
{
const IndexType offset = TransformedFeatureDimensions * index;
for (IndexType i = 0; i < TransformedFeatureDimensions; ++i)
(next->*accPtr).accumulation[Perspective][i] += weights[offset + i];
}
// Difference calculation for the deactivated features
for (auto index : removed)
{
const IndexType offset = TransformedFeatureDimensions * index;
for (IndexType i = 0; i < TransformedFeatureDimensions; ++i)
(next->*accPtr).accumulation[Perspective][i] -= weights[offset + i];
}
}```
there is an issue in this
piece of code
hmm I don't see anything wrong from a quick skim
yeah that's why this is so suspicious
replacing the block with
else
{
acc_updates++;
threat_loops += (int)newthreats.size();
threat_loops += (int)newpsq.size();
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
(next->*accPtr).accumulation[Perspective][j] = biases[j];
}
for (auto index : newpsq)
{
const IndexType offset = TransformedFeatureDimensions * index;
assert(offset < TransformedFeatureDimensions * InputDimensions);
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
(next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
}
for (auto index : newthreats)
{
const IndexType offset = TransformedFeatureDimensions * index;
assert(offset < TransformedFeatureDimensions * InputDimensions);
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
(next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
}
}
works but defeats the purpose of this function
i am equally perplexed
the (flawed) version of incremental halves avg. accumulator updates per node
but also my guess is that crippling the search also causes it to search on avg moves that are more heavy in threat updates
like I think incremental on every move is not optimal with threat inputs
because e.g. if you have a long series of captures
you do not really need to evaluate how the threats change with each intermediate capture
just skip all the way to the end
you may also assume that removed and added are correct
at least, it has matched all the manual experiments
@frosty imp issue is in this line
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
(next->*accPtr).accumulation[Perspective][j] = (computed->*accPtr).accumulation[Perspective][j];
}
sigh
we can replace it with
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
(next->*accPtr).accumulation[Perspective][j] = biases[j];
}
for (auto index : oldpsq)
{
const IndexType offset = TransformedFeatureDimensions * index;
assert(offset < TransformedFeatureDimensions * InputDimensions);
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
(next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
}
for (auto index : oldthreats)
{
const IndexType offset = TransformedFeatureDimensions * index;
assert(offset < TransformedFeatureDimensions * InputDimensions);
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
(next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
}
and it will work
meaning
despite (computed->*accPtr).computed[Perspective] being true
the accumulator is not correct
ok my github is up to date
I would really like help examining where accumulator.accumulation is updated
and how it can become "outdated"
honestly if you want a proof of concept you can just remove the accumulator updates altogether and do everything via the finny refreshes
it is not much of a slowdown and not an absolute PITA to implement
Just a quick look at the warnings, I'm seeing:
position.cpp:1022:16: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of a non-trivial type ‘struct Stockfish::StateInfo’ leaves 1088 bytes unchanged [-Wclass-memaccess]
1022 | std::memcpy(&newSt, st, offsetof(StateInfo, accumulatorBig));
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from position.cpp:19:
I'd be worried about this kind of warnings.
I indeed added 1440 extra bytes to StateInfo...
also clang never gave me this warning
perhaps I should install gcc
what does this code do
that seems to be finding the byte offset of a member of a struct
aka
struct S {
u32 a;
u32 b;
}
-> offsetof(S, b) == 4 // probably
ok
but according to this warning it is not working
what did you add to StateInfo
2x Eval::NNUE::FeatureSet::IndexList (520 bytes each), ColorBB[2], PieceBB[16] (I intended for it to be 8 but whatever), Piece[64]
This is all before accumulatorBig
according to the warning the object is non trivial anymore, this will be there for as long as StateInfo is non trivial
it might work might not work depends a bit on the context
Is it because of adding a Eval::NNUE::FeatureSet::IndexList in StateInfo
I can store them in Accumulator instead I think
And that should make StateInfo trivial again
Isn’t accumulator in state info too ? It would have the same result since if a member is not trivially copyable then the struct itself isn’t either
but since we only memcpy the non-accumulator portion of StateInfo
perhaps that issue could be avoided
otherwise I could look into storing pointers instead
but idk how that would work
actually since valuelist is very simple implementation
what if I just construct a trivial version of it
lemme try that
First check if that’s actually the struct causing the warning
template<typename T, std::size_t MaxSize>
class ValueList {
public:
std::size_t size() const { return size_; }
void push_back(const T& value) { values_[size_++] = value; }
const T* begin() const { return values_; }
const T* end() const { return values_ + size_; }
const T& operator[](int index) const { return values_[index]; }
private:
T values_[MaxSize];
std::size_t size_ = 0;
};
and if I comment out the IndexList declarations in stateinfo I get it is trivial
valuelist should be easy to do in a trivial manner
it's essentially just an array
idk why it isn't trivially copyable
it might be because of the [] operator
well the member functions aren't copied anyway
yeah I also have no idea why it's nontrivial
maybe static_assert(std::is_trivially_copyable_v<T>)?
otherwise it will not be trivially copyable
Eval::NNUE::IndexType is just uint32_t
can you send the error you get
here
for static_assert(std::is_trivially_copyable<ValueList<T, N>>)?
essentially memcpy doesn't work because adding valuelists makes stateinfo nontrivial
whereas I would like to have a valuelist store all active features corresponding to a given accumulator
compiler and language version?
so that I don't need to recompute them to take the difference
I personally have clang 19.1.4, and stockfish compiles with c++17
also since millions of these will be processed for accumulator updates
trivialness might actually impact the speed
in the meanwhile I'll just replace it with an actual array
and see if that fixes things
this assert passes if I insert it into Valuelist
if vscode is to be trusted the issue is in std::size_t size_ = 0
maybe vscode is not to be trusted though
ok godbolt backs up this claim
could you send godbolt link
how do I get a link
nvm im dumb
anyways the compiler in godbolt (hopefully) doesn't lie
even though I have no idea what the issue could possibly be
also why cout
i think it's just the custom constructor
instead of static_assert
std::cout << std::boolalpha << std::is_trivially_copyable_v<ValueList<std::uint32_t, 128ULL>> << std::endl;
but this is still true though
non-trivial type
vondele's warning just says 'writing to an object of a non-trivial type'
anyways I can really just like
manually replace the functionality
with only an array
but that's stupid
like just have arr[0] be the replacement for size_ or whatever
i think it's the initialization of size_=0 that causes the problem
nahh please have a separate member no?
so remove it?
idk how you can do that without breaking stuff
no initializing size_ to 1 also has issues
unless it's the initialization in general
maybe all members need initialization or none?
you need to remove the initialization
it wouldnt be any particular value
probably the initialization makes the constructor non-trivial?
ok but how do you get the functionality without the initialization of size_
eh you probably can't
you know this would be a great time to refactor accumulator updates 
what if we just leave size_ uninitialized
and then use clear() to set it to 0
does this bypass work
that could solve it
but i don't like the extra step needed to use it elsewere
feels error prone
yeah it's
annoying
this is basically doing a constructor without actually doing one
how about you just do it that way for the prototype
yeah I'll just try and see if it works for ue now
would you be open to rebasing this onto some other refactors
yeah I can try making it nice but it'll probably take a few days lol
i mean only if this will get merged
myself I'm pretty confident that threat inputs will work with sufficient optimization but
thankfully valuelist is only used once
'capturesSearched' and 'quietsSearched'
i really hope it works lol
threat inputs might solve a whole class of fortress issues
welp not initializing size_ = 0 breaks the code somehow even though I thought I tracked down and added .clear() every time a valuelist is declared
Assertion failed: (next->*accPtr).accumulation[Perspective][j] == (computed->*accPtr).accumulation[Perspective][j], file nnue/nnue_feature_transformer.h, line 264```
lmao it still doesn't work
the accumulator still gets messed up somehow
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
(next->*accPtr).accumulation[Perspective][j] = biases[j];
}
for (auto index : oldfeatures)
{
const IndexType offset = TransformedFeatureDimensions * index;
assert(offset < TransformedFeatureDimensions * InputDimensions);
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
(next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
}
for (IndexType j = 0; j < TransformedFeatureDimensions; j++) {
assert((next->*accPtr).accumulation[Perspective][j] == (computed->*accPtr).accumulation[Perspective][j]);
}``` relevant code
uh @frosty imp I pushed commit to my branch so can you compile it with gcc and tell me all the warnings
because clang did not give me any memcpy warnings
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
also for a startpos search at least
after update_accumulator_scratch being called on startpos
everything else will be update_accumulator_incremental
anyways since Stockfish::StateInfo is back to being nontrivial the memcpy should be fine
is it still possible that something in the memcpy is going wrong?
I don't know any other change that could affect the accumulation
since nowhere else in the code besides feature_transformer is it modified
I still have no idea where accumulator.accumulation is modified besides in the feature transformer functions
so I really don't know how it would stop matching
I don't think it's because of any memcpy because a 4096 byte buffer in the accumulator struct before the values doesn't work either
92 -60 -152 39 -71 -20 -113 31 -45 262 -76 -104 193 -98 72 -6 65 4 -2 -1025 88 -300 295 33 -270 -292 -442 37 72 -7 -184 252 -209 -70 -514 -235 130 -243 -256 -112 -57 140 -418 -121 -75 13 -906 110 77 -150 103 -231 -155 -1937 -2310 -98 -191 -522 -27 1 46 -2177 -185 135 -358 12 -131 -44 -81 -127 -87 -952 -430 -416 -6 -56 140 -22 85 -49 -131 10 -68 218 -1726 -630 -92 -23 -81 139 -459 30 -223 -53 13 -785 -32 -88 244 99 107 159 93 -272 -399 -551 -136 15 17 -731 -328 -116 -48 -294 8 54 -88 -609 -204 28 13 -739 36 -149 -213 -128 -201 -53 -417 -161 -180 -10 -2 25 -11 -175 56 -130 -90 -178 -3416 39 -656 -84 181 69 -5 -394 -137 -763 31 -205 46 11 -72 -71 -41 196 39 -1601 -521 -223 47 -3563 -384 -96 43 57 215 -195 -72 -12 -191 -153 -26 106 65 -2286 9 -192 190 -50 2 -256 170 -167 -43 50 70 71 -641 346 -2 72 -10 67 -107 -42 40 36 306 -476 -122 -188 33 -36 -604 119 -275 93 -917 2 -701 -43 -410 -128 -156 -946 -80 64 -446 40 6 38 75 -648 177 -27 -425 -20 -32 -1993 -174 40 -117 -7 60 -148 124 -226 -71 27 -358 -47 147 -168 -226 -601 5 7 -231 -273 -12 -9 -174 -121```
101 -61 -132 28 -64 -10 -105 45 -51 267 -77 -99 180 -299 81 -17 72 33 -41 -975 88 -298 265 65 -251 -304 -402 -103 138 -101 -186 250 -221 -224 -502 -140 128 -168 -239 -114 -56 133 -424 -114 -81 17 -937 128 33 -135 101 -334 -151 -1822 -2369 -95 -189 -363 3 -31 28 -2126 -166 136 -382 2 -647 -19 -166 -126 -87 -998 -443 -606 -31 -49 169 -27 86 -60 -127 6 -75 198 -1701 -549 -107 -29 -109 145 -437 20 -213 -15 25 -753 -47 -80 237 138 101 143 123 -291 -395 -530 -118 -50 7 -706 -330 -101 -53 -448 34 40 -176 -606 -197 8 -11 -837 27 -140 -208 -100 -200 -58 -435 -181 -150 -17 38 15 -12 -162 -73 -136 -90 16 -3478 -108 -680 -509 163 65 -9 -393 -197 -729 39 -111 46 -4 -69 -75 -26 211 11 -1556 -523 -209 43 -3528 -265 -71 20 -8 211 -180 -81 2 -181 -131 -27 115 69 -2229 19 -219 208 -45 8 -243 159 -181 -15 56 43 73 -659 311 -9 80 4 75 -129 -38 25 12 303 -464 -115 -163 32 -32 -598 41 -283 111 -935 -14 -705 -83 -416 -60 -177 -939 -78 66 -455 48 1 -155 74 -740 166 -53 -358 -37 -7 -2134 -175 -45 -108 -3 67 -158 131 -232 -81 15 -345 -67 139 196 -187 -627 39 -4 -221 -270 -11 31 -169 -58```
what
ok then
idk how useful this is though
8 9 10 11 12 13 14 15 65 70 130 133 192 199 260 323 429 432 433 434 435 436 438 439 505 510 570 573 632 639 700 763 938 957 1622 1623 1643 1644 2728 2735 2832 2833 4607 4608 4611 4612 4613 7445 7446 7447 7448 7449 8350 8353 8754 8771 8773 9635 9636 9656 9657 11022 11023 11120 11127 13407 13408 13413 13414 15333 15334 15335 15336 15337```
this should be the position after 1. c2c3
ok I have discovered this is the white perspective accumulator after c2c3
so somehow it's getting passed the accumulator of the wrong perspective
wot
ok I am a clown
I have the same
IndexList
for both perspectives
because I forgot to split by perspective
so of course the features and stuff will cease to match...
UE reduces accumulator updates by more than 4x in 10M node search from startpos
but is not that much faster
maybe because of write_difference overhead
speedtest Nodes/second : 1809147 (non-ue)
Nodes/second : 1736032 (ue)
lmao
write_difference overhead is actually insane apparently
ok branch updated
uh
tbh ue being like
~same speed
with anywhere between 1/3 and 1/5 of the accumulator updates
is shocking
(ly bad overhead)
btw please compare current branch vs https://github.com/sscg13/Stockfish/commit/a2604d40b42b7f755cebe91ed5a41d2cd0ac30d9
for speedtesting purposes
(and/or profile it, the only major difference should be way less accumulator updates but many usages of write_difference)
i would appreciate it a lot
according to my debug statistics, which I can run now that I have 'real ue', this estimate is approximately correct (per side, so ~16 total per eval)
I am suspicious of the bench change when using dirtypiece to perform the psq feature updates
But the short STC I ran (50 - 31 - 119) doesn’t lie
It’s like a 10% speedup
@formal smelt are we still doing king-rook and king-bishop deduplication
In full inputs
i thought we weren't doing that
Ok sure that’s fine
Yeah now that ue is in a better shape
(Overhead is constant but massively less accumulator updates scales much better with larger nets)
I can work on supporting full threats as well
Based on this statistic, it should only have 2x more accumulator updates vs halfkav2hm so disregarding overhead it should be very promising
Also I’m pretty sure this speedup is purely from write_difference on smaller vectors
Result of 50 runs
==================
base (...rc/stockfish) = 275282 +/- 1164
test (...rc/stockfish) = 447672 +/- 1937
diff = +172389 +/- 1315
speedup = +0.6262
P(speedup > 0) = 1.0000
speedup of latest threat-inputs branch (46581e8) vs. a2604d4
Ok looks like ue is much better on your machine compared to mine
On my laptop it’s barely like 10%
@formal smelt can I get active features for kiwipete, full inputs (including deduplication) when you have time
~/bench_parallel.sh ./stockfish_a2604d40 ./stockfish_46581e8a 13 10
sf_base = 335055 +/- 1366 (95%)
sf_test = 739457 +/- 6135 (95%)
diff = 404401 +/- 5258 (95%)
speedup = 120.69692% +/- 1.570% (95%)
Since net weights are not shared between instances
If you do single threaded the result will be far better
is there at least another 2x speedup expected beyond this?
diff --git a/src/nnue/nnue_feature_transformer.h b/src/nnue/nnue_feature_transformer.h
index 35027bf6..ae832a67 100644
--- a/src/nnue/nnue_feature_transformer.h
+++ b/src/nnue/nnue_feature_transformer.h
@@ -256,21 +256,25 @@ class FeatureTransformer {
acc_updates++;
threat_loops += (int)removed.size();
threat_loops += (int)added.size();
+
+ auto* acc_ptr = &((next->*accPtr).accumulation[Perspective][0]);
+
// Difference calculation for the activated features
for (auto index : added)
{
- const IndexType offset = TransformedFeatureDimensions * index;
- assert(offset < TransformedFeatureDimensions * InputDimensions);
+ const auto* weight_ptr = &weights[TransformedFeatureDimensions * index];
+
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
- (next->*accPtr).accumulation[Perspective][j] += weights[offset + j];
+ acc_ptr[j] += weight_ptr[j];
}
+
// Difference calculation for the deactivated features
for (auto index : removed)
{
- const IndexType offset = TransformedFeatureDimensions * index;
- assert(offset < TransformedFeatureDimensions * InputDimensions);
+ const auto* weight_ptr = &weights[TransformedFeatureDimensions * index];
+
for (IndexType j = 0; j < TransformedFeatureDimensions; j++)
- (next->*accPtr).accumulation[Perspective][j] -= weights[offset + j];
+ acc_ptr[j] -= weight_ptr[j];
}
}
unless i did something wrong i got a good speedup from this
sf_base = 731585 +/- 8007 (95%)
sf_test = 1162599 +/- 14972 (95%)
diff = 431013 +/- 10956 (95%)
speedup = 58.91504% +/- 1.498% (95%)
you can use bullet main, compile with --no-default-features --features cpu
uh hang on
ill just push to the montytrain branch
simplified inputs ye?
branch print-features-simpler-inputs
run with cargo r -r --bin value
edit the fen here: https://github.com/official-monty/montytrain/blob/print-features-simpler-inputs/value/src/main.rs#L10
Full inputs (including deduplication) but should be easy change
Simpler inputs is already worked out
Well we apparently already have a significant one so given I suck at optimization I expect this to be the case
yeah I also get a decent speedup with this but not nearly as large (probably because I use a laptop and don't use the most aggressive optimization settings)
well I expect this to be -100 elo so far
going to test this, and also with a +50% time odds to simulate further optimization
Uh
There may be something suspicious with https://github.com/sscg13/Stockfish/commit/a9cb57d5cb6c7c06b7bd7445fd1a6d04da333c96
Bench altering
I don't get how a 256 threat net is that slow. It shouldn't be much, 80%+ time is in search
Because vondele said the 3072 master net is only 50% of runtime in eval
Yeah I probably optimized smth poorly
Maybe the special SIMD kernels actually are meaningful
Welp when I get back I’ll try undoing this change
read the thread
i wouldn't take that test as indicating anything
I see.
If someone could profile like tmrw I would really appreciate it ^^
I am pretty sure write_difference is much slower than I expected
nothing else can explain that ue is less than 2x faster than non-ue with around 1/4 the accumulator updates on average
I will look into hacking in a custom incremental difference for single move
and also +60% time only performing 50 elo better at STC UHO is way below what we would expect from scaling data
is it? I thought it was 2x time odds ~= 70-80 elo
ok well in that case
+40 fixed nodes 1/2 speed
in no way equals -270 stc elo
so either way something is wrong
well the bigger issue was I merged some ue change that affected bench bc I ran SSS and it was not worse
that was probably against my better judgement
so now I'm running a non-SSS test on fishtest
what the - (check back in on this in a couple hours)
I really got scammed by SSS
maybe I straight up loaded the wrong compile
anyways pls fix dirtypiece
it's worth another like 20% if done properly or smth
@twilit oriole I think everything is explained lol
Interesting.
ok yeah tmrw I will try to get dirtypiece working again
and maybe also try and optimize incremental more
(in the difference calculation for features)
look at append_active_threats for optimizations, this eats most of the time, i.e. move the vector into the class don't recreate it constantly
the enemy bool in the make_index can be rewritten as bool enemy = (attkr ^ attkd) & 8; in my profile it was kinda slow i think, (~1.4-2% speedup)
@rocky vigil whats the purpose of sorting the features in the append_active_threats ? I get the same bench without them
diff --git a/src/nnue/layers/screlu_affine.h b/src/nnue/layers/screlu_affine.h
index aeb7e951..06ad8249 100644
--- a/src/nnue/layers/screlu_affine.h
+++ b/src/nnue/layers/screlu_affine.h
@@ -56,24 +56,23 @@ class SCReLUAffine {
}
// Forward propagation
- OutputType evaluate(InputType* input, IndexType bucket) {
+ OutputType evaluate(const InputType* input, IndexType bucket) {
assert(bucket < OutputBuckets);
- constexpr IndexType Start = 0;
- OutputType output = 255*(std::int32_t)biases[bucket];
- for (IndexType i = Start; i < InputDimensions; i++) {
- input[i] = std::min((std::int16_t)255, std::max(input[i], (std::int16_t)0));
- }
- for (IndexType i = Start; i < InputDimensions; i++) {
- intermediate[i] = input[i]*weights[bucket*InputDimensions+i];
- }
- for (IndexType i = Start; i < InputDimensions; ++i)
- {
- output += (std::int32_t)(input[i])*(std::int32_t)(intermediate[i]);
+
+ const IndexType weightOffset = bucket * InputDimensions;
+ const auto* weights_ptr = &(weights[weightOffset]);
+
+ std::int32_t output = 255 * static_cast<std::int32_t>(biases[bucket]);
+
+ for (IndexType i = 0; i < InputDimensions; i++) {
+ const std::int16_t clipped = std::clamp(input[i], static_cast<std::int16_t>(0), static_cast<std::int16_t>(255));
+
+ output += clipped * clipped * weights_ptr[i];
}
+
return output / 255;
}
- alignas(CacheLineSize) std::int16_t intermediate[InputDimensions];
alignas(CacheLineSize) std::int16_t biases[OutputBuckets];
alignas(CacheLineSize) std::int16_t weights[OutputBuckets * InputDimensions];
};
you can speedup evaluate by like 5% for me, if you need the sort order for the active inputs, then I wonder if you can avoid the sort by changing the loop so that all indices which are generated next are already in ascending order?
then I wonder if you can avoid the sort by changing the loop so that all indices which are generated next are already in ascending order?
maybe they already are?
This is so that write_difference works
Also I’m pretty sure that they don’t automatically generate in ascending order, say you have a rook defending two pawns on a3 and h3, then if the king is on efgh by mirroring the h3 one is first
yeye i checked for some reason bench was unchanged but they weren't in ascending order
Test latest commit or two
9594b46 appears to be a significant gain in bench
Though the school laptop’s antivirus sometimes screws the results
In general how much do the various enum conversions cost? If this is actually significant then I can just go back to bitwise tricks
Also if someone wants to debug the dirtypiece
Well they are free basically this seems to just do better than the color of which uses bitshifts
I mostly copied it from master
Make_index(piece, from, from, piece, ksq) does get you the psq index
As a hack
Figuring this out should also be a large gain
iirc some 15%
Yeah turns out declaring many small vectors is highly unnecessary
Oh bruh what happens to dirtypiece in case of a king move
I think this is what is breaking
Yeah I probably mess up the mirror refresh
~16.5%
Yeah I added this and combined psq, threat loops when computing from scratch in d0605d9
Probably not as big of a speedup though
Multilayer inference will be better because we have proper code for it
Ok I have no new optimizations planned rn so I’ll run a test to see where we are at
we’re probably ready to test multilayer with simplified inputs @round stone
Ah shoot can’t for a few hours bc I don’t remember my fishtest password
with some additional optimizations and simd you can get another 11%
this is not permuted as well right?
Nope
Just whatever bullet default order is
At this point it’s almost certain that at L1=256 simplified threats are superior to halfkav2hm at STC
I want to switch to multilayer in part because we have optimized SIMD for that already
Besides if computing the threat indices is the bottleneck rn that implies favorable scaling to large L1
Speaking of computing threat indices bulk pawn attacks might be considerable speedup
MultiThreading is broken in 9f21b44. Was ok in 2db74f4.
Since half the pieces being looped through will be pawns
Huh interesting
Yeah uh
I see
Lemme bisect it quickly
9594b46 breaks multithreading
sigh
Do the threads like access the same featuretransformer class or smth
As long as they access separate featuretransformers it shouldn’t break?
make it a static thread_local
Welp n*ram for n threads
it's really not much ram?
I guess fishtest already has this problem though
how many pieces are in that array, you are sorting and adding it constantly, it's like max 30 elements with 4 bytes or something?
Max 16 with 4 bytes
64 bytes per thread 
Oh but the commit afterwards moves removed, added also into featuretransformer
Whatever that’s like 1KB
Uh adding static thread_local gives me a bunch of warnings on forward declaration
And then undefined symbol errors
show me
also just move the vector into the function again if you do that
or if it is max 16, i mean use another valuelist not a vector
std::sort doesn’t work on valuelist
you just need to add
T* begin() { return values_; }
T* end() { return values_ + size_; }
Wait I claim valuelist already has these
Ok well yeah I won’t be able to do changes for a couple hours
At that point I’ll also run another STC on fishtest
To see where are are single thread now
alright, something like this to start?
const HIDDEN_SIZE: usize = 256;
let mut trainer = TrainerBuilder::default()
.quantisations(&[255, 64])
.optimiser(optimiser::AdamW)
.loss_fn(Loss::SigmoidMPE(2.6))
.input(ThreatInputs)
.output_buckets(outputs::MaterialCount::<8>)
.feature_transformer(HIDDEN_SIZE)
.activate(Activation::SCReLU)
.add_layer(16)
.activate(Activation::CReLU)
.add_layer(32)
.activate(Activation::CReLU)
.add_layer(1)
.build();
You could just recreate the SF arch
It should be more or less what is in the advanced example
sure, if the inference code for SF arch is reusable or easy to set up for this
down to use whatever arch. lmk what multi-layer arch is easiest to get inference working for @rocky vigil
can bullet leb compress the weights
also found that there's a newer L1-256 which is the one lichess is using: nn-9067e33176e8.nnue
there's no leb compression implementation in bullet
No but Stockfish can
Anyways idk it doesn’t matter the multilayer arch
I’ll get it to work either way (including weight reading)
Whatever you think is best
alright, then i'm inclined to start simple with the arch
less code to deal with
and ignore leb128 until the end, or if larger L1 gets annoying to deal with during testing
since it has no effect on strength, and strength is the important part now
Yeah that’s fine
This should be fine
'Stockfish::Eval::NNUE::FeatureTransformer<256, &Stockfish::StateInfo::accumulatorBig>::added' required here, but
no definition is available [-Wundefined-var-template]
129 | added.clear();
| ^
nnue/nnue_feature_transformer.h:291:17: note: in instantiation of function template specialization
'Stockfish::Eval::NNUE::FeatureTransformer<256,
&Stockfish::StateInfo::accumulatorBig>::update_accumulator_scratch<Stockfish::WHITE>' requested here
291 | update_accumulator_scratch<Perspective>(pos);
| ^
nnue/nnue_feature_transformer.h:309:9: note: in instantiation of function template specialization
'Stockfish::Eval::NNUE::FeatureTransformer<256,
&Stockfish::StateInfo::accumulatorBig>::update_accumulator<Stockfish::WHITE>' requested here
309 | update_accumulator<WHITE>(pos);
| ^
nnue/network.cpp:222:25: note: in instantiation of member function 'Stockfish::Eval::NNUE::FeatureTransformer<256,
&Stockfish::StateInfo::accumulatorBig>::transform' requested here
222 | featureTransformer->transform(pos, acc);
| ^
nnue/nnue_feature_transformer.h:59:47: note: forward declaration of template entity is here
59 | static thread_local FeatureSet::IndexList added;
| ^
nnue/nnue_feature_transformer.h:129:9: note: add an explicit instantiation declaration to suppress this warning if
'Stockfish::Eval::NNUE::FeatureTransformer<256, &Stockfish::StateInfo::accumulatorBig>::added' is explicitly
instantiated in another translation unit
129 | added.clear();```
hmm we are not cooking with this one
uh can someone run bench of threat-inputs vs halfkav2hm-256-base
what is going on with the residuals in the test
btw
this is the first time I've seen red
speed of latest threat-inputs vs. L1-256nn-9067e33176e8.nnue as main net
Result of 20 runs
==================
base (...-256-s3-9067) = 2760160 +/- 6232
test (...g13-sf-mar11) = 1059262 +/- 3025
diff = -1700898 +/- 5835
speedup = -0.6162
P(speedup > 0) = 0.0000
ok we are not cooking very hard on the speed
that's probably why still -100 elo
i mean the current code will perform (relatively) better with large nets because I think calculating threat indices is still the bottleneck
i will try and bulk pawn threats later
once multithreading is resolved
for multilayer, floats for later layers ok? based on bullet morelayers.rs
const HIDDEN_SIZE: usize = 256;
let mut trainer = TrainerBuilder::default()
.advanced_quantisations(&[QuantTarget::I16(255), QuantTarget::I16(64), QuantTarget::Float, QuantTarget::Float])
.optimiser(optimiser::AdamW)
.loss_fn(Loss::SigmoidMPE(2.6))
.input(ThreatInputs)
.output_buckets(outputs::MaterialCount::<8>)
.feature_transformer(HIDDEN_SIZE)
.activate(Activation::SCReLU)
.add_layer(16)
.activate(Activation::CReLU)
.add_layer(32)
.activate(Activation::CReLU)
.add_layer(1)
.build();
Still like -100 elo.
-100 elo vs. L1-256 main net means still a long ways though
yea, morelayers can be a baseline but isn't going to help that much
I did say I don't think simplified threat inputs are sufficient lol
But it's fine, if it's close enough then we know the full will be better
ok hopefully the inference layers already in sf work lol
well as disservin said most of the slowdown is from computing threat indices
like here
according to my statistics you have ~ 2x as more accumulator updates with threats compared to halfkav2hm
in midgame
now obviously because of hilariously poor overhead this doesn't match what we see at small sizes
and yeah I would appreciate help with ue impl
like if there's a way to compute threat difference without looping through all the pieces
i'll try L1-512 simple threats for another data point then
and i'll look into full threats soon too
if you want I can take the current net, permute the ft/l2 weights 7 more times, and "effectively" have a l1 = 2048 for speed testing purposes
full threats i assume can be trained with this
https://github.com/official-monty/montytrain/commits/threat-inputs-nnue-fixed/
i can train an actual L1-2048 and use an early checkpoint for speedtest purposes
yes I believe everything there is ready
although inference side it's not quite ready yet
I'll work on it
alright full threats would be more important than multilayer
yeah sure the fixed nodes test of full threats vs simplified would probably be more informative
btw @frosty imp would you be able to help with this
moving indexlists to classes so that we don't declare a bunch of temporary ones (a noticeable speed gain in single thread btw) currently breaks multithreading
Result of 10 runs
==================
base (...g13-sf-mar11) = 1063281 +/- 1289
test (...1-sscg13-512) = 254596 +/- 485
diff = -808684 +/- 1307
speedup = -0.7606
P(speedup > 0) = 0.0000
assuming i did this right, this shows simple threats L1-512 being quite a lot slower than L1-256
early results after 10 superbatches of training on SF data
Architecture : (15776 -> 512)x2 -> 1x8
Inputs : Threat inputs
Number of Weights : 8.09m
hmm
wait 4x slower is a bit
suspicious
like doubling L1 should never make it 4x slower
16mb .nnue file. only change in the engine code otherwise was setting L1 to 512
can you send it to me so I can test
I got ~900k for 256 vs ~800k for 512
10M node search from startpos with 256: info depth 33 seldepth 40 multipv 1 score cp 28 lowerbound nodes 10000621 nps 625625 hashfull 999 tbhits 0 time 15985 pv d2d4 bestmove d2d4 ponder d7d5 Number of accumulator updates: 15940910 Number of feature indices looped through: 189471834
10M node search from startpos with 512: info depth 30 seldepth 37 multipv 1 score cp 35 lowerbound nodes 10000298 nps 570467 hashfull 1000 tbhits 0 time 17530 pv e2e4 bestmove e2e4 ponder e7e5 Number of accumulator updates: 15923538 Number of feature indices looped through: 190573575
what do the Nodes/second numbers show when you run stockfish bench with both?
~this
though I don't have the benchmarking script so I need to run it manually
weird, we'll this training finish and see how it fares on fishtest
this is all i changed:
-#define EvalFileDefaultNameBig "nn-98b68b5a9455.nnue"
+#define EvalFileDefaultNameBig "nn-ff12e5c0b08b.nnue"
// Number of input feature dimensions after conversion
-constexpr IndexType TransformedFeatureDimensionsBig = 256;
+constexpr IndexType TransformedFeatureDimensionsBig = 512;
that is also all I changed
oh wait, my branch wasn't updated with the latest speed updates
nm, looks less slow now
it should be on this commit right? 9f21b44 disservin screlu affine speedup
yeah
that one
unfortunately a couple of the single-thread speedups break multithread
I am hoping for someone to help me resolve those (I have no experience coding multithreaded)
otherwise it would be a shame to have to roll those back
L1-512 on top of 9f21b44 looks a lot better
Result of 10 runs
==================
base (...g13-sf-mar11) = 1057315 +/- 5402
test (...rofile-build) = 916618 +/- 4834
diff = -140697 +/- 6748
speedup = -0.1331
P(speedup > 0) = 0.0000
mm
looks like those speedups were worth a lot
how far back was your old branch just curious
the speedups today should only be like +30-40% compared to yesterday
anyways based on this napkin math suggests scaling to 3072 as master would be 40% of the speed
oh huh the total speedup should not be so large then, then again this branch was broken inference because I added dirtypiece and forgot that it doesn't work if the king moves from efgh to abcd
how does it perform if you try it locally?
upper 700k (for L1=256)
then again my laptop is like really not great for speed testing
like fishtest stc https://tests.stockfishchess.org/tests/view/67d0ca72166a3e8781d84242 suggests today was a ~40% speedup for L1=256 on average, my laptop suggests it was half that lol
anyways better speed is better speed lol
yea, any speed we can get is good
i wouldn't worry about multithreaded for now either
it's nice if it works of course. however the main blocker is getting anything on par with master
simple threat inputs - L1-512 vs. L1-256
https://tests.stockfishchess.org/tests/view/67d123cf166a3e8781d842bf
yeah that's 100% a huge speedup
why?
that's barely a speedup, it just makes multi threaded work again
With a3427fc multithreading works again without crash. But the analysis is totally weird.
https://github.com/cj5716/Alexandria/commit/bb20a5bb7c217d2a47caa24d72cf0e07c05885ab was a 10% speedup for me tho
I guess this is quite different
I didn't read it in detail
ah well the vector is created in the function so that was a speedup for you, but sscg already moved the instatation out of the function so that speedup is gone
{
Color c = order[Perspective][i];
PieceType pt = PAWN;
Piece attkr = make_piece(c, pt);
Bitboard bb = colorBB[c] & pieceBB[pt];
indices.clear();
auto right = c == WHITE ? NORTH_EAST : SOUTH_WEST;
auto left = c == WHITE ? NORTH_WEST : SOUTH_EAST;
auto attacks_left = (c == WHITE ? shift<NORTH_EAST>(bb) : shift<SOUTH_WEST>(bb)) & occupied;
auto attacks_right = (c == WHITE ? shift<NORTH_WEST>(bb) : shift<SOUTH_EAST>(bb)) & occupied;
while (attacks_left) {
Square to = pop_lsb(attacks_left);
Square from = to - right;
Piece attkd = board[to];
indices.push_back(make_index<Perspective>(attkr, from, to, attkd, ksq));
}
while (attacks_right) {
Square to = pop_lsb(attacks_right);
Square from = to - left;
Piece attkd = board[to];
indices.push_back(make_index<Perspective>(attkr, from, to, attkd, ksq));
}
std::sort(indices.begin(), indices.end());
for (auto threat : indices) {
active.push_back(threat);
}
}
here you go, ~6% for me
try using emplace_back and see if it helps
Hmm this is quite concerning
Do you have a fixed nodes estimate for the net?
Results of threat-inputs/sscg13-sf-mar11 vs threat-inputs/mar11-sscg13-512 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 135.41 +/- 8.83, nElo: 191.47 +/- 11.30
LOS: 100.00 %, DrawRatio: 31.83 %, PairsRatio: 7.04
Games: 3632, Wins: 1989, Losses: 641, Draws: 1002, Points: 2490.0 (68.56 %)
Ptnml(0-2): [26, 128, 578, 640, 444], WL/DD Ratio: 3.94
+135 elo at 25k nodes per move: L1-512 vs. L1-256, simple threats
the speed difference measured with bench positions may not be reflective of speed changes throughout actual games
simple threat inputs - L1-1024 vs. L1-512
https://tests.stockfishchess.org/tests/view/67d1be92166a3e8781d843aa
Result of 10 runs
==================
base (...rofile-build) = 913329 +/- 4211
test (...rofile-build) = 668868 +/- 2992
diff = -244461 +/- 4790
speedup = -0.2677
P(speedup > 0) = 0.0000
look who's hiding in the last bytes of the L1-1024

Hmm looks like we still need major optimization work
Fixed nodes results are really strong though
why do you think it doesn't add up though
did we have any measurements of how much elo we actually get from doubling nowadays?
f84f2226 should fully restore it for hopefully minimal speed loss
You linked the wrong test but this one’s going even worse than 512 vs 256
I’m curious what the fixed nodes are for this one as well
oops, edited the original message with the correct test link:
https://tests.stockfishchess.org/tests/view/67d1be92166a3e8781d843aa
I actually just suck at optimization and I have no idea what is going wrong
Like all the data suggests that threat inputs should be significantly more accurate as evaluation but the STC never matches
for that the new speedtest should be much more representative.
./stockfish speedtest 1 16 5 (speedtest [threads] [hash (MiB)] [runtime (s)])
(you might want to give it a bit more than 5s though)
Running a couple 150 sec 4 thread for 256, 512, 1024
Will have results in several minutes
Results of ./threat-inputs/mar11-sscg13-512-profile-build vs ./threat-inputs/mar11-sscg13-1024-profile-build (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -42.48 +/- 7.02, nElo: -67.50 +/- 11.04
LOS: 0.00 %, DrawRatio: 44.51 %, PairsRatio: 0.49
Games: 3806, Wins: 963, Losses: 1426, Draws: 1417, Points: 1671.5 (43.92 %)
Ptnml(0-2): [144, 563, 847, 310, 39], WL/DD Ratio: 2.11
Actually nvm I forgot school wifi blocks fishtest
So I can’t download new nets 
those results somehow indicate 1024 is negative vs. 512 at fixed nodes
can you download if i upload them here directly?
Uh I need to transfer from phone to school laptop then
So Google drive might be easier
L1=256 4 thread 662978
(This computer is very slow disregard the absolute numbers)
Bruh the school WiFi also manages to block that 
School WiFi is actually terrible
Fine I’ll figure out phone -> laptop transfer
But you might have a wait longer
is it blocking based on the domain, or the filename extension or what?
Because I have a class soon
I think it is domain of website
Some kind of filter
phone == mobile hotspot ... connect computer to hotspot. No longer blocking domains?
you can also just have discord on the laptop...
School laptop as well
There are a lot of things I can do on this laptop but discord and fishtest are not in that category
- the web filtering is partially built in as software on the laptop as well
Anyways STC to fixed nodes difference is less for 512-1024 than 256-512 so I guess there is less slowdown
I see.
btw that is the profile
L1=512 4 thread 539842
Bruh overhead is like 4x actual accumulator updates