#UE Threat Inputs for AB
1 messages · Page 5 of 1
with my current arch (including the factoriser) training speeds are basically identical to my current master net
there is also a possibility that the numbers are different due to vondele running debug builds
at least one of the numbers was ~equal to sf master, one was 1/2 sf master, and one was 1/4
yeah. no idea either how to interpret the 80-90it/s either, but from what i've seen in the past it seems pretty good
same speed as master net (l1=3072) training.
so fairly straightforward to train.
certainly if we train just 1 or 2 stages.
actually even a bit faster to train.
Elo | -3.77 +- 2.98 (95%)
SPRT | 40.0+0.40s Threads=1 Hash=64MB
LLR | -2.25 (-2.25, 2.89) [0.00, 2.50]
Games | N: 11898 W: 2877 L: 3006 D: 6015
Penta | [6, 1394, 3281, 1259, 9]
https://furybench.com/test/3100/
LTC vs main. slowly getting there. i'm hopeful that a factoriser is all that's needed now
First factoriser not looking good (loss also sucked)
--------------------------------------------------
Results of ThreatsFactorised vs Threats (20000 nodes, 1t, 16MB, UHO_4060_v2.epd):
Elo: -13.87 +/- 9.37, nElo: -21.63 +/- 14.58
LOS: 0.18 %, DrawRatio: 43.30 %, PairsRatio: 0.81
Games: 2180, Wins: 602, Losses: 689, Draws: 889, Points: 1046.5 (48.00 %)
Ptnml(0-2): [58, 284, 472, 239, 37], WL/DD Ratio: 1.58
--------------------------------------------------
Results of ThreatsFactorised vs Threats (5+0.05, 1t, 16MB, UHO_4060_v2.epd):
Elo: -13.70 +/- 11.09, nElo: -26.13 +/- 21.12
LOS: 0.76 %, DrawRatio: 48.46 %, PairsRatio: 0.75
Games: 1040, Wins: 245, Losses: 286, Draws: 509, Points: 499.5 (48.03 %)
Ptnml(0-2): [5, 148, 252, 113, 2], WL/DD Ratio: 1.03
--------------------------------------------------
I mean i've never experimented with factoriser schemes, there is the possibility that smth is still bugged ofc even though i double checked the things i could think of, but it also doesn't look bad enough for it to be bugged
i'll try coding up 768x4 next, when i have time
yeah that looks cooked rip
i tried to describe the information encoded in various threat schemes, in the hope of getting some collective opinion on what a factoriser might need most.
large threat inputs: [src][src_pc][src_pc_col][dest][dest_pc][dest_pc_rel_col]
small threat inputs: [src][src_pc][src_pc_col][dest][dest_pc_rel_col] -> leave out attacked piece type
what i tried: [src][src_pc][src_pc_col][dest][dest_pc_worth_more_than_src_pc][dest_pc_rel_col]
768x4: [dest][dest_pc][dest_pc_col][dest_attacked][dest_defended]
alternative idea 1: [src_pc][src_pc_col][dest][dest_pc][dest_pc_rel_col] -> leave out source square
alternative idea 2: [src][src_pc][src_pc_col][dest_pc][dest_pc_rel_col] -> leave out destination square
i'm actually thinking alternative idea 1 might be best, the source square should not be super important for the factoriser. but i wanna hear some opinions
i think leaving out src square seems reasonable yea
Why not use small threat inputs as the factoriser?
Because it is known to not be terrible even as standalone
trying without encoding the source square now. for pawns, since source/destination are so closely tied, i encode source file+threat direction, but not rank. 6824 features
eta 17-22h from now, depends on when i'm home
Still not great
--------------------------------------------------
Results of ThreatsFactorised vs Threats (20000 nodes, 1t, 16MB, UHO_4060_v2.epd):
Elo: -7.04 +/- 6.13, nElo: -11.26 +/- 9.79
LOS: 1.21 %, DrawRatio: 45.16 %, PairsRatio: 0.86
Games: 4836, Wins: 1329, Losses: 1427, Draws: 2080, Points: 2369.0 (48.99 %)
Ptnml(0-2): [95, 617, 1092, 519, 95], WL/DD Ratio: 1.31
--------------------------------------------------
I don't know, maybe it's not a thing that can be factorised well, at least in the ways i've tried so far? I.e. the weights of each "bucket" are too different, what i'm doing seems to be doing more harm than good
Yeah
ucinewgame position startpos eval (x2) gives two wildly different results
this is so bad
surprise surprise doing a no-ue inference hack on sf nnue vector code by treating the biases as accumulator caches breaks
because the biases themselves get updated
ok well this looks more like chess
idk how good this chess is
This is the early checkpoint right
yes
... Frolic (stable) playing White: 0 - 46 - 4 [0.040] 50
... Frolic (stable) playing Black: 0 - 48 - 2 [0.020] 50
... White vs Black: 48 - 46 - 6 [0.510] 100
Elo difference: -603.9 +/- 183.4, LOS: 0.0 %, DrawRatio: 6.0 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
100 of 100 games finished.
well seeing it can still destroy Frolic (~3080 CCRL blitz) at stc without ue
i think there shouldn't be any major issues with training/inference at this stage
... Stockfish TI-experimental playing White: 6 - 29 - 15 [0.270] 50
... Stockfish TI-experimental playing Black: 6 - 34 - 10 [0.220] 50
... White vs Black: 40 - 35 - 25 [0.525] 100
Elo difference: -195.5 +/- 65.7, LOS: 0.0 %, DrawRatio: 25.0 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
100 of 100 games finished.``` 10k node
idk how much the rest of training is worth
what should be the plan
start a full training run and compare fixed nodes?
200 superbatches was it?
i honestly have no idea how undertrained that is
besides "very"
@twilit oriole @stray reef opinions?
well i think you should get rid of king buckets for the baseline lol
then we can compare to plenty results easier
Plenty has a L1 512 TI vs L1 1536 regular, SF would be L1 1024 TI vs L1 3072. So fixed nodes should be very similar
hm -190 sounds almost like something's broken honestly, or the end LR is still extremely high
for my training setup there is no way any stage can be so bad, assuming a reasonable LR schedule
plenty L1 is 1792 btw
#nnue-dev message
given this, the elo diff seems fine
or just different levels of optimism 😉
Anyway, worthwhile training something stronger.
Yeah still cannot guarantee everything is perfectly fine
But at least this is a lower bound
right, but it is likely not outrageously wrong, which is good enough to put some more resources on this.
do you have some correlation plot, e.g. TI vs master net evals in a scatter plot?
Ah I can make that later if you tell me how
just take a random source of fens (e.g. a binpack), and evaluate once 1000 fens with your net and once with master net, and plot x,y..
ok
Btw @twilit oriole do you have any data on how much data and how many SBs/epochs a threat input net of a certain L1 size needs?
i'm wondering if mine is massively undertrained (not only wrt data, but also SBs)
hm not really. we are using 12k SBs and 160B positions for an L1 8192
and that seems slightly undertrained but not by much
though mcts might have higher data requirements
how many SBs would you do for L1 512, given enough data (whatever that may be)
difficult to say because you can nearly always squeeze a few more elo out
probably something like 1k minimum, 2k to be sure
hm ok
... Stockfish TI-experimental playing White: 19 - 141 - 90 [0.256] 250
... Stockfish TI-experimental playing Black: 18 - 165 - 67 [0.206] 250
... White vs Black: 184 - 159 - 157 [0.525] 500
Elo difference: -208.9 +/- 27.1, LOS: 0.0 %, DrawRatio: 31.4 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
500 of 500 games finished.``` 20k nodes but idt there's really much more substantial things to learn atm
I think the important check is to see if the inference is consistent with the trainer...
(though the script might need verifying it still works)
actually my question is why isn't there a command that just returns the unnormalized eval
btw @twilit oriole sparsity on the threat net L1 -> L2 seems trashed
have you measured this before
wdym trashed
i found threat nets compress much better for us which would lead me to believe the opposite
combined zeros here seems much lower
than in the halfka
(master arch at this checkpoint has like 78 which is double the amount)
oh right L1 issue
@violet badger I'm measuring a large fixed nodes loss between the first checkpoint of the full run (nn-42b0b08a207a.nnue) and the net trained from the short run (nn-cc78fa7e0258.nnue) despite a lower validation loss (0.00405 vs 0.00425), is there a meaningful difference between the two in the first stage besides training time?
... Stockfish TI-experimental playing White: 24 - 45 - 31 [0.395] 100
... Stockfish TI-experimental playing Black: 20 - 48 - 32 [0.360] 100
... White vs Black: 72 - 65 - 63 [0.517] 200
Elo difference: -86.9 +/- 40.7, LOS: 0.0 %, DrawRatio: 31.5 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
200 of 200 games finished.```
full first stage (nn-42b....) should be better than the previous (nn-cc78...), there is no difference except increasing from 200 to 800 epochs the training. In this sense nn-42b can now also be compared to similarly trained nets of the master arch (which are roughly -50Elo compared to master fully trained).
do I understand your measurement as showing it is worse?
yeah, I think this need some checking from the implementation point of view.
note on the loss during training, we adjust lambda (mix between eval and game outcome) during training (if lambda start and lambda end is not the same), so the loss doesn't mean the same at the same epoch if the max_epoch is different.
yeah i mean the 0.00405 vs 0.00425 comparison is from final epoch from both
but yeah something is strange
possible the final epoch can indeed be compared.
still strange.
even if painful, I think the thing to do right now is to ensure trainer and SF have the same inference result.
do all of the nnue-pytorch functions really need a gpu to run
i'll try to enlist a friend's help if that is the case
most likely, at least I don't think non-gpu runs are still supported. It would add a new dimension to testing..
ah shoot
(lldb) (int) 625```
```(lldb) print Eval::NNUE::Features::debug_threat_index(Stockfish::Color::BLACK, Stockfish::Piece::W_PAWN, Stockfish::Square::SQ_C5, Stockfish::Square::SQ_D6, Stockfish::Piece::B_KNIGHT, Stockfish::Square::SQ_E1)
(lldb) (int) 40049```
i should have guessed something was sketchy
in stockfish piece enum color is msb, in nnue-pytorch it is lsb...
ok well it turns out changing the threat indexing does not affect bench at all
um
something is highly wrong in my inference then
void init_threat_offsets() {
int pieceoffset = 0;
for (int c = WHITE; c <= BLACK; c++) {
for (int pt = PAWN; pt <= KING; pt++) {
Piece piece = make_piece(Color(c), PieceType(pt));
threatoffsets[piece][65] = pieceoffset;
int squareoffset = 0;
for (int from = SQ_A1; from <= SQ_H8; from++) {
threatoffsets[piece][from] = squareoffset;
if (pt != PAWN) {
Bitboard attacks = attacks_bb(PieceType(pt), Square(from), 0ULL);
squareoffset += popcount(attacks);
}
else if (from >= SQ_A2 && from <= SQ_H7) {
Bitboard attacks = (piece < 8) ? pawn_attacks_bb<WHITE>(square_bb(Square(from)))
: pawn_attacks_bb<BLACK>(square_bb(Square(from)));
squareoffset += popcount(attacks);
}
}
threatoffsets[piece][64] = squareoffset;
pieceoffset += numvalidtargets[piece]*squareoffset;
}
}
}```
no matter how I swap the order of the top for loops (either way), I get the same bench
idk what is going wrong...
I legitimately do not know how changing the threat indexing does not affect bench at all
the battle begins again
nvm this is it packing two ints and interpreting it as a u64
more inclined to believe the issue is in the trainer now
so this is basically just a L1=1024 halfka net
no wonder it's -200 to master
doesn't explain how the 800 sb one is worse than the 200 sb one at fixed nodes
oh well
@violet badger it looks like something is wrong right now so there isn't much point in continuing the run
I'll have to take a look into trainer again
okay, just let me know if there are fixes to the trainer to test out and we can restart.
yeah it's hard to work with nnue-pytorch w/o a gpu but hopefully my friend can help in the next few days
@violet badger is it safe to rebase against master
added 300M 5ksn-adversarial positions to the last training stage, it's probably passing LTC which is awesome
https://furybench.com/test/3149/
https://furybench.com/test/3155/
especially since including 5ksn-adversarial did not pass LTC in master, and it's super quick to generate compared to 20ksn-adversarial
oh nice
@twilit oriole threat inputs don't allow duplicate encoding of the same interaction, e.g. two queens attacking each other. did you ever measure the elo of this?
i realised there's quite a few unused features due to this (only 73360 are used)
i think it's possible to change the encoding itself
to reduce some of stuff like that
but it's annoying because you then have to treat it separately
not like indexing is the bottleneck anyways
it takes like < 1% of runtime
well actually, this loop which calculates feature indices takes 12% of the entire runtime atm
https://github.com/Yoshie2000/PlentyChess/blob/threat-inputs/src/nnue.cpp#L249-L272
and it's already sped up by using a precomputed index lookup table (around 2.4MB)
the elo was not measured no. you need to be careful about unused features, sometimes it is an illusion due to rare underpromos that would for example allow u to have two own bishops of same square complex etc
it was faster than the usual calculation
hmmm
though i'm 100% sure there must be a different encoding to make this faster
and also to figure out if a feature is unused or not
back when diss ran profile the actual indexing portion was only 1% or so of runtime and generating the threats was like 20% over both sides
idk maybe stuff changes
well maybe i did smth really stupid but i didn't really get very far with profiling
do you actually have a profile of latest version
the time taken in this loop is roughly 1/3 unpacking DirtyThreat and calculating relative squares, 1/3 table lookup, 1/3 adding into the arrays
I can't do it bc windows sucks
always?
ok cool
yoshie have you tried to split into 2 DirtyThreat lists, one with add and one with subtract, to remove branching in the loop? i think it could be a minor speedup
i tried it in combination with smth else, can try it standalone as well
yeah i don't really see how to speed this up rn
this loop basically takes more time than all of addsub
it's crazy
i think going back and forth between indexing the table and the dirty threat lists is awful for the cache, especially if there's like 10 threat updates to process
though i've not managed to found a way to improve it yet
also random idea maybe don't use max capacity 128 indexlists
for add/remove
like 32 should do just fine
tried that, was not a speedup
oh well
yeah I tried once not to like create entirely new lists every time but that screwed with multithreading
@naive comet maybe you have some idea on how to improve the cache situation? to not jump back and forth between dirtyThreats and the lookup table?
how big is dirtythreats
struct DirtyThreat {
Piece piece;
Piece attackedPiece;
Square square;
Square attackedSquare;
Color pieceColor;
Color attackedColor;
bool add;
};
struct Accumulator {
alignas(ALIGNMENT) int16_t threatState[2][L1_SIZE];
alignas(ALIGNMENT) int16_t pieceState[2][L1_SIZE];
DirtyPiece dirtyPieces[4];
int numDirtyPieces;
DirtyThreat dirtyThreats[256];
int numDirtyThreats;
KingBucketInfo kingBucketInfo[2];
Board* board;
};
lmao the 256 can definitely be made smaller
but it's not like that's an issue here, we're staying in the same accumulator
something else to try is measure threat activity per index over a long search. i think ultra rare threats could be combined
oh god
like the threats that only activate in underpromo situations etc
i expect the distribution has an extreme skew in general
yeah i mean it looks small
idk about cache but i wouldn't see how it's a big issue
if anything the lookup table looks much larger of an issue
but if you measured that it gains over using less
the lookup table is ofc way bigger than theoretically necessary
then idk either
but doing the calculations to reduce size (e.g. compressing the [64][64]) are more expensive apparently
cant attackedColor/pieceColor be inferred from attackedSquare/square?
yes, that would work
if you're willing to do a bunch of mailbox lookups you only need the two squares
i don't have colored pieces so it'd have to be bitboard lookups but yeah
oh interesting
we just ditched them early on, you can see your example here https://github.com/official-monty/Monty/blob/master/src/networks/value/threats.rs#L197
just free space saving
i wouldn't expect it to make a notable difference in the resulting net, though the training would be slightly different
I tried a bunch more stuff to optimise the index calculation. Even tried unpacking the network like this
struct NetworkData {
alignas(ALIGNMENT) int16_t inputWeightsPawn[ThreatInputs::LookupSizes::PAWN * L1_SIZE];
alignas(ALIGNMENT) int16_t inputWeightsKnight[ThreatInputs::LookupSizes::KNIGHT * L1_SIZE];
alignas(ALIGNMENT) int16_t inputWeightsBishop[ThreatInputs::LookupSizes::BISHOP * L1_SIZE];
alignas(ALIGNMENT) int16_t inputWeightsRook[ThreatInputs::LookupSizes::ROOK * L1_SIZE];
alignas(ALIGNMENT) int16_t inputWeightsQueen[ThreatInputs::LookupSizes::QUEEN * L1_SIZE];
alignas(ALIGNMENT) int16_t inputWeightsKing[ThreatInputs::LookupSizes::KING* L1_SIZE];
alignas(ALIGNMENT) int16_t inputWeightsPsq[768 * KING_BUCKETS * L1_SIZE];
alignas(ALIGNMENT) int16_t inputBiases[L1_SIZE];
alignas(ALIGNMENT) int8_t l1Weights[OUTPUT_BUCKETS][L1_SIZE * L2_SIZE];
alignas(ALIGNMENT) float l1Biases[OUTPUT_BUCKETS][L2_SIZE];
alignas(ALIGNMENT) float l2Weights[OUTPUT_BUCKETS][2 * L2_SIZE * L3_SIZE];
alignas(ALIGNMENT) float l2Biases[OUTPUT_BUCKETS][L3_SIZE];
alignas(ALIGNMENT) float l3Weights[OUTPUT_BUCKETS][L3_SIZE + 2 * L2_SIZE];
alignas(ALIGNMENT) float l3Biases[OUTPUT_BUCKETS];
};
where the threat feature weights for each attacking piece are encoded as [64][64][6][2][2]. was equally fast. ofc there would be way too much unused space but i was hoping to at least achieve faster calculation, cache pressure was roughly similar still
i think i'll give up on speedups for now, and just generate some more data
Hm. Something else to try is have the L1 for piece square inputs be larger than that of the threat inputs
How are you inferencing that then
ah I see
asymmetric like that requires more extensive trainer modifications and stuff
In bullet should be easy
honestly I think right now increasing L1 would easily pass LTC. 384 -> 512 passed STC with 4 elo or so, 640 should definitely be doable
have 160B positions on offer for the price of $0.0
Yeah I know. It wasn't suggested as an alternative to that
what was the difference in 123rrr4 btw
this is cool bc it should hopefully mean ltc is neutral now
so very close
Yeah wanted to post about this. 0123rrr4 is the last stage with 600M more positions (5ksn adversarial) compared to 0123rrr. Gained 2 elo at STC + LTC
The game plan is generate 600M more positions while I'm on holiday, and then train a 640 L1
nice it's looking very promising
hopefully you are rewarded for all of the effort soon enough
@violet badger we discovered an error in the threat offsets initializer not being run. That should be resolved now, so the threat features should actually train
let's try a short test run first, and I'll verify the fix works
super exciting
OK, will try to set that up today. Need to recall where we did our first experiment 😉
@rocky vigil do you happen to have a repo + sha of an SF that can use your net already? If I have it, I should be able to add this to the training pipeline already. Not urgent.
ok, think I found it threat-inputs-rebase last commit.
ooh cool
how long is it expected to take?
short test only, 1h for 'a bit of a net'
Full training schedule would e about 4days
let me think..
(wondering how doable it is to experiment at home)
fancy schmancy
But I don't think this is ways faster than some fancy home GPU.
It is merged
while threat inputs in SF won its first games against master...
[129, 722, 283, 29, 0]
@rocky vigil https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11687793252/artifacts/browse/step_e06216ffc4a2/ is a net for download. Still young, 100 epochs only. Match here https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11687793253 (not fixed nodes)
Elo: -150.87 +/- 7.78, nElo: -309.70 +/- 14.12
So, time to further increase epochs.
Ah I need to re-index the threats in inference as well
Lemme do that quickly
And update against master as well
But it’s already looking better
oh, that's going to make a difference, but sure.
Gimme a bit to sanity check run through lldb etc.
Would have done this yesterday if I knew it would’ve been a very fast response
it is exciting, so got bumped in priority 😉
Wait actually the current inference already seems to use the right indexing
Ah it was always the bullet indexing
Still lemme sanity check
oh wait this isn't fixed nodes, that's insane, it's like 1/3-1/2 the speed of master since it's basic non-ue inference
so that sound promising..
it is still a very early net as well, I wouldn't expect a master net to be better than -100 Elo at this point.
this is much better and looks proper
quick sanity 20k nodes (on 8moves, balanced book)
... Stockfish TI-experimental playing White: 27 - 25 - 48 [0.510] 100
... Stockfish TI-experimental playing Black: 17 - 36 - 47 [0.405] 100
... White vs Black: 63 - 42 - 95 [0.552] 200
Elo difference: -29.6 +/- 35.0, LOS: 4.9 %, DrawRatio: 47.5 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
200 of 200 games finished.```
so unless some recent search development is extremely good at low node counts, the net is already quite good
no magic search recently.
that's really rather strong already.
should definitely gain > 30 Elo from training.
There should be an updated net in like 12h or so, that should be equivalent to master -50Elo.
the estimate of -20% speed with optimization still seems accurate
but am hopeful 50 elo more from the full training can be gotten
especially now that the threats seem to actually work properly
pretty certain that 50Elo is still quite easy with training..
unless these nets train much faster
i wouldn't expect it, due to parameter count
monty used a similar training schedule of 3000 * (100M pos) I think at L1=3072 or so, and plentychess was 1200 * (100M pos) at L1=512
@stray reef do you have lofty's resource regarding incremental threat tracking?
so, the quick practical conclusion is that the inference code is fine for testing right now. No need for me to change things urgently.
yeah
switching to nnue-pytorch is good on my end
as I can just use the real inference code and it just "works"
not sure I fully followed that remark, but yes, SF inference code is working, though will need the speedup work that we kind of know how to start.
If nnue-pytorch is working fine, we'll have a next net in about 12h
And could have a fully trained net in 3-4d
ah basically I don't have to hack in the later layers of inference also
needed to write the entire inference from scratch last attempt with bullet
I understand now... one day would still be nice to have a bullet compatible setup, but that's a different story.
Not sure if there is a dedicated resource other than yukari code
though it is using bitlists instead of bitboards
Stefan Pohl is going to do some tests with the new net as well, against the latest release (net being the only diff to latest release). Will be interesting to have those results as well
it was 4k SB at L1 3072
oh, I guess I should refer to your code then?
how are the (expanded) threat inputs indexed btw
I know your current input set is just that but squished right?
are you referring to my attempt to simplify indexing for faster index calculations?
the current indexing setup stems from an old montytrain branch
okay I should probably upread this thread
https://github.com/Yoshie2000/PlentyChess/blob/main/src/board.cpp#L404-L565 this is what i have rn
thank you
For other NN applications I had the experience that often (maybe counterintuitively?) large NNs train faster initially
The way I explained it for myself was because with each step you update more parameters than for a small NN
like about the large nets being faster initially but much slower to squeeze out maximum performance from
So it has more potential to learn in a single step
But without systematic analysis I'll be careful to make a definite claim, it could also just be that that hyperparameters were optimized for large NNs
@rocky vigil https://pastebin.com/q5T0zfFE
ok
sample 0 looks correct by manual inspection
i mean the fact that it's so close fixed nodes means it hopefully works
how close?
one of the reasons why NNs are so hard to debug is because even when they're buggy, they often perform pretty well
30 +- 30
Interesting. And that's like -20% speed?
should be according to plenty data
The plenty measurement didn't have pairwise?
I would have thought it's less than 20% slowdown
I think anything above 30 fixed nodes should be passing easily at SF VVLTC for around 15% slowdown
i should clarify this is distance to master
so we still need ~60 more ish
should be quite straightforward to measure nps?
no need to speculate what it is right now?
It isn't. Very position dependent
right now the inference is not intended to optimize nps
it is intended to optimize for correctness
sure
Not really. The elo dependence of speed is dependent on position
non-ue on my laptop is like 1/3 - 1/2 the speed of master
right so that's a number
but i think my laptop is not representative
do stuff wrong and it'll send processes between the P / E cores etc.
average intel laptop experience
let me measure
Where do u get that number
the target is +30 (fixed nodes), and the 100 SB one was -30 +- 30
at STC the difference is 150 Elo
Lol
Well I don't think we "need" 60 Elo kek. We need a measurement with lower error bars lol
this is true
you can run this locally if you have hardware for it, just pull my branch
threat-inputs-rebase
Where's the net
up here
And I just set evalfile?
How many SBs are there in total
the full run should have 800 * num stages
Oh early days then. Might as well wait till at least first stage concludes
I assume there is some numbers on how close it is after first stage?
In a regular master run
theoretically we surpassed -50 elo with 1/8 of the first stage so hopefully the good stuff continues 🙏
I think it looks promising indeed.
56% of speed is a lot of Elo STC.
(consistent with your fixed nodes number and my STC number)
I think people should start looking at a faster inference now, full trained net will be there before the end of the week.
yeah I'll start working with yoshie and let's see how we should approach the incremental threat tracking
Pretty sure we'll get some more people to look at this as it makes progress.
there is
ah yeah your upstream optimizations have also made it here (:
lol
what we need to do next for improving NPS is like set up the foundation basically
our UE framework, etc.
and after we do that it's minor optimizations go go
I agree..
for sure
though some pondering can go in parallel 😉
bestmove sleep ponder speedup_ideas
I have a strange plot atm to fuse FC0 with add/sub
bestmove do dishes
ooh interesting
working fusing would be quite good
since average threat update has multiple add/sub
ye
my hope™ is that if add/sub is really memory bandwidth limited, then we should be able to do useful work (like the dot products) at the same time
but there are complications ofc
the other foundational thing is to set up dual accumulator, which I have been procrastinating on
can probably do a lot by fusing threat updates if done right
can't have good ue without dual accumulator, as otherwise every king move is suddenly gonna be 4x as expensive
so i guess that might be the priority
i've been trying to come up with something similar to finny tables, that fuses threat updates on a per move basis (for frequent moves), but no good idea yet
the good news is, full refreshes from mirroring changing are neglegible
yeah i expected as much
elaborate?
what is a "dual accumulator"
should track the contribution from threat features and psq features separately
bc like, the refresh patterns are different
psq needs a full refresh every king move
but threats only need a full refresh when the king crosses d/e (due to horizontal mirroring)
interesting
what's an up-to-date threat inputs branch/net
hmm the net is not on fishtest?
stage 1
in ~ a few hours
hopefully that'll be equal fixed nodes to master
at least
is there a net I can use to just get it running
yep
this is also the one named in my branch
oh lol
do threat inputs apply to psqt?
noob question, why are there both piece square table and positional factors
I see
like why not just the latter
well it gains
(theoretically)
but practically it gains to use the difference between psqt and positional as information
It seems that capturing simpler features first makes it much easier for the rest of the net to focus on the nonlinear ones.
is this like
what is this difference intuitively
how sharp the position is?
interseting
misread the code
Well, for some reason... thr psqt and the positional factors are 125/128 and 131/128... but actually, they were trained on both being 1.
And somehow it gained.
also are we planning to use psqt biases?
I think the 125, 131 are tuned...
no i just set it to 0 to make inference easier fo rme
@twilit oriole stage 1 net is at https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11688159305/artifacts/file/step_574f3061fd9e/nn-3e22bf1f564d.nnue, a brief test indicates it is at least on par with master at fixed (20k) nodes
... Stockfish TI-experimental playing White: 33 - 12 - 55 [0.605] 100
... Stockfish TI-experimental playing Black: 19 - 30 - 51 [0.445] 100
... White vs Black: 63 - 31 - 106 [0.580] 200
Elo difference: 17.4 +/- 33.1, LOS: 84.9 %, DrawRatio: 53.0 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
200 of 200 games finished.```
now pushed a bench to my branch
corresponding stc:
Elo: -116.83 +/- 6.88, nElo: -237.30 +/- 12.98
LOS: 0.00 %, DrawRatio: 30.60 %, PairsRatio: 0.08
Games: 2752, Wins: 300, Losses: 1192, Draws: 1260, Points: 930.0 (33.79 %)
Ptnml(0-2): [81, 802, 421, 72, 0], WL/DD Ratio: 1.18
LLR: -2.95 (-100.0%) (-2.94, 2.94) [-101.00, -99.00]```
update: Score of Stockfish TI-experimental vs Stockfish 10/07/25: 255 - 251 - 494 [0.502] ... Stockfish TI-experimental playing White: 148 - 99 - 253 [0.549] 500 ... Stockfish TI-experimental playing Black: 107 - 152 - 241 [0.455] 500 ... White vs Black: 300 - 206 - 494 [0.547] 1000 Elo difference: 1.4 +/- 15.3, LOS: 57.1 %, DrawRatio: 49.4 % SPRT: llr 0 (0.0%), lbound -inf, ubound inf 1000 of 1000 games finished.
there are cases where a piece is neither threatened nor attacked - the net still needs to know about it
(you may be talking about the integrated psq of the sf arch, not threats, in that case nvm)
oh btw yoshie
we should probably also concurrently start working on setting up the ue
actually lemme start by figuring how how to do dual accumulator
alright, i can start with incremental threat tracking today
like add on to my branch?
that would be welcome yeah
how much do you estimate you'd have to overhaul sf stuff
this was the main concern i had when trying to think of this
ostensibly you need to add stuff to the position structure etc.
not sure how complex the sf position structure is, hopefully less than you think
well good luck with it
i sleep soon
i don't know if the rest of the training stages are happening but that would also be interesting to see
The simplest way I see for this is to have two accumulator(stack) classes one for threats and one for psq
But this is a lot of code duplication
cc @rocky vigil
updated branch with dual acc. Just FYI I refactored some stuff with the input features so it's probably best to write incremental threats on top of this
Ok cool
There is also a new net (see above) just to note
Bench looks right
For the older net
nice, that worked well, so roughly 30 Elo progress and parity at fixed nodes.
With some luck adding the other training stages adds another 30+ Elo. So I'll start those soon
I’m wondering if some pairwise multiplication-ish architecture is possible with threat inputs
Since dual accumulators is already a thing
so, final 4 stages added here https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/pipelines/2095621268
Wait I thought if you only wanted to update threats from scratch I have a different function append_active_threats for this
Lemme look more carefully
seems to have crashed
Ah you updated it to exclude the psq parts
Fair enough
Does it achieve any speedup
Now that the halfkav2hm part is being ue’d normally
Stockfish dev-20251012-536051bf by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251012-536051bf
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-bmi2
Compilation settings : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 43, 25
single game : 732, 453
Total nodes searched : 122156946
Total search time [s] : 153.585
Nodes/second : 795370```
```./stockfish speedtest 1
Stockfish dev-20251012-3a5c355e by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251012-3a5c355e
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-bmi2
Compilation settings : 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 47, 29
single game : 798, 525
Total nodes searched : 137345907
Total search time [s] : 153.564
Nodes/second : 894388```
nice initial speedup
so now that takes care of the psq part
so we can focus on incremental threats
restarted, some network issue can cause that (somewhere between the gitlab runner reading the output and the actual calculation).
no worries.
relatively transparent.
so, looks like we already made progress with the inference code... nice!
how is this going? i can attempt to help if you want
Some other stuff got in the way, should get somewhere tomorrow
ah fair
stage 2, (nn-a878500a97a8.nnue), 8moves_v3.epd, 20k nodes
... Stockfish TI-experimental playing White: 151 - 72 - 277 [0.579] 500
... Stockfish TI-experimental playing Black: 104 - 116 - 280 [0.488] 500
... White vs Black: 267 - 176 - 557 [0.545] 1000
Elo difference: 23.3 +/- 14.3, LOS: 99.9 %, DrawRatio: 55.7 %
SPRT: llr 0 (0.0%), lbound -inf, ubound inf
1000 of 1000 games finished.```
🙏
2b2r2/p7/r1p1p3/P1p1P3/2P3R1/1P3kP1/2KB4/8 w - - 0 1Lichess Link | Image
threat net solves this while master can't 👀
updated to latest net btw https://github.com/xu-shawn/Stockfish/tree/threat_inputs
significant static eval diff for 8/p6b/r1p1p3/P1p1P3/2P2P2/1P6/3Bk3/2K5 w - - 15 10
alright i got something written up, getting to the debugging part now
(just incremental threat tracking, no UE yet)
https://github.com/xu-shawn/Stockfish/pull/9 ig i'll PR it to shawns branch for now
gonna start working on UE now (though i might get stuck in SF inference hell there, we'll see)
yeah i rebased
ah forgot to update the bench in the PR, but the commit has the right bench
it's definitely not a good way to just do what the branch currently has:
std::vector<AccumulatorState> accumulators;
std::vector<AccumulatorState> threat_accumulators;
since AccumulatorState has a dirty piece, but now also needs a dirty threats list, which we don't want to duplicate
nnue_accumulator.h/.cpp looks awful to work with lmao
fwiw master does solve this after a while
just not quickly
yeah it like fits halfkav2hm very well but is very hard to extend
i think the easiest way to do this is probably to distinguish AccumulatorState with ThreatAccumulatorState
so maybe if that is done everything will still work nicely
I think I won't produce anything reasonable here. Adding more abstraction is going to make this code even worse, making what's there fit is ugly, I'd want to simplify it if anything
that would also be nice if it can be done in a good way
how would you want to simplify?
we should probably also get @frosty imp's opinion since he probably understands this code the best
i have no idea :P
If i understand correctly the "problem" is that the AccumulatorState
Accumulator<TransformedFeatureDimensionsBig> accumulatorBig;
Accumulator<TransformedFeatureDimensionsSmall> accumulatorSmall;
DirtyPiece
has this but it actually only needs one accumulator? and no dirty pieces?
are we removing smallnet support?
we could
i think shawn wanted to keep it in case it was still useful
but right now bool use_smallnet is just false
imo this is a maintainer decision
it makes no sense to remove it now if we need to re-implement it in 2 weeks
if we still want to support smallnet, then ideally we split between PsqAccumulatorState and ThreatAccumulatorState, where PsqAccumulatorState has smallnet, bignet, dirtypiece while ThreatAccumulatorState has bignet and dirtythreats
i'd be fine with removing it if the threat inputs itself is strong enough to compensate the loss obviously
We should be able to remove it then
I think if we remove it it will reappear... threat net doesn't solve what smallnet provides (i.e. speed at decided positions)
it's probably worth testing later
if it turns out to be a big gain many small things can be masked underneath it
@frosty imp would you mind setting a low throughput stc vs master as well
my guess is around the range of -50 to -40 elo
can you pass the dirty type through a template?
i personally feel this would be more clear
eh sounds like a lot of code duplication
that would add templates with accumulatorStack operations anyway
I would say add a variable length version of AccumulatorUpdateContext::apply(IndexList added, removed)
ngl this file has caused me great pain over the last few days haha
honestly what would be ideal to me is some sort of simple DSL to describe the network layout
and a Python (or whatever) script to generate nice C++ code
that way you don't have to futz around with template metaprogramming
it'd also make performance improvements easier by allowing layers to be fused together
nice
i think stage 4 should be the big gain (according to master results) but we'll see
I think that's a bit unpredictable, I've seen it jump or not at that point or earlier.
anyways this is probably beyond what I can run reasonably fast locally fixed nodes so I'll just put a reduced throughput stc up on fishtest vs stage 2
at the end of the full training run there will be accurate results on the testing, so patience will also get us there.
i.e. will be clear which net to pick
fair
oh nice
shawn impatient 😉
now, I'm much more curious to see the inference speedup patch being tested like that... seems like this was another good improvement though.
@violet badger
Did you check if removing those duplicated lines actually improves "master" nets?
#nnue-dev message
#nnue-dev message ... wrong thread to ask though.
ok, sorry
the dual accumulator patch or the incremental threats patch
all steps needed to get to full speed inference 😉
but I meant the net test you did (with good improvement)
if I'm not mistaken that suggests another 10+ Elo from stage 2 to stage 3?
You will want to try doubling length of each stage after this run completes. Convergence time goes up a lot because some threats are very rare
Also u can ditch small net, try later disabling threats for decided positions (will need a new training run as well obviously)
It should do a similar thing with benefit of that regular part of the accumulator always being up to date if it switches back to regular eval
@stray reef have you debugged the incremental threats calculation? my bench isn't matching and I'm not sure where the problem is
The pain begins
suspecting something is wrong when capturing a piece
oh crap yeah
https://github.com/xu-shawn/Stockfish/blob/threat_incremental_updates/src/nnue/features/full_threats.cpp#L197 you should also guard against this being Dimensions, because that indicates deduplication
i.e. smth like
let index = make_index(...)
if (index < Dimensions) { append(index) }
never really figured out a better way to handle deduplication
afaik plentychess does same thing
the psqdifftype is actually more useful for this as you might guess
wait wdym
in short some threats imply the existence of the corresponding ones in the opposite direction
i.e. rook attacking queen implies queen attacking rook
so in that case we filter so that only one of the two is active
ah I see
yeah besides this most of the failure points would come from the incremental threat calculation
but yoshie claims he tested this thoroughly against from scratch
so I'm hoping it just works after these fixes
around 20%
yeah
hmm
vs threat tracking but no incr update
ok ok i see
so that moves -52 to -15 or so?
this gonna be a close one at stc
but should scale
nice
oh huh no difference on speedtest
Stockfish dev-20251014-895f63de by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251014-895f63de
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 56, 31
single game : 821, 583
Total nodes searched : 141977379
Total search time [s] : 153.54
Nodes/second : 924693```
```./stockfish speedtest 1
Stockfish dev-20251014-75edbee0 by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251014-75edbee0
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 78, 40
single game : 914, 712
Total nodes searched : 190559772
Total search time [s] : 153.52
Nodes/second : 1241270```
-25% maybe?
rip
@stray reef can you get similar numbers for plentychess L1=1024?
Well. It's -10 at STC ofc it's not too bad kek
Finish training + SPSA is already enough to just pass at higher TCs ig
finish training already might be enough
assuming LTC scales by +5 or so
am curious how https://tests.stockfishchess.org/tests/live_elo/68eee5fd28e6d77fcff9fe4f will affect the speeds thoug
still SSS 
actually that one probably benefits us as well
since it's related to overhead of finny tables
and the websocket thing happened in stage 4 again
oof
also this is so confusing
-52 elo
speedup 10 elo
net 6 elo
= -14 elo?
it does not add
btw does not fusing make it faster
yoshie also said that fusing the threat updates never worked well
oh fishtest
ig we'll see
speedtest looks the same, so I put it on fishtest
something is strange here
ig just errors bars
i also like did not add factorizer to psq part for this run, that might also play a role
the next run should also have this once I figure out how to do it
@stray reef seems plentychess with 640 is typically 30-40% faster than current branch (based on manual inspection of nps in two LTC games), is this reasonable numbers?
also removal of smallnet, let's say 10 Elo .. or even more in this case.
huh? you mean the bench of the PR with no further changes?
yes i did debug it, everything was 100% identical
Later today, sure
"fixed"
btw we fixed it lmao
everything was indeed good
it was not in the threat calculation
there are some tests up on fishtest rn
around -16 stc
30-40% faster than 1024... hm
might be reasonable, if closer to the 30 side ig
is where we are at
damn nice
prayer for scaling
i think we wait until stage 5 for that
since i strongly suspect lack of factorizer + threat inputs itself means it benefits more from more stages
am I correc that this is the current branch to be used in testing https://github.com/xu-shawn/Stockfish/commits/threat_inputs/
b7f553ee8b28a4abace6c1056dceb1d69169873a
yeah
bruh it dropped to -20 i hate error bars
To think that some obscure monty led to more than a thousand nontrivial LOCs, the rewrite of Stockfish training infrastructure, etc... that could finally be gaining.
"some obscure monty"?
~50 stars on github vs 14000.
it's not obscure in the chess engine sphere
idk otherwise you could call practically any other engine obscure
Ethereal is about 400. Even stormphrax is like 100.
Koivisto is 150.
And given that github stars are already skewed toward programmers who are familiar with the chess engines, the popularity of monty compared to Stockfish in the wider chess world is probably even lesser.
But let's look at a more objective metric: TCEC. Monty isn't even in TCEC.
Monty should've been in tcec if not for some small issues
anyways that's besides the point
@frosty imp I might try some speed stuff later
do I speedtest or start a test on fishtest?
speedtest probably fine
am very curious how uh
ue only managed to gain like 5% speed
or smth
both are good ig
prolly cuz of Finny being cracked
Though it shows one thing...
A chess engine doesn't even have to be even remotely close to the strength to be able to improve another engine.
Which is wild.
i think you underestimate monty in many ways, including strength
Isn't Monty like 700 elo behind?
no but like prior to this it was literally compute every threat and add it up
i guess lazy eval probably screws around with the threats
hasn't lead to an improvement yet ... you're often a bit too speculative, let's stay close to the facts..
sad
This is ongoing work.... let's not forget that something similar was tried years ago by sopel, and at that time it didn't gain either.
things have changed, not the least the amount of data available, improved trainer, etc etc... so worthwhile trying again.
as usual a lot of work has to come together to replace sota stuff..
2 more hours until stage 4 or so i presume?
something like that.
monty dev is like 3500 or 3600 afaik
and under tcec conditions a lot better than whatever ccrl or so would show
What? I see... the info might be outdated.
yeah tcec conditions a lot better
since gigantic net reduces contention
let's also not forget that PlentyChess is also #1 at ccrl 40/15 rn
measurements where?
e.g. things like https://tests.montychess.org/tests/view/68d5e6bd56f229dd4390f2b4 compared with https://tests.montychess.org/tests/view/68d5e99756f229dd4390f2b8
i suppose like
200+2 5thread
is similar to CCRL Blitz 8CPU
I said it could... not that it did...
Though I often use the back-of-the-envelope calculation that if fixed node elo gains more than two third of the elo loss from slowdown, it should gain at LTC.
Yeah, especially since the current top engine in the world PlentyChess has already adopted this.
guys stop distracting cj from coming up with bangers 
Oh... welp... I guess yeah...
Though Stockfish is often a bit more conservative in adopting ideas than in other engines because it often has to be done well to gain.
I understand, I did not do it well in plenty 
Oh, not that... I meant that in Stockfish, since the baseline is higher, it's much harder to gain with new ideas.
just joking ofc
@naive comet here's the profile if you haven't seen #1336647760388034610 message
sf is a much bigger entity
maybe there's opportunities in incremental threat tracking? idk
the refresh scheme might also be improvable
well threat specific stuff
would be in tracking i think
or like the actual accumulator updates
like we should see
if backwards updates are still worth it for threats
considering that refreshing from scratch is not as heavy as expected
seems so when I measured with speedtest
huh
strange
why are full refreshes so op
then
or like
how is it possible to come so close
with literally most basic strat
on average is what, 8 or so?
compared to a full refresh probably being like at least 20
well the percentage reduction from full refresh to incremental isn't as good as halfkav2
ig maybe that's where the problem is
could be interesting to try alternative update schemes based on that
okay I'll think about it
am curious how it's only 1/2 as slow in standard psq when mathematically ue is like 20% the work of full refresh
Maybe it takes a lot of work to compute what needs to get updated?
You can try byteboard technology if it helps.
Yeah that's something worth investigating... lots of simd stuff possible
how do I clone Shawn's branch and only that branch?
nvm I got it thanks to my friend chatgpt
3600 or so, and they’re focusing on strength under TCEC conditions afaik
is there a potential reason why this might trigger at a much higher rate with the threat inputs?
since it has happened again
no independent of what runs in CI, really just somehow timeout or dropped connection somewhere, needs some more robust polling mechanism in the CI infrastructure. Not our concern right now, just restart and wait a bit.
ah i see
for restart, I also updated the SF used in the final testing, so we'll get info on all steps with the current best inference in 24h or so.
step 5 should be running now.
I'm not expecting these training steps to gain miracolously, but we'll see.
fingers crossed, we'll see.
Trained a 1SB L1=1024 net quickly.
Since plenty does not have a speedtest command, the most comparable thing I think I can easily do is a single-threaded d=20 bench, since plenty uses the same bench positions as SF.
2100060 nps ao3 for Stockfish latest dev
1486695 nps ao3 for Plenty
1553660 nps ao3 for shawns TI branch
looking pretty good i'd say :P
i think it is possible that a fully trained net is a bit sparser, so maybe the "real" number for plenty would be a big higher. but SF speeds are looking nice
@frosty imp @rocky vigil small speedup https://github.com/cj5716/Stockfish/tree/threat_inputs_3
1014074 vs 907784 but idk my hardware is noisy
I used speedtest btw
what should I do? pr to your branch?
also ideally I'd need someone with stable hardware to test this maybe
since when would 10% be small 😉
I don't trust that its 10% to be honest cuz i was typing in word during the speedtest
I can rerun without and see I guess
like at 1200 words per minute 😉
Anyway, I think PR to the branch of shawn, and he can integrate.. ?
Can be tested on fishtest, but I think this is not essential for speedups right now.
Oh that was already resolved
I see the new pipeline now
Version : Stockfish dev-20251014-b7f553ee
Compiled by : g++ (GNUC) 14.2.0 on Linux
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 14.2.0
Large pages : yes
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-3
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 41, 22
single game : 668, 439
Total nodes searched : 97642350
Total search time [s] : 153.564
Nodes/second : 635841
Version : Stockfish dev-20251015-40e85beb
Compiled by : g++ (GNUC) 14.2.0 on Linux
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 14.2.0
Large pages : yes
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-3
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 42, 23
single game : 685, 459
Total nodes searched : 100965370
Total search time [s] : 153.551
Nodes/second : 657536
local speedtest on cj speedup
OK, so closer to 3% than 10%
oh I understand all the noise now
it was using all the threads for speedtest
I dropped speedtest down to single thread
should be more accurate now
Stockfish dev-20251014-895f63de by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251014-895f63de
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 58, 31
single game : 832, 590
Total nodes searched : 143169711
Total search time [s] : 153.545
Nodes/second : 932428```
```./stockfish speedtest 1
Stockfish dev-20251015-40e85beb by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251015-40e85beb
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 52, 31
single game : 820, 582
Total nodes searched : 142798016
Total search time [s] : 153.549
Nodes/second : 929983``` yeah neutral on my laptop
but my laptop might be noisy
I hope this ran on the P cores the entire time
I think you can handle that with priority https://github.com/official-stockfish/Stockfish/issues/6213
yeah 900k+ suggests it used P core at least a majority of the time
just do fishtest test?
so, what's the speed-ups we could reasonably still expect?
(relative to shawn's nn-598188c9a702.nnue branch, which has most of it already).
he'll make master faster faster than branch 😉
anyway, doing a quick test of your 598188c9a702 branch against master..
seems like we need another 30Elo or so..
Depends on if smallnet is counted as a speedup, probably
Maybe we can do a ralph wiggum approach with this
How much does smallnet speed up master?
I think that could be 5-10 Elo, but I don't know the exact number.
is it possible to start a smallnet run now?
you mean a threat smallnet?
yeah
have it
--------------------------------------------------
Results of master vs patch (10+0.1, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 28.66 +/- 3.69, nElo: 53.68 +/- 6.88
LOS: 100.00 %, DrawRatio: 47.71 %, PairsRatio: 1.86
Games: 9806, Wins: 3046, Losses: 2239, Draws: 4521, Points: 5306.5 (54.11 %)
Ptnml(0-2): [39, 859, 2339, 1588, 78], WL/DD Ratio: 1.26
--------------------------------------------------
we don't have it on fishtest right?
Wouldn’t a regular smallnet be better
dunno
We had -20 +- smth on fishtest
From the sprt
I see.
might depend quite a bit on HW.
(i.e. different memory architecture and so on).
Though arguably, ranges still overlap more or less.
Fair
I mean I thought smallnet purpose was speed
So it would be better to have it not be threats
we might be mixing conversations, but yeah, if possible regular small net would be faster, unless there is something sharable between the two.
I think we could get the existing small net to work first and give it a try
I think that's probably better
Maybe some template bool use_threats or whatever
I will do a check at larger TC and more threads, just to have a reference.
Will start working on smallnet in ~2 hours
The standard or the threats one? I did start a threats net optimization at 128 as well, just one stage, so probably ready in like 8h or so.
Standard
yeah dont use threats for small net
eh you need more than that
Huh
because threat accumulators are a class field
How I was gonna hack it in was just keep threat accumulators for smallnet but never touch them
I guess maybe for a temporary hack
Yeah
then you can check with constexpr bool UseThreats = Dimensions == TransformedFeatureDimensionsSmall
Ohhh indeed
don't even need templates
This works
Btw if I call eval
Will it use smallnet when applicable
Or always big net?
bignet always it seems
bruh
If I give it like KQQk
Will that default to smallnet then
In a real search
Like 8/8/8/3k4/8/8/6K1/6QQ b - - 0 1 for instance
well just disable bignet in evaluate.cpp
also remove the re-eval
What is the tc/thread count of this test?
60+0.6, 288t, 16000MB, UHO_Lichess_4852_v1.epd
Crazy
funny, 11 drawn game pairs in a row for now.
you were the one talking about TCEC style dev 😉
hehe
but I must say that if it doesn't gain at LTC I would have quite strong reservations...
Can you do an updated fixed nodes with smaller error bars
Do you have hardware available
Yes but vondele already has it set up lol
Oh
Yeah I think LTC smp is out target
LTC if possible
when you set back the concurrency but forget to set back the hash ...
@frosty imp which branch is preferable for me to test smallnet against
just the threat_inputs branch
yep
and stage 4 net or still stage 3
uh let's keep stage 3 net
--------------------------------------------------
Results of master vs patch (20000 nodes, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: -29.60 +/- 2.28, nElo: -44.34 +/- 3.40
LOS: 0.00 %, DrawRatio: 41.82 %, PairsRatio: 0.64
Games: 40000, Wins: 10775, Losses: 14175, Draws: 15050, Points: 18300.0 (45.75 %)
Ptnml(0-2): [1551, 5531, 8363, 3877, 678], WL/DD Ratio: 1.96
--------------------------------------------------
so net is not bad per se, but speed matters
@twilit oriole
btw is 3231282 the correct bench for master smallnet only?
Cool
actually @frosty imp how is the threat input branch able to read the smallnet without dying
doesn't read it?
at least there are warnings related to that..
(compile time warnings that is)
meanwhile, master and patch battling it out at scale, and deciding to break the UHO_Lichess_4852_v1.epd while doing so.
not saved I'm afraid
nah.
is it still all drawn pairs
ok
that's pretty insane IMO.
but 90% drawn game pairs
like by construction the book is aiming for 50% of those.
it seems if I just do it, it reads bignet only and dies
I'd assume the format of the data structures in memory is changed?
it doesn't
I commented out the smallnet read + verification
so.
ok
checks out
shouldn't be too hard
bignet works fine
it isn't too bad
just frankenstein master and threat input code together
git checkout -b frankenstein ?
* frame #0: 0x00007fff7fd5b212 msvcrt.dll`memcpy + 146
frame #1: 0x00007ff74b8d5216 stockfish.exe`Stockfish::Eval::NNUE::AccumulatorCaches::Cache<128u>::Entry::clear(this=0x0000015a4ed26ac0, biases=0x0000000000000000)
frame #2: 0x00007ff74b8d52cf stockfish.exe`void Stockfish::Eval::NNUE::AccumulatorCaches::Cache<128u>::clear<Stockfish::Eval::NNUE::Network<Stockfish::Eval::NNUE::NetworkArchitecture<128u, 15, 32>, Stockfish::Eval::NNUE::FeatureTransformer<128u>>>(this=0x0000015a4ed26ac0, network=0x0000015a40626108)
frame #3: 0x00007ff74b8d536a stockfish.exe`void Stockfish::Eval::NNUE::AccumulatorCaches::clear<Stockfish::Eval::NNUE::Networks>(this=0x0000015a4ece2ac0, networks=0x0000015a40626090)
frame #4: 0x00007ff74b8d53a0 stockfish.exe`Stockfish::Eval::NNUE::AccumulatorCaches::AccumulatorCaches<Stockfish::Eval::NNUE::Networks>(this=0x0000015a4ece2ac0, networks=0x0000015a40626090)```we love to see it
how are the smallnet biases null pointer
branch?
oh
3231282 now
gg?
bruh this smallnet is 1.8M vs 2.3M in master
threat tracking maybe?
real weakness SHOWEN
try benchmark
oh wait
right
forgot it still did that
oh threat tracking pretty fast ngl
~7% runtime with big threat net
Stockfish dev-20251015-4c91a5c9 by the Stockfish developers (see AUTHORS file)
info string Using 1 thread
Warmup position 3/3
Position 258/258
===========================
Version : Stockfish dev-20251015-4c91a5c9
Compiled by : g++ (GNUC) 15.1.0 on MinGW64
Compilation architecture : x86-64-avxvnni
Compilation settings : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : 15.1.0
Large pages : no
User invocation : speedtest 1
Filled invocation : speedtest 1 128 150
Available processors : 0-15
Thread count : 1
Thread binding : none
TT size [MiB] : 128
Hash max, avg [per mille] :
single search : 56, 32
single game : 852, 602
Total nodes searched : 150141109
Total search time [s] : 153.543
Nodes/second : 977844```
(with smallnet)
maybe 5% faster
we'll see how much elo this is
update_piece_threats and append_changed_indices can definitely be optimized to a tiny fraction of the runtime
unless I'm misunderstanding what threats are
just what was needed
what the sprt gods grant they taketh away
Append changed indices mostly calculates threat indices, idk if there is faster way for this
that benchmark is old (before my speedup)
@twilit oriole expecting that at fishtest conditions rn at STC without any major breakthroughs we can get it to -15 +- 5 or so
