#UE Threat Inputs for AB
1 messages · Page 4 of 1
You can adjust the init settings
You should try adjusting the ft init because by default they’ll be tiny
I don’t think it’s an issue in bullet, there’s always a risk of dead init and that the default method yields dead init for this net arch is just unlucky
@stray reef Score of stockfish-plentychess-1024 vs stockfish-linrock-512: 313 - 320 - 367 [0.496] 1000 ... stockfish-plentychess-1024 playing White: 262 - 62 - 176 [0.700] 500 ... stockfish-plentychess-1024 playing Black: 51 - 258 - 191 [0.293] 500 ... White vs Black: 520 - 113 - 367 [0.704] 1000 Elo difference: -2.4 +/- 17.1, LOS: 39.0 %, DrawRatio: 36.7 %
(25k nodes, UHO4060v2)
btw since your impl is more advanced than mine rn maybe consider working with linrock/viren to test multilayer
in the meanwhile I'll be trying to optimize my impl in sf
wow that is a terrible result lmao
for me i mean
yeah I'm down to test or code stuff, or train some nets
you can check https://github.com/sscg13/Stockfish/commit/0140236ea10a9b5557cd45438218602a7e5a3533 but I think I got everything right
if there was an inference issue it would be far worse. can't check today anymore but I trust it's correct
yeah I think viren really wanted to test an L1=1536 net at fixed nodes vs SF master (plentychess can't do this directly but it'll be very helpful in experiments, since linrock suggests there has to be a lot of data tweaking/etc. for large nets)
and I also if possible want to test threats+king buckets (my plan is to separately UE the two accumulators then combine them on evaltime, idk how much slowdown there is)
There's 4x4090 if you want to do parallel experiments. I guess we can train a new baseline and then threat nets using Leela data to get around the data bottleneck
yeah we need some input from linrock on this (whether bullet supports all the data parsing options now)
plentychess also has verbatim / mmap right
so hopefully local stc will be more accurate as well
I started doing it but am blocked until #nnue-dev message is resolved
512 L1 on the threat-inputs-full branch is between 1.7M and 1.75M nps on my machine, so faster than the 1.6M nps of SF master
for some reason 256 L1 nets only produce nonsense right now... now sure what's wrong. so I don't have data on 256 -> 512 (you could send me a net though).
But going from 512 to 1024 decreases speed by roughly 21% in this impl
With pairwise and some minor optimisations, I can match my master speed with a (80624 -> 256)x2 -> (16 -> 32 -> 1)x8 net
--------------------------------------------------
Results of Threats256PWLayers vs Main (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -17.94 +/- 6.51, nElo: -26.61 +/- 9.63
LOS: 0.00 %, DrawRatio: 41.52 %, PairsRatio: 0.74
Games: 5000, Wins: 1402, Losses: 1660, Draws: 1938, Points: 2371.0 (47.42 %)
Ptnml(0-2): [161, 677, 1038, 507, 117], WL/DD Ratio: 1.75
--------------------------------------------------
Not a bad try I would say... I did 1000 superbatches with 13B positions for this one. This is all the data I have, but it must be possible to squeeze some 30-ish fixed nodes Elo from better training procedures. Speed is about 5% slower than main, but can probably be still improved a bit
Is main (16x768 -> 1536) for you? If so this is a really good result
The 256/512 ones in my branches should also be available to download from fishtest, if you still need them
They’re already transposed
340 scale, 255/64 quant
9 king buckets only, but yes, L1 1536, and multilayer
I'll try those too then.
Currently I'm giving that net a second train for 1000 SBs with a lower LR
oh you got pairwise init working?
I did merge a small fix in the default kaiming initialisation recently but im not sure if it would have made a noticeable difference
yep. for the sake of staying with TrainerBuilder I modified new_affine_custom to
pub fn new_affine_custom(&self, id: &str, input_size: usize, output_size: usize, bias_cols: usize) -> Affine {
let wid = format!("{}w", id);
let stdev = (1.0 / (input_size as f32 * bias_cols as f32).sqrt()).max(0.05);
let init = InitSettings::Normal { mean: 0.0, stdev: stdev };
let weights = self.new_weights(&wid, Shape::new(output_size, input_size), init);
let bias = self.new_weights(&format!("{}b", id), Shape::new(output_size, bias_cols), InitSettings::Zeroed);
Affine { weights, bias }
}
you can seed the weights without doing that btw
trainer.optimiser_mut().graph.get_weights_mut("l0w").seed_random(0.0, 0.05, true).unwrap();
I see, that's good to know
Not sure if there's something better than 0.05, but that's what the formula works out to for my master net (at least that's what I remember), so I just tried it and loss was fine
--------------------------------------------------
Results of PlentyThreats256PWLayers-0091 vs PlentyLinrock256SingleLayer-nn-23507ff7848b.nnue (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: 1.11 +/- 6.46, nElo: 1.66 +/- 9.63
LOS: 63.20 %, DrawRatio: 43.16 %, PairsRatio: 0.99
Games: 5000, Wins: 1525, Losses: 1509, Draws: 1966, Points: 2508.0 (50.16 %)
Ptnml(0-2): [127, 587, 1079, 557, 150], WL/DD Ratio: 1.63
--------------------------------------------------
Results of PlentyThreats256PWLayers-0091 vs PlentyLinrock256SingleLayer-nn-23507ff7848b.nnue (6+0.06, 1t, 16MB, Pohl.epd):
Elo: 9.34 +/- 5.48, nElo: 17.46 +/- 10.23
LOS: 99.96 %, DrawRatio: 50.09 %, PairsRatio: 1.20
Games: 4428, Wins: 1218, Losses: 1099, Draws: 2111, Points: 2273.5 (51.34 %)
Ptnml(0-2): [18, 485, 1109, 564, 38], WL/DD Ratio: 1.09
--------------------------------------------------
Results of PlentyThreats256PWLayers-0091 vs PlentyLinrock256SingleLayer-nn-23507ff7848b.nnue (30+0.3, 1t, 64MB, Pohl.epd):
Elo: 5.87 +/- 7.75, nElo: 12.44 +/- 16.43
LOS: 93.11 %, DrawRatio: 56.46 %, PairsRatio: 1.17
Games: 1718, Wins: 463, Losses: 434, Draws: 821, Points: 873.5 (50.84 %)
Ptnml(0-2): [2, 170, 485, 201, 1], WL/DD Ratio: 1.16
--------------------------------------------------
Seems like multilayer just about balances out linrocks improved training setup and data at fixed nodes.
STC and LTC looking similar too. Tomorrow I will test against plenty main
One thing that could be worth trying is factorising the threat inputs,
E.g. for each threat input you could add a factoriser for just the target square and piece
Obviously it would be a rather significant training speed hit
How's it even going? Looks like the idea takes pretty long to implement...
for me at least, it's too slow currently... it'd either need to be much stronger for the speed difference, or much faster with no strength loss
I see. It's quite hard to implement as an idea.
I think next week I'll try a much simpler threat input set, essentially 768x2x2, which just encodes for every piece if it's attacked and if it's protected. That should be much faster with regards to UE. Though it will also be a lot worse at fixed nodes compared to even simplified threat inputs
That is what Monty used to do
Before big threat inputs
do you have any data on the fixed nodes strength difference?
An advantage of the 768x2x2 feature set is that it should be doable to king bucket...
Large
Like 50 elo or something
Actually it was 50 elo at stc
The new threat input had halved L1 compared to the old threat input net
Hm alright. It'll definitely be stronger than plain 768 :P
I had another idea for reducing the number of updates.
Basically, I feel like the net should be able to figure out everything from all the threat features, so we only need to activate each standard 768 feature if the corresponding piece is not attacked or defended at all
Since especially in the middlegame, pieces pretty much always move from between squares that some piece already has vision on, that should mostly get rid of the updates required for the 768 features, at hopefully a very minor fixed nodes loss
This would significantly reduce the number of input changes
Should scale better with L1 increases (if it works)
that was the original idea that Viren outlined and it was way worse when we tested in Monty
i think you might be underestimating the difficulty of having to deduce piece value from some combination of threats given/received
I was hoping the net might figure it out even though it sounds hard
But if you already tested it then nevermind
(i.e. borking psq terms causes evals to be complete nonsense, but borking threat terms will still get smth less than 1000cp away)
I'm implementing the 79856+768xK arch in bullet rn but I'm not sure I'm using Factorised / Factorises quite correctly. I plan to merge myself, so I didn't implement merge_factoriser. Loss looks alright definitely, but before I waste hours or days of compute, @formal smelt could you take a look at https://pastebin.com/9YWK3xp9 if that looks reasonable?
lgtm
perfect, thanks
i'm not 100% sure about the layout of the input weights in raw.bin though. are the factorised weights at the very beginning (before the threat feature weights)? surely they must be, because i didn't tell bullet they should start at 79856
yeah they're put at the beginning
#top-dev-chill message
Comparing the (79856+768x12 -> 2048)x2 -> (32 -> 64 -> 1)x8 against the (79856+768x1 -> 2048)x2 -> (16 -> 32 -> 1)x8 I trained a few weeks ago, there's at least a 50 elo fixed nodes difference here. Of course I don't know how much of it comes from the king buckets vs. the larger later layers. But I do think that UEing the king buckets together with threats is the way to go to make this work
How big is this net lol
Like in mb
quantised it's 365.4MB
💀
It compresses well though
finished UE for threat inputs + king buckets
my GPU is busy for another 2 days but then i'll try to find some arch that has chances at real TCs
anyone knows how much king buckets can gain at fixed nodes, against an already mirrored net?
Btw are you still doing pairwise with a tiny HL?
When I was messing about with more layers nets using pairwise with a HL of 256 lost a lot of elo compared to not
Presumably because 128->256 is a lot more elo than 768->1536 or whatever most people have now
ohhh good point, yes i am
gonna try 256 L1 without pairwise then, it's gonna be slower than master for sure but should be stronger at fixed nodes
simd should allow for steps of 64, but 192 may be too weak
@formal smelt do you think a threat inputs net of that size can be trained on capture positions too?
🤷♂️
afaik linrock tried it and it didn't go too well (in Yukari at least)
up here
alright thx
This arch trains almost 4x faster than my master arch 
eta?
probably it won't finish before I sleep. but i can test an almost-fully trained version in like 4-8h
Comparing not against the master net rn, since they have different training schedules. Instead comparing against a master arch net 0102 that uses the same training schedule as the threat inputs net 0103, at the same point in training (after stage 2 finished)
--------------------------------------------------
Results of 0103r vs Main-0102r (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -38.04 +/- 8.17, nElo: -58.73 +/- 12.51
LOS: 0.00 %, DrawRatio: 42.00 %, PairsRatio: 0.54
Games: 2962, Wins: 725, Losses: 1048, Draws: 1189, Points: 1319.5 (44.55 %)
Ptnml(0-2): [108, 449, 622, 262, 40], WL/DD Ratio: 1.60
--------------------------------------------------
Not looking great. STC is running
I've actually used dual activation for L2 -> L3 here without thinking about it. But I doubt that'll make it any weaker, even with a small L1
--------------------------------------------------
Results of 0103r vs Main-0102r (5+0.05, 1t, 16MB, Pohl.epd):
Elo: -32.15 +/- 8.48, nElo: -63.80 +/- 16.72
LOS: 0.00 %, DrawRatio: 51.87 %, PairsRatio: 0.47
Games: 1658, Wins: 345, Losses: 498, Draws: 815, Points: 752.5 (45.39 %)
Ptnml(0-2): [13, 258, 430, 125, 3], WL/DD Ratio: 0.99
--------------------------------------------------
STC is holding up though!
this is ( -> 256 (no pairwise))x2 -> (16 (dual activation) -> 32 -> 1)x8?
There is a good chance this training schedule is absolute trash for threat input nets. So no matter how this holds up when fully trained, I'll give it another attempt
yes
ah interesting the scaling looks decent
yeah the speed is very good also
just a matter of making this arch strong, I think
"just"
🚀
I think L1 and threat tracking probably take comparatively more time compared to L2 so the loss of pairwise speed probably doesn't hit as hard
are the king buckets factorised?
perf for depth = 20 bench of main (left) vs new (right)
yes
overhead of evaluate() goes down (since smaller L1 -> L2), but incremental updates of threats, + makemove and related threat tracking is slower
yes it looks like the threat tracking is more expensive than the actual evaluation
I had smth similar in SF though that was without incremental threat tracking
Final results vs main
--------------------------------------------------
Results of 0103rr vs Main (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -13.28 +/- 6.35, nElo: -20.14 +/- 9.63
LOS: 0.00 %, DrawRatio: 42.84 %, PairsRatio: 0.81
Games: 5000, Wins: 1427, Losses: 1618, Draws: 1955, Points: 2404.5 (48.09 %)
Ptnml(0-2): [146, 644, 1071, 533, 106], WL/DD Ratio: 1.75
--------------------------------------------------
Results of 0103rr vs Main (5+0.05, 1t, 16MB, Pohl.epd):
Elo: -37.02 +/- 9.19, nElo: -71.43 +/- 17.59
LOS: 0.00 %, DrawRatio: 50.20 %, PairsRatio: 0.44
Games: 1498, Wins: 313, Losses: 472, Draws: 713, Points: 669.5 (44.69 %)
Ptnml(0-2): [17, 242, 376, 111, 3], WL/DD Ratio: 1.09
--------------------------------------------------
Scaling is worse against main, maybe due to the training schedule, not sure.
Second attempt is underway, ETA 24h
Another idea would be to go for L1=192, but a bigger L2? like 32 or 64? no idea if that would be stronger at a similar speed
Next attempt
--------------------------------------------------
Results of 0104r vs Main (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -7.30 +/- 6.30, nElo: -11.16 +/- 9.63
LOS: 1.16 %, DrawRatio: 43.08 %, PairsRatio: 0.88
Games: 5000, Wins: 1451, Losses: 1556, Draws: 1993, Points: 2447.5 (48.95 %)
Ptnml(0-2): [128, 628, 1077, 555, 112], WL/DD Ratio: 1.66
--------------------------------------------------
Results of 0104r vs Main (5+0.05, 1t, 16MB, Pohl.epd):
Elo: -33.55 +/- 9.50, nElo: -63.68 +/- 17.92
LOS: 0.00 %, DrawRatio: 49.03 %, PairsRatio: 0.47
Games: 1444, Wins: 312, Losses: 451, Draws: 681, Points: 652.5 (45.19 %)
Ptnml(0-2): [12, 239, 354, 110, 7], WL/DD Ratio: 1.13
--------------------------------------------------
Still not enough. At STC the slowdown kicks in (idk why it didn't here: #1336647760388034610 message), but even fixed nodes barely isn't good enough... at least with this training setup
Is the UE threat input still being tried in Stockfish?
Or has the Stockfish devs moved past this idea?
presumably if Yoshie, as the most serious attempt at threat inputs in a/b engines thus far, gets a gainer net, then it will encourage people to try it seriously in SF
SF has not even had a properly trained threat input net tried yet afaik
also this is not the only alternative to people not currently trying something in SF
I think the underestimated drawback was the threat tracking overhead which has ended up much higher than initial expectations
@stray reef How does your threat tracking work?
And what branch is it on
Where do I get the net also
which net do you want exactly? I've uploaded some past nets to my net repo but not these recent ones
yeah that branch you just linked is the correct one, the threat updating etc is done like in yukari
and the feature calculation in the file you linked obv
This one I guess?
if you want a verbatim version of the net, run make normally and it'll be put at processed.bin
is that clang or gcc?
compiler not supported? wait
what is your compiler / os setup
mmm
i haven't compiled on mingw in a while maybe i broke smth
xD
gcc should work tho
g++.exe (Rev3, Built by MSYS2 project) 14.1.0
g++ -std=c++17 -Wall -pedantic -Wextra -fcommon -pthread -O3 -g -ggdb -DARCH_X86 -march=native -lstdc++ -static -Wl,--no-as-needed -DEVALFILE=\"processed.bin\" -c src/engine.cpp -o src/engine.o
In file included from src/uci.h:5,
from src/engine.cpp:2:
src/nnue.h:472:3: error: '__attribute_noinline__' does not name a type
472 | __attribute_noinline__ void resetAccumulator(Board* board, Accumulator* acc);
| ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:476:3: error: '__attribute_noinline__' does not name a type
476 | __attribute_noinline__ void calculateAccumulators();
| ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:479:3: error: '__attribute_noinline__' does not name a type
479 | __attribute_noinline__ void refreshPieceFeatures(Accumulator* acc, KingBucketInfo* kingBuc
ket);
| ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:481:3: error: '__attribute_noinline__' does not name a type
481 | __attribute_noinline__ void refreshThreatFeatures(Accumulator* acc);
| ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:484:3: error: '__attribute_noinline__' does not name a type
484 | __attribute_noinline__ void incrementallyUpdatePieceFeatures(Accumulator* inputAcc, Accumu
lator* outputAcc, KingBucketInfo* kingBucket);
| ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:486:3: error: '__attribute_noinline__' does not name a type
486 | __attribute_noinline__ void incrementallyUpdateThreatFeatures(Accumulator* inputAcc, Accum
ulator* outputAcc, KingBucketInfo* kingBucket);
| ^~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [Makefile:164: src/engine.o] Error 1
make[1]: Leaving directory '/c/Users/Viren/Documents/Github/PlentyChess-0104r/PlentyChess-0104r'
make: *** [Makefile:153: all] Error 2
CXXFLAGS = -std=c++17 -Wall -pedantic -Wextra -fcommon -pthread -O3 \
-D'__attribute_noinline__=__attribute__((noinline))'
CXXFLAGS_EXTRA =```
I put this at the top of my Makefile to fix g++ for now
right. those aren't necessary anyway, just put them there for profiling
there is an impasse because linrock wants to see sufficient speed optimization before seriously trying to make a net and we want to see sufficient fixed nodes before seriously trying to speed optimize
Hm i think this fixed nodes is good enough? Im starting to speed optimize from it
we'd need to increase L1 to 320 I think. Not sure. With this exact arch I don't think I can squeeze much more than 10 elo fixed nodes without extreme effort
yeah sure
320 L1 would easily pass fixed nodes then ofc
I dont know about inference tricks but i think there are some tricks with the threats themselves. Like I think there are situations where you know you can terminate the calculation of threats early because there cant be further threats
yeah that's the stuff I didn't really put much thought into
there's also probably many moves that add and remove the same index
and i don't check for that rn either
well yeah especially if you do capture sequence
if you write me a bullet config for a "serious" net together with some datasets I can also bake something
see you would have to consult linrock on that
the issue is you are running say L1 = N but with a speed of N+1024 or smth
SF has L1 3072. So a pretty large threat input net should be possible at equal speed
Maybe an L1 1024 threat input net even
well with my last attempt we only got 256 to barely be faster (without incremental threat computation)
so we need a major overhaul in sf
the second issue is that we never got the bullet -> sf arch working
something goes wrong in the transpose or whatever
the custom kernels are almost certainly significantly better than my autovec'd for loops
(even for single layer)
upstream has a major NNUE code refactor since the last time I worked on threat inputs in sf btw
so basically the next attempt will be almost from scratch
I think what we will have to do is forget about SF for now. Train two leeler nets for use in a plentychess branch, L1 3072 regular net and L1 1024 threat input net, and then show how much better the threat input net is in this closest representation of what it would be like in SF. Then the idea will be fully proven finally
yeah that makes sense
and once everything is known (for speed optimization) I can work on adding it
this arch should be equivalent in speed with like a 2048 L1 net, in the current impl
ah
@stray reef Could you lazy update the threat generation itself? Like only walk through the threat indice updates when evaluate is needed
you mean what currently happens in addPiece, removePiece or movePiece?
sounds like it could save some time yeah
Bumping back this thread... byteboard representation might actually speed up NNUE with threat input.
The one @plain flower is working on? I guess so, currently incremental threat updates take up 5+% of the total runtime
What I am more concerned about is how going from L1=256 to L1=512 was basically neutral at stc and only +6 elo LTC, despite a big fixed nodes gain, but that result may have something to do with either speed optimization or the fact that output buckets were messed up at the time of that test (using (pieces - 1)/4 instead of -2)
Data starvation seems more likely to me
idk the details of linrock's training
Well. One possibility I don't know how much data and training time it had
Oh it's Linrock. Then it won't be that lol
there is definitely still a significant amount that linrock could probably gain with training routine
iirc he did a second stage of the L1=256 and gained 6 elo on top
@plain flower can I learn more about incremental threat tracking?
the simplest working method for threat inputs, would be, given a position and a move, compute the added and removed threats
we can ignore pins, making this simpler
I don't think we necessarily need full attack table knowledge (in particular, we may be able to save some computation), but I am not the expert on this
Maybe we can try several versions, with pins incorporated or ignored.
the network is trained ignoring pins, so it would be best if we also inference ignoring pins
we do not need all the functions necessary for movegen, we only need enough functionality to know what is attacked
Yes... I mean each version would need its own network as well.
But might as well extract as much information as possible to feed into the net if it helps.
For now though, let's just ignore pins.
At least if you compare my impl with yukari you see that despite having L1=384 Yukari is way faster in midgame positions
(well Yukari also uses simplified threat inputs but afaik it should be around the same speed as with full threat inputs?)
tbh doing superpiece rays from src/dest and updating relevant sliders is pretty much it
hmm so basically
do superpiece from src
update all sliders attacking src
do superpiece from dest
update all sliders attacking dest?
yeah
in attack tables this would be called slider extension / slider retraction respectively
I was thinking of this but it looked painful to implement
esp. since we don't have a way of only doing a single ray-direction
it would be more convenient if we had file-only attacks, for instance
what about e.g. castling (especially frc castling?)
code from Clockwork: ```cpp
switch (m.flags()) {
case MoveFlags::Normal:
new_pos.incrementally_move_piece(color, from, to, src);
// ...
break;
case MoveFlags::CaptureBit:
new_pos.incrementally_remove_piece(color, src.id(), from);
new_pos.incrementally_mutate_piece(!color, dst.id(), to, color, src);
// ...
break;
case MoveFlags::Castle: {
// ...
// TODO: Optimize further (slider updates can be elided in some cases).
new_pos.incrementally_remove_piece(color, king_id, king_from);
new_pos.incrementally_remove_piece(color, rook_id, rook_from);
new_pos.incrementally_add_piece(color, king_place, king_to);
new_pos.incrementally_add_piece(color, rook_place, rook_to);
// ...
break;
}
case MoveFlags::EnPassant: {
// ...
new_pos.incrementally_remove_piece(!color, victim.id(), victim_sq);
new_pos.incrementally_move_piece(color, from, to, src);
// ...
break;
}
case MoveFlags::PromoKnight:
case MoveFlags::PromoBishop:
case MoveFlags::PromoRook:
case MoveFlags::PromoQueen: {
// ...
new_pos.incrementally_move_piece(color, from, to, new_place);
// ...
break;
}
case MoveFlags::PromoKnightCapture:
case MoveFlags::PromoBishopCapture:
case MoveFlags::PromoRookCapture:
case MoveFlags::PromoQueenCapture: {
// ...
new_pos.incrementally_remove_piece(color, src.id(), from);
new_pos.incrementally_mutate_piece(!color, dst.id(), to, color, new_place);
// ...
break;
}
}
where move does extension at src and retraction at dst
add_piece just does retraction
and remove_piece just does extension
mutate doesn't do any slider updates
hmm for threat inputs we don't want to have to mutate separately since that means changing all the corresponding inputs of attackers of that piece
btw can someone independently verify this
at least on my laptop yukari seems 2x faster in a typical midgame position
which is honestly just sad
yukari doesn't even use bitboards or byteboards lol
yeah but afaik for L1=256 threat tracking takes up like half the total runtime
I would like to reduce that significantly
since I know it should be possible lol
i had a patch that does bitrays for SEE implemented for AVX2 and AVX512 which could be adapted for threat updates
yeah I believe that vector stuff can make this faster
reviewing this again it seems concerning that updating the threat feature accumulator is 4x more expensive than piece features, since from my measurements the average number of changed features should be comparable, not 4x as many
btw I also was not able to compile PlentyChess-0104r because of these weird issues: src/nnue.h:472:3: error: unknown type name '__attribute_noinline__' 472 | __attribute_noinline__ void resetAccumulator(Board* board, Accumulator* acc); | ^ src/nnue.h:476:3: error: unknown type name '__attribute_noinline__' 476 | __attribute_noinline__ void calculateAccumulators(); | ^ src/nnue.h:479:3: error: unknown type name '__attribute_noinline__' 479 | __attribute_noinline__ void refreshPieceFeatures(Accumulator* acc, KingBucketInfo* kingBucket); | ^ src/nnue.h:481:3: error: unknown type name '__attribute_noinline__' 481 | __attribute_noinline__ void refreshThreatFeatures(Accumulator* acc); | ^ src/nnue.h:484:3: error: unknown type name '__attribute_noinline__' 484 | __attribute_noinline__ void incrementallyUpdatePieceFeatures(Accumulator* inputAcc, Accumulator* outputAcc, Ki... | ^ src/nnue.h:486:3: error: unknown type name '__attribute_noinline__' 486 | __attribute_noinline__ void incrementallyUpdateThreatFeatures(Accumulator* inputAcc, Accumulator* outputAcc, K...
is this a compiler issue on my end
ah just remove the __attribute_noinline__, i think it doesn't work on all compilers, it's just there so the function is forced to show up in the profiler
not measurable with these functions
ah
so I do that and just standard make right
am trying to speed compare on my laptop
arch is (threats + 12x768) -> 256 -> (16 -> 32 -> 1)?
i think there should be more threat updates than piece updates, don't have my numbers anymore, but iirc the average total update was like 7.X, whereas without threat inputs it's 2.X
hmm
still not a 4x increase of course
yeah
am still very curious about how Yukari can be so much faster with L1=384
I am pretty sure there is a negligible speed difference between simplified vs full threat inputs
yeah but my single-layer impl is sf is like,
:((( slow
can you try running yukari release vs https://github.com/sscg13/Stockfish/tree/threat-inputs and let me know the nps's of a few positions?
can hopefully do it in 30min
thanks a lot
i mean for testing I can always just set time odds but I think linrock thought that wasn't sound
after 100M node search from startpos: Number of accumulator updates: 168218824 Number of positions looped through: 342768578 Number of feature indices looped through: 1840654489
but iirc feature indices counts both psq + threat
I think the number here is (feature indices) / (2 * acc updates)
which comes out to be around 5.5?
at least much less than 7.X
the inference seems broken on my compiled exe but it looks to be 50% faster than my single layer threats -> 256 -> 1 lmao
broken? that's not good
does it work locally on your computer?
yeah that's broken... on my machine it matches the number in the commit
what CPU arch, and what platform/ compiler are you using?
all I did was remove the __attribute__noinline and run make
1165-G7 (Intel, Tiger Lake (11th gen, AVX512) mobile), Windows, clang
wow i have the exact same config on my laptop, lemme try there
I did get lld: error: unknown argument: --no-as-needed so I executed the final link without -Wl,--no-as-needed
(I have the same issue compiling sf though, and this workaround has never messed up sf compilation for me)
i added some prints in incrementallyUpdateThreatFeatures / incrementallyUpdatePieceFeatures to see how many add/sub/addsub calls there are per incremental update. result:
1.38653 for piece features
8.76375 for threat features
(10M nodes from startpos)
piece features are mostly fused in addsub, therefore < 2
this is uhhh
pretty bad
well yeah every feature is applied to both accumulators
yeah for this just 1840654489/168218824 should be correct
but the number doesn't multiply by 2
so it is 10.x
mingw clang or "native" clang?
(the psq is not fused since I literally did the simplest for loop autovec)
mingw clang I think
or
actually for me clang lives in msys64/clang64/bin
iidk
that's mingw clang then
ok
what do you want me to run on your sf branch and yukari?
uh
can you just pull up the nps values for a single (LTC) game from startpos
between the two
or is that too complicated
maybe cutechess-ob works for this
i have no idea how cutechess cli works
oh uh
does fastchess output nps in the pgns?
maybe try .\cutechess-ob.exe -engine cmd="engine1-path" tc=60+1 name=engine1-name proto=uci -engine cmd="engine2-path" tc=60+1 name=engine2-name proto=uci -games 1 -rounds 1 -pgnout "pgn-file"
I think all file paths need to be absolute
damn, yukari doesn't output nps...
doing the calculations from time and nodes searched, for a 10s think from startpos, yukari is roughly 3x faster (7900X, 1 thread)
i'll see if some llm can quickly make a script to calculate this for the PGN...
oh shoot yukari does output nps (e.g. run game in cutechess GUI, or maybe cutechess auto-calculates it???) but I think cutechess-ob only prints time and nodes searched
which is good enough theoretically
lmao 3x faster with L1=384 vs L1=256
is comparable to result on my laptop
i think yukari doesn't report final nodes during hard cutoffs, at least i think so since the nps are very inconsistent
huh
it's only roughly 50% faster on average during the game, for this reason
alright i gtg, ping me if there's anything else
aight thanks for your help
the "vibe" way to compare nps is to just load up a game in cutechess gui and eyeball the nps ratios lol
that requires having some chess gui installed :P
I see :P
@formal smelt @hollow crystal turns out I actually did stc with fixed output buckets, but no ltc, anyways it's still basically neutral: https://tests.stockfishchess.org/tests/view/67e1f1d38888403457d87680
I'll rebase everything and run some again after mmap I think
can estimate the current ltc diff to be in the range of 7-9
but it will change a lot with speed optimization and mmap
can you not addsub threat changes as well or is there a limitation to this
iirc i tried variations of this without success, but looking at the now I'm not sure i did it right... i'll put it on my todo list
i wonder if there's any eta on mmap
once that is merged I'll rebase the basic ue to see how it affects the L1 scaling
How much data did you train these nets with? I don't seem able to replicate this result and I think it may have something to do with not being data starved so the full threat inputs can saturate
Since the additions in the latter net are much less sparse than full threat inputs
For context we are at 50B+ positions with L1 3072 and still not fully saturating full threat inputs
This must have been around 7B positions each
what kind of result are you getting (the later layers of the first net are also twice as large)
So far I just checked increasing L2 size and regular piece output buckets, not much was going on there. The king buckets test will happen soon
ah in monty?
Ye
surprising that increasing L2 isn't that good
This is some short writeup on how threat inputs progressed in monty also (and what the performance is like there)
I think the full threats sucks the Elo out the later layers basically
Linrock didn't gain with output buckets either with full threats
In SF
i thought it was like +5 elo or smth
idk my branch had them so...
single layer + small L1 maybe makes it different
tbh when the commit message says "what is going wrong" i might've borked smth
was there any later test
No but there is a diff. If you screwed smth it is wrong in both sides
The diff is very simple
nvm it's basically the end of the branch
he gained like 15 elo with more involved training right after this test lol
Yeah but output buckets shouldn't really benefit much from that
yeah i guess we actually should just not have them whoops
I mean it's not the only thing there's probably a lot of small things to tweak. It just wasn't the focus
if increasing L2 doesn't gain much then it might also be worth testing decreasing L2 to 8
or smth
I put this summary of how threats progressed in monty
although idk if it would screw with the nnz or anything
Just to get an idea of the value of full threat inputs
They are great tbh, just too bad fast threat gen is so hard lol
sorry what is being compared to a standard 768 -> 3072
The 80624 -> 3072
Full threats Vs none at all
Is about 300 UHO (if you set midpoint anchors so you don't hit book limit)
is it not in monty main yet
No. Since it is training lol
i guess that's smth exciting to be looking forward to
I took the +40 measurement midway through the run
since it should be much better than the current value net right
as far as i can tell linrock did 220 sb training so there's probably some big gain there as well
Nah. He trained a much smaller net right
remember to coauthor me and sscg (he found one of the bugs)
Yeah
🚀
idk what the training time scaling laws are
hmm so if you are doing 4000 with 3072 then I guess 350 for 256 is good enough
MCTS has longer training because we take LR much lower
btw if vondele can get within 10 elo to master net then resuming threat input training is feasible right
I mean I thought about what about just temporarily shoving threat inputs into NNUE pytorch kek
But it doesn't solve the issue of not having fast threat gen
I don't even know how incremental threat gen works
yeah i am decently convinced having acceptable speed requires a major change to sf position framework
anyways I don't really want to rebase until either that or mmap is worked out
but looks like it will be quite a wait
Yeah. There's the plenty branch, if someone sends me some configs for bigger nets I can train that. We can simulate it with L1 3072 base Vs L1 1024 threats and Leela data for both or smth
I mean I think a L1 3072 base Vs L1 1024 threats in plenty will already work tbh without additional optimization
Like the threats will already be superior in that comparison
yeah I am pretty convinced as well but somehow yoshie never found the speed / data to make it work selfgen
Because his base net is L1 1536 and the threats have fixed overhead is what I think
Like in the 3072 Vs 1024 that's a 2048 delta already
SF is unique in that it's somehow managed to work out how to allow eval taking a large fraction of total time already
fixed overhead is identical to increasing l1 by 512 I think (or, 1024 in my impl lmao)
personally am more concerned why L1 = 512 to L1 = 256 didn't work in stc (in fact, slower threat impl should make this more favorable to the larger net)
tbf i borked the output buckets initially and only realized later, see https://tests.stockfishchess.org/tests/view/67df73348888403457d874df
so it needs to be redone eventually
Yeah perhaps. But the Elo delta is very small regardless, output buckets usually yields more I thought
yeah +10-20 is normal whereas here suggests it's +3 or smth
https://github.com/official-monty/montytrain/tree/fixed-threat-inputs-out-buc soon we will attempt output buckets with piece and threat count in monty, will see how that goes
might be an artifact of training tbh
Might as well if we have the threat count already is what I'm thinking
There's no overhead then really
It's segmented like this
Counts were checked also to make sure it never gets too low
So all buckets get trained properly for sure
Just waiting on some new montytrain operations impls to do it
ah interesting
like the L1=256 has +6 training advantage as far as i can tell
The NN inference might be slow also
true I autovec'd it
Which would have close to twice the impact at double L1
dunno how to write simd since I've actually not done it in Prolix
the biggest impact is probably in the screlu affine
there is probably also some nontrivial gain from fusing addsub
that's smth that someone else needs to do
unless we get sf arch to work in bullet anyhow and I can go back to the already written code
@stray reef how did you do this with bullet? So I can try in monty adding the king bucketed piece square inputs to our full threat net also
If you have the config would be useful
Perfect thx that's very useful
How long does it take Monty nets to train
Currently around 4 days on a 4090
Oh that’s quite long…
actually not to different from a SF master net on H100.
Actually is a 4090 more effective
Since vram not a concern
Might depend on dataloader speed as well
Huh
Interesting
How effective is a 5090 vs a 4090 then
Since the 5090 is supposed to have way more bandwidth
A lot more. Depends on your exact arch and how sparse it is
@rocky vigil yukari is not multilayer
i'm gonna try a net of yukaris arch rq (training for 1 superbatch) to compare speeds
Alright
Yep I’m aware
But my single layer speed sucks as well
Because I thought the progression was going to be fixed nodes then speed optimization
Wait what the hell?
I think bench speeds aren't that comparable because of different positions, but even from startpos, yukari gets
3.7M nps during a 10s search, plenty gets
2.3M nps during a 10s search...
granted, this plenty arch still has king buckets. let me get rid of those rq
i need to take a deep dive in yukari again it seems...
yeah it's not that much better without king buckets
Wait how bad is the king bucket slowdown again?
10%-ish it seems
Hmm it seems Yukari maybe counts moves that get see pruned/lmp/whatever
Idk though
Rust is not my specialty
Oh interesting
Yeah we already know about that issue where the Nps isnt comparable
Ok so that inflated nps quite a bit
Since the counting is different
Ngl wasn’t aware
Am attempting to compile Plentychess on new laptop
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full-layers-pairwise-kingbuckets is still the right one?
it feels like we stumbled upon this already and i forgot yeah
best one is 0118 currently
though the net there isn't being downloaded correctly, the one you have should work out of the box
ah
why does it never work
is the net not processed correctly
let me check the branch rq
Did the experiment to add more data to training do anything?
For monty master net i measured king buckets + factoriser was -20 and L2 16 to 128 was +25. fixed nodes elo
those are the final values
yes
styx wanted to do the training, not yet done unfortunately
very very strange
i would do but your format is too big :p
the king buckets was on just the psq
oh it's broken for me too. whoops
well it thought things were dependent on king position when they were not
threats is a more useful signal
hmmm
yeah maybe it requires like more fancy training setups
like start with psq
and then do king buckets on muc hlower lr
this one has L1=2048. so it's probably the wrong branch / net combo. what exactly did you want to test?
essentially plentychess speed
vs main
well the loss came in lower. it did fit the data better. its just king position isnt that useful for our net so it confused it or whatever
is L1=384 multilayer fine? that would be easiest for me to push rn
well it is L1=3072. could be different for smaller L1s
our net is too clever even output buckets and stuff are rubbish
ok nice
Nodes searched : 2104883
Nodes/second : 1390279``` nice it works
lemme get plentychess main as well
Nodes searched : 1855539
Nodes/second : 1511025```
- main
so maybe 384 threat inputs competitive with 1792 standard
it's really strange, the speed of the 384 and 256 seem to be almost the same
do you get similar results?
yeah more or less, faster overall but similar speed loss
i mean it's good if scaling L1 incurs less speed loss :p
i'm just gonna train a full L1=384 net
wdym by "full" sorry
like the full training schedule, not just a few SBs for fun
ohhhh
yeah good idea
btw do you know how L1=3072 would compare with 1792
i assume something on the order of -30%
maybe it's because with very small L1s the overhead of doing lots of updates is comparably high to actually doing the updates, so if the update itself is 50% longer wrt. cpu instructions, the memory overhead etc. is much lower now in comparision
I do not unfortunately
I did recommend scaling L1 lol. We found same in monty, scaling L1 of the full threat net has less speed loss than expected
yeah i think the ultimate dream is to have L1=1024 be competitive
okay then hopefully this will yield good results...
i saw some -8 i think?
wait that's really good
yeah so he can replicate the tech pretty much
so maybe new sf nets are back on the menu
Yeah need to shove threat inputs into NNUE pytorch
need to shove bullet nets into SF :P
Threat inputs in NNUE pytorch ofc
shoving bullet nets in sf just went wrong when we tried so idk...
Yeah but it failed after many months
So best is just work with NNUE pytorch I think
idk how nnue-pytorch works, is it as simple as defining new features.py
Also Bruno tried with a simple single layer to get bullet net on par with NNUE pytorch and failed by some 40 Elo
So there's that aspect also
with leela data all that filtering is worth a lot ofc
No I mean both single
oh
And using Leela data
wait
Etc etc
strange
It was made to test if bullet nets are on par with NNUE pytorch
So everything was constant if it could be
Only trainer change was the idea
Anyways it failed terribly lmao
And nobody knows why
if this, it isn't that much of a stretch to port it
Yeah it's easy. Speed might be shit but oh well
surely vondele H200 or whatever cancels out the effect
I mean it's easy now. Shove the threat inputs into features.py, train a L1 1024 net using same schedule as SF master net, yoink the plenty threat UE stuff
That's all the steps
You can probably just ask vondele to do all the training even. Since he already did it once
yeah gimme a bit to figure out how nnue-pytorch works
idk how to interpret this line https://github.com/official-stockfish/nnue-pytorch/blob/master/features/halfka_v2_hm.py#L78
So you would be adding it there. It's actually easier since it is c++ already lol
that's true
yeah that's very good
and I'll share the one-liner needed to train that this weekend.
Only 10 elo? We're getting pretty close!
Wait... how much of the net is training vs post-training SPSA?
I mean... there already is a significant possibility that with enough SPSA tune and search tune tailored for this net, it could even vs master.
The question, however, is whether or not we should do that right now.
If it would be even, there's no point
Oops... I meant beat.
But I mean tuning and all takes a lot of resources, and once you do that, you're kinda partially locked in.
So, it's probably better to train the net as best as you can first before tuning.
If you read I have already mentioned this
Oh, I see. Thanks.
also, that's both the master and small net combined, and at LTC.
Hopefully we won’t need tuning
how did the experiment go (if it's concluded)?
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (20000 nodes, 1t, 16MB, UHO_4060_v2.epd):
Elo: 12.32 +/- 4.51, nElo: 18.87 +/- 6.90
LOS: 100.00 %, DrawRatio: 43.56 %, PairsRatio: 1.19
Games: 9734, Wins: 3126, Losses: 2781, Draws: 3827, Points: 5039.5 (51.77 %)
Ptnml(0-2): [183, 1072, 2120, 1201, 291], WL/DD Ratio: 1.73
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (5+0.05, 1t, 16MB, UHO_4060_v2.epd):
Elo: -17.03 +/- 5.12, nElo: -32.84 +/- 9.85
LOS: 0.00 %, DrawRatio: 50.36 %, PairsRatio: 0.68
Games: 4778, Wins: 1086, Losses: 1320, Draws: 2372, Points: 2272.0 (47.55 %)
Ptnml(0-2): [23, 684, 1203, 462, 17], WL/DD Ratio: 0.96
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (30+0.3, 1t, 64MB, UHO_4060_v2.epd):
Elo: -10.10 +/- 5.22, nElo: -20.93 +/- 10.82
LOS: 0.01 %, DrawRatio: 54.52 %, PairsRatio: 0.78
Games: 3958, Wins: 932, Losses: 1047, Draws: 1979, Points: 1921.5 (48.55 %)
Ptnml(0-2): [5, 502, 1079, 389, 4], WL/DD Ratio: 0.98
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (5+0.05, 12t, 192MB, UHO_4060_v2.epd):
Elo: -2.42 +/- 5.16, nElo: -5.04 +/- 10.74
LOS: 17.90 %, DrawRatio: 55.02 %, PairsRatio: 0.94
Games: 4020, Wins: 984, Losses: 1012, Draws: 2024, Points: 1996.0 (49.65 %)
Ptnml(0-2): [4, 462, 1106, 434, 4], WL/DD Ratio: 0.96
--------------------------------------------------
it is crazy close...
and in fact, i forgot to rebase on main, so it's missing a couple search gainers
wait it actually scales better ??
probably due to the slowdown, but yes!
wondering if i should try L1=512 next
it just hints at better scaling imo
i ran this test because
- it's a higher TC than 30+0.3
- i ran it with concurrency 1, so there's no memory bottleneck from multiple processes (i do have verbatim but eh, it's closer to tournament conditions this way)
the latter may be part of the good performance
its also within error of being neutral scaling
hm its not lol
not if you include the smp test
do u want a green. u can send rebased 16 thread 10+0.1 and i put worker on it
not yet. i want it to pass under my normal (V)LTC conditions
hm ok. smp stc seems just as valuable to me tbh
yes ofc, if i would care less about the spcc performance it'd do it
this will definitely work at ccc/tcec, that's for sure
and with some tweaking under shorther conditions as well
it's definitely worth trying i think
gonna train the first stage i think, and then compare at fixed nodes & stc
So if it is working here imagine what it would be like in SF with L1 1024...
I think these results indicate it would be both stronger at fixed nodes and faster lol
i can train 1 SB of an L1=1024 and compare speeds with SF master
oh, and i forgot to tell, this is without pairwise, at this size it probably makes sense to use it again
actually not super uncompetitive
Yeah I was assuming pairwise
FYI, we now have a fully described pipeline to train the SF net, to near master strength #nnue-dev message ... I hope we can use that to facilitate developing and testing some of the ideas discussed here, and e.g. compare bullet to nnue-pytorch.
Results of stage 1 of 4 of the L1=512 net against stage 1 of 4 of the L1=384 net
--------------------------------------------------
Results of Threats-0122 vs Threats-0120 (20000 nodes, 1t, 16MB, UHO_4060_v2.epd):
Elo: 18.54 +/- 7.02, nElo: 28.49 +/- 10.77
LOS: 100.00 %, DrawRatio: 42.89 %, PairsRatio: 1.32
Games: 3996, Wins: 1274, Losses: 1061, Draws: 1661, Points: 2104.5 (52.67 %)
Ptnml(0-2): [67, 425, 857, 526, 123], WL/DD Ratio: 1.41
--------------------------------------------------
Results of Threats-0122 vs Threats-0120 (5+0.05, 1t, 16MB, UHO_4060_v2.epd):
Elo: -1.91 +/- 7.62, nElo: -3.67 +/- 14.59
LOS: 31.12 %, DrawRatio: 50.41 %, PairsRatio: 0.97
Games: 2178, Wins: 534, Losses: 546, Draws: 1098, Points: 1083.0 (49.72 %)
Ptnml(0-2): [11, 263, 549, 259, 7], WL/DD Ratio: 0.91
--------------------------------------------------
running LTC over night while continuing training. if it's ready in time i'll send it to tcec, else i'll send L1=384
when is the tcec deadline?
i am lil busy now but will try to make progress on nnue pytorch etc. over this weekend
Updates will be run when the current bonus is over, so in ~21h. Unfortunately that means I'll have to send the smaller net, unless it's extended last minute
not finished training or lacking hw to test
--------------------------------------------------
Results of Threats-0122 vs Threats-0120 (30+0.3, 1t, 64MB, UHO_4060_v2.epd):
Elo: 1.02 +/- 3.41, nElo: 2.11 +/- 7.09
LOS: 72.05 %, DrawRatio: 54.27 %, PairsRatio: 1.02
Games: 9220, Wins: 2256, Losses: 2229, Draws: 4735, Points: 4623.5 (50.15 %)
Ptnml(0-2): [3, 1039, 2502, 1060, 6], WL/DD Ratio: 0.90
--------------------------------------------------
not finished
oof
cant help then
rip
maybe you can ask for an extension since this is Really Cool™?
can try :P
🙏
yeah within error margin oh well
i guess like this is first checkpoint only out of 4
512 over 384 is an instamerge tbh
yes fs, ty, will let you know when it's ready for ob
the net is gigantic (184MB), i'll have to implement leb compression
alright
I already offered testing HW so it's not an issue kek. He wants it to be within his normal conditions
ah okok
yeah note that the FT size is like
gigantic ik
4x that of 32 bucket
well some 32 or 64 thread data would also be nice for tcec. but to merge into main it needs to pass at least VLTC
It compresses very well though
Binary size isn't too bad
true
is probably necessary to uh
get past the 128 MB limit or whatever
on fishtest
It's easy to bypass that limit. I do it all the time
oh for montytest right
With an L1 1024 for SF. Binary size will actually go down. So like I said before this isn't a real issue
#503163384875974656 message
what channel is this?
Compression has been implemented, so network downloads are now a lot smaller. That means we are OB-ready
@twilit oriole @desert tree Would you be interested in a high-concurrency L1=384 test (the one being sent to TCEC) against main, or do you prefer waiting for the larger L1 to be done?
Ah sorry, it's the dev channel, forgot it's not public. I asked aloril & kan to use the threat inputs branch
FRD4 engine submission deadline Friday 2025-09-19T12:00 UTC
Deadline passed so I guess it is now the end of Altsufi Kibitzer Bonus
id like to see the 512hl net
if u get the required extension of course
which i rly hope u do
i think the rules are pretty clear unfortunately
oof
https://wiki.chessdom.org/TCEC_FRD_rules
Under no circumstances are updates and fixes to engines allowed once the FRD tournament has started.
welp
384 is gonna play at least as well as main so
how long is left for the large one
i mean can just do both ngl
around 15h
L1=512 is up on furybench now. First running fixed nodes & STC against L1=384, to confirm the results of the tests I ran of the first stage.
Then I'll run some tests against main, including SMP, though I'm not sure yet what conditions are best
ayy
I'm thinking something like 8th 60+0.6, potentially more threads and less TC
(not using smt fwiw)
mh actually, the fact that you're using 16 cute chess sockets might be biasing the test a little in favor of the smaller net. but not sure if this is significant, just something to potentially keep in mind
i can drop it if you want
8?
and why would it favor either net?
ah nevermind
i was for some reason imagining that verbatim nets don't work between cutechess instances
which is of course wrong
i think nets should be shared the same way regardless of what number of cutechess instances is running
ah ok
ill just leave it as is
lmk if theres any issue
the STC (https://furybench.com/test/3001/, which is carried by your worker) is definitely producing worse results than what I ran locally after the first training stage (#1336647760388034610 message)
i don't know if such a large machine still has more problems with memory contention, even with verbatim nets?
given that the fixed nodes test is similar, speed seems to be the main thing that could cause this
in terms of memory contention this should be close-ish to tcec conditions
cause its 2 sockets with 128c each
idk how many memory channels
ill check after what memory speeds im getting
ngl i'm gonna repeat this test without your worker. -23.84 +- 2.77 vs -1.91 +/- 7.62 is way too big of a difference.
for now i'll let it run the SMP test
uhm @desert tree your worker now has 0.11M nps, that's a bit strange
wtf
its consistent too what the hell
yeah somethings wrong with it
i sure hope it didnt poison the other result
it probably did, but it was only that one STC, i'll just re-run it
can't really poison fixed nodes, and hasn't played any SMP games yet
and it disconnected
i think the host fucked something up
it went completely offline now
oh damn
ill see if i can get another worker
best i can find are zen3 workers
which wont be representative for TCEC
that's fine, we'll wait with the SMP test then
you mean finding another worker is quick?
true, @split warren i'm running some threat input tests on OB right now, mind helping out with the SMP LTC test?
2x 7Y83, should be up in a sec
nice
I am scramlbing with the baby atm, I will come back and do my best
@stray reef it is on
awesome tysm
Hm i think @desert tree your worker is still giving different results. maybe it's just due to high concurrency. but if you look at the finished STC https://furybench.com/test/3003/ and the currently running LTC (-16.81 +- 6.22) https://furybench.com/test/3004/ and look at the individual elo of the worker (-26.92 +- 12.32)... it doesn't seem right
looking at the bench numbers, it matches the small workers (75% speed of main roughly)
i think it's not your fault this time, it must be due to concurrency
yeah idk maybe there is still some effect we aren't thinking about rn. i'm not knowledgable enough in that regard
my worker loving the threat net kek
i think i want an LTC with just the small workers. it seems too far off
ok i'll let everything do the SMP test then
this stuff is due to threads of test : threads of worker ratio i observed before. if u want favourable results especially on larger worker u should keep STC and crank the threads
so you're saying i should be running something like... 8+0.08 32th?
yep
alright
so ill put it back up then
unfortunately there is no good way to prevent one of the big workers to jump back to the LTC
i'll try my best by starting/stopping it if happens
the workers already started diverging on the SMP
big worker is on the ltc
and instantly lost kek
tbh i can just take my pgns at the end and run results through elo tool or smth. simpler
increased workload size and moved it back
yeah can easily filter the few games out at the end
File "/home/neural/FuryBench/Client/worker.py", line 1282, in run_openbench_worker
if config.workload: complete_workload(config)
File "/home/neural/FuryBench/Client/worker.py", line 1023, in complete_workload
rr.send_errors(timestamp, cutechess_cnt)
File "/home/neural/FuryBench/Client/worker.py", line 698, in send_errors
for header, moves in PGNHelper.slice_pgn_file(fname):
File "/home/neural/FuryBench/Client/worker.py", line 567, in slice_pgn_file
raise utils.OpenBenchMisssingPGNException(reason)
utils.OpenBenchMisssingPGNException: Unable to find PGNs/3007.35002.1758483719.0.pgn. Cutechess exited with no finished games.```
Lol u somehow managed to error the worker stopping that task


started again
How much did the extra data help btw. Was there a measurement old 384 to new 384
i don't remember if i trained a 384 net before. but i don't think so
the best thing would be to simply add some more data now and see
the new LTC definitely looks better but won't be positive yet it seems
has pairwise been tested with L1=512 (I recall it was tested at 256 and was negative, maybe?)
this seems maybe logical next step
although my suspicion is that the time taken for L1 -> L2 is not that big relative to the whole network
but who knows
actually yeah pairwise both halves L1 and doubles sparsity count
@stray reef do you have a bullet feature input set for (factorized) threat inputs + king buckets?
I am going to try (simplified threats + 2x768) -> 64 in shatranj because I think it'll actually finish training in a reasonable time with bullet-main single thread
I expect shatranj speed to be more favorable
since most of the pieces are leapers etc.
(rook is only slider)
and no special cases like castling/en passant
i would recommend connecting with -T 120 -N 8 for a 128c, the cutechess overhead is significant with that concurrency, and it does consume quite a bit of CPU
maybe that was the issue?
god damn it @stray reef , can u lower the prio of ur other test or something? I put my machine for the test and it picked the other one
i can lower it but i think the threat input test is basically done? lol
cool i will just go back to Reckless datagen then
ok so i tried running plentychess datagen. it failed for some reason and then the focusing got ignored and it just started another task when i have explicitly inputted I do not want to run other tasks. so i had enough kek
/usr/bin/ld: src/fathom/src/tbprobe.o: relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: failed to set dynamic section sizes: bad value
Also i feel like running an ob worker shouldnt require u to be a dev and do debugging
not yet
wtf
https://github.com/Yoshie2000/bullet/blob/plenty/examples/plenty/0120.rs
this is the config for the L1=384 net
but it's not factorised yet, i'm still working on that / need to see if what i did produces reasonable results elo-wise
so yeah pairwise and factorisation need to be tested next, hopefully they'll gain 5-6 LTC elo
and since L1=512 gained even at STC over 384, it's a no-brainer to go even bigger once everything else is figure out imo
What is threat input exactly ?
Like, I understood that we put the threat in the input layer, but I haven't found anything on what are the threats
read the first messages of this thread
alright I'm gonna try out simplified threats + 2 buckets for shatranj soon™ and see how it goes
will be heavily reduced L1(=64) since 1) i only get to use single CPU thread on bullet main and 2) only 840M pos of data
vs main currently 2 buckets L1=512
random question: this is just using crelu, no pairwise right? if I read the config correctly
yes
it gained a handful elo
not with threat inputs explicitly, this is just my main training schedule
bullet legacy user attempts to parse bullet main:
wait is this by like superbatch
or idk
per datapoint
but ofc every epoch the discarded data will be different
otherwise it'd just be 5% less data which would be bad
ah interesting
for side in [Side::WHITE, Side::BLACK] {
for piece in Piece::PAWN..=Piece::KING {
let pc = 6 * side + piece - 2;
map_bb(bbs[side] & bbs[piece], |sq| pieces[sq] = pc);
}
}```
what does this code do?
in
map_features
because I just copied the old simplified threat input code
and used that instead
and that doesn't have this
looks like it builds a mailbox from the bitboards
all this is mostly taken from the montytrain branch, and adapted as necessary fwiw
ok I shouldn't need it then yea
let occ = bbs[0] | bbs[1];
for side in [Side::WHITE, Side::BLACK] {
let side_offset = offsets::END * side;
let opps = bbs[side ^ 1];
for piece in Piece::PAWN..=Piece::KING {
map_bb(bbs[side] & bbs[piece], |sq| {
let threats = match piece {
Piece::PAWN => Attacks::pawn(sq, side),
Piece::KNIGHT => Attacks::knight(sq),
Piece::BISHOP => Attacks::bishop(sq, occ),
Piece::ROOK => Attacks::rook(sq, occ),
Piece::QUEEN => Attacks::queen(sq, occ),
Piece::KING => Attacks::king(sq),
_ => unreachable!(),
} & occ;
count += 1;
map_bb(threats, |dest| {
let enemy = (1 << dest) & opps > 0;
if let Some(idx) = map_piece_threat(piece, sq, dest, pieces[dest], enemy) {
f(side_offset + idx);
count += 1;
}
});
});
}
}``` wait where's the psq feature in this then
pieces[dest] is passed to map_piece_threat? not sure, i'm on mobile rn
hangon lemme attempt to figure this stuff out
let occ = bbs[0] | bbs[1];
for side in [Side::WHITE, Side::BLACK] {
let side_offset = offsets::END * side;
let opps = bbs[side ^ 1];
for piece in Piece::PAWN..=Piece::KING {
map_bb(bbs[side] & bbs[piece], |sq| {
let threats = match piece {
Piece::PAWN => Attacks::pawn(sq, side),
Piece::KNIGHT => Attacks::knight(sq),
Piece::BISHOP => Attacks::bishop(sq),
Piece::ROOK => Attacks::rook(sq, occ),
Piece::QUEEN => Attacks::queen(sq),
Piece::KING => Attacks::king(sq),
_ => unreachable!(),
} & occ;
f(TOTAL_THREATS + [0, 384][side] + 64 * (piece - 2) + sq);
count += 1;
map_bb(threats, |dest| {
let enemy = (1 << dest) & opps > 0;
if let Some(idx) = map_piece_threat(piece, sq, dest, enemy) {
f(side_offset + idx);
count += 1;
}
});
});
}
}``` this is what montytrain simplified threat inputs has
yes it's not needed here as simple threat inputs only distinguish if the threatened piece is an enemy or not, it doesn't care about the type
yeah but this has a f([0, 384][side] + 64 * (piece - 2) + sq) that corresponds to the psq feature
idk where the other code has that
oh that's what you mean. sorry
i moved that into the main method i think, as this input type has factorised king buckets, there must be something like Chess768::map_features
ah this? fn map_features<F: FnMut(usize, usize)>(&self, pos: &Self::RequiredDataType, mut f: F) { let get = |ksq| (if ksq % 8 > 3 { 7 } else { 0 }, 768 * self.buckets[usize::from(ksq)]); let (stm_flip, stm_bucket) = get(pos.our_ksq()); let (ntm_flip, ntm_bucket) = get(pos.opp_ksq()); Chess768.map_features(pos, |stm, ntm| { let bucketed_offset = 768 + TOTAL_THREATS; f(bucketed_offset + stm_bucket + (stm ^ stm_flip), bucketed_offset + ntm_bucket + (ntm ^ ntm_flip)); // bucketed feature f(stm ^ stm_flip, ntm ^ ntm_flip) // factorised feature });
yes exactly
so does that mean I don't have to add psq features in the map_features function
you can do it either way, but imo for factorised king buckets this makes it a lot easier
you need to have it somewhere, and exactly once
ok
so I'm just gonna go with whatever your code has in main method
meaning I should remove that from the map_features function I think
also what is ```impl ThreatInputsBucketsMirrored {
pub fn new(buckets: [usize; 32]) -> Self {
let num_buckets = get_num_buckets(&buckets);
let mut expanded = [0; 64];
for (idx, elem) in expanded.iter_mut().enumerate() {
*elem = buckets[(idx / 8) * 4 + [0, 1, 2, 3, 3, 2, 1, 0][idx % 8]];
}
Self { buckets: expanded, num_buckets }
}
}```
just some code that mirrors the bucket layout from a 32 element array into a 64 element array
oh i see
ok the last thing I think I need to fiddle with is the settings
yay
how do I uh
run
bullet
with CPU backend?
--features cpu maybe? idk
If you have a recent bullet commit it should tell you to add this
what
ah yes i added the rand crate to the bullet_lib cargo.tml iirc (forgive me jw :P)
idk about the mismatched types
wait i don't see it in https://github.com/Yoshie2000/bullet/blob/plenty/Cargo.toml
or am I blind
It should be an &str
mm hmm
well then
@formal smelt i assume this means it can't be done on cpu backend
Yes
what a shame
time to ask kevlu to do it ig
btw here's the actual edited config
@stray reef @formal smelt does it look good
How many threat updates could a single move cause at most? (preferrably even split into add+sub counts) Has anyone put thought into this yet?
At most 32 add (8 from the moving piece, 8 from uncovering sliders, 16 from attacks to the dest) and same for subtract
Actually if deduplication is taken into account that 32 is lower
Maybe 20
Because you get at most 4 from uncovering sliders
And the other 8+16 is reduced to 16
threat inputs weakness at king safety? 
gonna run a DFRC test of threat inputs actually, just out of curiosity
yeah it's about the same strength diff to master as in normal chess
which means it scales well but doesn't actually play much better DFRC than clover and stormphrax
Pairwise tests are now up on furybench (fixed nodes + STC)
Bench is anywhere from 0 to 10% faster
Fixed nodes
Elo | 0.03 +- 3.07 (95%)
Conf | N=20000 Threads=1 Hash=16MB
Games | N: 20140 W: 5862 L: 5860 D: 8418
Penta | [439, 2372, 4427, 2412, 420]
https://furybench.com/test/3080/
STC
Elo | 4.28 +- 2.48 (95%)
SPRT | 8.0+0.08s Threads=1 Hash=16MB
LLR | 2.90 (-2.25, 2.89) [0.00, 2.50]
Games | N: 19550 W: 4940 L: 4699 D: 9911
Penta | [50, 2200, 5042, 2425, 58]
https://furybench.com/test/3083/
Nice result, and should make it even easier to increase L1 for an LTC gain
i would have expected it to be a bit worse at fixed nodes, ngl
I am training stage 1 of the factorised threat features now, it's very slow, but it might be worth it
Same but I guess
Threat inputs is a little different
not sure if there are any other results of people trying pairwise with such small L1s
(without threat inputs)
If you have time could you try screlu multilayer as well
now that i have pairwise, screlu is no longer usable due to quantisation (probably, haven't thought a lot about it)
no, but it would be slower and scale worse with L1 size, i don't think that's worth trying, seeing this pairwise result
Seems like higher quantization is quite effective
That should put it only what 5 STC elo away?
probably neutral at LTC
Eh test that. Threat inputs are more resistant to quantisation. So much so that we now i8 quantise ours
no i mean with gain it's now probably neutral to master at LTC
was expecting quantisation to scale linearly
holy fuck
vltc gainer then?
surprisingly 8192 works over 3072
in Monty at least
i guess the future for cpu mcts is just in big net
that's an option, but i first want to test if my factoriser impl is worth any elo
rip i have a bug in the training script. guess i'll test LTC again then
oof
Factorisation is now fully working. We'll have the results of stage 1 tomorrow
🙏
(I am factorising similarly to small threat inputs, except I also encode if the threatened/protected piece is of higher value)
it's still a ton of features. potentially i'll have to cut it down more
Ah
Yeah it’s difficult to factor
It’s a subset of PP essentially
So the factorings are pretty much also just factorings of PP
it would be possible (but more complicated probably) to use what chef tried in vine recently as a factoriser, e.g. [colored_piece][sq][sq_attacked][sq_defended]
aka. 768x4
fwiw it seems to train pretty quickly
going from 600->800SBs is completely neutral
for our 1024hl net
Depends on amount of data
right yeah
with not that much data you dont need that many SBs
dont mind me forgetting basic stuff about training nets
#1220867251763286207 message this would be potentially a big issue with that scheme though (though practically probably not)
New LTC not looking terribly hot rn oof
Looks fine to me. There is still scaling the L1 and adding more data left
Yeah maybe neutral LTC was optimistic
btw
I got threat inputs branch of nnue PyTorch up
On my fork
In case anyone with gpu wants to try and see if it works
I largely copied the existing impl and just changed the function calls etc. to match the library
So here’s hoping nothing goes terribly wrong
notwithstanding the errors with nnue-pytorch, training seems to be quite fast