#UE Threat Inputs for AB

1 messages · Page 4 of 1

rocky vigil
#

I'll try to get transpose to work in reading

#

and see what fixed nodes results are

formal smelt
#

You can adjust the init settings

#

You should try adjusting the ft init because by default they’ll be tiny

#

I don’t think it’s an issue in bullet, there’s always a risk of dead init and that the default method yields dead init for this net arch is just unlucky

rocky vigil
#

@stray reef Score of stockfish-plentychess-1024 vs stockfish-linrock-512: 313 - 320 - 367 [0.496] 1000 ... stockfish-plentychess-1024 playing White: 262 - 62 - 176 [0.700] 500 ... stockfish-plentychess-1024 playing Black: 51 - 258 - 191 [0.293] 500 ... White vs Black: 520 - 113 - 367 [0.704] 1000 Elo difference: -2.4 +/- 17.1, LOS: 39.0 %, DrawRatio: 36.7 %
(25k nodes, UHO4060v2)
btw since your impl is more advanced than mine rn maybe consider working with linrock/viren to test multilayer

#

in the meanwhile I'll be trying to optimize my impl in sf

stray reef
#

wow that is a terrible result lmao

#

for me i mean

#

yeah I'm down to test or code stuff, or train some nets

rocky vigil
stray reef
#

if there was an inference issue it would be far worse. can't check today anymore but I trust it's correct

rocky vigil
# stray reef yeah I'm down to test or code stuff, or train some nets

yeah I think viren really wanted to test an L1=1536 net at fixed nodes vs SF master (plentychess can't do this directly but it'll be very helpful in experiments, since linrock suggests there has to be a lot of data tweaking/etc. for large nets)
and I also if possible want to test threats+king buckets (my plan is to separately UE the two accumulators then combine them on evaltime, idk how much slowdown there is)

twilit oriole
#

There's 4x4090 if you want to do parallel experiments. I guess we can train a new baseline and then threat nets using Leela data to get around the data bottleneck

rocky vigil
#

yeah we need some input from linrock on this (whether bullet supports all the data parsing options now)

#

plentychess also has verbatim / mmap right

#

so hopefully local stc will be more accurate as well

formal smelt
stray reef
#

512 L1 on the threat-inputs-full branch is between 1.7M and 1.75M nps on my machine, so faster than the 1.6M nps of SF master

#

for some reason 256 L1 nets only produce nonsense right now... now sure what's wrong. so I don't have data on 256 -> 512 (you could send me a net though).
But going from 512 to 1024 decreases speed by roughly 21% in this impl

stray reef
#

With pairwise and some minor optimisations, I can match my master speed with a (80624 -> 256)x2 -> (16 -> 32 -> 1)x8 net

stray reef
#
--------------------------------------------------
Results of Threats256PWLayers vs Main (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -17.94 +/- 6.51, nElo: -26.61 +/- 9.63
LOS: 0.00 %, DrawRatio: 41.52 %, PairsRatio: 0.74
Games: 5000, Wins: 1402, Losses: 1660, Draws: 1938, Points: 2371.0 (47.42 %)
Ptnml(0-2): [161, 677, 1038, 507, 117], WL/DD Ratio: 1.75
--------------------------------------------------

Not a bad try I would say... I did 1000 superbatches with 13B positions for this one. This is all the data I have, but it must be possible to squeeze some 30-ish fixed nodes Elo from better training procedures. Speed is about 5% slower than main, but can probably be still improved a bit

rocky vigil
#

Is main (16x768 -> 1536) for you? If so this is a really good result

rocky vigil
#

They’re already transposed

#

340 scale, 255/64 quant

stray reef
stray reef
#

Currently I'm giving that net a second train for 1000 SBs with a lower LR

formal smelt
#

I did merge a small fix in the default kaiming initialisation recently but im not sure if it would have made a noticeable difference

stray reef
#

yep. for the sake of staying with TrainerBuilder I modified new_affine_custom to

pub fn new_affine_custom(&self, id: &str, input_size: usize, output_size: usize, bias_cols: usize) -> Affine {
  let wid = format!("{}w", id);
  let stdev = (1.0 / (input_size as f32 * bias_cols as f32).sqrt()).max(0.05);
  let init = InitSettings::Normal { mean: 0.0, stdev: stdev };
  let weights = self.new_weights(&wid, Shape::new(output_size, input_size), init);
  let bias = self.new_weights(&format!("{}b", id), Shape::new(output_size, bias_cols), InitSettings::Zeroed);

  Affine { weights, bias }
}
formal smelt
#

you can seed the weights without doing that btw

#

trainer.optimiser_mut().graph.get_weights_mut("l0w").seed_random(0.0, 0.05, true).unwrap();

stray reef
#

I see, that's good to know

#

Not sure if there's something better than 0.05, but that's what the formula works out to for my master net (at least that's what I remember), so I just tried it and loss was fine

stray reef
#
--------------------------------------------------
Results of PlentyThreats256PWLayers-0091 vs PlentyLinrock256SingleLayer-nn-23507ff7848b.nnue (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: 1.11 +/- 6.46, nElo: 1.66 +/- 9.63
LOS: 63.20 %, DrawRatio: 43.16 %, PairsRatio: 0.99
Games: 5000, Wins: 1525, Losses: 1509, Draws: 1966, Points: 2508.0 (50.16 %)
Ptnml(0-2): [127, 587, 1079, 557, 150], WL/DD Ratio: 1.63
--------------------------------------------------
Results of PlentyThreats256PWLayers-0091 vs PlentyLinrock256SingleLayer-nn-23507ff7848b.nnue (6+0.06, 1t, 16MB, Pohl.epd):
Elo: 9.34 +/- 5.48, nElo: 17.46 +/- 10.23
LOS: 99.96 %, DrawRatio: 50.09 %, PairsRatio: 1.20
Games: 4428, Wins: 1218, Losses: 1099, Draws: 2111, Points: 2273.5 (51.34 %)
Ptnml(0-2): [18, 485, 1109, 564, 38], WL/DD Ratio: 1.09
--------------------------------------------------
Results of PlentyThreats256PWLayers-0091 vs PlentyLinrock256SingleLayer-nn-23507ff7848b.nnue (30+0.3, 1t, 64MB, Pohl.epd):
Elo: 5.87 +/- 7.75, nElo: 12.44 +/- 16.43
LOS: 93.11 %, DrawRatio: 56.46 %, PairsRatio: 1.17
Games: 1718, Wins: 463, Losses: 434, Draws: 821, Points: 873.5 (50.84 %)
Ptnml(0-2): [2, 170, 485, 201, 1], WL/DD Ratio: 1.16
--------------------------------------------------

Seems like multilayer just about balances out linrocks improved training setup and data at fixed nodes.
STC and LTC looking similar too. Tomorrow I will test against plenty main

formal smelt
#

Obviously it would be a rather significant training speed hit

lofty cedar
#

How's it even going? Looks like the idea takes pretty long to implement...

stray reef
#

for me at least, it's too slow currently... it'd either need to be much stronger for the speed difference, or much faster with no strength loss

lofty cedar
#

I see. It's quite hard to implement as an idea.

stray reef
#

I think next week I'll try a much simpler threat input set, essentially 768x2x2, which just encodes for every piece if it's attacked and if it's protected. That should be much faster with regards to UE. Though it will also be a lot worse at fixed nodes compared to even simplified threat inputs

formal smelt
#

Before big threat inputs

stray reef
#

do you have any data on the fixed nodes strength difference?

#

An advantage of the 768x2x2 feature set is that it should be doable to king bucket...

formal smelt
#

Like 50 elo or something

#

Actually it was 50 elo at stc

#

The new threat input had halved L1 compared to the old threat input net

stray reef
#

Hm alright. It'll definitely be stronger than plain 768 :P

stray reef
#

I had another idea for reducing the number of updates.

Basically, I feel like the net should be able to figure out everything from all the threat features, so we only need to activate each standard 768 feature if the corresponding piece is not attacked or defended at all

#

Since especially in the middlegame, pieces pretty much always move from between squares that some piece already has vision on, that should mostly get rid of the updates required for the 768 features, at hopefully a very minor fixed nodes loss

rocky vigil
#

This would significantly reduce the number of input changes

#

Should scale better with L1 increases (if it works)

formal smelt
#

i think you might be underestimating the difficulty of having to deduce piece value from some combination of threats given/received

stray reef
#

I was hoping the net might figure it out even though it sounds hard

#

But if you already tested it then nevermind

rocky vigil
#

i think on average the psq terms have much bigger influence

#

which makes sense

rocky vigil
stray reef
#

I'm implementing the 79856+768xK arch in bullet rn but I'm not sure I'm using Factorised / Factorises quite correctly. I plan to merge myself, so I didn't implement merge_factoriser. Loss looks alright definitely, but before I waste hours or days of compute, @formal smelt could you take a look at https://pastebin.com/9YWK3xp9 if that looks reasonable?

stray reef
#

perfect, thanks

#

i'm not 100% sure about the layout of the input weights in raw.bin though. are the factorised weights at the very beginning (before the threat feature weights)? surely they must be, because i didn't tell bullet they should start at 79856

formal smelt
#

yeah they're put at the beginning

stray reef
#

#top-dev-chill message

stray reef
#

Comparing the (79856+768x12 -> 2048)x2 -> (32 -> 64 -> 1)x8 against the (79856+768x1 -> 2048)x2 -> (16 -> 32 -> 1)x8 I trained a few weeks ago, there's at least a 50 elo fixed nodes difference here. Of course I don't know how much of it comes from the king buckets vs. the larger later layers. But I do think that UEing the king buckets together with threats is the way to go to make this work

formal smelt
#

Like in mb

stray reef
#

quantised it's 365.4MB

daring wren
#

💀

twilit oriole
#

It compresses well though

stray reef
#

finished UE for threat inputs + king buckets

#

my GPU is busy for another 2 days but then i'll try to find some arch that has chances at real TCs

#

anyone knows how much king buckets can gain at fixed nodes, against an already mirrored net?

formal smelt
#

4 buckets was +20 in akimbo iirc

#

Over HM

formal smelt
#

When I was messing about with more layers nets using pairwise with a HL of 256 lost a lot of elo compared to not

#

Presumably because 128->256 is a lot more elo than 768->1536 or whatever most people have now

stray reef
#

ohhh good point, yes i am

#

gonna try 256 L1 without pairwise then, it's gonna be slower than master for sure but should be stronger at fixed nodes

#

simd should allow for steps of 64, but 192 may be too weak

stray reef
#

@formal smelt do you think a threat inputs net of that size can be trained on capture positions too?

rocky vigil
#

up here

stray reef
#

alright thx

stray reef
#

This arch trains almost 4x faster than my master arch kekgasm

stray reef
#

probably it won't finish before I sleep. but i can test an almost-fully trained version in like 4-8h

stray reef
#

Comparing not against the master net rn, since they have different training schedules. Instead comparing against a master arch net 0102 that uses the same training schedule as the threat inputs net 0103, at the same point in training (after stage 2 finished)

--------------------------------------------------
Results of 0103r vs Main-0102r (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -38.04 +/- 8.17, nElo: -58.73 +/- 12.51
LOS: 0.00 %, DrawRatio: 42.00 %, PairsRatio: 0.54
Games: 2962, Wins: 725, Losses: 1048, Draws: 1189, Points: 1319.5 (44.55 %)
Ptnml(0-2): [108, 449, 622, 262, 40], WL/DD Ratio: 1.60
--------------------------------------------------

Not looking great. STC is running

#

I've actually used dual activation for L2 -> L3 here without thinking about it. But I doubt that'll make it any weaker, even with a small L1

stray reef
#
--------------------------------------------------
Results of 0103r vs Main-0102r (5+0.05, 1t, 16MB, Pohl.epd):
Elo: -32.15 +/- 8.48, nElo: -63.80 +/- 16.72
LOS: 0.00 %, DrawRatio: 51.87 %, PairsRatio: 0.47
Games: 1658, Wins: 345, Losses: 498, Draws: 815, Points: 752.5 (45.39 %)
Ptnml(0-2): [13, 258, 430, 125, 3], WL/DD Ratio: 0.99
--------------------------------------------------

STC is holding up though!

rocky vigil
#

this is ( -> 256 (no pairwise))x2 -> (16 (dual activation) -> 32 -> 1)x8?

stray reef
#

There is a good chance this training schedule is absolute trash for threat input nets. So no matter how this holds up when fully trained, I'll give it another attempt

rocky vigil
#

ah interesting the scaling looks decent

stray reef
#

yeah the speed is very good also

#

just a matter of making this arch strong, I think

#

"just"

daring wren
#

🚀

rocky vigil
formal smelt
stray reef
stray reef
stray reef
rocky vigil
#

yes it looks like the threat tracking is more expensive than the actual evaluation

#

I had smth similar in SF though that was without incremental threat tracking

stray reef
#

Final results vs main

--------------------------------------------------
Results of 0103rr vs Main (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -13.28 +/- 6.35, nElo: -20.14 +/- 9.63
LOS: 0.00 %, DrawRatio: 42.84 %, PairsRatio: 0.81
Games: 5000, Wins: 1427, Losses: 1618, Draws: 1955, Points: 2404.5 (48.09 %)
Ptnml(0-2): [146, 644, 1071, 533, 106], WL/DD Ratio: 1.75
--------------------------------------------------
Results of 0103rr vs Main (5+0.05, 1t, 16MB, Pohl.epd):
Elo: -37.02 +/- 9.19, nElo: -71.43 +/- 17.59
LOS: 0.00 %, DrawRatio: 50.20 %, PairsRatio: 0.44
Games: 1498, Wins: 313, Losses: 472, Draws: 713, Points: 669.5 (44.69 %)
Ptnml(0-2): [17, 242, 376, 111, 3], WL/DD Ratio: 1.09
--------------------------------------------------

Scaling is worse against main, maybe due to the training schedule, not sure.

Second attempt is underway, ETA 24h

Another idea would be to go for L1=192, but a bigger L2? like 32 or 64? no idea if that would be stronger at a similar speed

stray reef
#

Next attempt

--------------------------------------------------
Results of 0104r vs Main (20000 nodes, 1t, 16MB, Pohl.epd):
Elo: -7.30 +/- 6.30, nElo: -11.16 +/- 9.63
LOS: 1.16 %, DrawRatio: 43.08 %, PairsRatio: 0.88
Games: 5000, Wins: 1451, Losses: 1556, Draws: 1993, Points: 2447.5 (48.95 %)
Ptnml(0-2): [128, 628, 1077, 555, 112], WL/DD Ratio: 1.66
--------------------------------------------------
Results of 0104r vs Main (5+0.05, 1t, 16MB, Pohl.epd):
Elo: -33.55 +/- 9.50, nElo: -63.68 +/- 17.92
LOS: 0.00 %, DrawRatio: 49.03 %, PairsRatio: 0.47
Games: 1444, Wins: 312, Losses: 451, Draws: 681, Points: 652.5 (45.19 %)
Ptnml(0-2): [12, 239, 354, 110, 7], WL/DD Ratio: 1.13
--------------------------------------------------

Still not enough. At STC the slowdown kicks in (idk why it didn't here: #1336647760388034610 message), but even fixed nodes barely isn't good enough... at least with this training setup

lofty cedar
#

Is the UE threat input still being tried in Stockfish?

#

Or has the Stockfish devs moved past this idea?

formal smelt
#

presumably if Yoshie, as the most serious attempt at threat inputs in a/b engines thus far, gets a gainer net, then it will encourage people to try it seriously in SF
SF has not even had a properly trained threat input net tried yet afaik

formal smelt
twilit oriole
#

I think the underestimated drawback was the threat tracking overhead which has ended up much higher than initial expectations

#

@stray reef How does your threat tracking work?

#

And what branch is it on

twilit oriole
#

Where do I get the net also

stray reef
#

which net do you want exactly? I've uploaded some past nets to my net repo but not these recent ones

stray reef
#

and the feature calculation in the file you linked obv

stray reef
#

if you want a verbatim version of the net, run make normally and it'll be put at processed.bin

stray reef
#

is that clang or gcc?

#

compiler not supported? wait

#

what is your compiler / os setup

twilit oriole
#

Thats mingw

#

oh it uses clang lmao

stray reef
#

mmm

#

i haven't compiled on mingw in a while maybe i broke smth

#

xD

#

gcc should work tho

twilit oriole
#

g++.exe (Rev3, Built by MSYS2 project) 14.1.0

g++ -std=c++17 -Wall -pedantic -Wextra -fcommon -pthread -O3 -g -ggdb -DARCH_X86 -march=native -lstdc++ -static -Wl,--no-as-needed -DEVALFILE=\"processed.bin\"  -c src/engine.cpp -o src/engine.o
In file included from src/uci.h:5,
                 from src/engine.cpp:2:
src/nnue.h:472:3: error: '__attribute_noinline__' does not name a type
  472 |   __attribute_noinline__ void resetAccumulator(Board* board, Accumulator* acc);
      |   ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:476:3: error: '__attribute_noinline__' does not name a type
  476 |   __attribute_noinline__ void calculateAccumulators();
      |   ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:479:3: error: '__attribute_noinline__' does not name a type
  479 |   __attribute_noinline__ void refreshPieceFeatures(Accumulator* acc, KingBucketInfo* kingBuc
ket);
      |   ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:481:3: error: '__attribute_noinline__' does not name a type
  481 |   __attribute_noinline__ void refreshThreatFeatures(Accumulator* acc);
      |   ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:484:3: error: '__attribute_noinline__' does not name a type
  484 |   __attribute_noinline__ void incrementallyUpdatePieceFeatures(Accumulator* inputAcc, Accumu
lator* outputAcc, KingBucketInfo* kingBucket);
      |   ^~~~~~~~~~~~~~~~~~~~~~
src/nnue.h:486:3: error: '__attribute_noinline__' does not name a type
  486 |   __attribute_noinline__ void incrementallyUpdateThreatFeatures(Accumulator* inputAcc, Accum
ulator* outputAcc, KingBucketInfo* kingBucket);
      |   ^~~~~~~~~~~~~~~~~~~~~~
make[1]: *** [Makefile:164: src/engine.o] Error 1
make[1]: Leaving directory '/c/Users/Viren/Documents/Github/PlentyChess-0104r/PlentyChess-0104r'
make: *** [Makefile:153: all] Error 2
#
CXXFLAGS = -std=c++17 -Wall -pedantic -Wextra -fcommon -pthread -O3 \
           -D'__attribute_noinline__=__attribute__((noinline))'
CXXFLAGS_EXTRA =```
I put this at the top of my Makefile to fix g++ for now
stray reef
#

right. those aren't necessary anyway, just put them there for profiling

rocky vigil
twilit oriole
stray reef
#

we'd need to increase L1 to 320 I think. Not sure. With this exact arch I don't think I can squeeze much more than 10 elo fixed nodes without extreme effort

twilit oriole
#

yeah sure

stray reef
#

320 L1 would easily pass fixed nodes then ofc

twilit oriole
#

I dont know about inference tricks but i think there are some tricks with the threats themselves. Like I think there are situations where you know you can terminate the calculation of threats early because there cant be further threats

stray reef
#

yeah that's the stuff I didn't really put much thought into

#

there's also probably many moves that add and remove the same index

#

and i don't check for that rn either

rocky vigil
#

well yeah especially if you do capture sequence

candid ivy
rocky vigil
#

see you would have to consult linrock on that

#

the issue is you are running say L1 = N but with a speed of N+1024 or smth

twilit oriole
#

SF has L1 3072. So a pretty large threat input net should be possible at equal speed

#

Maybe an L1 1024 threat input net even

rocky vigil
#

well with my last attempt we only got 256 to barely be faster (without incremental threat computation)

#

so we need a major overhaul in sf

#

the second issue is that we never got the bullet -> sf arch working

#

something goes wrong in the transpose or whatever

#

the custom kernels are almost certainly significantly better than my autovec'd for loops

#

(even for single layer)

#

upstream has a major NNUE code refactor since the last time I worked on threat inputs in sf btw

#

so basically the next attempt will be almost from scratch

twilit oriole
#

I think what we will have to do is forget about SF for now. Train two leeler nets for use in a plentychess branch, L1 3072 regular net and L1 1024 threat input net, and then show how much better the threat input net is in this closest representation of what it would be like in SF. Then the idea will be fully proven finally

rocky vigil
#

yeah that makes sense

#

and once everything is known (for speed optimization) I can work on adding it

stray reef
rocky vigil
#

ah

twilit oriole
#

@stray reef Could you lazy update the threat generation itself? Like only walk through the threat indice updates when evaluate is needed

stray reef
#

you mean what currently happens in addPiece, removePiece or movePiece?

#

sounds like it could save some time yeah

lofty cedar
#

Bumping back this thread... byteboard representation might actually speed up NNUE with threat input.

stray reef
#

The one @plain flower is working on? I guess so, currently incremental threat updates take up 5+% of the total runtime

rocky vigil
#

What I am more concerned about is how going from L1=256 to L1=512 was basically neutral at stc and only +6 elo LTC, despite a big fixed nodes gain, but that result may have something to do with either speed optimization or the fact that output buckets were messed up at the time of that test (using (pieces - 1)/4 instead of -2)

twilit oriole
#

Data starvation seems more likely to me

rocky vigil
#

idk the details of linrock's training

twilit oriole
#

Well. One possibility I don't know how much data and training time it had

#

Oh it's Linrock. Then it won't be that lol

rocky vigil
#

there is definitely still a significant amount that linrock could probably gain with training routine

#

iirc he did a second stage of the L1=256 and gained 6 elo on top

#

@plain flower can I learn more about incremental threat tracking?

#

the simplest working method for threat inputs, would be, given a position and a move, compute the added and removed threats

#

we can ignore pins, making this simpler

#

I don't think we necessarily need full attack table knowledge (in particular, we may be able to save some computation), but I am not the expert on this

lofty cedar
#

Maybe we can try several versions, with pins incorporated or ignored.

rocky vigil
#

the network is trained ignoring pins, so it would be best if we also inference ignoring pins

#

we do not need all the functions necessary for movegen, we only need enough functionality to know what is attacked

lofty cedar
#

Yes... I mean each version would need its own network as well.

#

But might as well extract as much information as possible to feed into the net if it helps.

#

For now though, let's just ignore pins.

rocky vigil
#

At least if you compare my impl with yukari you see that despite having L1=384 Yukari is way faster in midgame positions

#

(well Yukari also uses simplified threat inputs but afaik it should be around the same speed as with full threat inputs?)

plain flower
rocky vigil
#

hmm so basically

#

do superpiece from src

#

update all sliders attacking src

#

do superpiece from dest

#

update all sliders attacking dest?

plain flower
#

yeah

#

in attack tables this would be called slider extension / slider retraction respectively

rocky vigil
#

I was thinking of this but it looked painful to implement

#

esp. since we don't have a way of only doing a single ray-direction

#

it would be more convenient if we had file-only attacks, for instance

rocky vigil
plain flower
# rocky vigil what about e.g. castling (especially frc castling?)

code from Clockwork: ```cpp
switch (m.flags()) {
case MoveFlags::Normal:
new_pos.incrementally_move_piece(color, from, to, src);
// ...
break;
case MoveFlags::CaptureBit:
new_pos.incrementally_remove_piece(color, src.id(), from);
new_pos.incrementally_mutate_piece(!color, dst.id(), to, color, src);

    // ...
    break;
case MoveFlags::Castle: {
    // ...

    // TODO: Optimize further (slider updates can be elided in some cases).
    new_pos.incrementally_remove_piece(color, king_id, king_from);
    new_pos.incrementally_remove_piece(color, rook_id, rook_from);
    new_pos.incrementally_add_piece(color, king_place, king_to);
    new_pos.incrementally_add_piece(color, rook_place, rook_to);

    // ...
    break;
}
case MoveFlags::EnPassant: {
    // ...

    new_pos.incrementally_remove_piece(!color, victim.id(), victim_sq);
    new_pos.incrementally_move_piece(color, from, to, src);

    // ...
    break;
}
case MoveFlags::PromoKnight:
case MoveFlags::PromoBishop:
case MoveFlags::PromoRook:
case MoveFlags::PromoQueen: {
    // ...

    new_pos.incrementally_move_piece(color, from, to, new_place);

    // ...
    break;
}
case MoveFlags::PromoKnightCapture:
case MoveFlags::PromoBishopCapture:
case MoveFlags::PromoRookCapture:
case MoveFlags::PromoQueenCapture: {
    // ...

    new_pos.incrementally_remove_piece(color, src.id(), from);
    new_pos.incrementally_mutate_piece(!color, dst.id(), to, color, new_place);

    // ...
    break;
}
}
#

where move does extension at src and retraction at dst
add_piece just does retraction
and remove_piece just does extension

#

mutate doesn't do any slider updates

rocky vigil
#

hmm for threat inputs we don't want to have to mutate separately since that means changing all the corresponding inputs of attackers of that piece

plain flower
#

yukari doesn't have mutate

#

it just does a remove then add

rocky vigil
#

at least on my laptop yukari seems 2x faster in a typical midgame position

#

which is honestly just sad

plain flower
#

yukari doesn't even use bitboards or byteboards lol

rocky vigil
#

yeah but afaik for L1=256 threat tracking takes up like half the total runtime

#

I would like to reduce that significantly

#

since I know it should be possible lol

plain flower
#

i had a patch that does bitrays for SEE implemented for AVX2 and AVX512 which could be adapted for threat updates

rocky vigil
#

yeah I believe that vector stuff can make this faster

rocky vigil
#

btw I also was not able to compile PlentyChess-0104r because of these weird issues: src/nnue.h:472:3: error: unknown type name '__attribute_noinline__' 472 | __attribute_noinline__ void resetAccumulator(Board* board, Accumulator* acc); | ^ src/nnue.h:476:3: error: unknown type name '__attribute_noinline__' 476 | __attribute_noinline__ void calculateAccumulators(); | ^ src/nnue.h:479:3: error: unknown type name '__attribute_noinline__' 479 | __attribute_noinline__ void refreshPieceFeatures(Accumulator* acc, KingBucketInfo* kingBucket); | ^ src/nnue.h:481:3: error: unknown type name '__attribute_noinline__' 481 | __attribute_noinline__ void refreshThreatFeatures(Accumulator* acc); | ^ src/nnue.h:484:3: error: unknown type name '__attribute_noinline__' 484 | __attribute_noinline__ void incrementallyUpdatePieceFeatures(Accumulator* inputAcc, Accumulator* outputAcc, Ki... | ^ src/nnue.h:486:3: error: unknown type name '__attribute_noinline__' 486 | __attribute_noinline__ void incrementallyUpdateThreatFeatures(Accumulator* inputAcc, Accumulator* outputAcc, K...

#

is this a compiler issue on my end

stray reef
#

ah just remove the __attribute_noinline__, i think it doesn't work on all compilers, it's just there so the function is forced to show up in the profiler

rocky vigil
#

oh I see

#

is there a noticeable speed diff

stray reef
#

not measurable with these functions

rocky vigil
#

ah

#

so I do that and just standard make right

#

am trying to speed compare on my laptop

#

arch is (threats + 12x768) -> 256 -> (16 -> 32 -> 1)?

stray reef
rocky vigil
#

hmm

stray reef
#

still not a 4x increase of course

rocky vigil
#

am still very curious about how Yukari can be so much faster with L1=384

#

I am pretty sure there is a negligible speed difference between simplified vs full threat inputs

stray reef
#

yep

#

well yukari doesn't have multilayer

rocky vigil
#

yeah but my single-layer impl is sf is like,

#

:((( slow

stray reef
#

can hopefully do it in 30min

rocky vigil
#

thanks a lot

#

i mean for testing I can always just set time odds but I think linrock thought that wasn't sound

rocky vigil
#

I think the number here is (feature indices) / (2 * acc updates)

#

which comes out to be around 5.5?

#

at least much less than 7.X

rocky vigil
# stray reef yeah

the inference seems broken on my compiled exe but it looks to be 50% faster than my single layer threats -> 256 -> 1 lmao

stray reef
#

broken? that's not good

rocky vigil
#

does it work locally on your computer?

stray reef
#

yep

#

what's your bench output?

rocky vigil
#

huh

#
Nodes searched  : 1078738
Nodes/second    : 1112101```
stray reef
#

yeah that's broken... on my machine it matches the number in the commit

#

what CPU arch, and what platform/ compiler are you using?

rocky vigil
#

all I did was remove the __attribute__noinline and run make

rocky vigil
stray reef
#

wow i have the exact same config on my laptop, lemme try there

rocky vigil
#

I did get lld: error: unknown argument: --no-as-needed so I executed the final link without -Wl,--no-as-needed

#

(I have the same issue compiling sf though, and this workaround has never messed up sf compilation for me)

stray reef
#

piece features are mostly fused in addsub, therefore < 2

#

this is uhhh

#

pretty bad

rocky vigil
#

does this multiply by 2 since two accumulators per position

#

actually I'm trolliing

stray reef
#

well yeah every feature is applied to both accumulators

rocky vigil
stray reef
#

but the number doesn't multiply by 2

rocky vigil
#

so it is 10.x

stray reef
rocky vigil
#

(the psq is not fused since I literally did the simplest for loop autovec)

#

mingw clang I think

#

or

#

actually for me clang lives in msys64/clang64/bin

#

iidk

stray reef
#

that's mingw clang then

rocky vigil
#

ok

stray reef
#

what do you want me to run on your sf branch and yukari?

rocky vigil
#

uh

#

can you just pull up the nps values for a single (LTC) game from startpos

#

between the two

#

or is that too complicated

#

maybe cutechess-ob works for this

stray reef
#

i have no idea how cutechess cli works

rocky vigil
#

oh uh

stray reef
#

does fastchess output nps in the pgns?

rocky vigil
#

good question

#

I don't have fastchess

#

I stick with cutechess bc shatranj

rocky vigil
#

I think all file paths need to be absolute

stray reef
#

damn, yukari doesn't output nps...

doing the calculations from time and nodes searched, for a 10s think from startpos, yukari is roughly 3x faster (7900X, 1 thread)

#

i'll see if some llm can quickly make a script to calculate this for the PGN...

rocky vigil
#

oh shoot yukari does output nps (e.g. run game in cutechess GUI, or maybe cutechess auto-calculates it???) but I think cutechess-ob only prints time and nodes searched

#

which is good enough theoretically

rocky vigil
#

is comparable to result on my laptop

stray reef
#

i think yukari doesn't report final nodes during hard cutoffs, at least i think so since the nps are very inconsistent

rocky vigil
#

huh

stray reef
#

it's only roughly 50% faster on average during the game, for this reason

rocky vigil
#

we regain our speed in the endgame I know this

#

but in the midgame the speed tanks

stray reef
#

alright i gtg, ping me if there's anything else

rocky vigil
#

aight thanks for your help

#

the "vibe" way to compare nps is to just load up a game in cutechess gui and eyeball the nps ratios lol

stray reef
#

that requires having some chess gui installed :P

rocky vigil
#

I see :P

rocky vigil
#

can estimate the current ltc diff to be in the range of 7-9

#

but it will change a lot with speed optimization and mmap

rocky vigil
stray reef
#

iirc i tried variations of this without success, but looking at the now I'm not sure i did it right... i'll put it on my todo list

rocky vigil
#

i wonder if there's any eta on mmap

#

once that is merged I'll rebase the basic ue to see how it affects the L1 scaling

violet badger
#

the branch is usable state..

#

but not mergeable state yet

twilit oriole
#

Since the additions in the latter net are much less sparse than full threat inputs

#

For context we are at 50B+ positions with L1 3072 and still not fully saturating full threat inputs

stray reef
rocky vigil
twilit oriole
#

So far I just checked increasing L2 size and regular piece output buckets, not much was going on there. The king buckets test will happen soon

rocky vigil
#

ah in monty?

twilit oriole
#

Ye

rocky vigil
#

surprising that increasing L2 isn't that good

twilit oriole
#

This is some short writeup on how threat inputs progressed in monty also (and what the performance is like there)

twilit oriole
#

Linrock didn't gain with output buckets either with full threats

#

In SF

rocky vigil
#

i thought it was like +5 elo or smth

#

idk my branch had them so...

#

single layer + small L1 maybe makes it different

twilit oriole
#

They don't actually work with threat inputs as far as I can tell

rocky vigil
#

tbh when the commit message says "what is going wrong" i might've borked smth

#

was there any later test

twilit oriole
#

No but there is a diff. If you screwed smth it is wrong in both sides

#

The diff is very simple

rocky vigil
#

nvm it's basically the end of the branch

rocky vigil
twilit oriole
#

Yeah but output buckets shouldn't really benefit much from that

rocky vigil
#

yeah i guess we actually should just not have them whoops

twilit oriole
#

I mean it's not the only thing there's probably a lot of small things to tweak. It just wasn't the focus

rocky vigil
#

if increasing L2 doesn't gain much then it might also be worth testing decreasing L2 to 8

#

or smth

twilit oriole
#

I put this summary of how threats progressed in monty

rocky vigil
#

although idk if it would screw with the nnz or anything

twilit oriole
#

Just to get an idea of the value of full threat inputs

#

They are great tbh, just too bad fast threat gen is so hard lol

rocky vigil
#

sorry what is being compared to a standard 768 -> 3072

twilit oriole
#

The 80624 -> 3072

#

Full threats Vs none at all

#

Is about 300 UHO (if you set midpoint anchors so you don't hit book limit)

rocky vigil
#

oh wait fixing the indexing bug was worth that much

#

interesting

twilit oriole
#

At least

#

It's still training

rocky vigil
#

is it not in monty main yet

twilit oriole
#

No. Since it is training lol

rocky vigil
#

i guess that's smth exciting to be looking forward to

twilit oriole
#

I took the +40 measurement midway through the run

rocky vigil
#

since it should be much better than the current value net right

twilit oriole
#

Yeah

rocky vigil
twilit oriole
#

Nah. He trained a much smaller net right

formal smelt
twilit oriole
#

Yeah

rocky vigil
#

wait i become monty contributor from this

#

lezgo

twilit oriole
#

🚀

rocky vigil
#

idk what the training time scaling laws are

twilit oriole
#

Linear with L1 size is what I found

#

Assuming same arch ofc

rocky vigil
#

hmm so if you are doing 4000 with 3072 then I guess 350 for 256 is good enough

twilit oriole
#

MCTS has longer training because we take LR much lower

rocky vigil
#

btw if vondele can get within 10 elo to master net then resuming threat input training is feasible right

twilit oriole
#

I mean I thought about what about just temporarily shoving threat inputs into NNUE pytorch kek

#

But it doesn't solve the issue of not having fast threat gen

#

I don't even know how incremental threat gen works

rocky vigil
#

yeah i am decently convinced having acceptable speed requires a major change to sf position framework

#

anyways I don't really want to rebase until either that or mmap is worked out

#

but looks like it will be quite a wait

twilit oriole
#

Yeah. There's the plenty branch, if someone sends me some configs for bigger nets I can train that. We can simulate it with L1 3072 base Vs L1 1024 threats and Leela data for both or smth

rocky vigil
#

ah true

#

plenty is decently optimized (at least +50% including multilayer)

twilit oriole
#

I mean I think a L1 3072 base Vs L1 1024 threats in plenty will already work tbh without additional optimization

#

Like the threats will already be superior in that comparison

rocky vigil
#

yeah I am pretty convinced as well but somehow yoshie never found the speed / data to make it work selfgen

twilit oriole
#

Because his base net is L1 1536 and the threats have fixed overhead is what I think

#

Like in the 3072 Vs 1024 that's a 2048 delta already

#

SF is unique in that it's somehow managed to work out how to allow eval taking a large fraction of total time already

rocky vigil
#

fixed overhead is identical to increasing l1 by 512 I think (or, 1024 in my impl lmao)

#

personally am more concerned why L1 = 512 to L1 = 256 didn't work in stc (in fact, slower threat impl should make this more favorable to the larger net)

rocky vigil
#

so it needs to be redone eventually

twilit oriole
#

Yeah perhaps. But the Elo delta is very small regardless, output buckets usually yields more I thought

rocky vigil
#

yeah +10-20 is normal whereas here suggests it's +3 or smth

twilit oriole
rocky vigil
twilit oriole
#

There's no overhead then really

#

It's segmented like this

#

Counts were checked also to make sure it never gets too low

#

So all buckets get trained properly for sure

#

Just waiting on some new montytrain operations impls to do it

rocky vigil
#

ah interesting

rocky vigil
twilit oriole
#

The NN inference might be slow also

rocky vigil
#

true I autovec'd it

twilit oriole
#

Which would have close to twice the impact at double L1

rocky vigil
#

the biggest impact is probably in the screlu affine

#

there is probably also some nontrivial gain from fusing addsub

#

that's smth that someone else needs to do

#

unless we get sf arch to work in bullet anyhow and I can go back to the already written code

twilit oriole
#

If you have the config would be useful

twilit oriole
#

Perfect thx that's very useful

rocky vigil
twilit oriole
#

Currently around 4 days on a 4090

rocky vigil
#

Oh that’s quite long…

violet badger
#

actually not to different from a SF master net on H100.

rocky vigil
#

Actually is a 4090 more effective

#

Since vram not a concern

#

Might depend on dataloader speed as well

twilit oriole
#

No

#

vram bandwidth is definitely a concern

rocky vigil
#

Huh

#

Interesting

#

How effective is a 5090 vs a 4090 then

#

Since the 5090 is supposed to have way more bandwidth

twilit oriole
#

A lot more. Depends on your exact arch and how sparse it is

stray reef
#

@rocky vigil yukari is not multilayer

#

i'm gonna try a net of yukaris arch rq (training for 1 superbatch) to compare speeds

rocky vigil
#

Alright

rocky vigil
#

But my single layer speed sucks as well

#

Because I thought the progression was going to be fixed nodes then speed optimization

stray reef
#

Wait what the hell?

I think bench speeds aren't that comparable because of different positions, but even from startpos, yukari gets
3.7M nps during a 10s search, plenty gets
2.3M nps during a 10s search...

#

granted, this plenty arch still has king buckets. let me get rid of those rq

#

i need to take a deep dive in yukari again it seems...

#

yeah it's not that much better without king buckets

rocky vigil
#

Wait how bad is the king bucket slowdown again?

stray reef
#

10%-ish it seems

rocky vigil
#

Hmm it seems Yukari maybe counts moves that get see pruned/lmp/whatever

#

Idk though

#

Rust is not my specialty

stray reef
#

ah good point. lemme change that

#

ah! :P 1.7M nps now. plenty is faster

rocky vigil
#

Oh interesting

twilit oriole
#

Yeah we already know about that issue where the Nps isnt comparable

rocky vigil
#

Ok so that inflated nps quite a bit

twilit oriole
#

Since the counting is different

rocky vigil
#

Ngl wasn’t aware

#

Am attempting to compile Plentychess on new laptop

stray reef
stray reef
#

though the net there isn't being downloaded correctly, the one you have should work out of the box

rocky vigil
#

why does it never work

#

is the net not processed correctly

stray reef
#

let me check the branch rq

twilit oriole
#

Did the experiment to add more data to training do anything?

#

For monty master net i measured king buckets + factoriser was -20 and L2 16 to 128 was +25. fixed nodes elo

#

those are the final values

rocky vigil
#

wait what

#

king buckets lost elo

#

fixed nodes?

#

in monty?

twilit oriole
#

yes

stray reef
rocky vigil
#

very very strange

twilit oriole
#

i would do but your format is too big :p

stray reef
#

yes ik it's bad

#

i could also do it myself, it's pretty quick

rocky vigil
#

considering it should be a strict generalization

#

of the threats + psq

twilit oriole
#

the king buckets was on just the psq

rocky vigil
#

yeah

#

but it's still more representative power

stray reef
twilit oriole
#

threats is a more useful signal

rocky vigil
#

hmmm

#

yeah maybe it requires like more fancy training setups

#

like start with psq

#

and then do king buckets on muc hlower lr

stray reef
rocky vigil
#

vs main

twilit oriole
#

well the loss came in lower. it did fit the data better. its just king position isnt that useful for our net so it confused it or whatever

stray reef
#

is L1=384 multilayer fine? that would be easiest for me to push rn

rocky vigil
#

sure

#

yeah

#

that's nice

#

huh

#

i guess less loss really doesn't mean better net 💀

twilit oriole
#

well it is L1=3072. could be different for smaller L1s

#

our net is too clever even output buckets and stuff are rubbish

rocky vigil
#

ok nice

#
Nodes searched  : 2104883
Nodes/second    : 1390279``` nice it works
#

lemme get plentychess main as well

#
Nodes searched  : 1855539
Nodes/second    : 1511025```
#
  • main
#

so maybe 384 threat inputs competitive with 1792 standard

stray reef
#

it's really strange, the speed of the 384 and 256 seem to be almost the same

rocky vigil
#

do you get similar results?

stray reef
#

yeah more or less, faster overall but similar speed loss

rocky vigil
#

i mean it's good if scaling L1 incurs less speed loss :p

stray reef
#

i'm just gonna train a full L1=384 net

rocky vigil
#

wdym by "full" sorry

stray reef
#

like the full training schedule, not just a few SBs for fun

rocky vigil
#

ohhhh

#

yeah good idea

#

btw do you know how L1=3072 would compare with 1792

#

i assume something on the order of -30%

stray reef
stray reef
twilit oriole
#

I did recommend scaling L1 lol. We found same in monty, scaling L1 of the full threat net has less speed loss than expected

rocky vigil
stray reef
#

okay then hopefully this will yield good results...

rocky vigil
#

how close is vondele to master?

#

might not be bad to give it a try again

stray reef
#

i saw some -8 i think?

rocky vigil
#

wait that's really good

twilit oriole
#

Yeah but SPSA gives 8

#

So he is about pre SPSA net level

stray reef
#

#nnue-dev message

#

this is all i read

rocky vigil
#

yeah so he can replicate the tech pretty much

#

so maybe new sf nets are back on the menu

twilit oriole
#

Yeah need to shove threat inputs into NNUE pytorch

rocky vigil
#

which would be easier

#

unironically

stray reef
#

need to shove bullet nets into SF :P

twilit oriole
rocky vigil
#

shoving bullet nets in sf just went wrong when we tried so idk...

stray reef
#

yeah ik

#

long term it would be so much better

twilit oriole
#

Yeah but it failed after many months

#

So best is just work with NNUE pytorch I think

rocky vigil
#

idk how nnue-pytorch works, is it as simple as defining new features.py

twilit oriole
#

Also Bruno tried with a simple single layer to get bullet net on par with NNUE pytorch and failed by some 40 Elo

#

So there's that aspect also

rocky vigil
#

single layer surely would be -20 elo

#

but -40 elo is anomalous

stray reef
#

with leela data all that filtering is worth a lot ofc

twilit oriole
#

No I mean both single

rocky vigil
#

oh

twilit oriole
#

And using Leela data

rocky vigil
#

wait

twilit oriole
#

Etc etc

rocky vigil
#

strange

twilit oriole
#

It was made to test if bullet nets are on par with NNUE pytorch

#

So everything was constant if it could be

#

Only trainer change was the idea

#

Anyways it failed terribly lmao

#

And nobody knows why

rocky vigil
twilit oriole
#

Yeah it's easy. Speed might be shit but oh well

rocky vigil
#

surely vondele H200 or whatever cancels out the effect

twilit oriole
#

I mean it's easy now. Shove the threat inputs into features.py, train a L1 1024 net using same schedule as SF master net, yoink the plenty threat UE stuff

#

That's all the steps

#

You can probably just ask vondele to do all the training even. Since he already did it once

rocky vigil
#

yeah gimme a bit to figure out how nnue-pytorch works

twilit oriole
#

There's where the features actually are

rocky vigil
#

oh cmon

#

you mean this python stuff is like

#

red herring

#

bruh

twilit oriole
#

So you would be adding it there. It's actually easier since it is c++ already lol

rocky vigil
#

that's true

rocky vigil
#

yeah that's very good

violet badger
#

and I'll share the one-liner needed to train that this weekend.

lofty cedar
#

Only 10 elo? We're getting pretty close!

#

Wait... how much of the net is training vs post-training SPSA?

#

I mean... there already is a significant possibility that with enough SPSA tune and search tune tailored for this net, it could even vs master.

#

The question, however, is whether or not we should do that right now.

torn lagoon
lofty cedar
#

But I mean tuning and all takes a lot of resources, and once you do that, you're kinda partially locked in.

#

So, it's probably better to train the net as best as you can first before tuning.

twilit oriole
lofty cedar
#

Oh, I see. Thanks.

violet badger
#

also, that's both the master and small net combined, and at LTC.

lofty cedar
#

Do we start tuning?

#

Or do we have some more training to do first?

frosty imp
#

Hopefully we won’t need tuning

rocky vigil
stray reef
# rocky vigil how did the experiment go (if it's concluded)?
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (20000 nodes, 1t, 16MB, UHO_4060_v2.epd):
Elo: 12.32 +/- 4.51, nElo: 18.87 +/- 6.90
LOS: 100.00 %, DrawRatio: 43.56 %, PairsRatio: 1.19
Games: 9734, Wins: 3126, Losses: 2781, Draws: 3827, Points: 5039.5 (51.77 %)
Ptnml(0-2): [183, 1072, 2120, 1201, 291], WL/DD Ratio: 1.73
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (5+0.05, 1t, 16MB, UHO_4060_v2.epd):
Elo: -17.03 +/- 5.12, nElo: -32.84 +/- 9.85
LOS: 0.00 %, DrawRatio: 50.36 %, PairsRatio: 0.68
Games: 4778, Wins: 1086, Losses: 1320, Draws: 2372, Points: 2272.0 (47.55 %)
Ptnml(0-2): [23, 684, 1203, 462, 17], WL/DD Ratio: 0.96
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (30+0.3, 1t, 64MB, UHO_4060_v2.epd):
Elo: -10.10 +/- 5.22, nElo: -20.93 +/- 10.82
LOS: 0.01 %, DrawRatio: 54.52 %, PairsRatio: 0.78
Games: 3958, Wins: 932, Losses: 1047, Draws: 1979, Points: 1921.5 (48.55 %)
Ptnml(0-2): [5, 502, 1079, 389, 4], WL/DD Ratio: 0.98
--------------------------------------------------
Results of Threats-384-0120rrr vs Main-0119rr (5+0.05, 12t, 192MB, UHO_4060_v2.epd):
Elo: -2.42 +/- 5.16, nElo: -5.04 +/- 10.74
LOS: 17.90 %, DrawRatio: 55.02 %, PairsRatio: 0.94
Games: 4020, Wins: 984, Losses: 1012, Draws: 2024, Points: 1996.0 (49.65 %)
Ptnml(0-2): [4, 462, 1106, 434, 4], WL/DD Ratio: 0.96
--------------------------------------------------
#

it is crazy close...

#

and in fact, i forgot to rebase on main, so it's missing a couple search gainers

rocky vigil
#

wait it actually scales better ??

stray reef
#

probably due to the slowdown, but yes!

rocky vigil
#

unexpected

#

interesting

stray reef
#

wondering if i should try L1=512 next

rocky vigil
#

stc smp result is also interesting

#

do you know what could be going on with that

stray reef
#

it just hints at better scaling imo

#

i ran this test because

  • it's a higher TC than 30+0.3
  • i ran it with concurrency 1, so there's no memory bottleneck from multiple processes (i do have verbatim but eh, it's closer to tournament conditions this way)
#

the latter may be part of the good performance

desert tree
#

its also within error of being neutral scaling

twilit oriole
#

hm its not lol

stray reef
#

not if you include the smp test

desert tree
#

oh i missed one test

#

woops yeah

twilit oriole
#

do u want a green. u can send rebased 16 thread 10+0.1 and i put worker on it

stray reef
#

not yet. i want it to pass under my normal (V)LTC conditions

twilit oriole
#

hm ok. smp stc seems just as valuable to me tbh

stray reef
#

yes ofc, if i would care less about the spcc performance it'd do it

#

this will definitely work at ccc/tcec, that's for sure

#

and with some tweaking under shorther conditions as well

rocky vigil
stray reef
#

gonna train the first stage i think, and then compare at fixed nodes & stc

twilit oriole
#

So if it is working here imagine what it would be like in SF with L1 1024...

#

I think these results indicate it would be both stronger at fixed nodes and faster lol

stray reef
#

i can train 1 SB of an L1=1024 and compare speeds with SF master

stray reef
#

Plenty with L1=1024: 1.2M nps
SF Master: 1.5M nps

#

(single core bench, ao5)

stray reef
#

oh, and i forgot to tell, this is without pairwise, at this size it probably makes sense to use it again

rocky vigil
stray reef
#

it might well be stronger at fixed nodes

#

so yeah

#
  • pairwise speedup
rocky vigil
#

yep it's looking exciting

#

the fun part is always dreaming

twilit oriole
#

Yeah I was assuming pairwise

violet badger
#

FYI, we now have a fully described pipeline to train the SF net, to near master strength #nnue-dev message ... I hope we can use that to facilitate developing and testing some of the ideas discussed here, and e.g. compare bullet to nnue-pytorch.

stray reef
#

Results of stage 1 of 4 of the L1=512 net against stage 1 of 4 of the L1=384 net

--------------------------------------------------
Results of Threats-0122 vs Threats-0120 (20000 nodes, 1t, 16MB, UHO_4060_v2.epd):
Elo: 18.54 +/- 7.02, nElo: 28.49 +/- 10.77
LOS: 100.00 %, DrawRatio: 42.89 %, PairsRatio: 1.32
Games: 3996, Wins: 1274, Losses: 1061, Draws: 1661, Points: 2104.5 (52.67 %)
Ptnml(0-2): [67, 425, 857, 526, 123], WL/DD Ratio: 1.41
--------------------------------------------------
Results of Threats-0122 vs Threats-0120 (5+0.05, 1t, 16MB, UHO_4060_v2.epd):
Elo: -1.91 +/- 7.62, nElo: -3.67 +/- 14.59
LOS: 31.12 %, DrawRatio: 50.41 %, PairsRatio: 0.97
Games: 2178, Wins: 534, Losses: 546, Draws: 1098, Points: 1083.0 (49.72 %)
Ptnml(0-2): [11, 263, 549, 259, 7], WL/DD Ratio: 0.91
--------------------------------------------------

running LTC over night while continuing training. if it's ready in time i'll send it to tcec, else i'll send L1=384

rocky vigil
#

when is the tcec deadline?

#

i am lil busy now but will try to make progress on nnue pytorch etc. over this weekend

stray reef
#

Updates will be run when the current bonus is over, so in ~21h. Unfortunately that means I'll have to send the smaller net, unless it's extended last minute

rocky vigil
#

ah

#

that's a shame

#

how did the LTC go?

#

if it's done

desert tree
stray reef
#
--------------------------------------------------
Results of Threats-0122 vs Threats-0120 (30+0.3, 1t, 64MB, UHO_4060_v2.epd):
Elo: 1.02 +/- 3.41, nElo: 2.11 +/- 7.09
LOS: 72.05 %, DrawRatio: 54.27 %, PairsRatio: 1.02
Games: 9220, Wins: 2256, Losses: 2229, Draws: 4735, Points: 4623.5 (50.15 %)
Ptnml(0-2): [3, 1039, 2502, 1060, 6], WL/DD Ratio: 0.90
--------------------------------------------------
stray reef
desert tree
#

oof

#

cant help then

#

rip

#

maybe you can ask for an extension since this is Really Cool™?

stray reef
#

can try :P

desert tree
#

🙏

rocky vigil
#

yeah within error margin oh well

#

i guess like this is first checkpoint only out of 4

stray reef
#

512 over 384 is an instamerge tbh

desert tree
#

if u need hw to test it ill gladly help on that front

#

but itd have to be on OB

stray reef
#

yes fs, ty, will let you know when it's ready for ob

#

the net is gigantic (184MB), i'll have to implement leb compression

desert tree
#

alright

twilit oriole
#

I already offered testing HW so it's not an issue kek. He wants it to be within his normal conditions

desert tree
#

ah okok

rocky vigil
#

yeah note that the FT size is like

desert tree
#

gigantic ik

rocky vigil
#

4x that of 32 bucket

stray reef
#

well some 32 or 64 thread data would also be nice for tcec. but to merge into main it needs to pass at least VLTC

twilit oriole
#

Binary size isn't too bad

rocky vigil
#

true

#

is probably necessary to uh

#

get past the 128 MB limit or whatever

#

on fishtest

twilit oriole
#

It's easy to bypass that limit. I do it all the time

rocky vigil
#

oh for montytest right

twilit oriole
#

With an L1 1024 for SF. Binary size will actually go down. So like I said before this isn't a real issue

stray reef
#

#503163384875974656 message

stray reef
#

Compression has been implemented, so network downloads are now a lot smaller. That means we are OB-ready
@twilit oriole @desert tree Would you be interested in a high-concurrency L1=384 test (the one being sent to TCEC) against main, or do you prefer waiting for the larger L1 to be done?

stray reef
desert tree
#

ah yeah nw

#

i have no roles in the tcec disc

green moat
#

Deadline passed so I guess it is now the end of Altsufi Kibitzer Bonus

desert tree
#

if u get the required extension of course

#

which i rly hope u do

stray reef
#

i think the rules are pretty clear unfortunately

desert tree
#

oof

stray reef
desert tree
#

welp

stray reef
#

384 is gonna play at least as well as main so

twilit oriole
#

i mean can just do both ngl

stray reef
stray reef
#

L1=512 is up on furybench now. First running fixed nodes & STC against L1=384, to confirm the results of the tests I ran of the first stage.
Then I'll run some tests against main, including SMP, though I'm not sure yet what conditions are best

stray reef
#

I'm thinking something like 8th 60+0.6, potentially more threads and less TC

desert tree
#

sgtm

#

0.8 mnps / core seems reasonable right

#

for zen4

stray reef
#

yeah lgtm (and thanks a lot!)

#

@twilit oriole also paging you

desert tree
#

(not using smt fwiw)

stray reef
#

mh actually, the fact that you're using 16 cute chess sockets might be biasing the test a little in favor of the smaller net. but not sure if this is significant, just something to potentially keep in mind

desert tree
#

8?

#

and why would it favor either net?

stray reef
#

ah nevermind

#

i was for some reason imagining that verbatim nets don't work between cutechess instances

#

which is of course wrong

desert tree
#

i think nets should be shared the same way regardless of what number of cutechess instances is running

#

ah ok

#

ill just leave it as is

#

lmk if theres any issue

stray reef
#

the STC (https://furybench.com/test/3001/, which is carried by your worker) is definitely producing worse results than what I ran locally after the first training stage (#1336647760388034610 message)

#

i don't know if such a large machine still has more problems with memory contention, even with verbatim nets?

#

given that the fixed nodes test is similar, speed seems to be the main thing that could cause this

desert tree
#

in terms of memory contention this should be close-ish to tcec conditions

#

cause its 2 sockets with 128c each

#

idk how many memory channels

#

ill check after what memory speeds im getting

stray reef
#

ngl i'm gonna repeat this test without your worker. -23.84 +- 2.77 vs -1.91 +/- 7.62 is way too big of a difference.
for now i'll let it run the SMP test

#

uhm @desert tree your worker now has 0.11M nps, that's a bit strange

desert tree
#

wtf

#

its consistent too what the hell

#

yeah somethings wrong with it

#

i sure hope it didnt poison the other result

stray reef
#

it probably did, but it was only that one STC, i'll just re-run it

#

can't really poison fixed nodes, and hasn't played any SMP games yet

desert tree
#

and it disconnected

#

i think the host fucked something up

#

it went completely offline now

stray reef
#

oh damn

desert tree
#

ill see if i can get another worker

#

best i can find are zen3 workers

#

which wont be representative for TCEC

stray reef
#

that's fine, we'll wait with the SMP test then

desert tree
#

alr

#

nah its quick no waiting

#

few minutes at most

#

?

stray reef
#

you mean finding another worker is quick?

desert tree
#

ah i see lol

#

yeah

#

im sure if u ask styx hell help with this, too

stray reef
#

true, @split warren i'm running some threat input tests on OB right now, mind helping out with the SMP LTC test?

desert tree
#

2x 7Y83, should be up in a sec

stray reef
#

cool

#

yeah STC is looking a lot nicer now

desert tree
#

nice

split warren
#

I am scramlbing with the baby atm, I will come back and do my best

twilit oriole
#

Mine will be on within an hour

#

(2x EPYC 9654)

twilit oriole
#

@stray reef it is on

stray reef
#

awesome tysm

#

Hm i think @desert tree your worker is still giving different results. maybe it's just due to high concurrency. but if you look at the finished STC https://furybench.com/test/3003/ and the currently running LTC (-16.81 +- 6.22) https://furybench.com/test/3004/ and look at the individual elo of the worker (-26.92 +- 12.32)... it doesn't seem right

desert tree
#

damn

#

i can take it off

#

idk what im doing wrong

#

:(

stray reef
#

looking at the bench numbers, it matches the small workers (75% speed of main roughly)

#

i think it's not your fault this time, it must be due to concurrency

desert tree
#

alright

#

i set concurrency to equal physical core count

#

aka no smt

stray reef
#

yeah idk maybe there is still some effect we aren't thinking about rn. i'm not knowledgable enough in that regard

twilit oriole
#

my worker loving the threat net kek

desert tree
#

im thinking its just sss

#

if youd prefer i can turn it off

twilit oriole
#

hm

#

well idk maybe my worker will hate it at ltc also

stray reef
#

i think i want an LTC with just the small workers. it seems too far off

desert tree
#

fairs

#

ill kill it for now

#

lmk if u want it back

twilit oriole
#

nah keep it

#

kill the ltc for now

#

cos my worker is also there

stray reef
#

ok i'll let everything do the SMP test then

twilit oriole
#

this stuff is due to threads of test : threads of worker ratio i observed before. if u want favourable results especially on larger worker u should keep STC and crank the threads

stray reef
#

so you're saying i should be running something like... 8+0.08 32th?

twilit oriole
#

yep

stray reef
#

alright

desert tree
#

so ill put it back up then

stray reef
#

unfortunately there is no good way to prevent one of the big workers to jump back to the LTC

#

i'll try my best by starting/stopping it if happens

twilit oriole
#

the workers already started diverging on the SMP

twilit oriole
#

and instantly lost kek

#

tbh i can just take my pgns at the end and run results through elo tool or smth. simpler

stray reef
#

yeah can easily filter the few games out at the end

twilit oriole
#
  File "/home/neural/FuryBench/Client/worker.py", line 1282, in run_openbench_worker
    if config.workload: complete_workload(config)
  File "/home/neural/FuryBench/Client/worker.py", line 1023, in complete_workload
    rr.send_errors(timestamp, cutechess_cnt)
  File "/home/neural/FuryBench/Client/worker.py", line 698, in send_errors
    for header, moves in PGNHelper.slice_pgn_file(fname):
  File "/home/neural/FuryBench/Client/worker.py", line 567, in slice_pgn_file
    raise utils.OpenBenchMisssingPGNException(reason)
utils.OpenBenchMisssingPGNException: Unable to find PGNs/3007.35002.1758483719.0.pgn. Cutechess exited with no finished games.```
Lol u somehow managed to error the worker stopping that task
stray reef
twilit oriole
#

started again

twilit oriole
#

How much did the extra data help btw. Was there a measurement old 384 to new 384

stray reef
#

i don't remember if i trained a 384 net before. but i don't think so

#

the best thing would be to simply add some more data now and see

rocky vigil
#

the new LTC definitely looks better but won't be positive yet it seems

twilit oriole
#

Yeah. More data + More L1 (768) and maybe pairwise

#

Might do it

rocky vigil
#

has pairwise been tested with L1=512 (I recall it was tested at 256 and was negative, maybe?)

#

this seems maybe logical next step

#

although my suspicion is that the time taken for L1 -> L2 is not that big relative to the whole network

#

but who knows

#

actually yeah pairwise both halves L1 and doubles sparsity count

#

@stray reef do you have a bullet feature input set for (factorized) threat inputs + king buckets?
I am going to try (simplified threats + 2x768) -> 64 in shatranj because I think it'll actually finish training in a reasonable time with bullet-main single thread

#

I expect shatranj speed to be more favorable

#

since most of the pieces are leapers etc.

#

(rook is only slider)

#

and no special cases like castling/en passant

split warren
#

i would recommend connecting with -T 120 -N 8 for a 128c, the cutechess overhead is significant with that concurrency, and it does consume quite a bit of CPU

#

maybe that was the issue?

#

god damn it @stray reef , can u lower the prio of ur other test or something? I put my machine for the test and it picked the other one

twilit oriole
split warren
#

cool i will just go back to Reckless datagen then

twilit oriole
#

ok so i tried running plentychess datagen. it failed for some reason and then the focusing got ignored and it just started another task when i have explicitly inputted I do not want to run other tasks. so i had enough kek

#
/usr/bin/ld: src/fathom/src/tbprobe.o: relocation R_X86_64_32 against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: failed to set dynamic section sizes: bad value
#

Also i feel like running an ob worker shouldnt require u to be a dev and do debugging

stray reef
#

but it's not factorised yet, i'm still working on that / need to see if what i did produces reasonable results elo-wise

#

so yeah pairwise and factorisation need to be tested next, hopefully they'll gain 5-6 LTC elo

#

and since L1=512 gained even at STC over 384, it's a no-brainer to go even bigger once everything else is figure out imo

tender fractal
#

What is threat input exactly ?
Like, I understood that we put the threat in the input layer, but I haven't found anything on what are the threats

candid ivy
#

read the first messages of this thread

rocky vigil
#

alright I'm gonna try out simplified threats + 2 buckets for shatranj soon™ and see how it goes

rocky vigil
#

will be heavily reduced L1(=64) since 1) i only get to use single CPU thread on bullet main and 2) only 840M pos of data

#

vs main currently 2 buckets L1=512

rocky vigil
stray reef
#

yes

rocky vigil
#

ah interesting

#

is crelu multilayer or screlu multilayer better

stray reef
#

no idea tbh

#

that's worth testing if pairwise does not work

rocky vigil
#

wait also why do you just randomly discard 5% of data

#

is this like some tech

stray reef
#

it gained a handful elo

#

not with threat inputs explicitly, this is just my main training schedule

rocky vigil
#

bullet legacy user attempts to parse bullet main:

rocky vigil
#

or idk

stray reef
#

per datapoint

#

but ofc every epoch the discarded data will be different

#

otherwise it'd just be 5% less data which would be bad

rocky vigil
#

ah interesting

#
    for side in [Side::WHITE, Side::BLACK] {
        for piece in Piece::PAWN..=Piece::KING {
            let pc = 6 * side + piece - 2;
            map_bb(bbs[side] & bbs[piece], |sq| pieces[sq] = pc);
        }
    }```
#

what does this code do?

#

in

#

map_features

#

because I just copied the old simplified threat input code

#

and used that instead

#

and that doesn't have this

stray reef
#

looks like it builds a mailbox from the bitboards

rocky vigil
#

ohhh right

#

for

#

full threats

stray reef
#

all this is mostly taken from the montytrain branch, and adapted as necessary fwiw

rocky vigil
#

ok I shouldn't need it then yea

#

    let occ = bbs[0] | bbs[1];

    for side in [Side::WHITE, Side::BLACK] {
        let side_offset = offsets::END * side;
        let opps = bbs[side ^ 1];

        for piece in Piece::PAWN..=Piece::KING {
            map_bb(bbs[side] & bbs[piece], |sq| {
                let threats = match piece {
                    Piece::PAWN => Attacks::pawn(sq, side),
                    Piece::KNIGHT => Attacks::knight(sq),
                    Piece::BISHOP => Attacks::bishop(sq, occ),
                    Piece::ROOK => Attacks::rook(sq, occ),
                    Piece::QUEEN => Attacks::queen(sq, occ),
                    Piece::KING => Attacks::king(sq),
                    _ => unreachable!(),
                } & occ;

                count += 1;
                map_bb(threats, |dest| {
                    let enemy = (1 << dest) & opps > 0;
                    if let Some(idx) = map_piece_threat(piece, sq, dest, pieces[dest], enemy) {
                        f(side_offset + idx);
                        count += 1;
                    }
                });
            });
        }
    }``` wait where's the psq feature in this then
stray reef
#

pieces[dest] is passed to map_piece_threat? not sure, i'm on mobile rn

rocky vigil
#

hangon lemme attempt to figure this stuff out

#

    let occ = bbs[0] | bbs[1];

    for side in [Side::WHITE, Side::BLACK] {
        let side_offset = offsets::END * side;
        let opps = bbs[side ^ 1];

        for piece in Piece::PAWN..=Piece::KING {
            map_bb(bbs[side] & bbs[piece], |sq| {
                let threats = match piece {
                    Piece::PAWN => Attacks::pawn(sq, side),
                    Piece::KNIGHT => Attacks::knight(sq),
                    Piece::BISHOP => Attacks::bishop(sq),
                    Piece::ROOK => Attacks::rook(sq, occ),
                    Piece::QUEEN => Attacks::queen(sq),
                    Piece::KING => Attacks::king(sq),
                    _ => unreachable!(),
                } & occ;

                f(TOTAL_THREATS + [0, 384][side] + 64 * (piece - 2) + sq);
                count += 1;
                map_bb(threats, |dest| {
                    let enemy = (1 << dest) & opps > 0;
                    if let Some(idx) = map_piece_threat(piece, sq, dest, enemy) {
                        f(side_offset + idx);
                        count += 1;
                    }
                });
            });
        }
    }``` this is what montytrain simplified threat inputs has
stray reef
#

yes it's not needed here as simple threat inputs only distinguish if the threatened piece is an enemy or not, it doesn't care about the type

rocky vigil
#

yeah but this has a f([0, 384][side] + 64 * (piece - 2) + sq) that corresponds to the psq feature

#

idk where the other code has that

stray reef
#

oh that's what you mean. sorry

#

i moved that into the main method i think, as this input type has factorised king buckets, there must be something like Chess768::map_features

rocky vigil
#

ah this? fn map_features<F: FnMut(usize, usize)>(&self, pos: &Self::RequiredDataType, mut f: F) { let get = |ksq| (if ksq % 8 > 3 { 7 } else { 0 }, 768 * self.buckets[usize::from(ksq)]); let (stm_flip, stm_bucket) = get(pos.our_ksq()); let (ntm_flip, ntm_bucket) = get(pos.opp_ksq()); Chess768.map_features(pos, |stm, ntm| { let bucketed_offset = 768 + TOTAL_THREATS; f(bucketed_offset + stm_bucket + (stm ^ stm_flip), bucketed_offset + ntm_bucket + (ntm ^ ntm_flip)); // bucketed feature f(stm ^ stm_flip, ntm ^ ntm_flip) // factorised feature });

stray reef
#

yes exactly

rocky vigil
stray reef
#

you can do it either way, but imo for factorised king buckets this makes it a lot easier

#

you need to have it somewhere, and exactly once

rocky vigil
#

ok

#

so I'm just gonna go with whatever your code has in main method

#

meaning I should remove that from the map_features function I think

#

also what is ```impl ThreatInputsBucketsMirrored {
pub fn new(buckets: [usize; 32]) -> Self {
let num_buckets = get_num_buckets(&buckets);

    let mut expanded = [0; 64];
    for (idx, elem) in expanded.iter_mut().enumerate() {
        *elem = buckets[(idx / 8) * 4 + [0, 1, 2, 3, 3, 2, 1, 0][idx % 8]];
    }

    Self { buckets: expanded, num_buckets }
}

}```

stray reef
#

just some code that mirrors the bucket layout from a 32 element array into a 64 element array

rocky vigil
#

oh i see

#

ok the last thing I think I need to fiddle with is the settings

#

yay

#

how do I uh

#

run

#

bullet

#

with CPU backend?

stray reef
#

--features cpu maybe? idk

rocky vigil
#

nope that attempts to compile cudarc and fails

#

ah maybe --no-default-features

formal smelt
rocky vigil
#

i am using main latest

#

well

#

no gpu anyways

formal smelt
#

Like the error should be “bro disable default features”

rocky vigil
stray reef
#

ah yes i added the rand crate to the bullet_lib cargo.tml iirc (forgive me jw :P)

#

idk about the mismatched types

rocky vigil
#

or am I blind

rocky vigil
#

ohhh

#

yeah idk about mismatched types either

formal smelt
#

It should be an &str

rocky vigil
#

mm hmm

#

well then

#

@formal smelt i assume this means it can't be done on cpu backend

formal smelt
#

Yes

rocky vigil
#

what a shame

#

time to ask kevlu to do it ig

#

@stray reef @formal smelt does it look good

twilit oriole
rocky vigil
#

or that

#

my parents might mald bout it tho

stray reef
#

How many threat updates could a single move cause at most? (preferrably even split into add+sub counts) Has anyone put thought into this yet?

rocky vigil
#

At most 32 add (8 from the moving piece, 8 from uncovering sliders, 16 from attacks to the dest) and same for subtract

#

Actually if deduplication is taken into account that 32 is lower

#

Maybe 20

#

Because you get at most 4 from uncovering sliders

#

And the other 8+16 is reduced to 16

rocky vigil
#

similar cost as a refresh

rocky vigil
#

we dreamed

#

for 2 moves

frosty imp
#

threat inputs weakness at king safety? Kappa

stray reef
#

gonna run a DFRC test of threat inputs actually, just out of curiosity

stray reef
#

yeah it's about the same strength diff to master as in normal chess

#

which means it scales well but doesn't actually play much better DFRC than clover and stormphrax

stray reef
#

Pairwise tests are now up on furybench (fixed nodes + STC)

rocky vigil
#

lmao neutral fixed nodes

#

and +4 stc

rocky vigil
#

Bench is anywhere from 0 to 10% faster

stray reef
#

Fixed nodes

Elo   | 0.03 +- 3.07 (95%)
Conf  | N=20000 Threads=1 Hash=16MB
Games | N: 20140 W: 5862 L: 5860 D: 8418
Penta | [439, 2372, 4427, 2412, 420]

https://furybench.com/test/3080/
STC

Elo   | 4.28 +- 2.48 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.90 (-2.25, 2.89) [0.00, 2.50]
Games | N: 19550 W: 4940 L: 4699 D: 9911
Penta | [50, 2200, 5042, 2425, 58]

https://furybench.com/test/3083/

Nice result, and should make it even easier to increase L1 for an LTC gain

#

i would have expected it to be a bit worse at fixed nodes, ngl

#

I am training stage 1 of the factorised threat features now, it's very slow, but it might be worth it

rocky vigil
#

Threat inputs is a little different

stray reef
#

not sure if there are any other results of people trying pairwise with such small L1s

#

(without threat inputs)

rocky vigil
#

If you have time could you try screlu multilayer as well

stray reef
#

now that i have pairwise, screlu is no longer usable due to quantisation (probably, haven't thought a lot about it)

rocky vigil
#

Ah

#

I meant screlu w/o pairwise

#

Maybe that still affects quantization though

stray reef
#

no, but it would be slower and scale worse with L1 size, i don't think that's worth trying, seeing this pairwise result

rocky vigil
#

Ok

#

Fair enough

rocky vigil
#

Seems like higher quantization is quite effective

#

That should put it only what 5 STC elo away?

stray reef
#

probably neutral at LTC

twilit oriole
#

Eh test that. Threat inputs are more resistant to quantisation. So much so that we now i8 quantise ours

stray reef
#

no i mean with gain it's now probably neutral to master at LTC

#

was expecting quantisation to scale linearly

naive comet
#

vltc gainer then?

rocky vigil
#

surprisingly 8192 works over 3072

#

in Monty at least

#

i guess the future for cpu mcts is just in big net

stray reef
stray reef
#

rip i have a bug in the training script. guess i'll test LTC again then

rocky vigil
#

oof

stray reef
#

Factorisation is now fully working. We'll have the results of stage 1 tomorrow

desert tree
#

🙏

stray reef
#

(I am factorising similarly to small threat inputs, except I also encode if the threatened/protected piece is of higher value)

#

it's still a ton of features. potentially i'll have to cut it down more

rocky vigil
#

Ah

#

Yeah it’s difficult to factor

#

It’s a subset of PP essentially

#

So the factorings are pretty much also just factorings of PP

stray reef
#

it would be possible (but more complicated probably) to use what chef tried in vine recently as a factoriser, e.g. [colored_piece][sq][sq_attacked][sq_defended]

#

aka. 768x4

desert tree
#

fwiw it seems to train pretty quickly

#

going from 600->800SBs is completely neutral

#

for our 1024hl net

twilit oriole
#

Depends on amount of data

desert tree
#

right yeah

#

with not that much data you dont need that many SBs
dont mind me forgetting basic stuff about training nets

stray reef
#

#1220867251763286207 message this would be potentially a big issue with that scheme though (though practically probably not)

rocky vigil
#

New LTC not looking terribly hot rn oof

twilit oriole
#

Looks fine to me. There is still scaling the L1 and adding more data left

rocky vigil
#

Yeah maybe neutral LTC was optimistic

#

btw

#

I got threat inputs branch of nnue PyTorch up

#

On my fork

#

In case anyone with gpu wants to try and see if it works

#

I largely copied the existing impl and just changed the function calls etc. to match the library

#

So here’s hoping nothing goes terribly wrong

rocky vigil
#

notwithstanding the errors with nnue-pytorch, training seems to be quite fast