#UE Threat Inputs for AB

1 messages · Page 3 of 1

rocky vigil
#

I should’ve guessed

#

What is memmove avx unaligned

desert tree
#

memmove is memcpy but it works for overlapping regions

rocky vigil
#

L1=1024 4 thread 397434

#

Anyways these speed losses shouldn’t be on the order of 60-100 STC elo

#

Welp

candid ivy
#

this one is from a speedtest profile, with the 1024 net

#

(ah but this already with some simd which nets an additional ~8%)

twilit oriole
#

Well it's not so simple. The speed characteristics depend on game phase and a speed loss in each phase is likely worth different Elo amount in each

#

Only way here is to measure Elo really

#

Like a L1 1536 threat net is way slower in openings and way faster in endgames

#

Is that speed figure with separate SF instances running on each thread?

#

That's very important...

rocky vigil
#

Btw can we compare 1024 to master

#

It should get quite close in fixed nodes…

twilit oriole
#

I don't think threat inputs pass even in monty without shared net weights between the instances it's quite possible that is the discrepancy

rocky vigil
rocky vigil
#

It shouldn’t matter too much for 256 and simplified threats but if we move to full threats and make the net 3x as large as master then it will be an issue…

twilit oriole
#

Eh I think it definitely could matter still

#

Need to test

candid ivy
rocky vigil
#

Well I think proper simd should gain significantly over autovec

#

Especially since screlu-affine seems a lot slower than I hoped

#

As well as memmove

#

can I get smth like this for full inputs instead
Simplified inference already has worked for a long time

formal smelt
#

I will but I’m not available for at least a few hours

rocky vigil
#

Ok yeah I won’t be able to do things with rust for a while as well

#

Until I get out of school

round stone
# rocky vigil Btw can we compare 1024 to master
Results of ./threat-inputs/stockfish-master-mar12 vs ./threat-inputs/mar11-sscg13-1024-profile-build (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 80.69 +/- 7.18, nElo: 129.42 +/- 11.12
LOS: 100.00 %, DrawRatio: 36.30 %, PairsRatio: 4.17
Games: 3752, Wins: 1581, Losses: 725, Draws: 1446, Points: 2304.0 (61.41 %)
Ptnml(0-2): [39, 192, 681, 802, 162], WL/DD Ratio: 2.01
rocky vigil
#

Ok maybe not that close lol

#

I think this STC will end up being -200 or so

round stone
#

i think you're right, it should be positive elo if the engine on the left is stronger than the one on the right (stronger vs. weaker)

#

that's the only way for fishtest STC results to be consistent

twilit oriole
#

Are these nets multilayer yet?

round stone
#

these are all single-layer so far

#

i do have a multi-layer L1-256 sitting around

twilit oriole
#

So a 1536 full threats with multilayer should beat master in fixed nodes by a lot... Which is good in theory

rocky vigil
twilit oriole
rocky vigil
#

I can easily make multilayer inference if and only if it matches what already exists in sf code

twilit oriole
#

It should. i8 quant is possible in bullet I thought

rocky vigil
#

Yeah I think more pressing concern is get full threats to work

round stone
rocky vigil
#

Optimistically this should happen in a few days

round stone
#

it also has floats for inference in later layers

twilit oriole
rocky vigil
#

I think it’s worth testing if quantization for later layers loses elo

twilit oriole
#

Oh that's just an example lol

#

It's not made for this

round stone
#

full threats would be a good baseline

rocky vigil
#

But I think the greater issue is floating point arithmetic not being associative

round stone
#

is there a simple multi-layer training config with i8 quant for later layers somewhere?

twilit oriole
rocky vigil
#

Ah hmm

#

Do we know any devs who have multilayer with quantization

#

They might know

formal smelt
#

I’m pretty sure the arch matches unless I’ve misinterpreted the diagrams in NNUE.md

twilit oriole
#

Hm well can you make one to match the quantization also?

formal smelt
#

It’s a few line change, someone else can do it

#

I don’t know what SF does anyway

round stone
#
vec![
    SavedFormat::new("pst", QuantTarget::I16(255), Layout::Normal),
    SavedFormat::new("l0w", QuantTarget::I16(255), Layout::Normal),
    SavedFormat::new("l0b", QuantTarget::I16(255), Layout::Normal),
    SavedFormat::new("l1w", QuantTarget::I16(64), Layout::Normal),
    SavedFormat::new("l1b", QuantTarget::I16(64 * 255), Layout::Normal),

    SavedFormat::new("l2w", QuantTarget::I16(64), Layout::Normal),
    SavedFormat::new("l2b", QuantTarget::I16(64 * 255), Layout::Normal),

    SavedFormat::new("l3w", QuantTarget::I16(64), Layout::Normal),
    SavedFormat::new("l3b", QuantTarget::I16(64 * 255), Layout::Normal),
],
#

this doesn't crash, so i assume it's usable

#

don't know if there's a strict need for int8 vs. int16 for l2w and l3w

twilit oriole
#

I would just use i8

#

Like it matches the current, dunno if changing it is good

round stone
#

currently i think it's i8 for l2w, l3w and i32 for l2b, l3b

round stone
#

QuantTarget::I32(_) => unimplemented!("i32 quant is not implemented for TrainerBuilder!")

#

i don't think it matters to use i16 for everything

rocky vigil
#

Yeah I’ll have a look at the layers soon

rocky vigil
rocky vigil
#

SSS still but threatnet seems to be way better at endgames as expected

round stone
lofty cedar
#

I think we do need to move to multi-layer.

#

That would be a more accurate nnue.

naive comet
#

multilayer would only give 10ish gain, we can do that as a last step once we actually reach that range

lofty cedar
#

Really? I thought multilayer was a big deal.

#

But it is a big deal performance-wise.

round stone
#

-150 + 10 = -140

#

looking for elo gainers more like +100

twilit oriole
#

It would give far more than 10 for threats, the speed hit is far less % wise

naive comet
#

at most 20-30 fixed-nodes

#

though

lofty cedar
#

Was this measured using size 256?

#

It might be the case that the quality of a single-layer net nearly saturates at some point and so you need to go multi-layer.

#

Like... larger l1 scales better with more layers.

#

So, if l1=256, 1 layer vs l1=256 multilayer was 10 elo, maybe l1=512 multilayer would gain more?

rocky vigil
#

LTC for 768 vs 512 much better than STC

#

We do have speed issue

lofty cedar
#

Do you have further plan?

#

It's still a long way vs master.

rocky vigil
#

full inputs

#

multilayer

#

I think full input 1536 with multilayer should be good in fixed nodes vs master

#

and then a grind for optimizing

lofty cedar
#

Full input? WDYM?

rocky vigil
#

we are using simplified inputs right now, which are ~15k

#

full inputs add ~80k instead

lofty cedar
#

Oh, I see.

#

Good luck...

#

Lots of work.

rocky vigil
#

also a vague extra idea I had is to append the threats to halfkav2hm instead

#

not just append to psq

#

we can also support this theoretically

#

by having separate accumulators for them, and combining on eval time

#

tmrw if I have time I'll try to merge append_active_threats and write_difference into compute_threats_write_difference

rocky vigil
#

hmm we're getting outscaled hard by master

#

I guess not so surprising

stray reef
#

which montytrain branch produces the net that was +100-ish fixed nodes?
https://github.com/official-monty/montytrain/tree/threat-inputs-nnue-fixed sounded like the "correct" one that's not the simplified inputs from what i caught yesterday, but from looking at the code it looks like some threats are not incorporated (e.g. map_pawn_threat only maps to pawns, knights and rooks

rocky vigil
#

Yes because pawn -> bishop threat implies corresponding bishop -> pawn threat

#

So you know any pawn -> bishop is a duplicate

stray reef
#

Oh right! I didn't think of it that way. So the branch is correct?

rocky vigil
#

Yes

#

I believe

stray reef
#

Great, thank you

rocky vigil
#

Btw if you can run it can you print me the active feature indices

#

For kiwipete

#

And startpos

rocky vigil
#

Note that pawn - pawn are duplicates only when they are opposite colors

stray reef
#

startpos

79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525 79986 4550 4551 79989 4571 4572 80048 11032 10143 80055 11136 10241 80116 26463 22096 19187 19188 19189 80179 37421 36583 36584 36585 80288 80289 80290 80291 80292 80293 80294 80295 80361 42762 80366 42781 80426 47787 47788 80429 47808 47809 80488 55334 56231 80495 55432 56335 80556 69143 69144 69145 76429 72062 80619 78573 78574 78575 79416 
79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525 79986 4550 4551 79989 4571 4572 80048 11032 10143 80055 11136 10241 80116 26463 22096 19187 19188 19189 80179 37421 36583 36584 36585 80288 80289 80290 80291 80292 80293 80294 80295 80361 42762 80366 42781 80426 47787 47788 80429 47808 47809 80488 55334 56231 80495 55432 56335 80556 69143 69144 69145 76429 72062 80619 78573 78574 78575 79416

kiwipete

79864 79865 79866 79869 79870 95 79871 79883 34 79892 79941 1276 605 606 608 79955 2034 2710 2712 2713 79995 79996 6866 5189 80048 13722 10143 80055 13821 10241 80130 19491 19492 22405 28230 20954 19503 29699 80179 36583 37424 37425 80256 39942 80270 40051 80281 80283 39990 80290 40253 40254 80292 40257 80293 40344 80295 40347 80346 40663 40665 42683 44365 80350 40696 42713 43723 80415 46014 80417 48273 49394 80488 55330 58921 80495 55432 59020 80547 68941 70401 68946 68950 68951 76236 80619 78573 78575 
79866 3 4 79868 7 79869 94 79871 97 79873 79875 79894 389 79896 79938 2259 581 2599 2601 79942 1619 612 2629 79993 6279 5162 80007 80048 13722 10147 80055 13821 10241 80123 26612 19336 19337 20797 19342 19349 80179 36583 36585 80268 39963 80275 40228 80288 80289 39999 80290 80293 80294 40345 80295 80331 40566 40567 40568 43932 80349 42702 42704 43378 42707 80419 46045 80420 48300 49981 80488 55334 58921 80495 55432 59020 80538 61448 68735 60000 70196 68743 68744 71657 80619 78573 79414 79415
#

code used (i hope it's correct)

fn main() {
    let inputs = ThreatInputs::default();
    let pos = format!("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 | 0 | 0.0").parse::<ChessBoard>().unwrap();
    inputs.map_features(&pos, |stm, nstm| {print!("{stm} ")});
    println!();
    inputs.map_features(&pos, |stm, nstm| {print!("{nstm} ")});
    println!();
}
rocky vigil
#

Also 139 startpos features is way off

formal smelt
#

they're the standard 768 inputs

#

you're seeing them contiguously because the pawns aren't attacking or defending anything in startpos

#

i get startpos has 88 features

rocky vigil
#

Ok will check later

rocky vigil
formal smelt
#

I changed it for the simpler inputs to put them before

rocky vigil
#

Full threat indexing should be ready now

#

At least it’s correct for startpos and kiwipete

round stone
#

alright, does that mean full threats inference is ready for testing soon?

rocky vigil
#

Yes

formal smelt
round stone
#
Architecture           : (80624 -> 256)x2 -> 1
Inputs                 : Threat inputs
Number of Weights      : 20.64m
Quantisations          : [255, 64]
Eval Scale             : 340
rocky vigil
#

I wrote approximately the same functions

round stone
#

that's a partially-trained one. the bulletbullet footer needs to be trimmed still

rocky vigil
#

We do need a better incremental at some point though

formal smelt
#

hehe

round stone
#

gotta run for now. i'll check back in later

rocky vigil
#

Considering how Yukari is like 2x as fast

formal smelt
#

as long as its working first

rocky vigil
#

As current impl

formal smelt
#

ah lol

rocky vigil
#

The psq feature offset for full threats is 79858 right

#

Ah shoot 79856

formal smelt
#

something like that

#

you can check with the montytrain branch i linked

rocky vigil
round stone
rocky vigil
#

alright i'll try to get it to work in ~2 hours or so

rocky vigil
#

wait no output buckets?

#

ok then

rocky vigil
rocky vigil
#

nvm apparently I have access violation tryiing to read 0x00000000

#

and I am a bit too tired to debug this rn

#

commit is up for debug help

round stone
#

oh yea, output buckets would be good

#

seems like output buckets doesn't work with montytrain yet

round stone
#

need to do a new set of measurements for full threats later

formal smelt
#

It looks like you’ve done .inputs(output buckets) or something

rocky vigil
#

Can someone test if latest commit compiles and runs

#

Because it tries to read null pointer on one machine

#

But works on the other

#

Anyways unbucketed full threats are 40 +- 35 elo at 25k nodes, according to an sss test I ran

rocky vigil
#

Simplified 256, with 8 output buckets

twilit oriole
#

nice

rocky vigil
twilit oriole
#

not home rn

rustic bough
#

e27d31f compiled with Clang 20/MSYS2 under Win10 on Intel i7 7700HQ runs.

round stone
# formal smelt It looks like you’ve done `.inputs(output buckets)` or something
diff --git a/value/src/arch.rs b/value/src/arch.rs
index 0438bef..289766f 100644
--- a/value/src/arch.rs
+++ b/value/src/arch.rs
@@ -13,9 +13,9 @@ pub fn make_trainer<T: Default + SparseInputType>(
     TrainerBuilder::default()
         .quantisations(&[255, 64])
         .optimiser(AdamW)
-        .loss_fn(Loss::SigmoidMSE)
+        .loss_fn(Loss::SigmoidMPE(2.6))
         .input(inputs)
-        .output_buckets(outputs::Single)
+        .output_buckets(outputs::MaterialCount::<8>)
         .feature_transformer(l1)
         .activate(Activation::SCReLU)
         .add_layer(1)
formal smelt
#

huh i can reproduce it

#

thats really weird

round stone
#
use bullet::{
    nn::{
        optimiser::{AdamW, AdamWOptimiser},
        Activation,
    },
    trainer::default::{inputs::SparseInputType, outputs, Loss, Trainer, TrainerBuilder},
};

#[rustfmt::skip]
pub fn make_trainer<T: Default + SparseInputType>(
    inputs: T, l1: usize,
) -> Trainer<AdamWOptimiser, T, outputs::Single> {
    TrainerBuilder::default()
        .quantisations(&[255, 64])
        .optimiser(AdamW)
        .loss_fn(Loss::SigmoidMPE(2.6))
        .input(inputs)
        .output_buckets(outputs::MaterialCount::<8>)
        .feature_transformer(l1)
        .activate(Activation::SCReLU)
        .add_layer(1)
        .build()
formal smelt
#

) -> Trainer<AdamWOptimiser, T, outputs::Single> {
you'd need to change this line too

#

but when i do that it still errors

rocky vigil
#

Ok maybe my home laptop just has an issue with std::optional

#

Well that kinda sucks if true

formal smelt
round stone
#

cool works now, after also adding the import:

trainer::default::{
       formats::bulletformat::{ChessBoard},
       inputs::SparseInputType, outputs, Loss, Trainer, TrainerBuilder
},
round stone
rocky vigil
#

Does it run

#

And not crash

round stone
#

yea it's +30 fixed nodes so far vs. simplified threats 512

rocky vigil
#

Home laptop is getting an access violation at 0x00000000 and I’m trying to figure out if this is an issue with my code or with my device

#

So it’s probably an issue with my compiler on that laptop then

#

Ok good to know

#

Hmm surpassing double L1 is really good

round stone
#

or maybe -30? i keep mixing up the order of these

#
Results of ./threat-inputs/mar11-sscg13-512-profile-build vs ./threat-inputs/mar14-sscg13-full-threats-256 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 30.14 +/- 6.19, nElo: 45.18 +/- 9.24
LOS: 100.00 %, DrawRatio: 43.04 %, PairsRatio: 1.57
Games: 5432, Wins: 1978, Losses: 1508, Draws: 1946, Points: 2951.0 (54.33 %)
Ptnml(0-2): [93, 509, 1169, 725, 220], WL/DD Ratio: 2.28
main prawn
#

that's -30

round stone
#

womp womp

rocky vigil
#

Well if 512 was really +130 to 256 for simplified (this result seems off, btw) that is still a good gain

#

+output buckets are maybe 10 or so

formal smelt
#

output buckets can be neutral

#

very variable

round stone
#

an output buckets 256 is training now and will be ready later to test

formal smelt
#

nice

rocky vigil
#

Ok

#

Nice

#

In other news for ue optimization once again

#

Since Yukari obviously does it a lot better rn

#

I think at some point we need to switch from trying to compute every threat

formal smelt
#

do we have a guestimate of what perf is left on the table

#

vs yukari?

rocky vigil
#

To only incrementally updating threats

rocky vigil
# formal smelt vs yukari?

Startpos 20 second search for Yukari vs current branch had Yukari like 2x faster for (simplified->256) on my machine

#

You can also test this out

formal smelt
rocky vigil
#

I think Yukari has incremental attack tables

#

Which is integrated into movegen etc

#

I am wondering surely sf had fast attack tables back in the HCE era

#

So I am wondering if those still exist or we could bring them back

rocky vigil
rocky vigil
#

You can see it here

formal smelt
#

crazy

#

well that bodes well at least for the potential once optimised

rocky vigil
#

yeah the current method is kind of a dead end eventually

#

We can mask the overhead more as L1 gets large but it’s still significant

rocky vigil
#

So first of all castling never opens any discoveries right

#

And castling is the only move involving more than 2 squares

rocky vigil
#

So if castling we know that we only need to deal with attacks involving those (up to 4) squares

#

Otherwise for two-square moves we can loop through the leapers

#

And discoveries are only present if both sides of a file/rank/diagonal are attacked

#

Do we have methods for only getting attacks on one line

#

Not both

#

Welp I guess not

round stone
#

full threats L1-256 with 8 output buckets:

Architecture           : (80624 -> 256)x2 -> 1x8
Inputs                 : Threat inputs
Number of Weights      : 20.64m
Output Buckets         : Will be transposed in quantised network for you, output bucketed layers will
                       : have weights in form [[[T; layer input size]; layer output size]; buckets]
Quantisations          : [255, 64]
Eval Scale             : 340

https://tests.stockfishchess.org/api/nn/nn-a660a82f6a81.nnue

rocky vigil
#

It would allow better optimization of computing discovered attacks

candid ivy
#

vertical/horizontal/diagonal ?

#

there's line_bb and between_bb

rocky vigil
#

Vertical, both directions (so up and down)

#

Or horizontal, both directions (so left and right)

rocky vigil
#

Because incremental from bitboards alone is very annoying, at least to optimize

#

For threats what you would want to do is:

#

Loop through the attackers of both squares and also the attacks

#

But then optimizing around deduplication and computing less attacks is hard

candid ivy
#

mh you can use hyperbola approach for just horizontal/vert if you don't want to use a lookup

rocky vigil
#

Hyperbola doesn’t work for rank right?

#

What I was planning to do was loop through the attacks for the leapers (including both colors of pawns) and then consider rank, file, both diags individually to add discovered attacks

#

Combining this way means you don’t have to compute the bitboard of piece attacks separately since you already need it to compute the attackers of the square

#

But after thinking about it for a while I conclude that this will be a very messy hack

candid ivy
# rocky vigil Hyperbola doesn’t work for rank right?

mh it does, had a look on github an one impl is just

template<uint32_t sq>
static constexpr uint64_t rook_attack(uint64_t occ) {
       return vertical_attack<sq>(occ) | horizontal_atkL<sq>(~occ) | horizontal_atkR<sq>(~occ);
}

so you can split that up into just vert/horizontal

rocky vigil
#

Do you know if sf had attack tables in the HCE days

#

I think a broader issue is that currently we try to decouple nnue updates from the position

#

But threat inputs have a much larger dependence on the position

#

So trying to do threats separately without integrating it into the position and stuff like make_move naturally leads to awkward code

#

Eventually I think we do need to figure out some way to compute threat changes incrementally and I would appreciate help from more experienced people regarding this

round stone
#
Results of ./threat-inputs/mar14-sscg13-full-threats-256-8-output-buckets vs ./threat-inputs/mar14-sscg13-full-threats-256 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 3.50 +/- 4.74, nElo: 5.29 +/- 7.16
LOS: 92.60 %, DrawRatio: 44.70 %, PairsRatio: 1.03
Games: 9034, Wins: 2949, Losses: 2858, Draws: 3227, Points: 4562.5 (50.50 %)
Ptnml(0-2): [218, 1011, 2019, 1000, 269], WL/DD Ratio: 2.32
#

going to try larger L1, then different datasets later, including monty binpacks

rocky vigil
#

Yeah seeing the gains at larger L1 will also be very helpful

rocky vigil
#

And I would appreciate any suggestions

rocky vigil
#

Most important right now though is to scale the nets up to large size and verify gain over master at fixed nodes

#

Theoretically, with the current implementation, you compute all attacks of pieces, looping through ~40 threats and writing them, then write_difference loops through these and writes maybe 10 more on average

#

Whereas with real incremental you only need to loop through 12 attack bitboards and way less threats, and you can write the differences directly

#

Also: usually the accumulator updates are for both colors at once right, so we can try to optimize further by only computing the attacks once but writing the difference for both colors

#

This all needs more work to develop though

rocky vigil
#

We’ll still have to see for multilayer later

twilit oriole
#

Eh that's cache stuff I'm pretty sure. Like if the net was shared between instances it would gain

rocky vigil
#

8 output buckets vs 1 bucket is hardly cache right

#

does nodestime work as a tc version of fixed nodes

twilit oriole
rocky vigil
#

the new dataset appears quite superior to the old one

round stone
#

The new one is the same training sequence as the one used for simple threats. Old one was training directly on a single leela binpack from scratch

round stone
#

fishtest not a fan of L1-768 full threats being huge

lofty cedar
#

Need to update fishtest.

upbeat pewter
#

NNUE that's just a rickroll loaded as an eval

twilit oriole
#

@round stone What's the fixed nodes result of 512 Vs 256 full threats?

Don't really care about TC tests anyway, the speed is not representative of anything final so it's not telling anything

What we really are still needing is L1 1536 multilayer full threats with output buckets. To compare fixed nodes to master. I have access to 4x4090 u can use for training.

This speed stuff was supposed to be on the side not the main focus at this stage, the concept has not yet been proven to be spending so much time trying to optimise speed

twilit oriole
#

This is getting out of hand (2.3k messages) because the process is not being understood I think. Like multilayer is being delayed as it is 'only 20 Elo whereas at TC the net is currently -100 Elo' but the TC result is irrelevant at this stage. We need a solid fixed nodes result first, so it is relevant there

rocky vigil
#

although nowhere close to fully optimized, the speed is definitely sufficient for fixed nodes at a reasonable pace

round stone
#
Results of ./threat-inputs/mar15-sscg13-full-threats-512-S2 vs ./threat-inputs/mar15-sscg13-full-threats-256-S2 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 41.66 +/- 5.02, nElo: 64.55 +/- 7.71
LOS: 100.00 %, DrawRatio: 41.32 %, PairsRatio: 1.98
Games: 7802, Wins: 2882, Losses: 1951, Draws: 2969, Points: 4366.5 (55.97 %)
Ptnml(0-2): [104, 663, 1612, 1242, 280], WL/DD Ratio: 2.03
#

TC tests are important for establishing a baseline and quantifying improvement

#

optimizing speed is part of proving the concept, especially if speed is the biggest bottleneck

#

at larger L1, the training process and data becomes much more important

#

if an L1-512 full threats net is -50 elo STC vs. a far smaller L1-256 HalfKA v2 hm, then that's a problem

#

can you elaborate on why you think the process you have in mind is a better approach to being competitive vs. master at TC?

rocky vigil
#

i think they more want to say

#

that we should get data about fixed nodes

#

to see if it can be competitive assuming good optimization

#

rn I think if you want to simulate an "optimized" implementation you can give 1.6x time odds as this is approximately how much faster Yukari is in the midgame to current stockfish at L1=256

round stone
#

is a 1.6x speedup within reach?

#

much larger L1 is going to need significant improvements to the training process + data to have any decent indication of fixed nodes strength

rocky vigil
#

there is no reason it shouldn't

#

well the 1.6x applies mostly to midgame

#

in endgame where there are few threats it isn't as large

#

I'll see where we are (full threats -> 256 vs smallnet) with +50% time odds

upbeat pewter
#

and I think my implementation is actively less efficient than a lazy approach

formal smelt
#

is this the right way round?

rocky vigil
#

wait huh

twilit oriole
#

Ideally we would have the unoptimized version exactly how we want in terms of net and then optimise after

rocky vigil
#

I think L1 1536 should be close to master

#

if we take simplified 1024 is -80, add +40 for full threats, add +20 for larger L1, and +20 for multilayer

twilit oriole
#

Sure you can think but it needs to be proven. I have the GPUs let's just do it before investing a fuckton of time

rocky vigil
#

right

#

yeah 4x4090 should massively speed up the training

#

as long as you have the data and know what to do

twilit oriole
#

Linrock can provide the script and data to me

#

Ideally it would be complete in terms of arch so then we know this is exactly what we are even optimizing

round stone
#

nnue-pytorch is far ahead of bullet as far as maximizing strength of data, and this is only amplified with larger L1

twilit oriole
#

That's fine. This is work that has to be done anyways, I still think it's an easier path than doing UE threat gen and stuff

#

With 4 GPUs it means 4 experiments can go in parallel

round stone
#

i think it saves time to prove/disprove at smaller L1

rocky vigil
round stone
#

if it can never be fast enough to become a master net with optimal UE, then there's no need for a long and complicated training process

rocky vigil
#

in Disservin's profile for L1=256 up to 45% of the time is spent on append_active_threats, write_difference, and memmove_avx_unaligned_erms

#

memmove is probably because I did smth wrong with alignment

#

and if we can reduce those things to 1/4 of their current runtime with incremental threat computation that gives our +50% speedup approximately

candid ivy
rocky vigil
formal smelt
#

I’ll just add them in the next week sometime

round stone
#

biggest two are:

  • piece count probability distribution skipping
  • WDL skipping
formal smelt
#

got any digestible code references for them?

formal smelt
#

cheers

#

hopefully will have time in a few days

rocky vigil
#

it should still be superior even at the current version

#

i might make a test later

formal smelt
#

so if you grab 50% speedups from the threat calculation and indexing, we're gaming?

violet badger
#

out of curiosity how is the relative speed of the trainer (bullet vs nnue-pytorch), also, does bullet have multi-GPU training?

rocky vigil
#

sadly being like 4x larger than current smallnet it doesn't fulfill the smallnet purpose

round stone
#

the smallnet primary purpose was to gain elo

#

so if a threats net can be used as a smallnet as a gainer, it's another path to production

rocky vigil
#

i meant for replacing the lichess net

#

like it's gonna be 30MB

#

and that will be a bit bad

round stone
#

oh yea probably not as a lichess smallnet replacement, where the purpose was to be small

round stone
upbeat pewter
violet badger
#

ok, nice..

#

(if it can be made to work well 😉 )

rocky vigil
round stone
#

training speed is hard to compare, since it varies due to many differences between them: net architectures, skipping, dataset format

rocky vigil
#

it seems like it performs better against lichess smallnet with the original book though

#

compared to endgames.epd

#

for whatever reason

round stone
#

right now i'm getting around 2M pos/sec for these L1-512 threat nets, loading binpacks without any skipping

violet badger
#

so gut feeling comparison, same order of magnitude, or rather slower or rather faster.

round stone
#

also config options i haven't closely looked into optimizing

rocky vigil
#

also 120MB limit for net means that tc testing for larger L1 is out for now

round stone
#

i'd say training definitely seems faster with bullet, but that's also because piece count + WDL skipping makes it much slower

#

similar order of magnitude of training speed

#

using the sequential dataloader maybe 2-3x faster for small nets, vs. binpacks even without skipping

#

the data disk size usage vs. training speed tradeoff of bullet format vs. binpacks is very noticeable

rocky vigil
round stone
rocky vigil
#

which will hopefully let us know how threat inputs scales with more time

round stone
#

in the meantime i'm hoping L1-256 and L1-512 experiments can give an indicator of how it scales

rocky vigil
#

full threats -> 512 is like 80 mb right?

round stone
#

yea

rocky vigil
#

does the memory issue become significant

#

viren might know better

#

ik viren advocates for not duplicating ram for nnue weights on multiple sf processes

#

huh +50% speed is 150 elo

#

this is a bit tyrannical scaling

round stone
#

i was expecting less than +100 elo with a +50% speedup but didn't really know what was reasonable

#

worth testing it vs. master to see where it's at there

rocky vigil
#

it's only 1.5x time odds not 2x

formal smelt
# violet badger out of curiosity how is the relative speed of the trainer (bullet vs nnue-pytorc...

i'm using the command as follows on nnue-pytorch main branch

python3 train.py ../bullet/data/test80-2024-02-feb-2tb7p.min-v2.v6.binpack --batch-size 16384 --features=HalfKAv2_hm --lambda=1.0 --gpus "0," --threads 8 --num-workers 8 --default_root_dir ./ --no-smart-fen-skipping

does this command look okay?
This gives me 92/6166 [00:17<19:01, 5.32it/s, loss=0.0209, v_num=1] which is ~90k pos/sec
I believe I'm training SF master net with L1=3072, bucket=8, having touched nothing else

#

Num virtual features: 0 i take this is not using a factoriser as well

round stone
formal smelt
#

superbatch 1 [12.5% (128/1024 batches, 171237 pos/sec)]

formal smelt
#

its ofc not a proper comparison

#

need to guarantee everything is equal enough

rocky vigil
#

how long is the average fishtest game

formal smelt
#

tested on gtx 1660 super btw

daring wren
rocky vigil
#

ok it's more like 2.5x time odds then

#

oopsies

#

yeah the real 1.5x time odds should be close to neutral then

formal smelt
#

ah the psqt subnet is output bucketed

#

superbatch 1 [37.5% (384/1024 batches, 165194 pos/sec)] have updated

#

superbatch 1 [37.5% (384/1024 batches, 154846 pos/sec)] + ranger optimiser

candid ivy
#

makes me wonder how much elo is from optimizing hyperparameters and the data filter changes

rocky vigil
#

hmm looks like with the real +50% it's still better than smallnet

#

but significantly worse than master as expected

violet badger
violet badger
#

(which made me realize that my local backup was not adjusted to the recent server changes... need to fix that... fixed).

candid ivy
formal smelt
violet badger
#

interesting, so getting the exact SF arch and training features in bullet would be quite nice for training.

formal smelt
#

Apart from the output bucket formula I’m now fairly certain the arch is exactly replicated in the branch I linked

#

For required extra features other than the data loading stuff I’m not sure
The NNUE-PyTorch ranger impl has more stuff than the bullet one but all the extra features seem unused by default

candid ivy
#
#[derive(Clone, Copy, Default)]
pub struct SfMaterialCount<const N: usize>;
impl<const N: usize> OutputBuckets<ChessBoard> for SfMaterialCount<N> {
    const BUCKETS: usize = N;

    fn bucket(&self, pos: &ChessBoard) -> u8 {
        let piece_count = pos.occ().count_ones() as u8 - 1;
        (piece_count / 4) as u8
    }
}

one can just define that for the output buckets right?

formal smelt
#

Yeah

#

And for the optimiser stuff you can implement a custom optimiser that wraps the existing ranger one with any extra stuff you need

#

As long as it doesn’t require any funky gpu operations that aren’t yet supported

#
#[derive(Clone, Copy, Default)]
pub struct SfMaterialCount;
impl OutputBuckets<ChessBoard> for SfMaterialCount {
    const BUCKETS: usize = 8;

    fn bucket(&self, pos: &ChessBoard) -> u8 {
        let piece_count = pos.occ().count_ones() as u8 - 1;
        (piece_count / 4) as u8
    }
}

don't need the generic param also

candid ivy
#

yeah just realised, im getting 670345 pos/sec for that arch

formal smelt
#

damn my gpu sucks lol

violet badger
#

hmm, I do have a GPU I'd like to test, guess I have to figure out what to run. Is there a TL;DR / list of commands I could run. Worth mentioning I've never run anything rusty 🙂

round stone
#

after installing rust, take a look at the simple.rs example and train with cargo r -r --example simple

formal smelt
#

That is somewhat out of date now, the CUDA requirement is lower than 12.2 and you can also compile for CPU
The instructions in there should work fine tho

candid ivy
violet badger
#

ok, great, just got the simple to work..

candid ivy
violet badger
#

working 'out-of-the-box'

#
Params: 72205464
Beginning Training
Net Name               : test
Batch Size             : 16384
Batches / Superbatch   : 1024
Positions / Superbatch : 16777216
Start Superbatch       : 1
End Superbatch         : 10
Eval Scale             : 400
Save Rate              : 150
WDL Scheduler          : constant 0
LR Scheduler           : start 0.001 gamma 0.3 drop every 60 superbatches
Threads                : 4
Output Path            : checkpoints
superbatch 1 | time 2.6s | running loss 0.000000 | 6363127 pos/sec | total time 4.4s
Estimated time remaining in training: 0h 0m 39s
superbatch 2 | time 2.6s | running loss 0.000000 | 6463666 pos/sec | total time 6.9s
Estimated time remaining in training: 0h 0m 27s
superbatch 3 | time 2.6s | running loss 0.000000 | 6474294 pos/sec | total time 9.5s
Estimated time remaining in training: 0h 0m 22s
superbatch 4 | time 2.6s | running loss 0.000000 | 6477194 pos/sec | total time 12.1s
Estimated time remaining in training: 0h 0m 18s
superbatch 5 | time 2.6s | running loss 0.000000 | 6475190 pos/sec | total time 14.7s
Estimated time remaining in training: 0h 0m 14s
superbatch 6 | time 2.6s | running loss 0.000000 | 6475920 pos/sec | total time 17.3s
Estimated time remaining in training: 0h 0m 11s
superbatch 7 | time 2.6s | running loss 0.000000 | 6478756 pos/sec | total time 19.9s
Estimated time remaining in training: 0h 0m 8s
superbatch 8 | time 2.6s | running loss 0.000000 | 6475852 pos/sec | total time 22.5s
Estimated time remaining in training: 0h 0m 5s
superbatch 9 | time 2.6s | running loss 0.000000 | 6480538 pos/sec | total time 25.1s
Estimated time remaining in training: 0h 0m 2s
superbatch 10 | time 2.6s | running loss 0.000000 | 6494769 pos/sec | total time 27.7s
Estimated time remaining in training: 0h 0m 0s
Saved [test-10]
Total Training Time: 0h 0m 34s
Eval: 0.000cp
#

now that loss / eval is maybe a bit suspicious?

candid ivy
#

yeah deffo

violet badger
#

hmm, I'm using the CUDA_PATH setting, not HIP.

#

seems happy building things, though

#

(but no changes to your repo, except for dropping bmi2 and specifying the path to the .binpack)

candid ivy
#
superbatch 1 | time 26.0s | running loss 0.011861 | 645176 pos/sec | total time 28.5s
Estimated time remaining in training: 0h 0m 0s
Saved [test-1]
Total Training Time: 0h 0m 41s
Eval: 23.493cp

thats how it should end, just changing the superbatches from 10 to 1 here so that it finishes quickly

#

did the simple example do something reasonable?

violet badger
#

well, probably not:

Output Path            : checkpoints
superbatch 1 | time 8.7s | running loss 0.000000 | 11501668 pos/sec | total time 11.7s
Estimated time remaining in training: 0h 7m 37s
#

similarly 0.0 loss

upbeat pewter
violet badger
formal smelt
#

What GPU is this?

violet badger
#

GH200

#

I suspect the issue might be host is arm (aarch64)

#

but idk.

formal smelt
#

Try running cargo test --package bullet_core

violet badger
#

all green

#
running 11 tests
test backend::cpu::crelu ... ok
test backend::cpu::matmul ... ok
test backend::cpu::relu ... ok
test backend::cpu::matmul2 ... ok
test backend::cpu::concat ... ok
test backend::cpu::screlu ... ok
test backend::cpu::sparse_affine ... ok
test backend::cpu::sparse_affine_check_not_batched ... ok
test backend::cpu::sparse_affine_batched_biases ... ok
test backend::cpu::sqrrelu ... ok
test backend::cpu::sparse_affine_dual ... ok

test result: ok. 11 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
#

though that's the CPU backend?

formal smelt
#

Yeah

#

cargo test bullet_hip_backend

#

For gpu backend

formal smelt
upbeat pewter
#

wouldn't it just flat-out fail to link?

violet badger
#

I guess no? That test however does:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/debug/deps/bullet_core-a748fb7f74ceb385)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/debug/deps/bullet_hip_backend-f99c5dab2957960e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/debug/deps/bullet_lib-21ef2babe27415ff)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
formal smelt
#

Oh i sent wrong command

#

cargo test --package bullet_hip_backend

violet badger
#

ok..

#

errors and passes

formal smelt
#

I’ll get home in a minute, will link you a branch to try

violet badger
#

happy to try out (probably after dinner, let's see)

candid ivy
#

the typical vondele dinner break, i ran into many of those 😄

violet badger
#

good stuff coming 😉

upbeat pewter
violet badger
#

it can run crysis

formal smelt
#

okay on branch debug-gh200, run cargo test --package bullet_hip_backend followed by cargo r -r --example simple

violet badger
#

test still fails

#
running 11 tests
test tests::sparse_affine_check_not_batched ... ok
test tests::relu ... FAILED
test tests::sqrrelu ... FAILED
test tests::screlu ... FAILED
test tests::sparse_affine ... FAILED
test tests::sparse_affine_dual ... FAILED
test tests::sparse_affine_batched_biases ... FAILED
test tests::crelu ... FAILED
test tests::matmul ... ok
test tests::concat ... ok
test tests::matmul2 ... ok
formal smelt
#

i pushed another change just now that'll affect running the simple example

#

it looks like the kernels are not running but not raising any error about it

#

baffling

violet badger
#

Ah, wait: called Result::unwrap()on anErr value: Cuda(cudaErrorUnsupportedPtxVersion)

formal smelt
#

oh nice

#

what's the output of nvidia-smi?

upbeat pewter
#

driver issues

formal smelt
#

indeed the most common cause of this error is having mismatched driver for your cuda version

violet badger
#
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   16C    P0             95W /  900W |     290MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
...
candid ivy
#

cudaErrorUnsupportedPtxVersion
This indicates that the provided PTX was compiled with an unsupported toolchain. The most common reason for this, is the PTX was generated by a compiler newer than what is supported by the CUDA driver and PTX JIT compiler.

#

a very "cool" gpu :p

upbeat pewter
#

unfortunately nvidia only list minimum requirements for x86-64 drivers

violet badger
#

driver and cuda version definitely are compatible..

upbeat pewter
#

...did you update your drivers since last rebooting?

#

wondering if it's a kernel/userland mismatch

violet badger
#

me no... but certainly all good on that front.

candid ivy
violet badger
#

well, let me test something...

upbeat pewter
violet badger
#

I tried a different rust install, it said this:

error: rustc 1.81.0-nightly is not supported by the following package:
  [email protected] requires rustc 1.83

so I guess that's indeed a requirement for bullet_core?

violet badger
#

ok, so back to the previous install.

formal smelt
#

The rust install won’t make a difference here

#

The issue is with compiling the cuda kernels and rust has nothing to do with that other than build script invoking nvcc

violet badger
#

is there a way to get output of that compilation (in particular invocation of nvcc, or similar?)

formal smelt
#

Yeah one sec I’m cooking dinner

upbeat pewter
#

CXX=/bin/false cargo run ...

#

:p

violet badger
#

elegant ..

#

is there some equivalent of 'make clean' ?

upbeat pewter
#

cargo clean

violet badger
#

Removed 1499 files, 597.5MiB total yeah

#

though no difference

#

so with the CXX=/bin/false I don't get much more than:
error occurred in cc-rs: Command LC_ALL="C" "nvcc" "-ccbin=/bin/false" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/users/vjoost/fish/bullet/target/debug/build/bullet_hip_backend-44a0a83a10a89e6d/out/9080fc8993de408a-include.o" "-c" "./kernels/include.cu" with args nvcc did not execute successfully (status code exit status: 1).

#

Also

  --- stdout
  cargo:rerun-if-changed=./kernels
  rerun-if-env-changed=CUDA_PATH
  Path CUDA_PATH="/user-environment/env/default"
  cargo:rustc-link-lib=dylib=cublas
  cargo:rustc-link-search=native=/user-environment/env/default/lib64
  cargo:rerun-if-changed=/user-environment/env/default/include
  TARGET = Some(aarch64-unknown-linux-gnu)
  HOST = Some(aarch64-unknown-linux-gnu)
  cargo:rerun-if-env-changed=CXX_aarch64-unknown-linux-gnu
  CXX_aarch64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXX_aarch64_unknown_linux_gnu
  CXX_aarch64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = Some(/bin/false)
  RUSTC_WRAPPER = None
  cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
  cargo:rerun-if-env-changed=NVCC_aarch64-unknown-linux-gnu
  NVCC_aarch64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=NVCC_aarch64_unknown_linux_gnu
  NVCC_aarch64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_NVCC
  HOST_NVCC = None
  cargo:rerun-if-env-changed=NVCC
  NVCC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some(neon)
  cargo:rerun-if-env-changed=CXXFLAGS_aarch64-unknown-linux-gnu
  CXXFLAGS_aarch64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXXFLAGS_aarch64_unknown_linux_gnu
  CXXFLAGS_aarch64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = None
  CARGO_ENCODED_RUSTFLAGS = Some()
candid ivy
formal smelt
#

the issue is just setting CXX to something invalid doesn't actually compile anything

#

ideally would like to see what gets emitted

violet badger
#

full output?

formal smelt
#

yeah

#

i suspect it will be rather large

violet badger
#

it is 🙂

#

That's the strange bit?

  running: "nvcc" "-?"
  cargo:warning=nvcc fatal   : Unknown option '-?'
  exit status: 1
formal smelt
#

nah that's just the cc crate being weird

#

this is doing some crazy stuff after the initial nvcc invocation

#

i assume due to using gcc

#
  running: LC_ALL="C" "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-G" "-Xcompiler" "-gdwarf-4" "-Xcompiler" "-fno-omit-frame-pointer" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/users/vjoost/fish/bullet/target/debug/build/bullet_hip_backend-44a0a83a10a89e6d/out/9080fc8993de408a-include.o" "-c" "./kernels/include.cu"
  exit status: 0
#

the actual command seems to just work

violet badger
#

yeah c++ is definitely gcc

formal smelt
#

maybe just need to work out what ptx version your system wants and tell it to emit that

violet badger
#

so right now no clang available...

#

though it must be somewhere ..

#

would be easier if we could restrict the ptx version or so.

formal smelt
#

doesn't seem possible but can check the ptx version emmitted

violet badger
#

I don't see any option that specifies the gpu type

formal smelt
#

ptx version != target sm

#

if this works fine then it would be some issue with the nvcc commands that bullet invokes

violet badger
#

compiles into a main.exe

candid ivy
#

mv main.exe main

violet badger
#

nvprof is executed but not installed

formal smelt
#

try running main

violet badger
#

fails.

#

interesting

#

Running naive
Average Time: 0.625ms
Running vectorised
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.

formal smelt
#

well that is a relief to me at least

#

so its almost surely not bullet-side issue

violet badger
#

good let me figure this out.

formal smelt
#

sorry, im not really qualified to debug general cuda toolchain issues

violet badger
#

might take some time.

formal smelt
#

sounds good

formal smelt
#

nahhhh

#

that looks like the naive kernel runs

upbeat pewter
#

proposal: bullet should just always run the sanity checks on startup

formal smelt
#

good thinking batman

#

i'll PR that later

upbeat pewter
#

I'm like robin at best

formal smelt
violet badger
#
Running naive
Average Time: 0.597ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Running vectorised
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Error Status: no error
Running blocktiled
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Error Status: no error
#

so strange. Let me ask around

formal smelt
#

ah so it just takes 0.6ms to realise the ptx is the wrong version

violet badger
#
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:27:38_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
formal smelt
#

aha

violet badger
#

so compiler newer than the driver?

upbeat pewter
#

it really was driver issues :p

#

yes

formal smelt
#

yeah looks like it

violet badger
#

so compiler issues..

#

let me ask

upbeat pewter
#

do you have 12.4 installed somewhere on your FS?

violet badger
#

things are a bit more complicated in this setup... somewhat different approach to deploy software

violet badger
#
Running naive
Average Time: 0.586ms
Error Status: no error
Error Status: no error
Running vectorised
Average Time: 0.198ms
Error Status: no error
Error Status: no error
Error Status: no error
Running blocktiled
Average Time: 0.190ms
Error Status: no error
Error Status: no error
Error Status: no error
#

any way to pass -arch=native to the rust compilation args?

#

I think without specifying anything it compiles for the highest possible arch

upbeat pewter
#

RUSTFLAGS="-C target-cpu=native" cargo run ...

violet badger
#

which is more than GH200 already

#

target-cpu ?

upbeat pewter
#

though for nvcc, hm

violet badger
#

bingo!

#
running 11 tests
test tests::sparse_affine_check_not_batched ... ok
test tests::relu ... ok
test tests::matmul ... ok
test tests::crelu ... ok
test tests::sparse_affine ... ok
test tests::sqrrelu ... ok
test tests::screlu ... ok
test tests::sparse_affine_batched_biases ... ok
test tests::concat ... ok
test tests::sparse_affine_dual ... ok
test tests::matmul2 ... ok
upbeat pewter
#

\o/

formal smelt
#

yay

violet badger
#

so, moving to the simple example, I guess

#

or to @candid ivy code, but I assume that needs a bullet upstream fix?

formal smelt
#

what fix was needed?

formal smelt
#

i dont think that's needed but might be faster

violet badger
#
diff --git a/crates/bullet_hip_backend/build.rs b/crates/bullet_hip_backend/build.rs
index 71f7989..1c9cc2e 100644
--- a/crates/bullet_hip_backend/build.rs
+++ b/crates/bullet_hip_backend/build.rs
@@ -62,6 +62,7 @@ fn build_cuda(out_path: &Path) {
         .cudart("shared")
         .debug(false)
         .opt_level(3)
+        .flag("-arch=native")
         .files(&[KERNELS])
         .out_dir(out_path)
         .compile("libkernels.a");
candid ivy
violet badger
#

I think without this option, it takes the highest sm that the compiler supports..

#

which might include unknown ptx

formal smelt
#

it defaults to sm_52

violet badger
#

well that is probably ancient enough as well 😉

candid ivy
#

i think my 4090 is 80 or something?

formal smelt
#

it would be good to get data points on if this is a measurable speedup

candid ivy
#

let me test this

formal smelt
#

im not seeing a speedup locally but i have a crap gpu

upbeat pewter
#

I feel like "necessary for building on GH200 [even if obscure] and not a slowdown" is sufficiently convincing, tbh

formal smelt
#

yeah i'll merge it in

violet badger
#

superbatch 1 | time 8.8s | running loss 0.037813 | 11413254 pos/sec | total time 10.7s

#

simple working..

formal smelt
#

looks dataloader bottlenecked

#

nice

violet badger
#

you expect that bottleneck to be the filesystem or something of the code?

formal smelt
#

if you're loading bulletformat then its filesystem

violet badger
#

I guess this binpack is small enough to store in RAM.

#

no binpack

formal smelt
#

binpack loading its the shuffling step

candid ivy
formal smelt
#

which would be replaced by fen skipping aggressively

candid ivy
violet badger
#

not yet, can do now

formal smelt
#

okay i've merged the native thing and rebased sf-arch-i-think branch

violet badger
upbeat pewter
#

then yeah, probably bottlenecked on the data loader

formal smelt
#

or even preparing the data

#

you can try increasing the threads for each

upbeat pewter
#

(it was a missed opportunity to not call loaders magazines /j)

formal smelt
#

but simple is, well, very simple and tiny arch

violet badger
#

increased both to 16

#

Worse superbatch 1 | time 12.4s | running loss 0.037861 | 8092996 pos/sec | total time 14.2

formal smelt
#

probably increasing the loader threads doing that

#

i'd guess the bottleneck is the shuffling step

#

which is single threaded

violet badger
#

16/4 also worse superbatch 1 | time 14.1s | running loss 0.037859 | 7109966 pos/sec | total time 15.7s

formal smelt
#

damn

violet badger
#

4/16 rougly equal superbatch 1 | time 9.2s | running loss 0.037878 | 10833312 pos/sec | total time 11.0s

formal smelt
#

interesting

violet badger
#

4/4 best superbatch 1 | time 8.6s | running loss 0.037861 | 11631975 pos/sec | total time 10.3s

formal smelt
#

this is just a 768->128x2->1 network tbf
could you try the advanced example on branch sf-arch-i-think

violet badger
#

ok, let me swap branches.

formal smelt
#

nice

upbeat pewter
formal smelt
#

luckily as long as you aren't actually hitting the data loader bottleneck, it doesn't matter how fast/slow it is

#

so as long as less than 11m pos/sec on a real arch, all should be good

violet badger
#
superbatch 1 [75.0% (768/1024 batches, 1638909 pos/sec)]
thread '<unnamed>' panicked at /users/vjoost/fish/bullet/crates/bullet_lib/src/trainer/default/loader/sfbinpack.rs:108:50:
called `Result::unwrap()` on an `Err` value: SendError { .. }
#

some panick...

#

hmm, but not every time

#
superbatch 1 | time 10.2s | running loss 0.018962 | 1647798 pos/sec | total time 11.8s
Estimated time remaining in training: 0h 0m 0s
Saved [test-1]
Total Training Time: 0h 0m 19s
Eval: 68.779cp
formal smelt
#

yeesh

formal smelt
#

i would guess its something like if the binpack can't fill the shuffle buffer

violet badger
#

it is a big binpack..

candid ivy
violet badger
#

adnd another:

superbatch 1 [62.5% (640/1024 batches, 1635149 pos/sec)]
thread '<unnamed>' panicked at /users/vjoost/fish/bullet/crates/bullet_lib/src/trainer/default/loader/sfbinpack.rs:108:50:
called `Result::unwrap()` on an `Err` value: SendError { .. }
formal smelt
#

oh i see the bug

violet badger
#

ready to pull when your are 😉

formal smelt
#

its sending the message to stop one of the loader threads twice

#

and on the second time it errors because the thread isn't around anymore to receive it

#

oh wait

formal smelt
violet badger
#

trying...

#

ok, 3x without errors

formal smelt
#

yay

#

should give some profiling info

violet badger
#

sure, I see the GPUs 75% idle 😉

upbeat pewter
#

we need bigger batches /lh

violet badger
#

no, just kidding, it is using 1/4 GPUs

candid ivy
#

how much ram do you have?

#

& gpu ram?

formal smelt
violet badger
#

confused, but I think 4x (117GB + 94GB) usable

#

855 GB total

candid ivy
#

that 211GB probably is able to fit the entire dataset of people into ram 😛

violet badger
#

well Sf datasets are large

#

and it is seemingly not the bottleneck

#

profiling reduces a little bit the performance

formal smelt
#

interesting ratio
| Node 21 = SparseAffineDualActivate 1367 4208 | ~3.08
compared to mine
| Node 21 = SparseAffineDualActivate 37160 87121 | ~2.34

#

i suspect the aligned loads on the forward pass are going extremely hard

#

well this has been interesting

#

GH200 is insane

upbeat pewter
#

the chip's worth like a year's wages for me >.>

violet badger
#

it is quite nice indeed... if there is anything I can still measure let me know, thanks for the help getting it to run.

formal smelt
#

i have some upcoming optmisations that will hopefully help this arch a lot in particular

violet badger
#

I'd be more than happy to test a multi-gpu implementation if it appears 🙂

#

ok, let me know.

formal smelt
#

might just stick with the current cudarc version and not touch the unsound part

#

too many things to do, not enough free time

violet badger
#

relatable ..

violet badger
#

so, I think I found back the old nnue-pytorch benchmarks I did #nnue-dev message

#

now, can we match that to what we just ran with bullet?

upbeat pewter
#

I have no idea what a pytorch iteration is :p

violet badger
#

I think it is the same 16384 batch ...

upbeat pewter
#

39.66 × 16384 = 649789.44

#

so that's 650k positions/s for 2560 L1

violet badger
#

I think so as well.

candid ivy
#

that's also the math jw did earlier

violet badger
#

so, probably could run bullet for that size to compare better

upbeat pewter
#

which means bullet trains 2.4x faster on an even larger L1 :p

violet badger
#

though the dataskipping will be different... and that might be limiting. idk

#

2560 superbatch 1 | time 8.6s | running loss 0.019224 | 1960857 pos/sec | total time 10.3s

upbeat pewter
#

It's also probably not entirely equivalent because bullet uses superbatches

violet badger
#

so roughly 3x..

#

but yes, comparison might be slightly off.

#

Still makes a big difference.

formal smelt
violet badger
#

ok, so would be interesting to see a bullet trained net match SF master net.. obviously a non-trivial exercise.

#

though a 3x speedup would help doing that.

formal smelt
#

as a fun additional note, if you decreased the number of the buckets the gap between the two would most likely grow quite significantly

#

it would take ages to recreate an entire run yeah

#

one stage probably reasonable

violet badger
#

well, with this kind of speedups goes probably rather quick, less than a day for a stage.

formal smelt
#

oh i didn't realise individual stages were so short

violet badger
#

I think so, but I've a bit forgotten how long we train one stage.

#

@round stone will know

candid ivy
#

well at first id like to see someone successfuly load such a net into sf and get somewhere close like 200 elo range or something

violet badger
#

sure..

candid ivy
#

and when that is "public" then im sure people will naturally play around and try it

violet badger
#

I'm assuming that if the net arch is the same that might be more or less doable?

round stone
#

1st stage - 400 superbatches
2nd-11th stage ~ 800 superbatches each

violet badger
#

so 10s per superbatch right now..

candid ivy
#

the binary format is slightly different, like no header, (anything else?) and then leb128 and permutation i guess

formal smelt
#

i reduced batches/superbatch because otherwise it would take forever for me to run it lol

round stone
#

1 bullet superbatch = 1 nnue-pytorch epoch

formal smelt
#

right, ye

round stone
#

both are ~100 million samples

formal smelt
#

the sf-arch-i-think advanced example is doing ~16.78m samples per superbatch

round stone
#

nnue-pytorch training speed is much slower due to all the skipping going on

#

it's a tradeoff for strength

formal smelt
#

im confused as to how it has an effect
in bullet you are either hard limited by data loading speed, or not limited at all

#

because all that is done asynchronously from training

#

so you're either waiting on batches to be sent or drawing from a pre-prepared queue with no delay

round stone
#

all i know is disabling nnue-pytorch piece count probability skipping makes training a lot faster, but also weaker

violet badger
#

yes, skipping in nnue-pytorch is definitely slowing it down.

formal smelt
#

what cli args do i need to pass to disable fen skipping entirely?

round stone
#

can't disable piece count skipping with args. have to modify the source code

#

otherwise maybe --no-wld-fen-skipping and --random-fen-skipping 0 in addition

formal smelt
violet badger
#

yes

#

It is a while ago, but we're skipping a lot of fens in these runs.

formal smelt
#

i'm mostly trying to understand how the data loader would be implemented if its effect on speed isn't all or nothing

violet badger
formal smelt
#

some of those filters are also being done in bullet, and the random skipping and stuff can be disabled with cli args i think

#

188/6166 [00:54<28:45, 3.47it/s, loss=0. with defaults
187/6166 [00:56<30:19, 3.29it/s, loss=0.0128, v_num=7 with the piece count skipping removed and additional --no-smart-fen-skipping --no-wld-fen-skipping --random-fen-skipping 0 args

#

there's some variance between runs it seems

candid ivy
#

Needs a bit to warmup too

formal smelt
#

let me try a smaller net

#

on a very small net i can observe the training getting stuck cool

#

so i think it works how i expect it to

rocky vigil
#

on my laptop it's like ~5% for 4 thread speedtest

round stone
rocky vigil
#

Hmm I guess it’s ~neutral then

round stone
#

you can also try running an STC on fishtest to see if a speedup is detected there

rocky vigil
#

I mean all it does is use a pre-existing array to write the accumulator to and pass to L2

#

Instead of declaring a new one

#

So I’m not too keen on doing a whole STC

lofty cedar
#

How's it going?

#

Never heard an idea taking this long...

naive comet
#

not even NNUE?

lofty cedar
#

NNUE was long ago... not sure what happened back then. It was something like someone bolted NNUE to Stockfish and it won, then Stockfish decided to slap NNUE onto Stockfish.

#

But the time to first gain was probably really quick given that classical evaluation was comparatively weak compared to NNUE.

naive comet
#

thats not even close

lofty cedar
#

Really? I thought the NNUE vs classical was a cakewalk.

lofty cedar
naive comet
#

it is not that simple

#

you must keep in mind that sf hce is still 100s of elo stronger than any other hce around

twilit oriole
#

There's many things going on at once here (like bullet transition also) and the pipeline is complex after 5 years of optimisation

lofty cedar
#

Speaking of it, maybe this is the time to try something?

#

Might as well try KAN?

formal smelt
#

KANs are generally overrated

#

but yes, it could the be tried

#

the advantage of this idea is that there already exists a strong network that indicated it could work (the monty value network)

#

and the implementation just requires training an NNUE version and working out how to best optimise the efficient updates

#

on the other hand there does not exist any KAN chess network of notable strength

lofty cedar
#

That being said, I tried KAN training and it failed to even beat master at loss.

formal smelt
#

yes but obviously we would need a serious attempt at it to actually rule it out

#

your naivety is shown by comparing loss to master on the first networks you try to train...

#

secondly do you have a plan for UE or some ~equivalent with a KAN

#

a suggested architecture

#

etc

lofty cedar
#

It was despite my network having a higher compute budget. I didn't completely rule it out, but I didn't deem it worth it.

lofty cedar
#

Only the second layer onward would be KAN.

#

That being said, I only used the old feature transformer, not trained a whole new net.

lofty cedar
#

What I found out is that most didn't even stay in the range where nontrivial behavior would occur.

#

So, it was approximately an MLP anyway.

lofty cedar
# naive comet u sure about this?

Yeah... it was an amateur-ish attempt. I didn't clean the training data and so on, so don't take this as anything more than 2 cents.

rocky vigil
#

btw when shawn's nnue refactor is merged I'll also have to do a major rebase on that (I might also try to figure out incremental attack tables during this, so it might take a while)

#

but I'll leave the current branch for testing

rocky vigil
rocky vigil
#

Interesting

#

LTC ~ STC

rocky vigil
#

I thought it would scale a bit more

#

Maybe not being multilayer plays a part in this idk

rocky vigil
candid ivy
naive comet
#

bullet ordering is pnbrqkpnbrq but SF ordering is pnbrqpnbrqk for halfkav2_hm

#

is that it?

rocky vigil
#

ok now that refactor is here in the next week or so I'll try to rebase everything

#

if multilayer works though and a new net is trained I can still make quick updates on old branch

rocky vigil
candid ivy
rocky vigil
#

oh huh

rocky vigil
candid ivy
#

well that is only needed for one layer right now.. and i do that otherwise it wouldnt even load into sf

rocky vigil
#

wait so it loads correctly but doesn't inference

#

is that the issue

candid ivy
#

yeah

#

maybe i padded it wrong but dont think so

#

the psqt subnet is already correct

candid ivy
#

yeah well ik

rocky vigil
#

btw what are the hash values in the net intended to do

candid ivy
#

just some verification things that the loaded net is supported in this arch, not really important

rocky vigil
#

btw have you tried checking if write_parameters returns the same file as what you inputted

candid ivy
#

no but i don't see how that wouldn't be the case

rocky vigil
#

but if everything is being read correctly and inference is still not working that means it's more likely to be an issue with bullet no?

candid ivy
#

well reading it into an array is one thing, but the more important thing is that the layout of the weights for example is correct

rocky vigil
#

oh wait are you saying the array sizes could match but the layout might not

candid ivy
#

yeah well the array sizes currently definitely match

#

so must be something in the layout which isn't in the way sf expects it

rocky vigil
#

btw I find it funny how read_parameters is called on the activation functions even though they just return true

#

affine transform weights are written row major I think

candid ivy
#

well if you want to try and fix it, you can just give this repo a try, i can send you a bullet checkpoint as well if needed

rocky vigil
#

ok sure

#

do you still have the bullet header info that describes how bullet outputs the weights

candid ivy
#

there's some info here #1351682122162634796 message and in the normal channel too i think

rocky vigil
#

eh why is it 1 GB

#

oh I see

#

can I also get the bullet config used to train this net

candid ivy
#

clone the linked repo

rocky vigil
#

ah it has it ok

candid ivy
#

and checkout the quant-pst-old-correct-pst-values branch

#

and the quantised nets need to be converted using the python script python convert_quantised_to_pytorch.py ./checkpoints/halfkav2_hm-stm/test-80/quantised.bin bullet.nnue

rocky vigil
#

@stray reef how did you end up inferencing threat inputs? Did you add incremental attack tables?

#

Also if you have the net somewhere I’ll try to get it to work so fixed nodes can be run

stray reef
#

I have one version with and one version without incremental attack tables. The speedup was around 15-20%

rocky vigil
#

Speedup at which L1 btw

stray reef
#

I do have the net, it's not a SF arch though, I can send it when I'm at my pc again

rocky vigil
#

Alright that’s fine I’ll try to get it to work

stray reef
rocky vigil
#

Wait 15% speedup at that size is really good

stray reef
#

It's still a lot slower than my master net though (1.4M nps vs 2.3M nps), and "only" +30 fixed nodes

rocky vigil
stray reef
#

i haven't yet looked at small threat inputs, maybe that has less updates, that could help

rocky vigil
#

Simplified and full are ~ the same speed

stray reef
#

rip

rocky vigil
#

And we know the +30 fixed nodes elo approximately translates into similar STC diff

stray reef
rocky vigil
#

Yes single layer with 8 output buckets

#

Btw linrock’s simplified threats (15776 -> 1024)x2 -> 1x8 is known to be around -80 fixed nodes to sf master

#

But there is still a lot of experimentation remaining

stray reef
#

My current feeling is that the amount of extra feature updates makes it sort of infeasable. But I will try some multilayer architectures with small threat inputs, with an L1 that's similar speed to my master net, and see what happens.

rocky vigil
#

Btw have you measured the number of feature updates

#

I get it’s ~ 8 per color per node in sf in the midgame

stray reef
#

Though on the other hand, linrocks (80624->256)x2 -> 1x8 net matched my net at fixed nodes 😅

daring wren
stray reef
stray reef
daring wren
#

that's insane

stray reef
#

Maybe small hidden layers and multilayer are the way to go here

#

i also had a 384 HL net that was very half-assed and matched my master net too

rocky vigil
#

If you run a long search with my branch and type eval it’ll tell you the total number of accumulator updates

stray reef
#

already transposed and quantised (255, 64)

#

ah and the output bucket formula is the one from bullet, not (piececount - 1) / 4

stray reef
upbeat pewter
#

(I will just continue memeing that attack tables and mailboxes are slow)

finite wind
stray reef
#

i trained it for plentychess, but it could be used with any engine

rocky vigil
finite wind
stray reef
#

it won't work with any official stockfish version or development build. I'm trying to figure out myself how to use it with stockfish

finite wind
#

Ok. I will try with another strong engines

stray reef
#

probably worded that previous message badly. this is an experimental network architecture. given the right modifications to the code, it could be used with any engine

#

but you won't find any engine in which that net will work with no modifications

finite wind
#

I always looking for huge nets for ab engines

stray reef
#

one good thing about these types of nets though, the lack of input buckets makes them incredibly fast to train

rocky vigil
#

btw is bullet just (piececount - 2) / 4

#

I actually have (piececount - 1) / 4 rn lemme check if we're missing some elo bc of that

violet badger
rocky vigil
#

I guess that explains why output buckets weren't so great compared to single layer initially

rocky vigil
stray reef
upbeat pewter
stray reef
#

just trained a net with these settings

const HIDDEN_SIZE: usize = 512;
const SCALE: f32 = 400.0;

fn main() {

    let mut trainer = TrainerBuilder::default()
        .optimiser(optimiser::AdamW)
        .loss_fn(Loss::SigmoidMSE)
        .input(ThreatInputsSimple)
        .output_buckets(outputs::MaterialCount::<8>)
        .feature_transformer(HIDDEN_SIZE)
        .activate(Activation::CReLU)
        .add_layer(16)
        .activate(Activation::SCReLU)
        .add_layer(32)
        .activate(Activation::SCReLU)
        .add_layer(1)
        .build();

    let start_epoch = 1;
    let experiment_name = "0087";

    let output_path = format!("/mnt/d/Chess Data/Selfgen/Training/{}", experiment_name);
    let settings = LocalSettings {
        threads: 4,
        output_directory: &output_path.as_str(),
        test_set: None,
        batch_queue_size: 512,
    };
    let data_loader: loader::DirectSequentialDataLoader = loader::DirectSequentialDataLoader::new(&["/mnt/d/Chess Data/Selfgen/interleaved.data"]);
    let schedule = TrainingSchedule {
        net_id: format!("net-{}", experiment_name).to_string(),
        eval_scale: SCALE,
        steps: TrainingSteps {
            batch_size: 16_384,
            batches_per_superbatch: 6104,
            start_superbatch: start_epoch,
            end_superbatch: 500,
        },
        wdl_scheduler: wdl::ConstantWDL { value: 0.5 },
        lr_scheduler:  lr::CosineDecayLR { initial_lr: 0.001, final_lr: 0.001 * 0.3 * 0.3 * 0.3 * 0.3, final_superbatch: 500 },
        save_rate: 10,
    };

    trainer.set_optimiser_params(optimiser::AdamWParams {
        decay: 0.01,
        beta1: 0.9,
        beta2: 0.999,
        min_weight: -0.99,
        max_weight: 0.99,
    });

    trainer.run(&schedule, &settings, &data_loader);
}
#

apart from the arch, that pretty much matches the first stage of my master net (except 420 SBs instead of 500)

upbeat pewter
#
        min_weight: -0.99,
        max_weight: 0.99,

...huh.

stray reef
# stray reef i also had a 384 HL net that was very half-assed and matched my master net too

scratch this. i must have messed up this test. maybe I accidentally tested my 384 vs linrocks 256. Or it's the difference between full and simple threat inputs.

Trained two simple threat inputs 512 L1 nets today (one single layer, one multilayer, aka -> 16 -> 32 -> 1) to see what L1 is needed to beat my master net at fixed nodes. The results were a bit disappointing.
512 with layers vs. main: Elo: -16.13 +/- 6.58, nElo: -23.63 +/- 9.63
512 single layer vs. main: Elo: -33.53 +/- 6.49, nElo: -50.06 +/- 9.63

I doubt that a 768 L1 net would be noticably stronger than master at fixed nodes, so 896 or 1024 would be necessary at least (for my case, obviously). And for that I am really not sure if I can make it fast enough...

rocky vigil
#

fixed nodes single layer full inputs should be ~30 elo better than simplified at the same L1 so a full threat input net could be noticeably stronger at L1=768

rocky vigil
#

yeah it looks like threat inputs need more careful data work as L1 gets larger

frosty imp
#

maybe it's the slowdown?

rocky vigil
#

slowdown is like 15%

frosty imp
#

that'd be around -30 Elo STC

#

so decent chance it passes LTC

rocky vigil
#

i mean the last time we did this it was 0 stc and 7 ltc

#

so like

#

yeah

#

it's negative stc now because the new L1=256 net is slightly better

#

which is more evidence that data work will be very important

rocky vigil
#

technologov machine dropped a massive diff

#

idk it feels like threat inputs somehow always attract high residuals

twilit oriole
#

I've already said why multiple times lol. It's cache effects because the net is being duplicated for each instance instead of shared

#

Memory accesses are far less predictable with threat inputs compared to king buckets, having a large portion of the net in cache is important

#

512 Vs 256 is not a 15% slowdown when using multiple instances with no sharing, this is all this is showing

#

In monty mmap was worth 40 Elo at SPRT conditions. And I wasn't even using hyperthreading like fishtest machines. And it wasn't even a threat input net (but it was non UE so still had cache issues). This stuff really matters a lot

#

I always thought sharing would be absolutely necessary for this to pass. Which is part of why testing at TC is a huge pain in the ass for now (and the results are useless because the TC Elo diff of scaling L1 is totally dependent on how optimized the speed is), I'm still wanting to see scaling L1 and doing fixed nodes. It's a far easier path to validate the idea, if a L1 1536 full threat net (with multilayer, output buckets) surpasses master SF net the idea is finally validated after all this time, the speed can then be fully sorted after

rocky vigil
twilit oriole
#

@wide oasis is also up for helping with it I think

wide oasis
#

sure

twilit oriole
#

Yeah. I mean all the code for it already exists in the branches (training and inference) it's mostly copy paste job

wide oasis
#

what is needed from my end?

rocky vigil
#

Note that if you copy paste my inference code you lose quite a bit of speed over incremental attack tables

#

But it’ll be good for fixed nodes

rocky vigil
wide oasis
#

idk lol

#

i guess you'd replace nnue.cpp entirely

twilit oriole
#

Only the inputs need to be changed

#

From regular to threat

rocky vigil
#

note that there is no multilayer threat input net yet right now

twilit oriole
rocky vigil
#

I meant the actual net

twilit oriole
#

I got the GPUs if any training needs to take place after swapping the inputs in its config

rocky vigil
#

gabe gonna have to talk with jw about how to do it

#

or you i guess

#

obsidian uses float in later layers though right?

twilit oriole
#

Well it already trains multilayer with bullet. So whatever it does will work

#

@wide oasis what's your bullet training config?

#

And the Leela binpacks used

#

I'll just swap in the full threats and start training some nets I guess

rocky vigil
#

ok I think I should be able to get inference (inputs -> accumulator) to work but it might take a few days

wide oasis
#

isn't it quicker if I train the net with all the data ready?

twilit oriole
#

Think we need a new baseline anyways, might as well simplify to one stage. Will need to be using more data I think for both for fair comparison

stray reef
# rocky vigil btw are the branches available now since it looks like upstream sf nnue refactor...

Sorry, didn't get around to it until now.
Uploaded the branches and nets for a single layer ((80624 -> 1024)x2 -> 1x8) and multi layer ((80624 -> 2048)x2 -> (16 -> 32 -> 1)x8) net now. They are +30 and +55 to my master net at fixed nodes, respectively, but far too slow even with the current UE impl.
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full-layers

#

Changing to simplified threat inputs is super easily done in threat-inputs.h/cpp, though it seems I lost my small threat inputs nets and therefore can't test this with a working network at the moment

rocky vigil
#

Hmm 1024 to 2048 and multilayer fixed nodes gain seems surprisingly low

#

How much data do you have

#

If we assume that a (768->N) net needs around N million positions of data then you would need around 50B positions to have a similar input saturation for (80624 -> 2048)

stray reef
#

I could take some older positions that are generated with less nodes, maybe that still works together on such a big net

#

that would put me at around 13B altogether

rocky vigil
#

yeah I'm using smth like N*M*16K / (average number of features per position) to estimate the order of magnitude of data required

stray reef
#

50B is definitely not feasable for me, that's for sure. And I think that goes for all selfgenners (apart from maybe Jay)

rocky vigil
#

I'll test (fixed nodes) the 1024 -> 1x8 net vs linrock's 512 which I still have somewhere

rocky vigil
stray reef
#

Small threat inputs are less data starved, but they are not faster so they're not really useful

rocky vigil
#

for your impl approximately when do you staart to see significant slowdown

stray reef
#

what L1?

rocky vigil
#

yeah

#

in mine 256 -> 512 is around 15% slowdown

#

but it was quite slow to begin with

#

btw does 0084.bin have the bullet padding trimmed at the end

stray reef
#

mh I tested a 512 L1 morelayers net that was slower than master and worse at fixed nodes. I haven't tested the incremental speed differences yet

rocky vigil
#

how fast is sf master (or 17.1), your master and 512 L1 (in nps)

stray reef
#

this makes generating verbatim nets also very easy

rocky vigil
#

ok I wasn't sure if the 0000 bytes at the end were padding or part of the net

#

thanks

rocky vigil
rocky vigil
#

I think they are? from reading process_net.cpp

#

and what is the quantization

#

bc I'm getting info string NNUE evaluation using nn-239c9dddf51e.nnue (157MiB, (80624, 1024, 1)) info depth 1 seldepth 2 multipv 1 score cp 931 nodes 20 nps 20000 hashfull 0 tbhits 0 time 1 pv h2h4 info depth 2 seldepth 3 multipv 1 score cp 330 nodes 103 nps 51500 hashfull 0 tbhits 0 time 2 pv b1c3 a7a6 info depth 3 seldepth 4 multipv 1 score cp 586 nodes 127 nps 63500 hashfull 0 tbhits 0 time 2 pv b1c3 a7a6 info depth 4 seldepth 5 multipv 1 score cp 679 nodes 239 nps 79666 hashfull 0 tbhits 0 time 3 pv b1a3 a7a6 info depth 5 seldepth 6 multipv 1 score cp 757 nodes 341 nps 113666 hashfull 0 tbhits 0 time 3 pv b1a3 b7b6 h2h4 a7a6 info depth 6 seldepth 7 multipv 1 score cp 420 nodes 457 nps 114250 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 info depth 7 seldepth 8 multipv 1 score cp 396 nodes 485 nps 121250 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 b7b6 f2f3 info depth 8 seldepth 8 multipv 1 score cp 22 nodes 614 nps 153500 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 c6c5 f2f3 info depth 9 seldepth 9 multipv 1 score cp 115 nodes 670 nps 134000 hashfull 0 tbhits 0 time 5 pv d2d3 a7a6 h2h4 a6a5 g1h3 c7c6 info depth 10 seldepth 12 multipv 1 score cp 225 nodes 711 nps 142200 hashfull 0 tbhits 0 time 5 pv d2d3 a7a6 h2h4 a6a5 g1h3 c7c6 b1c3 c6c5 info depth 11 seldepth 22 multipv 1 score cp 216 nodes 3113 nps 283000 hashfull 2 tbhits 0 time 11 pv d2d3 a7a6 f2f4 c7c6 h2h4 d8a5 d1d2 h7h5 d2a5 g8h6 a5h5 info depth 12 seldepth 18 multipv 1 score cp 251 nodes 14188 nps 405371 hashfull 9 tbhits 0 time 35 pv e2e3 a7a6 d2d4 d7d6 f1a6 b8a6 c2c3 g7g6 d1d3 a6b4 d3b5 c7c6 info depth 13 seldepth 15 multipv 1 score cp 128 nodes 21150 nps 406730 hashfull 10 tbhits 0 time 52 pv e2e3 a7a6 d2d4 c7c6 b1c3 e7e6 d1d3 g8f6 d3c4 h7h5 info depth 14 seldepth 26 multipv 1 score cp 45 nodes 175351 nps 475205 hashfull 79 tbhits 0 time 369 pv c2c3 d7d5 e2e4 e7e5 a2a3 f8b4 g1e2 c8e6 e2f4 e5f4 a3b4 b8d7

#

which seems very wrong

stray reef
# rocky vigil btw are the output buckets transposed

they are not transposed in 0084.bin, but they will be transposed before being baked into the engine in transposePermuteNetwork(). If you compile normally, there will be a temporary processed.bin which is the file that's baked into the engine

rocky vigil
#

ohh

stray reef
rocky vigil
#

oh so QA = 510, QB = 64

#

I see

#

ok and scale is 400 right

#

(linrock uses 340)

stray reef
rocky vigil
#

hmm -10% to master

stray reef
#

plenty is 2.1M nps on this machine. so that's quite hard to get close to

rocky vigil
#

for comparison on my machine sf 1024 is -33% to master

stray reef
#

ok sounds like my impl is faster then. that adds up, since I gained quite a lot of speed from the incremental threat calculations

#

I guess an L1 512, full threat inputs morelayers with 13B positions should beat my master net at fixed nodes. But it'll still be relatively slow. And there's also an issue in bullet where threat input nets with morelayers and pairwise produces only dead nets on init... perhaps @formal smelt has some idea what would have to be done to improve that, since pairwise is quite an important speedup

#

i'm gonna sign off for today though

rocky vigil
#

ok yeah