UE Threat Inputs for AB | Stockfish | Page 3

rocky vigil Mar 12, 2025, 6:53 PM

#

I should’ve guessed

#

What is memmove avx unaligned

desert tree Mar 12, 2025, 6:57 PM

#

memmove is memcpy but it works for overlapping regions

rocky vigil Mar 12, 2025, 6:57 PM

#

L1=1024 4 thread 397434

#

Anyways these speed losses shouldn’t be on the order of 60-100 STC elo

#

Welp

candid ivy Mar 12, 2025, 7:11 PM

#

this one is from a speedtest profile, with the 1024 net

#

(ah but this already with some simd which nets an additional ~8%)

twilit oriole Mar 12, 2025, 7:38 PM

#

Well it's not so simple. The speed characteristics depend on game phase and a speed loss in each phase is likely worth different Elo amount in each

#

Only way here is to measure Elo really

#

Like a L1 1536 threat net is way slower in openings and way faster in endgames

#

Is that speed figure with separate SF instances running on each thread?

#

That's very important...

rocky vigil Mar 12, 2025, 7:56 PM

#

Btw can we compare 1024 to master

#

It should get quite close in fixed nodes…

twilit oriole Mar 12, 2025, 7:58 PM

#

I don't think threat inputs pass even in monty without shared net weights between the instances it's quite possible that is the discrepancy

rocky vigil Mar 12, 2025, 8:03 PM

#

candid ivy (ah but this already with some simd which nets an additional ~8%)

btw if you have SIMD could you share it so that I can pull

rocky vigil Mar 12, 2025, 8:03 PM

#

twilit oriole Is that speed figure with separate SF instances running on each thread?

Yeah this is 4 thread 1 instance

#

It shouldn’t matter too much for 256 and simplified threats but if we move to full threats and make the net 3x as large as master then it will be an issue…

twilit oriole Mar 12, 2025, 8:06 PM

#

Eh I think it definitely could matter still

#

Need to test

candid ivy Mar 12, 2025, 8:06 PM

#

rocky vigil btw if you have SIMD could you share it so that I can pull

Eh it’s only messed up avx2

rocky vigil Mar 12, 2025, 8:06 PM

#

Well I think proper simd should gain significantly over autovec

#

Especially since screlu-affine seems a lot slower than I hoped

#

As well as memmove

#

can I get smth like this for full inputs instead
Simplified inference already has worked for a long time

formal smelt Mar 12, 2025, 8:16 PM

#

I will but I’m not available for at least a few hours

rocky vigil Mar 12, 2025, 8:16 PM

#

Ok yeah I won’t be able to do things with rust for a while as well

#

Until I get out of school

round stone Mar 12, 2025, 8:41 PM

#

rocky vigil Btw can we compare 1024 to master

Results of ./threat-inputs/stockfish-master-mar12 vs ./threat-inputs/mar11-sscg13-1024-profile-build (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 80.69 +/- 7.18, nElo: 129.42 +/- 11.12
LOS: 100.00 %, DrawRatio: 36.30 %, PairsRatio: 4.17
Games: 3752, Wins: 1581, Losses: 725, Draws: 1446, Points: 2304.0 (61.41 %)
Ptnml(0-2): [39, 192, 681, 802, 162], WL/DD Ratio: 2.01

#

STC of 1024 vs. master here:
https://tests.stockfishchess.org/tests/view/67d1f0a4166a3e8781d843cf

rocky vigil Mar 12, 2025, 8:44 PM

#

Ok maybe not that close lol

#

I think this STC will end up being -200 or so

round stone Mar 12, 2025, 8:48 PM

#

i think you're right, it should be positive elo if the engine on the left is stronger than the one on the right (stronger vs. weaker)

#

that's the only way for fishtest STC results to be consistent

twilit oriole Mar 12, 2025, 8:48 PM

#

Are these nets multilayer yet?

round stone Mar 12, 2025, 8:49 PM

#

these are all single-layer so far

#

i do have a multi-layer L1-256 sitting around

twilit oriole Mar 12, 2025, 8:50 PM

#

So a 1536 full threats with multilayer should beat master in fixed nodes by a lot... Which is good in theory

rocky vigil Mar 12, 2025, 8:51 PM

#

round stone i do have a multi-layer L1-256 sitting around

Oops I realized float inference might be hard

twilit oriole Mar 12, 2025, 8:51 PM

#

twilit oriole So a 1536 full threats with multilayer should beat master in fixed nodes by a lo...

Ngl I think reaching just this point would be great progress

rocky vigil Mar 12, 2025, 8:51 PM

#

I can easily make multilayer inference if and only if it matches what already exists in sf code

twilit oriole Mar 12, 2025, 8:52 PM

#

It should. i8 quant is possible in bullet I thought

rocky vigil Mar 12, 2025, 8:52 PM

#

Yeah I think more pressing concern is get full threats to work

round stone Mar 12, 2025, 8:52 PM

#

here's a training config jw says is similar to SF multilayer:
https://github.com/jw1912/bullet/blob/main/examples/advanced.rs

rocky vigil Mar 12, 2025, 8:53 PM

#

Optimistically this should happen in a few days

round stone Mar 12, 2025, 8:53 PM

#

it also has floats for inference in later layers

twilit oriole Mar 12, 2025, 8:53 PM

#

round stone here's a training config jw says is similar to SF multilayer: https://github.com...

@formal smelt why didn't u i8 quant the later layers?

rocky vigil Mar 12, 2025, 8:54 PM

#

I think it’s worth testing if quantization for later layers loses elo

twilit oriole Mar 12, 2025, 8:54 PM

#

Oh that's just an example lol

#

It's not made for this

round stone Mar 12, 2025, 8:54 PM

#

full threats would be a good baseline

rocky vigil Mar 12, 2025, 8:54 PM

#

But I think the greater issue is floating point arithmetic not being associative

round stone Mar 12, 2025, 8:54 PM

#

is there a simple multi-layer training config with i8 quant for later layers somewhere?

twilit oriole Mar 12, 2025, 8:54 PM

#

rocky vigil I think it’s worth testing if quantization for later layers loses elo

It's already been tested. That's why SF uses i8. Let's not make so many changes at once heh

rocky vigil Mar 12, 2025, 8:54 PM

#

Ah hmm

#

Do we know any devs who have multilayer with quantization

#

They might know

formal smelt Mar 12, 2025, 9:06 PM

#

twilit oriole <@236941606035521537> why didn't u i8 quant the later layers?

Because I only made the arch similar to SF

#

I’m pretty sure the arch matches unless I’ve misinterpreted the diagrams in NNUE.md

twilit oriole Mar 12, 2025, 9:18 PM

#

Hm well can you make one to match the quantization also?

formal smelt Mar 12, 2025, 9:24 PM

#

It’s a few line change, someone else can do it

#

I don’t know what SF does anyway

round stone Mar 12, 2025, 9:38 PM

#

vec![
    SavedFormat::new("pst", QuantTarget::I16(255), Layout::Normal),
    SavedFormat::new("l0w", QuantTarget::I16(255), Layout::Normal),
    SavedFormat::new("l0b", QuantTarget::I16(255), Layout::Normal),
    SavedFormat::new("l1w", QuantTarget::I16(64), Layout::Normal),
    SavedFormat::new("l1b", QuantTarget::I16(64 * 255), Layout::Normal),

    SavedFormat::new("l2w", QuantTarget::I16(64), Layout::Normal),
    SavedFormat::new("l2b", QuantTarget::I16(64 * 255), Layout::Normal),

    SavedFormat::new("l3w", QuantTarget::I16(64), Layout::Normal),
    SavedFormat::new("l3b", QuantTarget::I16(64 * 255), Layout::Normal),
],

#

this doesn't crash, so i assume it's usable

#

don't know if there's a strict need for int8 vs. int16 for l2w and l3w

twilit oriole Mar 12, 2025, 9:47 PM

#

I would just use i8

#

Like it matches the current, dunno if changing it is good

round stone Mar 12, 2025, 9:48 PM

#

currently i think it's i8 for l2w, l3w and i32 for l2b, l3b

round stone Mar 12, 2025, 10:05 PM

#

QuantTarget::I32(_) => unimplemented!("i32 quant is not implemented for TrainerBuilder!")

#

i don't think it matters to use i16 for everything

rocky vigil Mar 12, 2025, 10:21 PM

#

Yeah I’ll have a look at the layers soon

rocky vigil Mar 13, 2025, 12:10 AM

#

round stone STC of 1024 vs. master here: https://tests.stockfishchess.org/tests/view/67d1f0a...

slightly better than I expected given it's supposed to be like over 2x slower

rocky vigil Mar 13, 2025, 12:36 AM

#

SSS still but threatnet seems to be way better at endgames as expected

round stone Mar 13, 2025, 12:43 AM

#

yea, endgames book expected to show better results from the speedup:
https://tests.stockfishchess.org/tests/view/67d220af166a3e8781d843fd

lofty cedar Mar 13, 2025, 2:48 AM

#

I think we do need to move to multi-layer.

#

That would be a more accurate nnue.

naive comet Mar 13, 2025, 4:20 AM

#

multilayer would only give 10ish gain, we can do that as a last step once we actually reach that range

lofty cedar Mar 13, 2025, 4:28 AM

#

Really? I thought multilayer was a big deal.

#

But it is a big deal performance-wise.

round stone Mar 13, 2025, 4:32 AM

#

-150 + 10 = -140

#

looking for elo gainers more like +100

twilit oriole Mar 13, 2025, 4:34 AM

#

It would give far more than 10 for threats, the speed hit is far less % wise

naive comet Mar 13, 2025, 4:51 AM

#

at most 20-30 fixed-nodes

#

though

lofty cedar Mar 13, 2025, 5:06 AM

#

Was this measured using size 256?

#

It might be the case that the quality of a single-layer net nearly saturates at some point and so you need to go multi-layer.

#

Like... larger l1 scales better with more layers.

#

So, if l1=256, 1 layer vs l1=256 multilayer was 10 elo, maybe l1=512 multilayer would gain more?

rocky vigil Mar 13, 2025, 5:24 AM

#

LTC for 768 vs 512 much better than STC

#

We do have speed issue

lofty cedar Mar 13, 2025, 5:58 AM

#

Do you have further plan?

#

It's still a long way vs master.

rocky vigil Mar 13, 2025, 5:59 AM

#

full inputs

#

multilayer

#

I think full input 1536 with multilayer should be good in fixed nodes vs master

#

and then a grind for optimizing

lofty cedar Mar 13, 2025, 6:00 AM

#

Full input? WDYM?

rocky vigil Mar 13, 2025, 6:00 AM

#

we are using simplified inputs right now, which are ~15k

#

full inputs add ~80k instead

lofty cedar Mar 13, 2025, 6:01 AM

#

Oh, I see.

#

Good luck...

#

Lots of work.

rocky vigil Mar 13, 2025, 6:01 AM

#

also a vague extra idea I had is to append the threats to halfkav2hm instead

#

not just append to psq

#

we can also support this theoretically

#

by having separate accumulators for them, and combining on eval time

#

tmrw if I have time I'll try to merge append_active_threats and write_difference into compute_threats_write_difference

rocky vigil Mar 13, 2025, 7:33 AM

#

hmm we're getting outscaled hard by master

#

I guess not so surprising

stray reef Mar 13, 2025, 9:31 AM

#

which montytrain branch produces the net that was +100-ish fixed nodes?
https://github.com/official-monty/montytrain/tree/threat-inputs-nnue-fixed sounded like the "correct" one that's not the simplified inputs from what i caught yesterday, but from looking at the code it looks like some threats are not incorporated (e.g. map_pawn_threat only maps to pawns, knights and rooks

rocky vigil Mar 13, 2025, 10:02 AM

#

Yes because pawn -> bishop threat implies corresponding bishop -> pawn threat

#

So you know any pawn -> bishop is a duplicate

stray reef Mar 13, 2025, 10:02 AM

#

Oh right! I didn't think of it that way. So the branch is correct?

rocky vigil Mar 13, 2025, 10:02 AM

#

Yes

#

I believe

stray reef Mar 13, 2025, 10:03 AM

#

Great, thank you

rocky vigil Mar 13, 2025, 10:03 AM

#

Btw if you can run it can you print me the active feature indices

#

For kiwipete

#

And startpos

rocky vigil Mar 13, 2025, 10:05 AM

#

rocky vigil So you know any pawn -> bishop is a duplicate

The other deduplication is when they are the same type, in this case I believe the one with src < dest is used

#

Note that pawn - pawn are duplicates only when they are opposite colors

stray reef Mar 13, 2025, 10:25 AM

#

startpos

79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525 79986 4550 4551 79989 4571 4572 80048 11032 10143 80055 11136 10241 80116 26463 22096 19187 19188 19189 80179 37421 36583 36584 36585 80288 80289 80290 80291 80292 80293 80294 80295 80361 42762 80366 42781 80426 47787 47788 80429 47808 47809 80488 55334 56231 80495 55432 56335 80556 69143 69144 69145 76429 72062 80619 78573 78574 78575 79416 
79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525 79986 4550 4551 79989 4571 4572 80048 11032 10143 80055 11136 10241 80116 26463 22096 19187 19188 19189 80179 37421 36583 36584 36585 80288 80289 80290 80291 80292 80293 80294 80295 80361 42762 80366 42781 80426 47787 47788 80429 47808 47809 80488 55334 56231 80495 55432 56335 80556 69143 69144 69145 76429 72062 80619 78573 78574 78575 79416

kiwipete

79864 79865 79866 79869 79870 95 79871 79883 34 79892 79941 1276 605 606 608 79955 2034 2710 2712 2713 79995 79996 6866 5189 80048 13722 10143 80055 13821 10241 80130 19491 19492 22405 28230 20954 19503 29699 80179 36583 37424 37425 80256 39942 80270 40051 80281 80283 39990 80290 40253 40254 80292 40257 80293 40344 80295 40347 80346 40663 40665 42683 44365 80350 40696 42713 43723 80415 46014 80417 48273 49394 80488 55330 58921 80495 55432 59020 80547 68941 70401 68946 68950 68951 76236 80619 78573 78575 
79866 3 4 79868 7 79869 94 79871 97 79873 79875 79894 389 79896 79938 2259 581 2599 2601 79942 1619 612 2629 79993 6279 5162 80007 80048 13722 10147 80055 13821 10241 80123 26612 19336 19337 20797 19342 19349 80179 36583 36585 80268 39963 80275 40228 80288 80289 39999 80290 80293 80294 40345 80295 80331 40566 40567 40568 43932 80349 42702 42704 43378 42707 80419 46045 80420 48300 49981 80488 55334 58921 80495 55432 59020 80538 61448 68735 60000 70196 68743 68744 71657 80619 78573 79414 79415

#

code used (i hope it's correct)

fn main() {
    let inputs = ThreatInputs::default();
    let pos = format!("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 | 0 | 0.0").parse::<ChessBoard>().unwrap();
    inputs.map_features(&pos, |stm, nstm| {print!("{stm} ")});
    println!();
    inputs.map_features(&pos, |stm, nstm| {print!("{nstm} ")});
    println!();
}

rocky vigil Mar 13, 2025, 2:52 PM

#

stray reef startpos ``` 79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525...

This seems wrong because we shouldn’t be seeing large runs of consecutive threat inputs like 79864…79871

#

Also 139 startpos features is way off

formal smelt Mar 13, 2025, 3:06 PM

#

rocky vigil This seems wrong because we shouldn’t be seeing large runs of consecutive threat...

those aren't threats

#

they're the standard 768 inputs

#

you're seeing them contiguously because the pawns aren't attacking or defending anything in startpos

#

https://github.com/official-monty/montytrain/tree/full-threat-inputs-print anyway here's your branch to check yourself

#

i get startpos has 88 features

rocky vigil Mar 13, 2025, 3:12 PM

#

Ok will check later

rocky vigil Mar 13, 2025, 3:26 PM

#

formal smelt they're the standard 768 inputs

Weren’t these supposed to be from 0 to 767 though or is it different for full inputs

formal smelt Mar 13, 2025, 3:27 PM

#

I changed it for the simpler inputs to put them before

rocky vigil Mar 13, 2025, 7:50 PM

#

Full threat indexing should be ready now

#

At least it’s correct for startpos and kiwipete

round stone Mar 13, 2025, 8:07 PM

#

alright, does that mean full threats inference is ready for testing soon?

rocky vigil Mar 13, 2025, 8:18 PM

#

Yes

formal smelt Mar 13, 2025, 8:31 PM

#

rocky vigil Full threat indexing should be ready now

with UE or no?

round stone Mar 13, 2025, 8:36 PM

#

📎 L1-256-full-threat-inputs-30.bin

#

Architecture           : (80624 -> 256)x2 -> 1
Inputs                 : Threat inputs
Number of Weights      : 20.64m
Quantisations          : [255, 64]
Eval Scale             : 340

rocky vigil Mar 13, 2025, 8:37 PM

#

formal smelt with UE or no?

UE strategy is the same as with simplified

#

I wrote approximately the same functions

round stone Mar 13, 2025, 8:37 PM

#

that's a partially-trained one. the bulletbullet footer needs to be trimmed still

rocky vigil Mar 13, 2025, 8:37 PM

#

We do need a better incremental at some point though

formal smelt Mar 13, 2025, 8:37 PM

#

hehe

round stone Mar 13, 2025, 8:37 PM

#

gotta run for now. i'll check back in later

rocky vigil Mar 13, 2025, 8:37 PM

#

Considering how Yukari is like 2x as fast

formal smelt Mar 13, 2025, 8:37 PM

#

as long as its working first

rocky vigil Mar 13, 2025, 8:37 PM

#

As current impl

formal smelt Mar 13, 2025, 8:37 PM

#

ah lol

rocky vigil Mar 13, 2025, 8:38 PM

#

The psq feature offset for full threats is 79858 right

#

Ah shoot 79856

formal smelt Mar 13, 2025, 8:39 PM

#

something like that

#

you can check with the montytrain branch i linked

rocky vigil Mar 13, 2025, 9:14 PM

#

round stone that's a partially-trained one. the bulletbullet footer needs to be trimmed stil...

Alright I’ll wait for a fully trained one and hopefully I can try to get it running later

round stone Mar 14, 2025, 3:53 AM

#

rocky vigil Alright I’ll wait for a fully trained one and hopefully I can try to get it runn...

full threats input net that finished training on some leela data, with the footer bytes trimmed:
https://tests.stockfishchess.org/api/nn/nn-2f2b7f959ee5.nnue

rocky vigil Mar 14, 2025, 3:54 AM

#

alright i'll try to get it to work in ~2 hours or so

rocky vigil Mar 14, 2025, 6:02 AM

#

wait no output buckets?

#

ok then

rocky vigil Mar 14, 2025, 6:03 AM

#

round stone ``` Architecture : (80624 -> 256)x2 -> 1 Inputs : Thre...

I can still force the inference through w/o buckets but it might be a completely accurate comparison

rocky vigil Mar 14, 2025, 6:53 AM

#

nvm apparently I have access violation tryiing to read 0x00000000

#

and I am a bit too tired to debug this rn

#

commit is up for debug help

round stone Mar 14, 2025, 7:07 AM

#

oh yea, output buckets would be good

#

seems like output buckets doesn't work with montytrain yet

round stone Mar 14, 2025, 7:09 AM

#

rocky vigil I can still force the inference through w/o buckets but it might be a completely...

without buckets is fine for testing. won't be a completely accurate comparison anyways, since this training data is simpler than before

#

need to do a new set of measurements for full threats later

formal smelt Mar 14, 2025, 10:32 AM

#

round stone seems like output buckets doesn't work with montytrain yet

They do

#

It looks like you’ve done .inputs(output buckets) or something

rocky vigil Mar 14, 2025, 4:00 PM

#

Can someone test if latest commit compiles and runs

#

Because it tries to read null pointer on one machine

#

But works on the other

#

Anyways unbucketed full threats are 40 +- 35 elo at 25k nodes, according to an sss test I ran

twilit oriole Mar 14, 2025, 4:02 PM

#

rocky vigil Anyways unbucketed full threats are 40 +- 35 elo at 25k nodes, according to an s...

compared to what

rocky vigil Mar 14, 2025, 4:02 PM

#

Simplified 256, with 8 output buckets

twilit oriole Mar 14, 2025, 4:02 PM

#

nice

rocky vigil Mar 14, 2025, 4:02 PM

#

rocky vigil Can someone test if latest commit compiles and runs

^^ if you can

twilit oriole Mar 14, 2025, 4:03 PM

#

not home rn

rustic bough Mar 14, 2025, 4:16 PM

#

e27d31f compiled with Clang 20/MSYS2 under Win10 on Intel i7 7700HQ runs.

round stone Mar 14, 2025, 4:51 PM

#

formal smelt It looks like you’ve done `.inputs(output buckets)` or something

diff --git a/value/src/arch.rs b/value/src/arch.rs
index 0438bef..289766f 100644
--- a/value/src/arch.rs
+++ b/value/src/arch.rs
@@ -13,9 +13,9 @@ pub fn make_trainer<T: Default + SparseInputType>(
     TrainerBuilder::default()
         .quantisations(&[255, 64])
         .optimiser(AdamW)
-        .loss_fn(Loss::SigmoidMSE)
+        .loss_fn(Loss::SigmoidMPE(2.6))
         .input(inputs)
-        .output_buckets(outputs::Single)
+        .output_buckets(outputs::MaterialCount::<8>)
         .feature_transformer(l1)
         .activate(Activation::SCReLU)
         .add_layer(1)

formal smelt Mar 14, 2025, 4:53 PM

#

round stone ```rust diff --git a/value/src/arch.rs b/value/src/arch.rs index 0438bef..289766...

can you send the whole make_trainer function?

#

huh i can reproduce it

#

thats really weird

round stone Mar 14, 2025, 4:57 PM

#

use bullet::{
    nn::{
        optimiser::{AdamW, AdamWOptimiser},
        Activation,
    },
    trainer::default::{inputs::SparseInputType, outputs, Loss, Trainer, TrainerBuilder},
};

#[rustfmt::skip]
pub fn make_trainer<T: Default + SparseInputType>(
    inputs: T, l1: usize,
) -> Trainer<AdamWOptimiser, T, outputs::Single> {
    TrainerBuilder::default()
        .quantisations(&[255, 64])
        .optimiser(AdamW)
        .loss_fn(Loss::SigmoidMPE(2.6))
        .input(inputs)
        .output_buckets(outputs::MaterialCount::<8>)
        .feature_transformer(l1)
        .activate(Activation::SCReLU)
        .add_layer(1)
        .build()

formal smelt Mar 14, 2025, 4:57 PM

#

) -> Trainer<AdamWOptimiser, T, outputs::Single> {
you'd need to change this line too

#

but when i do that it still errors

rocky vigil Mar 14, 2025, 4:58 PM

#

Ok maybe my home laptop just has an issue with std::optional

#

Well that kinda sucks if true

formal smelt Mar 14, 2025, 5:01 PM

#

round stone ```rust use bullet::{ nn::{ optimiser::{AdamW, AdamWOptimiser}, ...

AH i know what it is

- pub fn make_trainer<T: Default + SparseInputType>(
+ pub fn make_trainer<T: Default + SparseInputType<RequiredDataType = ChessBoard>>(
    inputs: T, l1: usize,
- ) -> Trainer<AdamWOptimiser, T, outputs::Single> {
+ ) -> Trainer<AdamWOptimiser, T, outputs::MaterialCount<8>> {

round stone Mar 14, 2025, 5:09 PM

#

cool works now, after also adding the import:

trainer::default::{
       formats::bulletformat::{ChessBoard},
       inputs::SparseInputType, outputs, Loss, Trainer, TrainerBuilder
},

round stone Mar 14, 2025, 5:16 PM

#

rocky vigil Can someone test if latest commit compiles and runs

latest full threats commit e27d31f1a compiles for me too

rocky vigil Mar 14, 2025, 5:16 PM

#

Does it run

#

And not crash

round stone Mar 14, 2025, 5:17 PM

#

yea it's +30 fixed nodes so far vs. simplified threats 512

rocky vigil Mar 14, 2025, 5:17 PM

#

Home laptop is getting an access violation at 0x00000000 and I’m trying to figure out if this is an issue with my code or with my device

#

So it’s probably an issue with my compiler on that laptop then

#

Ok good to know

#

Hmm surpassing double L1 is really good

round stone Mar 14, 2025, 5:25 PM

#

or maybe -30? i keep mixing up the order of these

#

Results of ./threat-inputs/mar11-sscg13-512-profile-build vs ./threat-inputs/mar14-sscg13-full-threats-256 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 30.14 +/- 6.19, nElo: 45.18 +/- 9.24
LOS: 100.00 %, DrawRatio: 43.04 %, PairsRatio: 1.57
Games: 5432, Wins: 1978, Losses: 1508, Draws: 1946, Points: 2951.0 (54.33 %)
Ptnml(0-2): [93, 509, 1169, 725, 220], WL/DD Ratio: 2.28

main prawn Mar 14, 2025, 5:25 PM

#

that's -30

round stone Mar 14, 2025, 5:25 PM

#

womp womp

rocky vigil Mar 14, 2025, 5:26 PM

#

Well if 512 was really +130 to 256 for simplified (this result seems off, btw) that is still a good gain

#

+output buckets are maybe 10 or so

formal smelt Mar 14, 2025, 5:27 PM

#

output buckets can be neutral

#

very variable

round stone Mar 14, 2025, 5:28 PM

#

an output buckets 256 is training now and will be ready later to test

formal smelt Mar 14, 2025, 5:28 PM

#

nice

rocky vigil Mar 14, 2025, 5:28 PM

#

Ok

#

Nice

#

In other news for ue optimization once again

#

Since Yukari obviously does it a lot better rn

#

I think at some point we need to switch from trying to compute every threat

formal smelt Mar 14, 2025, 5:29 PM

#

do we have a guestimate of what perf is left on the table

#

vs yukari?

rocky vigil Mar 14, 2025, 5:29 PM

#

To only incrementally updating threats

rocky vigil Mar 14, 2025, 5:29 PM

#

formal smelt vs yukari?

Startpos 20 second search for Yukari vs current branch had Yukari like 2x faster for (simplified->256) on my machine

#

You can also test this out

formal smelt Mar 14, 2025, 5:30 PM

#

formal smelt do we have a guestimate of what perf is left on the table

not clarify not just the amount but also where it comes from

rocky vigil Mar 14, 2025, 5:30 PM

#

I think Yukari has incremental attack tables

#

Which is integrated into movegen etc

#

I am wondering surely sf had fast attack tables back in the HCE era

#

So I am wondering if those still exist or we could bring them back

rocky vigil Mar 14, 2025, 5:32 PM

#

formal smelt do we have a guestimate of what perf is left on the table

Like based off Disservin’s profile for L1=256 nearly half of the runtime is dedicated to computing and indexing threats

formal smelt Mar 14, 2025, 5:32 PM

#

rocky vigil Like based off Disservin’s profile for L1=256 nearly half of the runtime is dedi...

oh sheesh

rocky vigil Mar 14, 2025, 5:33 PM

#

You can see it here

formal smelt Mar 14, 2025, 5:33 PM

#

crazy

#

well that bodes well at least for the potential once optimised

rocky vigil Mar 14, 2025, 5:34 PM

#

yeah the current method is kind of a dead end eventually

#

We can mask the overhead more as L1 gets large but it’s still significant

rocky vigil Mar 14, 2025, 5:36 PM

#

rocky vigil yeah the current method is kind of a dead end eventually

In the meanwhile I can try an incremental threat hack from bitboards alone

#

So first of all castling never opens any discoveries right

#

And castling is the only move involving more than 2 squares

rocky vigil Mar 14, 2025, 5:37 PM

#

rocky vigil So first of all castling never opens any discoveries right

Even for frc right (please check this logic)

#

So if castling we know that we only need to deal with attacks involving those (up to 4) squares

#

Otherwise for two-square moves we can loop through the leapers

#

And discoveries are only present if both sides of a file/rank/diagonal are attacked

#

Do we have methods for only getting attacks on one line

#

Not both

#

Welp I guess not

round stone Mar 14, 2025, 7:09 PM

#

full threats L1-256 with 8 output buckets:

Architecture           : (80624 -> 256)x2 -> 1x8
Inputs                 : Threat inputs
Number of Weights      : 20.64m
Output Buckets         : Will be transposed in quantised network for you, output bucketed layers will
                       : have weights in form [[[T; layer input size]; layer output size]; buckets]
Quantisations          : [255, 64]
Eval Scale             : 340

https://tests.stockfishchess.org/api/nn/nn-a660a82f6a81.nnue

rocky vigil Mar 14, 2025, 8:15 PM

#

rocky vigil Do we have methods for only getting attacks on one line

Can someone point me how to add these

#

It would allow better optimization of computing discovered attacks

candid ivy Mar 14, 2025, 8:18 PM

#

rocky vigil Can someone point me how to add these

what is one line in your case?

#

vertical/horizontal/diagonal ?

#

there's line_bb and between_bb

rocky vigil Mar 14, 2025, 8:26 PM

#

Vertical, both directions (so up and down)

#

Or horizontal, both directions (so left and right)

rocky vigil Mar 14, 2025, 8:29 PM

#

round stone full threats L1-256 with 8 output buckets: ``` Architecture : (80624 -...

Inference is up

rocky vigil Mar 14, 2025, 8:34 PM

#

rocky vigil In the meanwhile I can try an incremental threat hack from bitboards alone

I would prefer re-introducing attack tables and then building threats on top of that

#

Because incremental from bitboards alone is very annoying, at least to optimize

#

For threats what you would want to do is:

#

Loop through the attackers of both squares and also the attacks

#

But then optimizing around deduplication and computing less attacks is hard

candid ivy Mar 14, 2025, 8:36 PM

#

mh you can use hyperbola approach for just horizontal/vert if you don't want to use a lookup

rocky vigil Mar 14, 2025, 8:37 PM

#

Hyperbola doesn’t work for rank right?

#

What I was planning to do was loop through the attacks for the leapers (including both colors of pawns) and then consider rank, file, both diags individually to add discovered attacks

#

Combining this way means you don’t have to compute the bitboard of piece attacks separately since you already need it to compute the attackers of the square

#

But after thinking about it for a while I conclude that this will be a very messy hack

candid ivy Mar 14, 2025, 8:41 PM

#

rocky vigil Hyperbola doesn’t work for rank right?

mh it does, had a look on github an one impl is just

template<uint32_t sq>
static constexpr uint64_t rook_attack(uint64_t occ) {
       return vertical_attack<sq>(occ) | horizontal_atkL<sq>(~occ) | horizontal_atkR<sq>(~occ);
}

so you can split that up into just vert/horizontal

rocky vigil Mar 14, 2025, 8:42 PM

#

Do you know if sf had attack tables in the HCE days

#

I think a broader issue is that currently we try to decouple nnue updates from the position

#

But threat inputs have a much larger dependence on the position

#

So trying to do threats separately without integrating it into the position and stuff like make_move naturally leads to awkward code

#

Eventually I think we do need to figure out some way to compute threat changes incrementally and I would appreciate help from more experienced people regarding this

round stone Mar 14, 2025, 8:47 PM

#

Results of ./threat-inputs/mar14-sscg13-full-threats-256-8-output-buckets vs ./threat-inputs/mar14-sscg13-full-threats-256 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 3.50 +/- 4.74, nElo: 5.29 +/- 7.16
LOS: 92.60 %, DrawRatio: 44.70 %, PairsRatio: 1.03
Games: 9034, Wins: 2949, Losses: 2858, Draws: 3227, Points: 4562.5 (50.50 %)
Ptnml(0-2): [218, 1011, 2019, 1000, 269], WL/DD Ratio: 2.32

#

STC of 8 output buckets vs. 1 bucket:
https://tests.stockfishchess.org/tests/view/67d4935e517865b4a2dfcf8d

#

going to try larger L1, then different datasets later, including monty binpacks

rocky vigil Mar 14, 2025, 8:55 PM

#

Yeah seeing the gains at larger L1 will also be very helpful

rocky vigil Mar 14, 2025, 9:03 PM

#

rocky vigil Eventually I think we do need to figure out some way to compute threat changes i...

In particular since inference is not in a horrible state right now I think it’s fine to take more time and plan out something better long-term

#

And I would appreciate any suggestions

rocky vigil Mar 14, 2025, 9:28 PM

#

Most important right now though is to scale the nets up to large size and verify gain over master at fixed nodes

#

Theoretically, with the current implementation, you compute all attacks of pieces, looping through ~40 threats and writing them, then write_difference loops through these and writes maybe 10 more on average

#

Whereas with real incremental you only need to loop through 12 attack bitboards and way less threats, and you can write the differences directly

#

Also: usually the accumulator updates are for both colors at once right, so we can try to optimize further by only computing the attacks once but writing the difference for both colors

#

This all needs more work to develop though

rocky vigil Mar 14, 2025, 10:02 PM

#

round stone STC of 8 output buckets vs. 1 bucket: https://tests.stockfishchess.org/tests/vie...

Looks like output buckets aren’t so great for full threats single layer

#

We’ll still have to see for multilayer later

twilit oriole Mar 15, 2025, 12:34 AM

#

Eh that's cache stuff I'm pretty sure. Like if the net was shared between instances it would gain

rocky vigil Mar 15, 2025, 12:41 AM

#

8 output buckets vs 1 bucket is hardly cache right

#

does nodestime work as a tc version of fixed nodes

twilit oriole Mar 15, 2025, 12:52 AM

#

rocky vigil 8 output buckets vs 1 bucket is hardly cache right

is it because the later layers are constantly used, they probably sit in L1/L2. This is probably where the huge variation in elo gain comes from also

rocky vigil Mar 15, 2025, 5:24 AM

#

the new dataset appears quite superior to the old one

round stone Mar 15, 2025, 6:32 AM

#

The new one is the same training sequence as the one used for simple threats. Old one was training directly on a single leela binpack from scratch

round stone Mar 15, 2025, 7:51 AM

#

fishtest not a fan of L1-768 full threats being huge

lofty cedar Mar 15, 2025, 9:52 AM

#

Need to update fishtest.

upbeat pewter Mar 15, 2025, 9:53 AM

#

NNUE that's just a rickroll loaded as an eval

twilit oriole Mar 15, 2025, 10:10 AM

#

@round stone What's the fixed nodes result of 512 Vs 256 full threats?

Don't really care about TC tests anyway, the speed is not representative of anything final so it's not telling anything

What we really are still needing is L1 1536 multilayer full threats with output buckets. To compare fixed nodes to master. I have access to 4x4090 u can use for training.

This speed stuff was supposed to be on the side not the main focus at this stage, the concept has not yet been proven to be spending so much time trying to optimise speed

twilit oriole Mar 15, 2025, 10:41 AM

#

This is getting out of hand (2.3k messages) because the process is not being understood I think. Like multilayer is being delayed as it is 'only 20 Elo whereas at TC the net is currently -100 Elo' but the TC result is irrelevant at this stage. We need a solid fixed nodes result first, so it is relevant there

rocky vigil Mar 15, 2025, 5:42 PM

#

although nowhere close to fully optimized, the speed is definitely sufficient for fixed nodes at a reasonable pace

round stone Mar 15, 2025, 8:02 PM

#

Results of ./threat-inputs/mar15-sscg13-full-threats-512-S2 vs ./threat-inputs/mar15-sscg13-full-threats-256-S2 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 41.66 +/- 5.02, nElo: 64.55 +/- 7.71
LOS: 100.00 %, DrawRatio: 41.32 %, PairsRatio: 1.98
Games: 7802, Wins: 2882, Losses: 1951, Draws: 2969, Points: 4366.5 (55.97 %)
Ptnml(0-2): [104, 663, 1612, 1242, 280], WL/DD Ratio: 2.03

#

TC tests are important for establishing a baseline and quantifying improvement

#

optimizing speed is part of proving the concept, especially if speed is the biggest bottleneck

#

at larger L1, the training process and data becomes much more important

#

if an L1-512 full threats net is -50 elo STC vs. a far smaller L1-256 HalfKA v2 hm, then that's a problem

#

can you elaborate on why you think the process you have in mind is a better approach to being competitive vs. master at TC?

rocky vigil Mar 15, 2025, 8:36 PM

#

i think they more want to say

#

that we should get data about fixed nodes

#

to see if it can be competitive assuming good optimization

#

rn I think if you want to simulate an "optimized" implementation you can give 1.6x time odds as this is approximately how much faster Yukari is in the midgame to current stockfish at L1=256

round stone Mar 15, 2025, 8:41 PM

#

is a 1.6x speedup within reach?

#

much larger L1 is going to need significant improvements to the training process + data to have any decent indication of fixed nodes strength

rocky vigil Mar 15, 2025, 8:43 PM

#

round stone is a 1.6x speedup within reach?

considering yukari's done it

#

there is no reason it shouldn't

#

well the 1.6x applies mostly to midgame

#

in endgame where there are few threats it isn't as large

#

https://tests.stockfishchess.org/tests/view/67d5e6a9517865b4a2dfd2d2

#

I'll see where we are (full threats -> 256 vs smallnet) with +50% time odds

upbeat pewter Mar 15, 2025, 8:46 PM

#

and I think my implementation is actively less efficient than a lazy approach

formal smelt Mar 15, 2025, 8:46 PM

#

is this the right way round?

rocky vigil Mar 15, 2025, 8:46 PM

#

wait huh

twilit oriole Mar 15, 2025, 8:46 PM

#

round stone can you elaborate on why you think the process you have in mind is a better appr...

If a L1 1536 is not remotely close to master at fixed nodes no further time needs to be invested

#

Ideally we would have the unoptimized version exactly how we want in terms of net and then optimise after

rocky vigil Mar 15, 2025, 8:47 PM

#

I think L1 1536 should be close to master

#

if we take simplified 1024 is -80, add +40 for full threats, add +20 for larger L1, and +20 for multilayer

twilit oriole Mar 15, 2025, 8:48 PM

#

Sure you can think but it needs to be proven. I have the GPUs let's just do it before investing a fuckton of time

rocky vigil Mar 15, 2025, 8:48 PM

#

right

#

yeah 4x4090 should massively speed up the training

#

as long as you have the data and know what to do

twilit oriole Mar 15, 2025, 8:49 PM

#

Linrock can provide the script and data to me

#

Ideally it would be complete in terms of arch so then we know this is exactly what we are even optimizing

round stone Mar 15, 2025, 8:50 PM

#

nnue-pytorch is far ahead of bullet as far as maximizing strength of data, and this is only amplified with larger L1

twilit oriole Mar 15, 2025, 8:51 PM

#

That's fine. This is work that has to be done anyways, I still think it's an easier path than doing UE threat gen and stuff

#

With 4 GPUs it means 4 experiments can go in parallel

round stone Mar 15, 2025, 8:53 PM

#

i think it saves time to prove/disprove at smaller L1

rocky vigil Mar 15, 2025, 8:53 PM

#

ok https://tests.stockfishchess.org/tests/view/67d5e893517865b4a2dfd2da +50% time odds should be actually correct now

round stone Mar 15, 2025, 8:54 PM

#

if it can never be fast enough to become a master net with optimal UE, then there's no need for a long and complicated training process

rocky vigil Mar 15, 2025, 8:55 PM

#

in Disservin's profile for L1=256 up to 45% of the time is spent on append_active_threats, write_difference, and memmove_avx_unaligned_erms

#

memmove is probably because I did smth wrong with alignment

#

and if we can reduce those things to 1/4 of their current runtime with incremental threat computation that gives our +50% speedup approximately

candid ivy Mar 15, 2025, 8:56 PM

#

rocky vigil memmove is probably because I did smth wrong with alignment

Im not sure if it was there before I can check later

rocky vigil Mar 15, 2025, 9:06 PM

#

rocky vigil ok <https://tests.stockfishchess.org/tests/view/67d5e893517865b4a2dfd2da> +50% t...

SSS but it seems that it's far superior to smallnet with +50% time?

formal smelt Mar 15, 2025, 9:12 PM

#

round stone nnue-pytorch is far ahead of bullet as far as maximizing strength of data, and t...

Can you give a list of data loader features that are needed

#

I’ll just add them in the next week sometime

round stone Mar 15, 2025, 9:20 PM

#

biggest two are:

piece count probability distribution skipping
WDL skipping

formal smelt Mar 15, 2025, 9:21 PM

#

got any digestible code references for them?

round stone Mar 15, 2025, 9:22 PM

#

piece count probability: https://github.com/official-stockfish/nnue-pytorch/pull/173
WDL skipping: https://github.com/official-stockfish/nnue-pytorch/pull/155

formal smelt Mar 15, 2025, 9:22 PM

#

cheers

#

hopefully will have time in a few days

rocky vigil Mar 15, 2025, 9:25 PM

#

rocky vigil SSS but it seems that it's far superior to smallnet with +50% time?

btw +50% time odds is not 100 elo right

#

it should still be superior even at the current version

#

i might make a test later

formal smelt Mar 15, 2025, 9:26 PM

#

so if you grab 50% speedups from the threat calculation and indexing, we're gaming?

violet badger Mar 15, 2025, 9:29 PM

#

out of curiosity how is the relative speed of the trainer (bullet vs nnue-pytorch), also, does bullet have multi-GPU training?

rocky vigil Mar 15, 2025, 9:30 PM

#

formal smelt so if you grab 50% speedups from the threat calculation and indexing, we're gami...

yeah at least at L1=256

#

sadly being like 4x larger than current smallnet it doesn't fulfill the smallnet purpose

round stone Mar 15, 2025, 9:32 PM

#

the smallnet primary purpose was to gain elo

#

so if a threats net can be used as a smallnet as a gainer, it's another path to production

rocky vigil Mar 15, 2025, 9:33 PM

#

i meant for replacing the lichess net

#

like it's gonna be 30MB

#

and that will be a bit bad

round stone Mar 15, 2025, 9:34 PM

#

oh yea probably not as a lichess smallnet replacement, where the purpose was to be small

round stone Mar 15, 2025, 9:35 PM

#

formal smelt so if you grab 50% speedups from the threat calculation and indexing, we're gami...

if a 1.5x speedup from L1-256 threats = +150 STC elo (from -50 to +100) vs the L1-256 smallnet then it looks promising to explore larger L1

upbeat pewter Mar 15, 2025, 9:35 PM

#

violet badger out of curiosity how is the relative speed of the trainer (bullet vs nnue-pytorc...

bullet does not have multi-GPU training, but it's on the todo list, AIUI

violet badger Mar 15, 2025, 9:36 PM

#

ok, nice..

#

(if it can be made to work well 😉 )

rocky vigil Mar 15, 2025, 9:37 PM

#

round stone if a 1.5x speedup from L1-256 threats = +150 STC elo (from -50 to +100) vs the L...

well I assume full threats would've been like ~neutral at tc giiven it was +40 or so compared to simplified

round stone Mar 15, 2025, 9:37 PM

#

training speed is hard to compare, since it varies due to many differences between them: net architectures, skipping, dataset format

rocky vigil Mar 15, 2025, 9:37 PM

#

it seems like it performs better against lichess smallnet with the original book though

#

compared to endgames.epd

#

for whatever reason

round stone Mar 15, 2025, 9:37 PM

#

right now i'm getting around 2M pos/sec for these L1-512 threat nets, loading binpacks without any skipping

violet badger Mar 15, 2025, 9:38 PM

#

so gut feeling comparison, same order of magnitude, or rather slower or rather faster.

round stone Mar 15, 2025, 9:38 PM

#

also config options i haven't closely looked into optimizing

rocky vigil Mar 15, 2025, 9:38 PM

#

also 120MB limit for net means that tc testing for larger L1 is out for now

round stone Mar 15, 2025, 9:39 PM

#

i'd say training definitely seems faster with bullet, but that's also because piece count + WDL skipping makes it much slower

#

similar order of magnitude of training speed

#

using the sequential dataloader maybe 2-3x faster for small nets, vs. binpacks even without skipping

#

the data disk size usage vs. training speed tradeoff of bullet format vs. binpacks is very noticeable

rocky vigil Mar 15, 2025, 9:42 PM

#

I've done https://tests.stockfishchess.org/tests/view/67d5f445517865b4a2dfd2e6 as well which is equal tc

round stone Mar 15, 2025, 9:42 PM

#

rocky vigil also 120MB limit for net means that tc testing for larger L1 is out for now

yea, need compression (leb128) and/or fishtest limit increase to go further than full threats L1-512 on fishtest

rocky vigil Mar 15, 2025, 9:43 PM

#

which will hopefully let us know how threat inputs scales with more time

round stone Mar 15, 2025, 9:44 PM

#

in the meantime i'm hoping L1-256 and L1-512 experiments can give an indicator of how it scales

rocky vigil Mar 15, 2025, 9:44 PM

#

full threats -> 512 is like 80 mb right?

round stone Mar 15, 2025, 9:44 PM

#

yea

rocky vigil Mar 15, 2025, 9:44 PM

#

does the memory issue become significant

#

viren might know better

#

ik viren advocates for not duplicating ram for nnue weights on multiple sf processes

#

huh +50% speed is 150 elo

#

this is a bit tyrannical scaling

violet badger Mar 15, 2025, 9:54 PM

#

but consistent with what we know https://github.com/official-stockfish/Stockfish/wiki/Useful-data#elo-from-speedups

#

(more or less)

#

Also https://github.com/official-stockfish/Stockfish/wiki/Useful-data#elo-gain-with-time-odds

#

And just FYI https://github.com/official-stockfish/Stockfish/wiki/Useful-data#the-impact-of-efficient-incremental-updates-nnue

round stone Mar 15, 2025, 9:56 PM

#

i was expecting less than +100 elo with a +50% speedup but didn't really know what was reasonable

#

worth testing it vs. master to see where it's at there

#

since the L1-256 lichess smallnet is around -86 vs. master at STC
https://tests.stockfishchess.org/tests/view/67d33095517865b4a2dfc73a

rocky vigil Mar 15, 2025, 10:12 PM

#

it's only 1.5x time odds not 2x

formal smelt Mar 15, 2025, 10:19 PM

#

violet badger out of curiosity how is the relative speed of the trainer (bullet vs nnue-pytorc...

i'm using the command as follows on nnue-pytorch main branch

python3 train.py ../bullet/data/test80-2024-02-feb-2tb7p.min-v2.v6.binpack --batch-size 16384 --features=HalfKAv2_hm --lambda=1.0 --gpus "0," --threads 8 --num-workers 8 --default_root_dir ./ --no-smart-fen-skipping

does this command look okay?
This gives me 92/6166 [00:17<19:01, 5.32it/s, loss=0.0209, v_num=1] which is ~90k pos/sec
I believe I'm training SF master net with L1=3072, bucket=8, having touched nothing else

#

Num virtual features: 0 i take this is not using a factoriser as well

round stone Mar 15, 2025, 10:21 PM

#

rocky vigil it's only 1.5x time odds not 2x

shouldn't it be 15+0.15 vs. 10+0.1?

#

https://tests.stockfishchess.org/tests/view/67d5e893517865b4a2dfd2da
right now it's 15+0.5 which is 5x the base increment

formal smelt Mar 15, 2025, 10:22 PM

#

https://github.com/jw1912/bullet/blob/sf-arch-i-think/examples/advanced.rs here is more or less the equivalent in bullet, there's probably some minor differences that have no effect on performance (like output buckets being calculated slightly differently)

#

superbatch 1 [12.5% (128/1024 batches, 171237 pos/sec)]

rocky vigil Mar 15, 2025, 10:22 PM

#

round stone https://tests.stockfishchess.org/tests/view/67d5e893517865b4a2dfd2da right now i...

oh shoot-

formal smelt Mar 15, 2025, 10:23 PM

#

its ofc not a proper comparison

#

need to guarantee everything is equal enough

rocky vigil Mar 15, 2025, 10:23 PM

#

how long is the average fishtest game

formal smelt Mar 15, 2025, 10:23 PM

#

tested on gtx 1660 super btw

daring wren Mar 15, 2025, 10:23 PM

#

rocky vigil how long is the average fishtest game

probably more than 80 plies

rocky vigil Mar 15, 2025, 10:26 PM

#

ok it's more like 2.5x time odds then

#

oopsies

#

yeah the real 1.5x time odds should be close to neutral then

formal smelt Mar 15, 2025, 10:33 PM

#

formal smelt <https://github.com/jw1912/bullet/blob/sf-arch-i-think/examples/advanced.rs> her...

updated with added L1 factoriser as seen in the LayerStacks class in NNUE pytorch
superbatch 1 [62.5% (640/1024 batches, 166802 pos/sec)]

#

ah the psqt subnet is output bucketed

#

superbatch 1 [37.5% (384/1024 batches, 165194 pos/sec)] have updated

#

superbatch 1 [37.5% (384/1024 batches, 154846 pos/sec)] + ranger optimiser

candid ivy Mar 15, 2025, 10:48 PM

#

makes me wonder how much elo is from optimizing hyperparameters and the data filter changes

rocky vigil Mar 15, 2025, 11:18 PM

#

hmm looks like with the real +50% it's still better than smallnet

#

but significantly worse than master as expected

violet badger Mar 16, 2025, 6:52 AM

#

formal smelt `superbatch 1 [37.5% (384/1024 batches, 154846 pos/sec)]` + ranger optimiser

thanks for testing, so that would be 1.5 - 2.0x speedup or so, which is very relevant.

violet badger Mar 16, 2025, 6:54 AM

#

candid ivy makes me wonder how much elo is from optimizing hyperparameters and the data fil...

quite a bit I think, I don't dare to put a number to it, but people might start to forget how many iterations went in the current settings, we've quite literally tested over 5000 nets on fishtest.

#

(which made me realize that my local backup was not adjusted to the recent server changes... need to fix that... fixed).

candid ivy Mar 16, 2025, 1:16 PM

#

formal smelt i'm using the command as follows on nnue-pytorch main branch ``` python3 train.p...

the default net uses factorizing, so the features should be HalfKAv2_hm^ (I think)

formal smelt Mar 16, 2025, 3:38 PM

#

candid ivy the default net uses factorizing, so the features should be `HalfKAv2_hm^` (I th...

good catch
309/6166 [01:35<30:14, 3.23it/s, loss=0.0147, v_num=3] with factoriser ~53k pos/sec
superbatch 1 [50.0% (512/1024 batches, 98099 pos/sec)] pushed to the branch i linked earlier

violet badger Mar 16, 2025, 3:48 PM

#

interesting, so getting the exact SF arch and training features in bullet would be quite nice for training.

formal smelt Mar 16, 2025, 3:53 PM

#

Apart from the output bucket formula I’m now fairly certain the arch is exactly replicated in the branch I linked

#

For required extra features other than the data loading stuff I’m not sure
The NNUE-PyTorch ranger impl has more stuff than the bullet one but all the extra features seem unused by default

candid ivy Mar 16, 2025, 3:59 PM

#

#[derive(Clone, Copy, Default)]
pub struct SfMaterialCount<const N: usize>;
impl<const N: usize> OutputBuckets<ChessBoard> for SfMaterialCount<N> {
    const BUCKETS: usize = N;

    fn bucket(&self, pos: &ChessBoard) -> u8 {
        let piece_count = pos.occ().count_ones() as u8 - 1;
        (piece_count / 4) as u8
    }
}

one can just define that for the output buckets right?

formal smelt Mar 16, 2025, 4:00 PM

#

Yeah

#

And for the optimiser stuff you can implement a custom optimiser that wraps the existing ranger one with any extra stuff you need

#

As long as it doesn’t require any funky gpu operations that aren’t yet supported

#

#[derive(Clone, Copy, Default)]
pub struct SfMaterialCount;
impl OutputBuckets<ChessBoard> for SfMaterialCount {
    const BUCKETS: usize = 8;

    fn bucket(&self, pos: &ChessBoard) -> u8 {
        let piece_count = pos.occ().count_ones() as u8 - 1;
        (piece_count / 4) as u8
    }
}

don't need the generic param also

candid ivy Mar 16, 2025, 4:18 PM

#

yeah just realised, im getting 670345 pos/sec for that arch

formal smelt Mar 16, 2025, 4:18 PM

#

damn my gpu sucks lol

violet badger Mar 16, 2025, 5:17 PM

#

hmm, I do have a GPU I'd like to test, guess I have to figure out what to run. Is there a TL;DR / list of commands I could run. Worth mentioning I've never run anything rusty 🙂

round stone Mar 16, 2025, 5:26 PM

#

here's a good place to start with getting bullet working for training:
https://github.com/jw1912/bullet/blob/main/docs/2-getting-started.md

#

after installing rust, take a look at the simple.rs example and train with cargo r -r --example simple

formal smelt Mar 16, 2025, 5:37 PM

#

That is somewhat out of date now, the CUDA requirement is lower than 12.2 and you can also compile for CPU
The instructions in there should work fine tho

candid ivy Mar 16, 2025, 5:48 PM

#

violet badger hmm, I do have a GPU I'd like to test, guess I have to figure out what to run. I...

clone this https://github.com/Disservin/sf-bullet-train.git
replace the binpack path

let file_path = "G:\\stockfish-data\\leela96-filt-v2.min.binpack";

and run
cargo run --release .

violet badger Mar 16, 2025, 5:49 PM

#

ok, great, just got the simple to work..

candid ivy Mar 16, 2025, 5:51 PM

#

candid ivy clone this <https://github.com/Disservin/sf-bullet-train.git> replace the binpac...

if you run this on your arm cpu you need to drop the bmi2 in the Cargo.toml
sfbinpack = { package = "binpack", git = "https://github.com/Disservin/binpack-rust", rev = "483e9aac028b4c3e0671af6b28ff50f64d696558", features = ["bmi2"]}

violet badger Mar 16, 2025, 5:54 PM

#

working 'out-of-the-box'

#

Params: 72205464
Beginning Training
Net Name               : test
Batch Size             : 16384
Batches / Superbatch   : 1024
Positions / Superbatch : 16777216
Start Superbatch       : 1
End Superbatch         : 10
Eval Scale             : 400
Save Rate              : 150
WDL Scheduler          : constant 0
LR Scheduler           : start 0.001 gamma 0.3 drop every 60 superbatches
Threads                : 4
Output Path            : checkpoints
superbatch 1 | time 2.6s | running loss 0.000000 | 6363127 pos/sec | total time 4.4s
Estimated time remaining in training: 0h 0m 39s
superbatch 2 | time 2.6s | running loss 0.000000 | 6463666 pos/sec | total time 6.9s
Estimated time remaining in training: 0h 0m 27s
superbatch 3 | time 2.6s | running loss 0.000000 | 6474294 pos/sec | total time 9.5s
Estimated time remaining in training: 0h 0m 22s
superbatch 4 | time 2.6s | running loss 0.000000 | 6477194 pos/sec | total time 12.1s
Estimated time remaining in training: 0h 0m 18s
superbatch 5 | time 2.6s | running loss 0.000000 | 6475190 pos/sec | total time 14.7s
Estimated time remaining in training: 0h 0m 14s
superbatch 6 | time 2.6s | running loss 0.000000 | 6475920 pos/sec | total time 17.3s
Estimated time remaining in training: 0h 0m 11s
superbatch 7 | time 2.6s | running loss 0.000000 | 6478756 pos/sec | total time 19.9s
Estimated time remaining in training: 0h 0m 8s
superbatch 8 | time 2.6s | running loss 0.000000 | 6475852 pos/sec | total time 22.5s
Estimated time remaining in training: 0h 0m 5s
superbatch 9 | time 2.6s | running loss 0.000000 | 6480538 pos/sec | total time 25.1s
Estimated time remaining in training: 0h 0m 2s
superbatch 10 | time 2.6s | running loss 0.000000 | 6494769 pos/sec | total time 27.7s
Estimated time remaining in training: 0h 0m 0s
Saved [test-10]
Total Training Time: 0h 0m 34s
Eval: 0.000cp

#

now that loss / eval is maybe a bit suspicious?

candid ivy Mar 16, 2025, 5:56 PM

#

yeah deffo

violet badger Mar 16, 2025, 5:57 PM

#

hmm, I'm using the CUDA_PATH setting, not HIP.

#

seems happy building things, though

#

(but no changes to your repo, except for dropping bmi2 and specifying the path to the .binpack)

candid ivy Mar 16, 2025, 5:59 PM

#

superbatch 1 | time 26.0s | running loss 0.011861 | 645176 pos/sec | total time 28.5s
Estimated time remaining in training: 0h 0m 0s
Saved [test-1]
Total Training Time: 0h 0m 41s
Eval: 23.493cp

thats how it should end, just changing the superbatches from 10 to 1 here so that it finishes quickly

#

did the simple example do something reasonable?

violet badger Mar 16, 2025, 6:00 PM

#

well, probably not:

Output Path            : checkpoints
superbatch 1 | time 8.7s | running loss 0.000000 | 11501668 pos/sec | total time 11.7s
Estimated time remaining in training: 0h 7m 37s

#

similarly 0.0 loss

upbeat pewter Mar 16, 2025, 6:02 PM

#

violet badger well, probably not: ``` Output Path : checkpoints superbatch 1 | time...

what happens if you add the line trainer.sanity_check(); before trainer.run() in the simple example?

violet badger Mar 16, 2025, 6:04 PM

#

lots of red ..

📎 message.txt

formal smelt Mar 16, 2025, 6:04 PM

#

What GPU is this?

violet badger Mar 16, 2025, 6:04 PM

#

GH200

#

I suspect the issue might be host is arm (aarch64)

#

but idk.

formal smelt Mar 16, 2025, 6:06 PM

#

Try running cargo test --package bullet_core

violet badger Mar 16, 2025, 6:07 PM

#

all green

#

running 11 tests
test backend::cpu::crelu ... ok
test backend::cpu::matmul ... ok
test backend::cpu::relu ... ok
test backend::cpu::matmul2 ... ok
test backend::cpu::concat ... ok
test backend::cpu::screlu ... ok
test backend::cpu::sparse_affine ... ok
test backend::cpu::sparse_affine_check_not_batched ... ok
test backend::cpu::sparse_affine_batched_biases ... ok
test backend::cpu::sqrrelu ... ok
test backend::cpu::sparse_affine_dual ... ok

test result: ok. 11 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

#

though that's the CPU backend?

formal smelt Mar 16, 2025, 6:10 PM

#

Yeah

#

cargo test bullet_hip_backend

#

For gpu backend

formal smelt Mar 16, 2025, 6:10 PM

#

violet badger hmm, I'm using the CUDA_PATH setting, not HIP.

You aren’t passing the HIP feature are you?

upbeat pewter Mar 16, 2025, 6:10 PM

#

wouldn't it just flat-out fail to link?

violet badger Mar 16, 2025, 6:11 PM

#

I guess no? That test however does:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/debug/deps/bullet_core-a748fb7f74ceb385)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/debug/deps/bullet_hip_backend-f99c5dab2957960e)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/debug/deps/bullet_lib-21ef2babe27415ff)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

formal smelt Mar 16, 2025, 6:11 PM

#

Oh i sent wrong command

#

cargo test --package bullet_hip_backend

violet badger Mar 16, 2025, 6:12 PM

#

ok..

#

errors and passes

#

results

📎 message.txt

formal smelt Mar 16, 2025, 6:13 PM

#

I’ll get home in a minute, will link you a branch to try

violet badger Mar 16, 2025, 6:14 PM

#

happy to try out (probably after dinner, let's see)

candid ivy Mar 16, 2025, 6:14 PM

#

the typical vondele dinner break, i ran into many of those 😄

violet badger Mar 16, 2025, 6:14 PM

#

good stuff coming 😉

upbeat pewter Mar 16, 2025, 6:20 PM

#

violet badger GH200

I guess this is why people joke about vondele's laptop; jeez that's some kit :p

violet badger Mar 16, 2025, 6:23 PM

#

it can run crysis

formal smelt Mar 16, 2025, 6:29 PM

#

okay on branch debug-gh200, run cargo test --package bullet_hip_backend followed by cargo r -r --example simple

violet badger Mar 16, 2025, 6:31 PM

#

test still fails

#

running 11 tests
test tests::sparse_affine_check_not_batched ... ok
test tests::relu ... FAILED
test tests::sqrrelu ... FAILED
test tests::screlu ... FAILED
test tests::sparse_affine ... FAILED
test tests::sparse_affine_dual ... FAILED
test tests::sparse_affine_batched_biases ... FAILED
test tests::crelu ... FAILED
test tests::matmul ... ok
test tests::concat ... ok
test tests::matmul2 ... ok

formal smelt Mar 16, 2025, 6:31 PM

#

i pushed another change just now that'll affect running the simple example

#

it looks like the kernels are not running but not raising any error about it

#

baffling

violet badger Mar 16, 2025, 6:33 PM

#

Ah, wait: called Result::unwrap()on anErr value: Cuda(cudaErrorUnsupportedPtxVersion)

formal smelt Mar 16, 2025, 6:33 PM

#

oh nice

#

what's the output of nvidia-smi?

upbeat pewter Mar 16, 2025, 6:33 PM

#

driver issues

formal smelt Mar 16, 2025, 6:34 PM

#

indeed the most common cause of this error is having mismatched driver for your cuda version

violet badger Mar 16, 2025, 6:34 PM

#

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             On  |   00000009:01:00.0 Off |                    0 |
| N/A   16C    P0             95W /  900W |     290MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
...

candid ivy Mar 16, 2025, 6:34 PM

#

cudaErrorUnsupportedPtxVersion
This indicates that the provided PTX was compiled with an unsupported toolchain. The most common reason for this, is the PTX was generated by a compiler newer than what is supported by the CUDA driver and PTX JIT compiler.

#

a very "cool" gpu :p

upbeat pewter Mar 16, 2025, 6:35 PM

#

unfortunately nvidia only list minimum requirements for x86-64 drivers

violet badger Mar 16, 2025, 6:36 PM

#

driver and cuda version definitely are compatible..

upbeat pewter Mar 16, 2025, 6:36 PM

#

https://fieldprogrammable.gay/files/f660acb0-505b-43f0-ab3c-9edc6066c95b.png
this would match up though

#

...did you update your drivers since last rebooting?

#

wondering if it's a kernel/userland mismatch

violet badger Mar 16, 2025, 6:37 PM

#

me no... but certainly all good on that front.

candid ivy Mar 16, 2025, 6:37 PM

#

the guy here apparently add to export, PATH and LD_LIBRARY_PATH and do a reboot
https://forums.developer.nvidia.com/t/agx-orin-error-cudaerrorunsupportedptxversion-the-provided-ptx-was-compiled-with-an-unsupported-toolchain/294408/7

violet badger Mar 16, 2025, 6:38 PM

#

well, let me test something...

upbeat pewter Mar 16, 2025, 6:40 PM

#

candid ivy the guy here apparently add to export, PATH and LD_LIBRARY_PATH and do a reboot ...

I think that's required for CUDA in general

violet badger Mar 16, 2025, 7:08 PM

#

I tried a different rust install, it said this:

error: rustc 1.81.0-nightly is not supported by the following package:
  [email protected] requires rustc 1.83

so I guess that's indeed a requirement for bullet_core?

upbeat pewter Mar 16, 2025, 7:08 PM

#

violet badger I tried a different rust install, it said this: ``` error: rustc 1.81.0-nightly ...

yes

violet badger Mar 16, 2025, 7:09 PM

#

ok, so back to the previous install.

formal smelt Mar 16, 2025, 7:09 PM

#

The rust install won’t make a difference here

#

The issue is with compiling the cuda kernels and rust has nothing to do with that other than build script invoking nvcc

violet badger Mar 16, 2025, 7:11 PM

#

is there a way to get output of that compilation (in particular invocation of nvcc, or similar?)

formal smelt Mar 16, 2025, 7:11 PM

#

Yeah one sec I’m cooking dinner

upbeat pewter Mar 16, 2025, 7:11 PM

#

CXX=/bin/false cargo run ...

#

:p

violet badger Mar 16, 2025, 7:12 PM

#

elegant ..

#

is there some equivalent of 'make clean' ?

upbeat pewter Mar 16, 2025, 7:13 PM

#

cargo clean

violet badger Mar 16, 2025, 7:13 PM

#

Removed 1499 files, 597.5MiB total yeah

#

though no difference

#

so with the CXX=/bin/false I don't get much more than:
error occurred in cc-rs: Command LC_ALL="C" "nvcc" "-ccbin=/bin/false" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/users/vjoost/fish/bullet/target/debug/build/bullet_hip_backend-44a0a83a10a89e6d/out/9080fc8993de408a-include.o" "-c" "./kernels/include.cu" with args nvcc did not execute successfully (status code exit status: 1).

#

Also

  --- stdout
  cargo:rerun-if-changed=./kernels
  rerun-if-env-changed=CUDA_PATH
  Path CUDA_PATH="/user-environment/env/default"
  cargo:rustc-link-lib=dylib=cublas
  cargo:rustc-link-search=native=/user-environment/env/default/lib64
  cargo:rerun-if-changed=/user-environment/env/default/include
  TARGET = Some(aarch64-unknown-linux-gnu)
  HOST = Some(aarch64-unknown-linux-gnu)
  cargo:rerun-if-env-changed=CXX_aarch64-unknown-linux-gnu
  CXX_aarch64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXX_aarch64_unknown_linux_gnu
  CXX_aarch64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = Some(/bin/false)
  RUSTC_WRAPPER = None
  cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
  cargo:rerun-if-env-changed=NVCC_aarch64-unknown-linux-gnu
  NVCC_aarch64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=NVCC_aarch64_unknown_linux_gnu
  NVCC_aarch64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_NVCC
  HOST_NVCC = None
  cargo:rerun-if-env-changed=NVCC
  NVCC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some(neon)
  cargo:rerun-if-env-changed=CXXFLAGS_aarch64-unknown-linux-gnu
  CXXFLAGS_aarch64-unknown-linux-gnu = None
  cargo:rerun-if-env-changed=CXXFLAGS_aarch64_unknown_linux_gnu
  CXXFLAGS_aarch64_unknown_linux_gnu = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = None
  CARGO_ENCODED_RUSTFLAGS = Some()

candid ivy Mar 16, 2025, 7:22 PM

#

violet badger so with the `CXX=/bin/false` I don't get much more than: ` error occurred in cc...

i think there isn't much more expected since bullet just compiles that one file

formal smelt Mar 16, 2025, 7:22 PM

#

https://github.com/jw1912/bullet/blob/main/crates/bullet_hip_backend/build.rs#L59 replace this block with

    cc::Build::new()
        .cargo_warnings(false)
        .cuda(true)
        .cudart("shared")
        .cargo_debug(true)
        .debug(true)
        .opt_level(3)
        .files(&[KERNELS])
        .out_dir(out_path)
        .compile("libkernels.a");

    panic!();

#

the issue is just setting CXX to something invalid doesn't actually compile anything

#

ideally would like to see what gets emitted

violet badger Mar 16, 2025, 7:24 PM

#

full output?

formal smelt Mar 16, 2025, 7:25 PM

#

yeah

#

i suspect it will be rather large

violet badger Mar 16, 2025, 7:25 PM

#

it is 🙂

#

output

📎 message.txt

#

That's the strange bit?

  running: "nvcc" "-?"
  cargo:warning=nvcc fatal   : Unknown option '-?'
  exit status: 1

formal smelt Mar 16, 2025, 7:30 PM

#

nah that's just the cc crate being weird

#

this is doing some crazy stuff after the initial nvcc invocation

#

i assume due to using gcc

#

  running: LC_ALL="C" "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-G" "-Xcompiler" "-gdwarf-4" "-Xcompiler" "-fno-omit-frame-pointer" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/users/vjoost/fish/bullet/target/debug/build/bullet_hip_backend-44a0a83a10a89e6d/out/9080fc8993de408a-include.o" "-c" "./kernels/include.cu"
  exit status: 0

#

the actual command seems to just work

violet badger Mar 16, 2025, 7:32 PM

#

yeah c++ is definitely gcc

formal smelt Mar 16, 2025, 7:32 PM

#

maybe just need to work out what ptx version your system wants and tell it to emit that

#

https://github.com/jw1912/bullet/blob/main/docs/2-getting-started.md#general i recommend clang but i dont believe thats causing the issue here

violet badger Mar 16, 2025, 7:34 PM

#

so right now no clang available...

#

though it must be somewhere ..

#

would be easier if we could restrict the ptx version or so.

formal smelt Mar 16, 2025, 7:35 PM

#

doesn't seem possible but can check the ptx version emmitted

violet badger Mar 16, 2025, 7:35 PM

#

I don't see any option that specifies the gpu type

formal smelt Mar 16, 2025, 7:35 PM

#

ptx version != target sm

#

https://github.com/jw1912/cuda-stuff can you try cloning this repo and running make TARGET=sparse_fwd?

#

if this works fine then it would be some issue with the nvcc commands that bullet invokes

violet badger Mar 16, 2025, 7:40 PM

#

compiles into a main.exe

candid ivy Mar 16, 2025, 7:40 PM

#

mv main.exe main

violet badger Mar 16, 2025, 7:40 PM

#

nvprof is executed but not installed

formal smelt Mar 16, 2025, 7:40 PM

#

try running main

violet badger Mar 16, 2025, 7:40 PM

#

fails.

#

interesting

#

Running naive
Average Time: 0.625ms
Running vectorised
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.

formal smelt Mar 16, 2025, 7:40 PM

#

well that is a relief to me at least

#

so its almost surely not bullet-side issue

violet badger Mar 16, 2025, 7:41 PM

#

good let me figure this out.

formal smelt Mar 16, 2025, 7:41 PM

#

sorry, im not really qualified to debug general cuda toolchain issues

violet badger Mar 16, 2025, 7:41 PM

#

might take some time.

formal smelt Mar 16, 2025, 7:41 PM

#

sounds good

formal smelt Mar 16, 2025, 7:41 PM

#

violet badger Running naive Average Time: 0.625ms Running vectorised Average Time: 0.000ms Err...

wait holy shit

#

nahhhh

#

that looks like the naive kernel runs

upbeat pewter Mar 16, 2025, 7:41 PM

#

proposal: bullet should just always run the sanity checks on startup

formal smelt Mar 16, 2025, 7:42 PM

#

upbeat pewter proposal: bullet should just always run the sanity checks on startup

yeah in silent mode

#

good thinking batman

#

i'll PR that later

upbeat pewter Mar 16, 2025, 7:42 PM

#

I'm like robin at best

formal smelt Mar 16, 2025, 7:45 PM

#

violet badger Running naive Average Time: 0.625ms Running vectorised Average Time: 0.000ms Err...

could you checkout check-errors-everywhere and do the same process again?

violet badger Mar 16, 2025, 7:46 PM

#

Running naive
Average Time: 0.597ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Running vectorised
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Error Status: no error
Running blocktiled
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Error Status: no error

#

so strange. Let me ask around

formal smelt Mar 16, 2025, 7:47 PM

#

ah so it just takes 0.6ms to realise the ptx is the wrong version

violet badger Mar 16, 2025, 7:47 PM

#

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:27:38_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0

formal smelt Mar 16, 2025, 7:47 PM

#

aha

upbeat pewter Mar 16, 2025, 7:47 PM

#

violet badger ``` +---------------------------------------------------------------------------...

that...that would do it

formal smelt Mar 16, 2025, 7:47 PM

#

violet badger ``` +---------------------------------------------------------------------------...

^^

violet badger Mar 16, 2025, 7:47 PM

#

so compiler newer than the driver?

upbeat pewter Mar 16, 2025, 7:47 PM

#

it really was driver issues :p

#

yes

formal smelt Mar 16, 2025, 7:48 PM

#

yeah looks like it

violet badger Mar 16, 2025, 7:48 PM

#

so compiler issues..

#

let me ask

upbeat pewter Mar 16, 2025, 7:50 PM

#

do you have 12.4 installed somewhere on your FS?

violet badger Mar 16, 2025, 7:51 PM

#

things are a bit more complicated in this setup... somewhat different approach to deploy software

violet badger Mar 16, 2025, 8:09 PM

#

formal smelt <https://github.com/jw1912/cuda-stuff> can you try cloning this repo and running...

OK, this if fixed if I pass nvcc -arch=native src/sparse_fwd/main.cu -o main.exe -lcublas

#

Running naive
Average Time: 0.586ms
Error Status: no error
Error Status: no error
Running vectorised
Average Time: 0.198ms
Error Status: no error
Error Status: no error
Error Status: no error
Running blocktiled
Average Time: 0.190ms
Error Status: no error
Error Status: no error
Error Status: no error

#

any way to pass -arch=native to the rust compilation args?

#

I think without specifying anything it compiles for the highest possible arch

upbeat pewter Mar 16, 2025, 8:10 PM

#

RUSTFLAGS="-C target-cpu=native" cargo run ...

violet badger Mar 16, 2025, 8:10 PM

#

which is more than GH200 already

#

target-cpu ?

upbeat pewter Mar 16, 2025, 8:11 PM

#

though for nvcc, hm

#

somewhere in this section, try adding .flag("-arch=native")

GitHub

bullet/crates/bullet_hip_backend/build.rs at main · jw1912/bullet

Specialised ML Library. Contribute to jw1912/bullet development by creating an account on GitHub.

violet badger Mar 16, 2025, 8:14 PM

#

bingo!

#

running 11 tests
test tests::sparse_affine_check_not_batched ... ok
test tests::relu ... ok
test tests::matmul ... ok
test tests::crelu ... ok
test tests::sparse_affine ... ok
test tests::sqrrelu ... ok
test tests::screlu ... ok
test tests::sparse_affine_batched_biases ... ok
test tests::concat ... ok
test tests::sparse_affine_dual ... ok
test tests::matmul2 ... ok

upbeat pewter Mar 16, 2025, 8:14 PM

#

\o/

formal smelt Mar 16, 2025, 8:15 PM

#

yay

violet badger Mar 16, 2025, 8:16 PM

#

so, moving to the simple example, I guess

#

or to @candid ivy code, but I assume that needs a bullet upstream fix?

formal smelt Mar 16, 2025, 8:17 PM

#

what fix was needed?

candid ivy Mar 16, 2025, 8:17 PM

#

upbeat pewter somewhere in [this section](https://github.com/jw1912/bullet/blob/main/crates/bu...

this

formal smelt Mar 16, 2025, 8:17 PM

#

i dont think that's needed but might be faster

violet badger Mar 16, 2025, 8:17 PM

#

diff --git a/crates/bullet_hip_backend/build.rs b/crates/bullet_hip_backend/build.rs
index 71f7989..1c9cc2e 100644
--- a/crates/bullet_hip_backend/build.rs
+++ b/crates/bullet_hip_backend/build.rs
@@ -62,6 +62,7 @@ fn build_cuda(out_path: &Path) {
         .cudart("shared")
         .debug(false)
         .opt_level(3)
+        .flag("-arch=native")
         .files(&[KERNELS])
         .out_dir(out_path)
         .compile("libkernels.a");

candid ivy Mar 16, 2025, 8:17 PM

#

violet badger or to <@226785822601379840> code, but I assume that needs a bullet upstream fix?

if you don't want to hack then yes, just copy the main.rs into the sample.rs
or just checkout to sf-arch-i-think on the upstream bullet repo and run the advanced file

violet badger Mar 16, 2025, 8:17 PM

#

I think without this option, it takes the highest sm that the compiler supports..

#

which might include unknown ptx

formal smelt Mar 16, 2025, 8:19 PM

#

it defaults to sm_52

violet badger Mar 16, 2025, 8:19 PM

#

well that is probably ancient enough as well 😉

candid ivy Mar 16, 2025, 8:19 PM

#

i think my 4090 is 80 or something?

formal smelt Mar 16, 2025, 8:19 PM

#

it would be good to get data points on if this is a measurable speedup

candid ivy Mar 16, 2025, 8:19 PM

#

let me test this

formal smelt Mar 16, 2025, 8:21 PM

#

im not seeing a speedup locally but i have a crap gpu

upbeat pewter Mar 16, 2025, 8:22 PM

#

I feel like "necessary for building on GH200 [even if obscure] and not a slowdown" is sufficiently convincing, tbh

formal smelt Mar 16, 2025, 8:22 PM

#

yeah i'll merge it in

violet badger Mar 16, 2025, 8:22 PM

#

superbatch 1 | time 8.8s | running loss 0.037813 | 11413254 pos/sec | total time 10.7s

#

simple working..

formal smelt Mar 16, 2025, 8:22 PM

#

looks dataloader bottlenecked

#

nice

violet badger Mar 16, 2025, 8:24 PM

#

you expect that bottleneck to be the filesystem or something of the code?

formal smelt Mar 16, 2025, 8:24 PM

#

if you're loading bulletformat then its filesystem

violet badger Mar 16, 2025, 8:24 PM

#

I guess this binpack is small enough to store in RAM.

#

no binpack

formal smelt Mar 16, 2025, 8:24 PM

#

binpack loading its the shuffling step

candid ivy Mar 16, 2025, 8:24 PM

#

formal smelt it would be good to get data points on if this is a measurable speedup

eye balling the change id say a 3% performance increase, but id estimate the error to be 5%

formal smelt Mar 16, 2025, 8:24 PM

#

which would be replaced by fen skipping aggressively

formal smelt Mar 16, 2025, 8:25 PM

#

candid ivy eye balling the change id say a 3% performance increase, but id estimate the err...

good enough for me

candid ivy Mar 16, 2025, 8:26 PM

#

violet badger I guess this binpack is small enough to store in RAM.

i guess you are already trying /dev/shm ?

violet badger Mar 16, 2025, 8:26 PM

#

not yet, can do now

formal smelt Mar 16, 2025, 8:27 PM

#

okay i've merged the native thing and rebased sf-arch-i-think branch

violet badger Mar 16, 2025, 8:29 PM

#

candid ivy i guess you are already trying /dev/shm ?

no real difference superbatch 1 | time 8.6s | running loss 0.037825 | 11637627 pos/sec | total time 10.2s (on simple)

upbeat pewter Mar 16, 2025, 8:29 PM

#

then yeah, probably bottlenecked on the data loader

formal smelt Mar 16, 2025, 8:29 PM

#

or even preparing the data

#

you can try increasing the threads for each

upbeat pewter Mar 16, 2025, 8:29 PM

#

(it was a missed opportunity to not call loaders magazines /j)

formal smelt Mar 16, 2025, 8:29 PM

#

but simple is, well, very simple and tiny arch

formal smelt Mar 16, 2025, 8:29 PM

#

upbeat pewter (it was a missed opportunity to not call loaders magazines /j)

lol

#

https://github.com/jw1912/bullet/blob/main/examples/simple.rs#L56 for data preparing threads

#

https://github.com/jw1912/bullet/blob/main/examples/simple.rs#L62 data loading threads

violet badger Mar 16, 2025, 8:30 PM

#

increased both to 16

#

Worse superbatch 1 | time 12.4s | running loss 0.037861 | 8092996 pos/sec | total time 14.2

formal smelt Mar 16, 2025, 8:31 PM

#

probably increasing the loader threads doing that

#

i'd guess the bottleneck is the shuffling step

#

which is single threaded

violet badger Mar 16, 2025, 8:33 PM

#

16/4 also worse superbatch 1 | time 14.1s | running loss 0.037859 | 7109966 pos/sec | total time 15.7s

formal smelt Mar 16, 2025, 8:33 PM

#

damn

violet badger Mar 16, 2025, 8:34 PM

#

4/16 rougly equal superbatch 1 | time 9.2s | running loss 0.037878 | 10833312 pos/sec | total time 11.0s

formal smelt Mar 16, 2025, 8:34 PM

#

interesting

violet badger Mar 16, 2025, 8:35 PM

#

4/4 best superbatch 1 | time 8.6s | running loss 0.037861 | 11631975 pos/sec | total time 10.3s

formal smelt Mar 16, 2025, 8:36 PM

#

this is just a 768->128x2->1 network tbf
could you try the advanced example on branch sf-arch-i-think

violet badger Mar 16, 2025, 8:36 PM

#

ok, let me swap branches.

formal smelt Mar 16, 2025, 8:36 PM

#

nice

upbeat pewter Mar 16, 2025, 8:37 PM

#

violet badger 4/4 best `superbatch 1 | time 8.6s | running loss 0.037861 | 11631975 pos/sec | ...

I think I top out at like 7 million positions per second, so this is pretty huge :p

formal smelt Mar 16, 2025, 8:38 PM

#

luckily as long as you aren't actually hitting the data loader bottleneck, it doesn't matter how fast/slow it is

#

so as long as less than 11m pos/sec on a real arch, all should be good

violet badger Mar 16, 2025, 8:38 PM

#

superbatch 1 [75.0% (768/1024 batches, 1638909 pos/sec)]
thread '<unnamed>' panicked at /users/vjoost/fish/bullet/crates/bullet_lib/src/trainer/default/loader/sfbinpack.rs:108:50:
called `Result::unwrap()` on an `Err` value: SendError { .. }

#

some panick...

#

hmm, but not every time

#

superbatch 1 | time 10.2s | running loss 0.018962 | 1647798 pos/sec | total time 11.8s
Estimated time remaining in training: 0h 0m 0s
Saved [test-1]
Total Training Time: 0h 0m 19s
Eval: 68.779cp

formal smelt Mar 16, 2025, 8:41 PM

#

yeesh

formal smelt Mar 16, 2025, 8:41 PM

#

violet badger ``` superbatch 1 [75.0% (768/1024 batches, 1638909 pos/sec)] thread '<unnamed>' ...

binpack loading code has an edge case it seems

#

i would guess its something like if the binpack can't fill the shuffle buffer

violet badger Mar 16, 2025, 8:42 PM

#

it is a big binpack..

candid ivy Mar 16, 2025, 8:42 PM

#

candid ivy yeah just realised, im getting `670345 pos/sec` for that arch

+1m pos/sec than my gpu damn

violet badger Mar 16, 2025, 8:42 PM

#

adnd another:

superbatch 1 [62.5% (640/1024 batches, 1635149 pos/sec)]
thread '<unnamed>' panicked at /users/vjoost/fish/bullet/crates/bullet_lib/src/trainer/default/loader/sfbinpack.rs:108:50:
called `Result::unwrap()` on an `Err` value: SendError { .. }

formal smelt Mar 16, 2025, 8:43 PM

#

oh i see the bug

violet badger Mar 16, 2025, 8:43 PM

#

ready to pull when your are 😉

formal smelt Mar 16, 2025, 8:43 PM

#

its sending the message to stop one of the loader threads twice

#

and on the second time it errors because the thread isn't around anymore to receive it

#

oh wait

formal smelt Mar 16, 2025, 8:49 PM

#

violet badger ready to pull when your are 😉

have merged

violet badger Mar 16, 2025, 8:50 PM

#

trying...

#

ok, 3x without errors

formal smelt Mar 16, 2025, 8:52 PM

#

yay

#

https://github.com/jw1912/bullet/blob/sf-arch-i-think/examples/advanced.rs#L97 could you uncomment the two lines on either side of trainer.run and run it?

#

should give some profiling info

violet badger Mar 16, 2025, 8:53 PM

#

sure, I see the GPUs 75% idle 😉

upbeat pewter Mar 16, 2025, 8:53 PM

#

we need bigger batches /lh

violet badger Mar 16, 2025, 8:53 PM

#

no, just kidding, it is using 1/4 GPUs

candid ivy Mar 16, 2025, 8:54 PM

#

how much ram do you have?

#

& gpu ram?

formal smelt Mar 16, 2025, 8:56 PM

#

upbeat pewter we need bigger batches /lh

i think HL=3072 might be so large that the pairwise mul might trip a cuda error for launching kernel with too large grid size if you increase batch size too much

violet badger Mar 16, 2025, 8:56 PM

#

confused, but I think 4x (117GB + 94GB) usable

#

855 GB total

candid ivy Mar 16, 2025, 8:57 PM

#

that 211GB probably is able to fit the entire dataset of people into ram 😛

violet badger Mar 16, 2025, 8:58 PM

#

well Sf datasets are large

#

and it is seemingly not the bottleneck

#

profiling reduces a little bit the performance

#

profile

📎 message.txt

formal smelt Mar 16, 2025, 9:02 PM

#

interesting ratio
| Node 21 = SparseAffineDualActivate 1367 4208 | ~3.08
compared to mine
| Node 21 = SparseAffineDualActivate 37160 87121 | ~2.34

#

i suspect the aligned loads on the forward pass are going extremely hard

#

well this has been interesting

#

GH200 is insane

upbeat pewter Mar 16, 2025, 9:07 PM

#

the chip's worth like a year's wages for me >.>

violet badger Mar 16, 2025, 9:07 PM

#

it is quite nice indeed... if there is anything I can still measure let me know, thanks for the help getting it to run.

formal smelt Mar 16, 2025, 9:08 PM

#

i have some upcoming optmisations that will hopefully help this arch a lot in particular

violet badger Mar 16, 2025, 9:08 PM

#

I'd be more than happy to test a multi-gpu implementation if it appears 🙂

#

ok, let me know.

formal smelt Mar 16, 2025, 9:10 PM

#

violet badger I'd be more than happy to test a multi-gpu implementation if it appears 🙂

I wanted to switch to using cudarc (rather than custom code as is at the moment) before doing that because then I'd be offloading writing tons of device handling boilerplate (for which the safety isn't trivial)
but then some soundness issue was pointed out and now its getting loads of breaking changes lol https://github.com/coreylowman/cudarc/issues/340

#

might just stick with the current cudarc version and not touch the unsound part

#

too many things to do, not enough free time

violet badger Mar 16, 2025, 9:11 PM

#

relatable ..

violet badger Mar 16, 2025, 9:34 PM

#

so, I think I found back the old nnue-pytorch benchmarks I did #nnue-dev message

#

now, can we match that to what we just ran with bullet?

upbeat pewter Mar 16, 2025, 9:36 PM

#

I have no idea what a pytorch iteration is :p

violet badger Mar 16, 2025, 9:37 PM

#

I think it is the same 16384 batch ...

upbeat pewter Mar 16, 2025, 9:38 PM

#

39.66 × 16384 = 649789.44

#

so that's 650k positions/s for 2560 L1

violet badger Mar 16, 2025, 9:38 PM

#

I think so as well.

candid ivy Mar 16, 2025, 9:39 PM

#

that's also the math jw did earlier

violet badger Mar 16, 2025, 9:39 PM

#

so, probably could run bullet for that size to compare better

upbeat pewter Mar 16, 2025, 9:39 PM

#

which means bullet trains 2.4x faster on an even larger L1 :p

violet badger Mar 16, 2025, 9:40 PM

#

though the dataskipping will be different... and that might be limiting. idk

#

2560 superbatch 1 | time 8.6s | running loss 0.019224 | 1960857 pos/sec | total time 10.3s

upbeat pewter Mar 16, 2025, 9:40 PM

#

It's also probably not entirely equivalent because bullet uses superbatches

violet badger Mar 16, 2025, 9:41 PM

#

so roughly 3x..

#

but yes, comparison might be slightly off.

#

Still makes a big difference.

formal smelt Mar 16, 2025, 9:41 PM

#

violet badger though the dataskipping will be different... and that might be limiting. idk

not on bullet side at least

violet badger Mar 16, 2025, 9:42 PM

#

ok, so would be interesting to see a bullet trained net match SF master net.. obviously a non-trivial exercise.

#

though a 3x speedup would help doing that.

formal smelt Mar 16, 2025, 9:43 PM

#

as a fun additional note, if you decreased the number of the buckets the gap between the two would most likely grow quite significantly

#

it would take ages to recreate an entire run yeah

#

one stage probably reasonable

violet badger Mar 16, 2025, 9:46 PM

#

well, with this kind of speedups goes probably rather quick, less than a day for a stage.

formal smelt Mar 16, 2025, 9:46 PM

#

oh i didn't realise individual stages were so short

violet badger Mar 16, 2025, 9:47 PM

#

I think so, but I've a bit forgotten how long we train one stage.

#

@round stone will know

candid ivy Mar 16, 2025, 9:47 PM

#

well at first id like to see someone successfuly load such a net into sf and get somewhere close like 200 elo range or something

violet badger Mar 16, 2025, 9:47 PM

#

sure..

candid ivy Mar 16, 2025, 9:48 PM

#

and when that is "public" then im sure people will naturally play around and try it

violet badger Mar 16, 2025, 9:48 PM

#

I'm assuming that if the net arch is the same that might be more or less doable?

round stone Mar 16, 2025, 9:48 PM

#

1st stage - 400 superbatches
2nd-11th stage ~ 800 superbatches each

violet badger Mar 16, 2025, 9:49 PM

#

so 10s per superbatch right now..

candid ivy Mar 16, 2025, 9:49 PM

#

the binary format is slightly different, like no header, (anything else?) and then leb128 and permutation i guess

formal smelt Mar 16, 2025, 9:49 PM

#

violet badger so 10s per superbatch right now..

i think linrock is using the normal superbatch definition, the one in that sf arch example is different

#

i reduced batches/superbatch because otherwise it would take forever for me to run it lol

round stone Mar 16, 2025, 9:50 PM

#

1 bullet superbatch = 1 nnue-pytorch epoch

formal smelt Mar 16, 2025, 9:50 PM

#

right, ye

round stone Mar 16, 2025, 9:50 PM

#

both are ~100 million samples

formal smelt Mar 16, 2025, 9:51 PM

#

the sf-arch-i-think advanced example is doing ~16.78m samples per superbatch

round stone Mar 16, 2025, 9:52 PM

#

nnue-pytorch training speed is much slower due to all the skipping going on

#

it's a tradeoff for strength

formal smelt Mar 16, 2025, 9:52 PM

#

im confused as to how it has an effect
in bullet you are either hard limited by data loading speed, or not limited at all

#

because all that is done asynchronously from training

#

so you're either waiting on batches to be sent or drawing from a pre-prepared queue with no delay

round stone Mar 16, 2025, 9:54 PM

#

all i know is disabling nnue-pytorch piece count probability skipping makes training a lot faster, but also weaker

violet badger Mar 16, 2025, 9:54 PM

#

yes, skipping in nnue-pytorch is definitely slowing it down.

formal smelt Mar 16, 2025, 9:55 PM

#

what cli args do i need to pass to disable fen skipping entirely?

round stone Mar 16, 2025, 9:55 PM

#

can't disable piece count skipping with args. have to modify the source code

#

otherwise maybe --no-wld-fen-skipping and --random-fen-skipping 0 in addition

formal smelt Mar 16, 2025, 9:57 PM

#

violet badger yes, skipping in nnue-pytorch is definitely slowing it down.

even on large nets?

violet badger Mar 16, 2025, 9:57 PM

#

yes

#

It is a while ago, but we're skipping a lot of fens in these runs.

formal smelt Mar 16, 2025, 9:59 PM

#

i'm mostly trying to understand how the data loader would be implemented if its effect on speed isn't all or nothing

#

https://github.com/official-stockfish/nnue-pytorch/blob/master/training_data_loader.cpp#L925 i can just remove this i think

violet badger Mar 16, 2025, 10:05 PM

#

and maybe more of those https://github.com/official-stockfish/nnue-pytorch/blob/14124a0c9c6d70b25f46e5bbe443c1c97fd55fee/training_data_loader.cpp#L873-L887 ?

I'm trying to reconstruct from old messages here #nnue-dev message but I think we be using only 1 out 15 fens or so.

formal smelt Mar 16, 2025, 10:06 PM

#

some of those filters are also being done in bullet, and the random skipping and stuff can be disabled with cli args i think

#

188/6166 [00:54<28:45, 3.47it/s, loss=0. with defaults
187/6166 [00:56<30:19, 3.29it/s, loss=0.0128, v_num=7 with the piece count skipping removed and additional --no-smart-fen-skipping --no-wld-fen-skipping --random-fen-skipping 0 args

#

there's some variance between runs it seems

candid ivy Mar 16, 2025, 10:09 PM

#

Needs a bit to warmup too

formal smelt Mar 16, 2025, 10:12 PM

#

let me try a smaller net

#

on a very small net i can observe the training getting stuck cool

#

so i think it works how i expect it to

rocky vigil Mar 16, 2025, 11:11 PM

#

btw can someone test if https://github.com/sscg13/Stockfish/commit/6eafac4204bbbb06f47fb1bfadd3f912ce0a1c0e is a speed gain

#

on my laptop it's like ~5% for 4 thread speedtest

round stone Mar 17, 2025, 4:26 PM

#

rocky vigil btw can someone test if <https://github.com/sscg13/Stockfish/commit/6eafac4204bb...

Result of  10 runs
==================
base (...wld025-sb120) =     977351  +/- 9978
test (...peedup-maybe) =     975058  +/- 10170
diff                   =      -2292  +/- 13205

speedup        = -0.0023
P(speedup > 0) =  0.3670

rocky vigil Mar 17, 2025, 4:40 PM

#

Hmm I guess it’s ~neutral then

round stone Mar 17, 2025, 4:44 PM

#

you can also try running an STC on fishtest to see if a speedup is detected there

rocky vigil Mar 17, 2025, 4:46 PM

#

I mean all it does is use a pre-existing array to write the accumulator to and pass to L2

#

Instead of declaring a new one

#

So I’m not too keen on doing a whole STC

lofty cedar Mar 18, 2025, 8:24 AM

#

How's it going?

#

Never heard an idea taking this long...

naive comet Mar 18, 2025, 9:18 AM

#

not even NNUE?

lofty cedar Mar 18, 2025, 1:03 PM

#

NNUE was long ago... not sure what happened back then. It was something like someone bolted NNUE to Stockfish and it won, then Stockfish decided to slap NNUE onto Stockfish.

#

But the time to first gain was probably really quick given that classical evaluation was comparatively weak compared to NNUE.

naive comet Mar 18, 2025, 1:04 PM

#

thats not even close

lofty cedar Mar 18, 2025, 1:05 PM

#

Really? I thought the NNUE vs classical was a cakewalk.

lofty cedar Mar 18, 2025, 1:08 PM

#

naive comet thats not even close

It was not as simple as slapping an NNUE onto Stockfish to get an instant elo gain?

naive comet Mar 18, 2025, 1:12 PM

#

it is not that simple

#

you must keep in mind that sf hce is still 100s of elo stronger than any other hce around

twilit oriole Mar 18, 2025, 1:15 PM

#

There's many things going on at once here (like bullet transition also) and the pipeline is complex after 5 years of optimisation

lofty cedar Mar 18, 2025, 1:17 PM

#

Speaking of it, maybe this is the time to try something?

#

https://arxiv.org/pdf/2502.07176

#

Might as well try KAN?

formal smelt Mar 18, 2025, 1:21 PM

#

KANs are generally overrated

#

but yes, it could the be tried

#

the advantage of this idea is that there already exists a strong network that indicated it could work (the monty value network)

#

and the implementation just requires training an NNUE version and working out how to best optimise the efficient updates

#

on the other hand there does not exist any KAN chess network of notable strength

lofty cedar Mar 18, 2025, 1:23 PM

#

formal smelt KANs are generally overrated

KANs are generally unscalable and prone to overfit, but Stockfish has an abundance of data for the model size and so on.

#

That being said, I tried KAN training and it failed to even beat master at loss.

formal smelt Mar 18, 2025, 1:24 PM

#

yes but obviously we would need a serious attempt at it to actually rule it out

#

your naivety is shown by comparing loss to master on the first networks you try to train...

#

secondly do you have a plan for UE or some ~equivalent with a KAN

#

a suggested architecture

#

etc

lofty cedar Mar 18, 2025, 1:25 PM

#

It was despite my network having a higher compute budget. I didn't completely rule it out, but I didn't deem it worth it.

lofty cedar Mar 18, 2025, 1:26 PM

#

formal smelt secondly do you have a plan for UE or some ~equivalent with a KAN

Well, the first layer would be the same.

#

Only the second layer onward would be KAN.

#

That being said, I only used the old feature transformer, not trained a whole new net.

naive comet Mar 18, 2025, 1:28 PM

#

lofty cedar It was despite my network having a higher compute budget. I didn't completely ru...

u sure about this?

lofty cedar Mar 18, 2025, 1:29 PM

#

naive comet u sure about this?

I tried several epochs and it seemed to stagnate. I tried a bunch of pruning, re-initializing, etc trick.

#

What I found out is that most didn't even stay in the range where nontrivial behavior would occur.

#

So, it was approximately an MLP anyway.

lofty cedar Mar 18, 2025, 1:35 PM

#

naive comet u sure about this?

Yeah... it was an amateur-ish attempt. I didn't clean the training data and so on, so don't take this as anything more than 2 cents.

rocky vigil Mar 19, 2025, 12:57 AM

#

btw when shawn's nnue refactor is merged I'll also have to do a major rebase on that (I might also try to figure out incremental attack tables during this, so it might take a while)

#

but I'll leave the current branch for testing

rocky vigil Mar 19, 2025, 5:57 AM

#

https://tests.stockfishchess.org/tests/view/67da5c508c7f315cc372a9d6 in the meanwhile we'll check LTC vs smallnet bc why not

rocky vigil Mar 19, 2025, 2:59 PM

#

Interesting

#

LTC ~ STC

rocky vigil Mar 19, 2025, 8:58 PM

#

I thought it would scale a bit more

#

Maybe not being multilayer plays a part in this idk

rocky vigil Mar 19, 2025, 11:30 PM

#

rocky vigil btw when shawn's nnue refactor is merged I'll also have to do a major rebase on ...

btw @formal smelt is there any estimate on how long it'll take to get bullet all sorted out for training sf nets

candid ivy Mar 21, 2025, 10:32 AM

#

rocky vigil btw <@236941606035521537> is there any estimate on how long it'll take to get bu...

i have a repo here, https://github.com/Disservin/sf-bullet-train/blob/quant-pst-old-correct-pst-values/src/main.rs
loading the psqt values works i think but something else isn't converted entirely correct, i think the layer stack weights are in a different format but still couldnt figure out what else/maybe i did it wrong

naive comet Mar 21, 2025, 11:10 AM

#

bullet ordering is pnbrqkpnbrq but SF ordering is pnbrqpnbrqk for halfkav2_hm

#

is that it?

candid ivy Mar 21, 2025, 11:29 AM

#

naive comet bullet ordering is pnbrqkpnbrq but SF ordering is pnbrqpnbrqk for halfkav2_hm

that shouldn't matter cause the indices are calculated from the type and color, https://github.com/Disservin/sf-bullet-train/blob/quant-pst-old-correct-pst-values/src/halfkav2_hm.rs#L59

rocky vigil Mar 21, 2025, 10:30 PM

#

ok now that refactor is here in the next week or so I'll try to rebase everything

#

if multilayer works though and a new net is trained I can still make quick updates on old branch

rocky vigil Mar 21, 2025, 10:32 PM

#

candid ivy i have a repo here, <https://github.com/Disservin/sf-bullet-train/blob/quant-pst...

i think the affine transforms all expect i8 weights and i32 biases that might be a bit tricky

candid ivy Mar 21, 2025, 10:32 PM

#

rocky vigil i think the affine transforms all expect i8 weights and i32 biases that might be...

why would that be tricky ? bullet is already outputting that

rocky vigil Mar 21, 2025, 10:33 PM

#

oh huh

rocky vigil Mar 21, 2025, 10:35 PM

#

candid ivy i have a repo here, <https://github.com/Disservin/sf-bullet-train/blob/quant-pst...

the other thing I can think of rn is that you also have to pad all the input dimensions to the next multiple of 32 for the affine transforms

candid ivy Mar 21, 2025, 10:35 PM

#

well that is only needed for one layer right now.. and i do that otherwise it wouldnt even load into sf

rocky vigil Mar 21, 2025, 10:36 PM

#

wait so it loads correctly but doesn't inference

#

is that the issue

candid ivy Mar 21, 2025, 10:36 PM

#

yeah

#

maybe i padded it wrong but dont think so

#

the psqt subnet is already correct

rocky vigil Mar 21, 2025, 10:39 PM

#

ah nice

#

https://github.com/official-stockfish/Stockfish/blob/master/src/nnue/network.cpp#L431 https://github.com/official-stockfish/Stockfish/blob/master/src/nnue/nnue_architecture.h#L81 this should be how the layerstack is read

candid ivy Mar 21, 2025, 10:40 PM

#

yeah well ik

rocky vigil Mar 21, 2025, 10:41 PM

#

btw what are the hash values in the net intended to do

candid ivy Mar 21, 2025, 10:41 PM

#

just some verification things that the loaded net is supported in this arch, not really important

rocky vigil Mar 21, 2025, 10:42 PM

#

btw have you tried checking if write_parameters returns the same file as what you inputted

candid ivy Mar 21, 2025, 10:42 PM

#

no but i don't see how that wouldn't be the case

#

https://github.com/Disservin/sf-bullet-train/blob/quant-pst-old-correct-pst-values/convert_quantised_to_pytorch.py#L219
here the fc hashes are just hard coded

rocky vigil Mar 21, 2025, 10:43 PM

#

but if everything is being read correctly and inference is still not working that means it's more likely to be an issue with bullet no?

candid ivy Mar 21, 2025, 10:44 PM

#

well reading it into an array is one thing, but the more important thing is that the layout of the weights for example is correct

rocky vigil Mar 21, 2025, 10:44 PM

#

oh wait are you saying the array sizes could match but the layout might not

candid ivy Mar 21, 2025, 10:45 PM

#

yeah well the array sizes currently definitely match

#

so must be something in the layout which isn't in the way sf expects it

rocky vigil Mar 21, 2025, 10:45 PM

#

btw I find it funny how read_parameters is called on the activation functions even though they just return true

#

affine transform weights are written row major I think

candid ivy Mar 21, 2025, 10:52 PM

#

well if you want to try and fix it, you can just give this repo a try, i can send you a bullet checkpoint as well if needed

rocky vigil Mar 21, 2025, 10:54 PM

#

ok sure

#

do you still have the bullet header info that describes how bullet outputs the weights

candid ivy Mar 21, 2025, 10:57 PM

#

there's some info here #1351682122162634796 message and in the normal channel too i think

#

https://drive.google.com/file/d/1Rl5eLPIZlL31spzWMtGv8YQBwvyyyf9l/view?usp=sharing
here is the checkpoint you can load

Google Docs

test-80.zip

rocky vigil Mar 21, 2025, 11:04 PM

#

eh why is it 1 GB

#

oh I see

#

can I also get the bullet config used to train this net

candid ivy Mar 21, 2025, 11:13 PM

#

clone the linked repo

rocky vigil Mar 21, 2025, 11:13 PM

#

ah it has it ok

candid ivy Mar 21, 2025, 11:13 PM

#

and checkout the quant-pst-old-correct-pst-values branch

#

and the quantised nets need to be converted using the python script python convert_quantised_to_pytorch.py ./checkpoints/halfkav2_hm-stm/test-80/quantised.bin bullet.nnue

rocky vigil Mar 22, 2025, 9:55 PM

#

@stray reef how did you end up inferencing threat inputs? Did you add incremental attack tables?

#

Also if you have the net somewhere I’ll try to get it to work so fixed nodes can be run

stray reef Mar 22, 2025, 10:09 PM

#

I have one version with and one version without incremental attack tables. The speedup was around 15-20%

rocky vigil Mar 22, 2025, 10:09 PM

#

Speedup at which L1 btw

stray reef Mar 22, 2025, 10:10 PM

#

I do have the net, it's not a SF arch though, I can send it when I'm at my pc again

rocky vigil Mar 22, 2025, 10:11 PM

#

Alright that’s fine I’ll try to get it to work

stray reef Mar 22, 2025, 10:11 PM

#

rocky vigil Speedup at which L1 btw

that number is from a (80624 -> 1024)x2 -> 1x8 net

rocky vigil Mar 22, 2025, 10:11 PM

#

Wait 15% speedup at that size is really good

stray reef Mar 22, 2025, 10:12 PM

#

It's still a lot slower than my master net though (1.4M nps vs 2.3M nps), and "only" +30 fixed nodes

rocky vigil Mar 22, 2025, 10:12 PM

#

stray reef that number is from a `(80624 -> 1024)x2 -> 1x8` net

Btw can you run your version vs my Stockfish branch (change the L1 to 1024) and lemme know the nps difference

stray reef Mar 22, 2025, 10:12 PM

#

i haven't yet looked at small threat inputs, maybe that has less updates, that could help

rocky vigil Mar 22, 2025, 10:13 PM

#

Simplified and full are ~ the same speed

stray reef Mar 22, 2025, 10:13 PM

#

rip

rocky vigil Mar 22, 2025, 10:13 PM

#

And we know the +30 fixed nodes elo approximately translates into similar STC diff

stray reef Mar 22, 2025, 10:14 PM

#

rocky vigil Btw can you run your version vs my Stockfish branch (change the L1 to 1024) and ...

does that branch have inference for a single layer net already? and can you send the link?

rocky vigil Mar 22, 2025, 10:14 PM

#

Yes single layer with 8 output buckets

#

https://github.com/sscg13/Stockfish/tree/threat-inputs

#

Btw linrock’s simplified threats (15776 -> 1024)x2 -> 1x8 is known to be around -80 fixed nodes to sf master

#

But there is still a lot of experimentation remaining

stray reef Mar 22, 2025, 10:17 PM

#

My current feeling is that the amount of extra feature updates makes it sort of infeasable. But I will try some multilayer architectures with small threat inputs, with an L1 that's similar speed to my master net, and see what happens.

rocky vigil Mar 22, 2025, 10:18 PM

#

Btw have you measured the number of feature updates

#

I get it’s ~ 8 per color per node in sf in the midgame

stray reef Mar 22, 2025, 10:18 PM

#

Though on the other hand, linrocks (80624->256)x2 -> 1x8 net matched my net at fixed nodes 😅

daring wren Mar 22, 2025, 10:19 PM

#

stray reef Though on the other hand, linrocks (80624->256)x2 -> 1x8 net matched my net at f...

matched your 2048hl multilayer net?

stray reef Mar 22, 2025, 10:20 PM

#

rocky vigil Btw have you measured the number of feature updates

dont have concrete numbers right now, only looked at some print statements where it was mostly 5-6 threat updates, but sometimes 20+

stray reef Mar 22, 2025, 10:20 PM

#

daring wren matched your 2048hl multilayer net?

1536, but yes

daring wren Mar 22, 2025, 10:20 PM

#

that's insane

stray reef Mar 22, 2025, 10:20 PM

#

Maybe small hidden layers and multilayer are the way to go here

#

i also had a 384 HL net that was very half-assed and matched my master net too

rocky vigil Mar 22, 2025, 10:23 PM

#

stray reef dont have concrete numbers right now, only looked at some print statements where...

Yeah 8 is an average count

#

If you run a long search with my branch and type eval it’ll tell you the total number of accumulator updates

stray reef Mar 22, 2025, 10:28 PM

#

rocky vigil Also if you have the net somewhere I’ll try to get it to work so fixed nodes can...

https://1drv.ms/u/s!AoYl_69Zm9N0-sEKHQcbjUJWO3R3FQ?e=goWVnM

#

already transposed and quantised (255, 64)

#

ah and the output bucket formula is the one from bullet, not (piececount - 1) / 4

stray reef Mar 22, 2025, 10:39 PM

#

rocky vigil <https://github.com/sscg13/Stockfish/tree/threat-inputs>

looks like changing EvalFileDefaultNameBig and the output bucket formula isn't enough to get this working. i'll try again tomorrow i suppose

upbeat pewter Mar 22, 2025, 10:39 PM

#

stray reef I have one version with and one version without incremental attack tables. The s...

does this make plentychess a yukari clone? /j

#

(I will just continue memeing that attack tables and mailboxes are slow)

finite wind Mar 22, 2025, 10:42 PM

#

stray reef already transposed and quantised (255, 64)

what engine is this network for?

stray reef Mar 22, 2025, 10:43 PM

#

i trained it for plentychess, but it could be used with any engine

rocky vigil Mar 22, 2025, 10:46 PM

#

stray reef looks like changing EvalFileDefaultNameBig and the output bucket formula isn't e...

You need to name it nn-12 digits of sha hash like the sf nets

finite wind Mar 22, 2025, 10:51 PM

#

stray reef i trained it for plentychess, but it could be used with any engine

How to set it for SF? It crashes whem I just set file location. It need some conversion or rename?

stray reef Mar 22, 2025, 10:52 PM

#

it won't work with any official stockfish version or development build. I'm trying to figure out myself how to use it with stockfish

finite wind Mar 22, 2025, 10:53 PM

#

Ok. I will try with another strong engines

stray reef Mar 22, 2025, 10:54 PM

#

probably worded that previous message badly. this is an experimental network architecture. given the right modifications to the code, it could be used with any engine

#

but you won't find any engine in which that net will work with no modifications

finite wind Mar 22, 2025, 11:07 PM

#

I always looking for huge nets for ab engines

stray reef Mar 22, 2025, 11:22 PM

#

one good thing about these types of nets though, the lack of input buckets makes them incredibly fast to train

rocky vigil Mar 23, 2025, 2:18 AM

#

stray reef ah and the output bucket formula is the one from bullet, not (piececount - 1) / ...

so I realized that for various reasons (eg quantization and weight types) I can't just compose existing nnue layers meaning supporting this in sf is not as easy as I hoped, uhh since I already plan to rebase everything on the NNUE refactor and add incremental attack tables I probably will just wait for sf style multilayer to make its way to bullet

#

btw is bullet just (piececount - 2) / 4

#

I actually have (piececount - 1) / 4 rn lemme check if we're missing some elo bc of that

rocky vigil Mar 23, 2025, 2:34 AM

#

https://tests.stockfishchess.org/tests/view/67df73348888403457d874df

violet badger Mar 23, 2025, 7:36 AM

#

rocky vigil <https://tests.stockfishchess.org/tests/view/67df73348888403457d874df>

that looks like a fix.

rocky vigil Mar 23, 2025, 7:47 AM

#

I guess that explains why output buckets weren't so great compared to single layer initially

rocky vigil Mar 23, 2025, 7:52 AM

#

stray reef I have one version with and one version without incremental attack tables. The s...

btw are these publicly accessible? I checked github and didn't find anything
(hoping to get some information about how to do incremental attack tables and also if there are any other optimizations I'm missing)

stray reef Mar 23, 2025, 9:29 AM

#

rocky vigil btw are these publicly accessible? I checked github and didn't find anything (ho...

I will push my branches once they've been cleaned up properly, it's still a bit of a mess (maybe later today?). I looked at Yukari for the logic and transferred it to bitboards

upbeat pewter Mar 23, 2025, 9:48 AM

#

stray reef one good thing about these types of nets though, the lack of input buckets makes...

I'm curious, what are your training settings?

stray reef Mar 23, 2025, 9:55 AM

#

just trained a net with these settings

const HIDDEN_SIZE: usize = 512;
const SCALE: f32 = 400.0;

fn main() {

    let mut trainer = TrainerBuilder::default()
        .optimiser(optimiser::AdamW)
        .loss_fn(Loss::SigmoidMSE)
        .input(ThreatInputsSimple)
        .output_buckets(outputs::MaterialCount::<8>)
        .feature_transformer(HIDDEN_SIZE)
        .activate(Activation::CReLU)
        .add_layer(16)
        .activate(Activation::SCReLU)
        .add_layer(32)
        .activate(Activation::SCReLU)
        .add_layer(1)
        .build();

    let start_epoch = 1;
    let experiment_name = "0087";

    let output_path = format!("/mnt/d/Chess Data/Selfgen/Training/{}", experiment_name);
    let settings = LocalSettings {
        threads: 4,
        output_directory: &output_path.as_str(),
        test_set: None,
        batch_queue_size: 512,
    };
    let data_loader: loader::DirectSequentialDataLoader = loader::DirectSequentialDataLoader::new(&["/mnt/d/Chess Data/Selfgen/interleaved.data"]);
    let schedule = TrainingSchedule {
        net_id: format!("net-{}", experiment_name).to_string(),
        eval_scale: SCALE,
        steps: TrainingSteps {
            batch_size: 16_384,
            batches_per_superbatch: 6104,
            start_superbatch: start_epoch,
            end_superbatch: 500,
        },
        wdl_scheduler: wdl::ConstantWDL { value: 0.5 },
        lr_scheduler:  lr::CosineDecayLR { initial_lr: 0.001, final_lr: 0.001 * 0.3 * 0.3 * 0.3 * 0.3, final_superbatch: 500 },
        save_rate: 10,
    };

    trainer.set_optimiser_params(optimiser::AdamWParams {
        decay: 0.01,
        beta1: 0.9,
        beta2: 0.999,
        min_weight: -0.99,
        max_weight: 0.99,
    });

    trainer.run(&schedule, &settings, &data_loader);
}

#

apart from the arch, that pretty much matches the first stage of my master net (except 420 SBs instead of 500)

upbeat pewter Mar 23, 2025, 10:00 AM

#

        min_weight: -0.99,
        max_weight: 0.99,

...huh.

stray reef Mar 23, 2025, 11:48 PM

#

stray reef i also had a 384 HL net that was very half-assed and matched my master net too

scratch this. i must have messed up this test. maybe I accidentally tested my 384 vs linrocks 256. Or it's the difference between full and simple threat inputs.

Trained two simple threat inputs 512 L1 nets today (one single layer, one multilayer, aka -> 16 -> 32 -> 1) to see what L1 is needed to beat my master net at fixed nodes. The results were a bit disappointing.
512 with layers vs. main: Elo: -16.13 +/- 6.58, nElo: -23.63 +/- 9.63
512 single layer vs. main: Elo: -33.53 +/- 6.49, nElo: -50.06 +/- 9.63

I doubt that a 768 L1 net would be noticably stronger than master at fixed nodes, so 896 or 1024 would be necessary at least (for my case, obviously). And for that I am really not sure if I can make it fast enough...

rocky vigil Mar 24, 2025, 1:06 AM

#

fixed nodes single layer full inputs should be ~30 elo better than simplified at the same L1 so a full threat input net could be noticeably stronger at L1=768

rocky vigil Mar 25, 2025, 4:11 AM

#

https://tests.stockfishchess.org/tests/view/67e1f1d38888403457d87680

#

yeah it looks like threat inputs need more careful data work as L1 gets larger

frosty imp Mar 25, 2025, 4:18 AM

#

maybe it's the slowdown?

rocky vigil Mar 25, 2025, 4:24 AM

#

slowdown is like 15%

frosty imp Mar 25, 2025, 4:28 AM

#

that'd be around -30 Elo STC

#

so decent chance it passes LTC

rocky vigil Mar 25, 2025, 4:56 AM

#

i mean the last time we did this it was 0 stc and 7 ltc

#

so like

#

yeah

#

it's negative stc now because the new L1=256 net is slightly better

#

which is more evidence that data work will be very important

rocky vigil Mar 25, 2025, 7:03 AM

#

technologov machine dropped a massive diff

#

idk it feels like threat inputs somehow always attract high residuals

twilit oriole Mar 25, 2025, 7:19 AM

#

I've already said why multiple times lol. It's cache effects because the net is being duplicated for each instance instead of shared

#

Memory accesses are far less predictable with threat inputs compared to king buckets, having a large portion of the net in cache is important

#

512 Vs 256 is not a 15% slowdown when using multiple instances with no sharing, this is all this is showing

#

In monty mmap was worth 40 Elo at SPRT conditions. And I wasn't even using hyperthreading like fishtest machines. And it wasn't even a threat input net (but it was non UE so still had cache issues). This stuff really matters a lot

#

I always thought sharing would be absolutely necessary for this to pass. Which is part of why testing at TC is a huge pain in the ass for now (and the results are useless because the TC Elo diff of scaling L1 is totally dependent on how optimized the speed is), I'm still wanting to see scaling L1 and doing fixed nodes. It's a far easier path to validate the idea, if a L1 1536 full threat net (with multilayer, output buckets) surpasses master SF net the idea is finally validated after all this time, the speed can then be fully sorted after

rocky vigil Apr 3, 2025, 4:48 PM

#

stray reef I will push my branches once they've been cleaned up properly, it's still a bit ...

btw are the branches available now
since it looks like upstream sf nnue refactor will conclude soon I’m also going to start rebasing threat inputs again and hopefully end up with a much faster version

twilit oriole Apr 3, 2025, 5:47 PM

#

rocky vigil btw are the branches available now since it looks like upstream sf nnue refactor...

Have you considered trying this is obsidian also instead? Would be a lot easier since it's literally just swapping the regular inputs with the threat ones. It already uses bullet and lc0 data, no other variables are changing

#

@wide oasis is also up for helping with it I think

wide oasis Apr 3, 2025, 5:49 PM

#

sure

twilit oriole Apr 3, 2025, 5:49 PM

#

Yeah. I mean all the code for it already exists in the branches (training and inference) it's mostly copy paste job

wide oasis Apr 3, 2025, 5:53 PM

#

what is needed from my end?

rocky vigil Apr 3, 2025, 6:34 PM

#

twilit oriole Have you considered trying this is obsidian also instead? Would be a lot easier ...

Sure if gabe is willing to help

#

Note that if you copy paste my inference code you lose quite a bit of speed over incremental attack tables

#

But it’ll be good for fixed nodes

rocky vigil Apr 3, 2025, 6:36 PM

#

wide oasis what is needed from my end?

I’ll try and convert threat input code into some form suitable for Obsidian use, if you tell me what is needed

wide oasis Apr 3, 2025, 6:37 PM

#

idk lol

#

i guess you'd replace nnue.cpp entirely

twilit oriole Apr 3, 2025, 6:38 PM

#

Only the inputs need to be changed

#

From regular to threat

rocky vigil Apr 4, 2025, 1:46 AM

#

note that there is no multilayer threat input net yet right now

twilit oriole Apr 5, 2025, 12:03 AM

#

rocky vigil note that there is no multilayer threat input net yet right now

Hm? Can't it just use obsidian multilayer it's already there. I thought it's just an input swap

rocky vigil Apr 5, 2025, 12:04 AM

#

I meant the actual net

twilit oriole Apr 5, 2025, 12:04 AM

#

I got the GPUs if any training needs to take place after swapping the inputs in its config

rocky vigil Apr 5, 2025, 12:04 AM

#

gabe gonna have to talk with jw about how to do it

#

or you i guess

#

obsidian uses float in later layers though right?

twilit oriole Apr 5, 2025, 12:04 AM

#

Well it already trains multilayer with bullet. So whatever it does will work

#

@wide oasis what's your bullet training config?

#

And the Leela binpacks used

#

I'll just swap in the full threats and start training some nets I guess

rocky vigil Apr 5, 2025, 12:10 AM

#

ok I think I should be able to get inference (inputs -> accumulator) to work but it might take a few days

wide oasis Apr 5, 2025, 7:14 AM

#

twilit oriole <@867862496538787851> what's your bullet training config?

it's 4 files (im from phone rn)

#

isn't it quicker if I train the net with all the data ready?

twilit oriole Apr 5, 2025, 11:08 AM

#

Think we need a new baseline anyways, might as well simplify to one stage. Will need to be using more data I think for both for fair comparison

stray reef Apr 5, 2025, 1:38 PM

#

rocky vigil btw are the branches available now since it looks like upstream sf nnue refactor...

Sorry, didn't get around to it until now.
Uploaded the branches and nets for a single layer ((80624 -> 1024)x2 -> 1x8) and multi layer ((80624 -> 2048)x2 -> (16 -> 32 -> 1)x8) net now. They are +30 and +55 to my master net at fixed nodes, respectively, but far too slow even with the current UE impl.
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full-layers

#

Changing to simplified threat inputs is super easily done in threat-inputs.h/cpp, though it seems I lost my small threat inputs nets and therefore can't test this with a working network at the moment

rocky vigil Apr 5, 2025, 6:01 PM

#

Hmm 1024 to 2048 and multilayer fixed nodes gain seems surprisingly low

#

How much data do you have

#

If we assume that a (768->N) net needs around N million positions of data then you would need around 50B positions to have a similar input saturation for (80624 -> 2048)

stray reef Apr 5, 2025, 6:06 PM

#

rocky vigil How much data do you have

6B positions

#

I could take some older positions that are generated with less nodes, maybe that still works together on such a big net

#

that would put me at around 13B altogether

rocky vigil Apr 5, 2025, 6:10 PM

#

yeah I'm using smth like N*M*16K / (average number of features per position) to estimate the order of magnitude of data required

stray reef Apr 5, 2025, 6:11 PM

#

50B is definitely not feasable for me, that's for sure. And I think that goes for all selfgenners (apart from maybe Jay)

rocky vigil Apr 5, 2025, 6:11 PM

#

I'll test (fixed nodes) the 1024 -> 1x8 net vs linrock's 512 which I still have somewhere

rocky vigil Apr 5, 2025, 6:12 PM

#

stray reef 50B is definitely not feasable for me, that's for sure. And I think that goes fo...

yeah the thing about threat inputs is if it works it becomes another advantage to leelers

stray reef Apr 5, 2025, 6:13 PM

#

Small threat inputs are less data starved, but they are not faster so they're not really useful

rocky vigil Apr 5, 2025, 6:13 PM

#

for your impl approximately when do you staart to see significant slowdown

stray reef Apr 5, 2025, 6:13 PM

#

what L1?

rocky vigil Apr 5, 2025, 6:14 PM

#

yeah

#

in mine 256 -> 512 is around 15% slowdown

#

but it was quite slow to begin with

#

btw does 0084.bin have the bullet padding trimmed at the end

stray reef Apr 5, 2025, 6:15 PM

#

mh I tested a 512 L1 morelayers net that was slower than master and worse at fixed nodes. I haven't tested the incremental speed differences yet

rocky vigil Apr 5, 2025, 6:16 PM

#

how fast is sf master (or 17.1), your master and 512 L1 (in nps)

stray reef Apr 5, 2025, 6:18 PM

#

yes, I process bullets raw.bin in a way so just the weights are included in 0084.bin, no padding, and already aligned to 64 bytes. See https://github.com/Yoshie2000/PlentyChess/blob/threat-inputs-full/tools/process_net.cpp

#

this makes generating verbatim nets also very easy

rocky vigil Apr 5, 2025, 6:20 PM

#

ok I wasn't sure if the 0000 bytes at the end were padding or part of the net

#

thanks

rocky vigil Apr 5, 2025, 6:30 PM

#

rocky vigil how fast is sf master (or 17.1), your master and 512 L1 (in nps)

in my impl 512 L1 is around 15% slower than sf master but my impl has over 1/4 of the runtime lost to overhead with threat inputs

rocky vigil Apr 5, 2025, 6:49 PM

#

stray reef yes, I process bullets raw.bin in a way so just the weights are included in 0084...

btw are the output buckets transposed

#

I think they are? from reading process_net.cpp

#

and what is the quantization

#

bc I'm getting info string NNUE evaluation using nn-239c9dddf51e.nnue (157MiB, (80624, 1024, 1)) info depth 1 seldepth 2 multipv 1 score cp 931 nodes 20 nps 20000 hashfull 0 tbhits 0 time 1 pv h2h4 info depth 2 seldepth 3 multipv 1 score cp 330 nodes 103 nps 51500 hashfull 0 tbhits 0 time 2 pv b1c3 a7a6 info depth 3 seldepth 4 multipv 1 score cp 586 nodes 127 nps 63500 hashfull 0 tbhits 0 time 2 pv b1c3 a7a6 info depth 4 seldepth 5 multipv 1 score cp 679 nodes 239 nps 79666 hashfull 0 tbhits 0 time 3 pv b1a3 a7a6 info depth 5 seldepth 6 multipv 1 score cp 757 nodes 341 nps 113666 hashfull 0 tbhits 0 time 3 pv b1a3 b7b6 h2h4 a7a6 info depth 6 seldepth 7 multipv 1 score cp 420 nodes 457 nps 114250 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 info depth 7 seldepth 8 multipv 1 score cp 396 nodes 485 nps 121250 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 b7b6 f2f3 info depth 8 seldepth 8 multipv 1 score cp 22 nodes 614 nps 153500 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 c6c5 f2f3 info depth 9 seldepth 9 multipv 1 score cp 115 nodes 670 nps 134000 hashfull 0 tbhits 0 time 5 pv d2d3 a7a6 h2h4 a6a5 g1h3 c7c6 info depth 10 seldepth 12 multipv 1 score cp 225 nodes 711 nps 142200 hashfull 0 tbhits 0 time 5 pv d2d3 a7a6 h2h4 a6a5 g1h3 c7c6 b1c3 c6c5 info depth 11 seldepth 22 multipv 1 score cp 216 nodes 3113 nps 283000 hashfull 2 tbhits 0 time 11 pv d2d3 a7a6 f2f4 c7c6 h2h4 d8a5 d1d2 h7h5 d2a5 g8h6 a5h5 info depth 12 seldepth 18 multipv 1 score cp 251 nodes 14188 nps 405371 hashfull 9 tbhits 0 time 35 pv e2e3 a7a6 d2d4 d7d6 f1a6 b8a6 c2c3 g7g6 d1d3 a6b4 d3b5 c7c6 info depth 13 seldepth 15 multipv 1 score cp 128 nodes 21150 nps 406730 hashfull 10 tbhits 0 time 52 pv e2e3 a7a6 d2d4 c7c6 b1c3 e7e6 d1d3 g8f6 d3c4 h7h5 info depth 14 seldepth 26 multipv 1 score cp 45 nodes 175351 nps 475205 hashfull 79 tbhits 0 time 369 pv c2c3 d7d5 e2e4 e7e5 a2a3 f8b4 g1e2 c8e6 e2f4 e5f4 a3b4 b8d7

#

which seems very wrong

stray reef Apr 5, 2025, 6:55 PM

#

rocky vigil btw are the output buckets transposed

they are not transposed in 0084.bin, but they will be transposed before being baked into the engine in transposePermuteNetwork(). If you compile normally, there will be a temporary processed.bin which is the file that's baked into the engine

rocky vigil Apr 5, 2025, 6:55 PM

#

ohh

stray reef Apr 5, 2025, 6:56 PM

#

rocky vigil and what is the quantization

input quant 510, l1 quant 64

rocky vigil Apr 5, 2025, 6:56 PM

#

oh so QA = 510, QB = 64

#

I see

#

ok and scale is 400 right

#

(linrock uses 340)

stray reef Apr 5, 2025, 6:57 PM

#

rocky vigil in my impl 512 L1 is around 15% slower than sf master but my impl has over 1/4 o...

I don't have a 512 net right now, it got lost somehow. 0084.bin is 1.45M nps, SF master is 1.6M nps on my machine

rocky vigil Apr 5, 2025, 6:58 PM

#

hmm -10% to master

stray reef Apr 5, 2025, 6:58 PM

#

plenty is 2.1M nps on this machine. so that's quite hard to get close to

rocky vigil Apr 5, 2025, 6:59 PM

#

for comparison on my machine sf 1024 is -33% to master

stray reef Apr 5, 2025, 6:59 PM

#

ok sounds like my impl is faster then. that adds up, since I gained quite a lot of speed from the incremental threat calculations

#

I guess an L1 512, full threat inputs morelayers with 13B positions should beat my master net at fixed nodes. But it'll still be relatively slow. And there's also an issue in bullet where threat input nets with morelayers and pairwise produces only dead nets on init... perhaps @formal smelt has some idea what would have to be done to improve that, since pairwise is quite an important speedup

#

i'm gonna sign off for today though

rocky vigil Apr 5, 2025, 7:01 PM

#

ok yeah

#UE Threat Inputs for AB