#UE Threat Inputs for AB
1 messages · Page 3 of 1
memmove is memcpy but it works for overlapping regions
L1=1024 4 thread 397434
Anyways these speed losses shouldn’t be on the order of 60-100 STC elo
Welp
this one is from a speedtest profile, with the 1024 net
(ah but this already with some simd which nets an additional ~8%)
Well it's not so simple. The speed characteristics depend on game phase and a speed loss in each phase is likely worth different Elo amount in each
Only way here is to measure Elo really
Like a L1 1536 threat net is way slower in openings and way faster in endgames
Is that speed figure with separate SF instances running on each thread?
That's very important...
I don't think threat inputs pass even in monty without shared net weights between the instances it's quite possible that is the discrepancy
btw if you have SIMD could you share it so that I can pull
Yeah this is 4 thread 1 instance
It shouldn’t matter too much for 256 and simplified threats but if we move to full threats and make the net 3x as large as master then it will be an issue…
Eh it’s only messed up avx2
Well I think proper simd should gain significantly over autovec
Especially since screlu-affine seems a lot slower than I hoped
As well as memmove
can I get smth like this for full inputs instead
Simplified inference already has worked for a long time
I will but I’m not available for at least a few hours
Ok yeah I won’t be able to do things with rust for a while as well
Until I get out of school
Results of ./threat-inputs/stockfish-master-mar12 vs ./threat-inputs/mar11-sscg13-1024-profile-build (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 80.69 +/- 7.18, nElo: 129.42 +/- 11.12
LOS: 100.00 %, DrawRatio: 36.30 %, PairsRatio: 4.17
Games: 3752, Wins: 1581, Losses: 725, Draws: 1446, Points: 2304.0 (61.41 %)
Ptnml(0-2): [39, 192, 681, 802, 162], WL/DD Ratio: 2.01
STC of 1024 vs. master here:
https://tests.stockfishchess.org/tests/view/67d1f0a4166a3e8781d843cf
i think you're right, it should be positive elo if the engine on the left is stronger than the one on the right (stronger vs. weaker)
that's the only way for fishtest STC results to be consistent
Are these nets multilayer yet?
So a 1536 full threats with multilayer should beat master in fixed nodes by a lot... Which is good in theory
Oops I realized float inference might be hard
Ngl I think reaching just this point would be great progress
I can easily make multilayer inference if and only if it matches what already exists in sf code
It should. i8 quant is possible in bullet I thought
Yeah I think more pressing concern is get full threats to work
here's a training config jw says is similar to SF multilayer:
https://github.com/jw1912/bullet/blob/main/examples/advanced.rs
Optimistically this should happen in a few days
it also has floats for inference in later layers
@formal smelt why didn't u i8 quant the later layers?
I think it’s worth testing if quantization for later layers loses elo
full threats would be a good baseline
But I think the greater issue is floating point arithmetic not being associative
is there a simple multi-layer training config with i8 quant for later layers somewhere?
It's already been tested. That's why SF uses i8. Let's not make so many changes at once heh
Because I only made the arch similar to SF
I’m pretty sure the arch matches unless I’ve misinterpreted the diagrams in NNUE.md
Hm well can you make one to match the quantization also?
vec![
SavedFormat::new("pst", QuantTarget::I16(255), Layout::Normal),
SavedFormat::new("l0w", QuantTarget::I16(255), Layout::Normal),
SavedFormat::new("l0b", QuantTarget::I16(255), Layout::Normal),
SavedFormat::new("l1w", QuantTarget::I16(64), Layout::Normal),
SavedFormat::new("l1b", QuantTarget::I16(64 * 255), Layout::Normal),
SavedFormat::new("l2w", QuantTarget::I16(64), Layout::Normal),
SavedFormat::new("l2b", QuantTarget::I16(64 * 255), Layout::Normal),
SavedFormat::new("l3w", QuantTarget::I16(64), Layout::Normal),
SavedFormat::new("l3b", QuantTarget::I16(64 * 255), Layout::Normal),
],
this doesn't crash, so i assume it's usable
don't know if there's a strict need for int8 vs. int16 for l2w and l3w
currently i think it's i8 for l2w, l3w and i32 for l2b, l3b
QuantTarget::I32(_) => unimplemented!("i32 quant is not implemented for TrainerBuilder!")
i don't think it matters to use i16 for everything
Yeah I’ll have a look at the layers soon
slightly better than I expected given it's supposed to be like over 2x slower
SSS still but threatnet seems to be way better at endgames as expected
yea, endgames book expected to show better results from the speedup:
https://tests.stockfishchess.org/tests/view/67d220af166a3e8781d843fd
multilayer would only give 10ish gain, we can do that as a last step once we actually reach that range
It would give far more than 10 for threats, the speed hit is far less % wise
Was this measured using size 256?
It might be the case that the quality of a single-layer net nearly saturates at some point and so you need to go multi-layer.
Like... larger l1 scales better with more layers.
So, if l1=256, 1 layer vs l1=256 multilayer was 10 elo, maybe l1=512 multilayer would gain more?
full inputs
multilayer
I think full input 1536 with multilayer should be good in fixed nodes vs master
and then a grind for optimizing
Full input? WDYM?
we are using simplified inputs right now, which are ~15k
full inputs add ~80k instead
also a vague extra idea I had is to append the threats to halfkav2hm instead
not just append to psq
we can also support this theoretically
by having separate accumulators for them, and combining on eval time
tmrw if I have time I'll try to merge append_active_threats and write_difference into compute_threats_write_difference
which montytrain branch produces the net that was +100-ish fixed nodes?
https://github.com/official-monty/montytrain/tree/threat-inputs-nnue-fixed sounded like the "correct" one that's not the simplified inputs from what i caught yesterday, but from looking at the code it looks like some threats are not incorporated (e.g. map_pawn_threat only maps to pawns, knights and rooks
Yes because pawn -> bishop threat implies corresponding bishop -> pawn threat
So you know any pawn -> bishop is a duplicate
Oh right! I didn't think of it that way. So the branch is correct?
Great, thank you
Btw if you can run it can you print me the active feature indices
For kiwipete
And startpos
The other deduplication is when they are the same type, in this case I believe the one with src < dest is used
Note that pawn - pawn are duplicates only when they are opposite colors
startpos
79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525 79986 4550 4551 79989 4571 4572 80048 11032 10143 80055 11136 10241 80116 26463 22096 19187 19188 19189 80179 37421 36583 36584 36585 80288 80289 80290 80291 80292 80293 80294 80295 80361 42762 80366 42781 80426 47787 47788 80429 47808 47809 80488 55334 56231 80495 55432 56335 80556 69143 69144 69145 76429 72062 80619 78573 78574 78575 79416
79864 79865 79866 79867 79868 79869 79870 79871 79921 506 79926 525 79986 4550 4551 79989 4571 4572 80048 11032 10143 80055 11136 10241 80116 26463 22096 19187 19188 19189 80179 37421 36583 36584 36585 80288 80289 80290 80291 80292 80293 80294 80295 80361 42762 80366 42781 80426 47787 47788 80429 47808 47809 80488 55334 56231 80495 55432 56335 80556 69143 69144 69145 76429 72062 80619 78573 78574 78575 79416
kiwipete
79864 79865 79866 79869 79870 95 79871 79883 34 79892 79941 1276 605 606 608 79955 2034 2710 2712 2713 79995 79996 6866 5189 80048 13722 10143 80055 13821 10241 80130 19491 19492 22405 28230 20954 19503 29699 80179 36583 37424 37425 80256 39942 80270 40051 80281 80283 39990 80290 40253 40254 80292 40257 80293 40344 80295 40347 80346 40663 40665 42683 44365 80350 40696 42713 43723 80415 46014 80417 48273 49394 80488 55330 58921 80495 55432 59020 80547 68941 70401 68946 68950 68951 76236 80619 78573 78575
79866 3 4 79868 7 79869 94 79871 97 79873 79875 79894 389 79896 79938 2259 581 2599 2601 79942 1619 612 2629 79993 6279 5162 80007 80048 13722 10147 80055 13821 10241 80123 26612 19336 19337 20797 19342 19349 80179 36583 36585 80268 39963 80275 40228 80288 80289 39999 80290 80293 80294 40345 80295 80331 40566 40567 40568 43932 80349 42702 42704 43378 42707 80419 46045 80420 48300 49981 80488 55334 58921 80495 55432 59020 80538 61448 68735 60000 70196 68743 68744 71657 80619 78573 79414 79415
code used (i hope it's correct)
fn main() {
let inputs = ThreatInputs::default();
let pos = format!("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 | 0 | 0.0").parse::<ChessBoard>().unwrap();
inputs.map_features(&pos, |stm, nstm| {print!("{stm} ")});
println!();
inputs.map_features(&pos, |stm, nstm| {print!("{nstm} ")});
println!();
}
This seems wrong because we shouldn’t be seeing large runs of consecutive threat inputs like 79864…79871
Also 139 startpos features is way off
those aren't threats
they're the standard 768 inputs
you're seeing them contiguously because the pawns aren't attacking or defending anything in startpos
https://github.com/official-monty/montytrain/tree/full-threat-inputs-print anyway here's your branch to check yourself
i get startpos has 88 features
Ok will check later
Weren’t these supposed to be from 0 to 767 though or is it different for full inputs
I changed it for the simpler inputs to put them before
Full threat indexing should be ready now
At least it’s correct for startpos and kiwipete
alright, does that mean full threats inference is ready for testing soon?
Yes
with UE or no?
Architecture : (80624 -> 256)x2 -> 1
Inputs : Threat inputs
Number of Weights : 20.64m
Quantisations : [255, 64]
Eval Scale : 340
UE strategy is the same as with simplified
I wrote approximately the same functions
that's a partially-trained one. the bulletbullet footer needs to be trimmed still
We do need a better incremental at some point though
hehe
gotta run for now. i'll check back in later
Considering how Yukari is like 2x as fast
as long as its working first
As current impl
ah lol
Alright I’ll wait for a fully trained one and hopefully I can try to get it running later
full threats input net that finished training on some leela data, with the footer bytes trimmed:
https://tests.stockfishchess.org/api/nn/nn-2f2b7f959ee5.nnue
alright i'll try to get it to work in ~2 hours or so
I can still force the inference through w/o buckets but it might be a completely accurate comparison
nvm apparently I have access violation tryiing to read 0x00000000
and I am a bit too tired to debug this rn
commit is up for debug help
oh yea, output buckets would be good
seems like output buckets doesn't work with montytrain yet
without buckets is fine for testing. won't be a completely accurate comparison anyways, since this training data is simpler than before
need to do a new set of measurements for full threats later
They do
It looks like you’ve done .inputs(output buckets) or something
Can someone test if latest commit compiles and runs
Because it tries to read null pointer on one machine
But works on the other
Anyways unbucketed full threats are 40 +- 35 elo at 25k nodes, according to an sss test I ran
compared to what
Simplified 256, with 8 output buckets
nice
^^ if you can
not home rn
e27d31f compiled with Clang 20/MSYS2 under Win10 on Intel i7 7700HQ runs.
diff --git a/value/src/arch.rs b/value/src/arch.rs
index 0438bef..289766f 100644
--- a/value/src/arch.rs
+++ b/value/src/arch.rs
@@ -13,9 +13,9 @@ pub fn make_trainer<T: Default + SparseInputType>(
TrainerBuilder::default()
.quantisations(&[255, 64])
.optimiser(AdamW)
- .loss_fn(Loss::SigmoidMSE)
+ .loss_fn(Loss::SigmoidMPE(2.6))
.input(inputs)
- .output_buckets(outputs::Single)
+ .output_buckets(outputs::MaterialCount::<8>)
.feature_transformer(l1)
.activate(Activation::SCReLU)
.add_layer(1)
can you send the whole make_trainer function?
huh i can reproduce it
thats really weird
use bullet::{
nn::{
optimiser::{AdamW, AdamWOptimiser},
Activation,
},
trainer::default::{inputs::SparseInputType, outputs, Loss, Trainer, TrainerBuilder},
};
#[rustfmt::skip]
pub fn make_trainer<T: Default + SparseInputType>(
inputs: T, l1: usize,
) -> Trainer<AdamWOptimiser, T, outputs::Single> {
TrainerBuilder::default()
.quantisations(&[255, 64])
.optimiser(AdamW)
.loss_fn(Loss::SigmoidMPE(2.6))
.input(inputs)
.output_buckets(outputs::MaterialCount::<8>)
.feature_transformer(l1)
.activate(Activation::SCReLU)
.add_layer(1)
.build()
) -> Trainer<AdamWOptimiser, T, outputs::Single> {
you'd need to change this line too
but when i do that it still errors
Ok maybe my home laptop just has an issue with std::optional
Well that kinda sucks if true
AH i know what it is
- pub fn make_trainer<T: Default + SparseInputType>(
+ pub fn make_trainer<T: Default + SparseInputType<RequiredDataType = ChessBoard>>(
inputs: T, l1: usize,
- ) -> Trainer<AdamWOptimiser, T, outputs::Single> {
+ ) -> Trainer<AdamWOptimiser, T, outputs::MaterialCount<8>> {
cool works now, after also adding the import:
trainer::default::{
formats::bulletformat::{ChessBoard},
inputs::SparseInputType, outputs, Loss, Trainer, TrainerBuilder
},
latest full threats commit e27d31f1a compiles for me too
yea it's +30 fixed nodes so far vs. simplified threats 512
Home laptop is getting an access violation at 0x00000000 and I’m trying to figure out if this is an issue with my code or with my device
So it’s probably an issue with my compiler on that laptop then
Ok good to know
Hmm surpassing double L1 is really good
or maybe -30? i keep mixing up the order of these
Results of ./threat-inputs/mar11-sscg13-512-profile-build vs ./threat-inputs/mar14-sscg13-full-threats-256 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 30.14 +/- 6.19, nElo: 45.18 +/- 9.24
LOS: 100.00 %, DrawRatio: 43.04 %, PairsRatio: 1.57
Games: 5432, Wins: 1978, Losses: 1508, Draws: 1946, Points: 2951.0 (54.33 %)
Ptnml(0-2): [93, 509, 1169, 725, 220], WL/DD Ratio: 2.28
that's -30
womp womp
Well if 512 was really +130 to 256 for simplified (this result seems off, btw) that is still a good gain
+output buckets are maybe 10 or so
an output buckets 256 is training now and will be ready later to test
nice
Ok
Nice
In other news for ue optimization once again
Since Yukari obviously does it a lot better rn
I think at some point we need to switch from trying to compute every threat
To only incrementally updating threats
Startpos 20 second search for Yukari vs current branch had Yukari like 2x faster for (simplified->256) on my machine
You can also test this out
not clarify not just the amount but also where it comes from
I think Yukari has incremental attack tables
Which is integrated into movegen etc
I am wondering surely sf had fast attack tables back in the HCE era
So I am wondering if those still exist or we could bring them back
Like based off Disservin’s profile for L1=256 nearly half of the runtime is dedicated to computing and indexing threats
oh sheesh
You can see it here
yeah the current method is kind of a dead end eventually
We can mask the overhead more as L1 gets large but it’s still significant
In the meanwhile I can try an incremental threat hack from bitboards alone
So first of all castling never opens any discoveries right
And castling is the only move involving more than 2 squares
Even for frc right (please check this logic)
So if castling we know that we only need to deal with attacks involving those (up to 4) squares
Otherwise for two-square moves we can loop through the leapers
And discoveries are only present if both sides of a file/rank/diagonal are attacked
Do we have methods for only getting attacks on one line
Not both
Welp I guess not
full threats L1-256 with 8 output buckets:
Architecture : (80624 -> 256)x2 -> 1x8
Inputs : Threat inputs
Number of Weights : 20.64m
Output Buckets : Will be transposed in quantised network for you, output bucketed layers will
: have weights in form [[[T; layer input size]; layer output size]; buckets]
Quantisations : [255, 64]
Eval Scale : 340
https://tests.stockfishchess.org/api/nn/nn-a660a82f6a81.nnue
Can someone point me how to add these
It would allow better optimization of computing discovered attacks
what is one line in your case?
vertical/horizontal/diagonal ?
there's line_bb and between_bb
Vertical, both directions (so up and down)
Or horizontal, both directions (so left and right)
Inference is up
I would prefer re-introducing attack tables and then building threats on top of that
Because incremental from bitboards alone is very annoying, at least to optimize
For threats what you would want to do is:
Loop through the attackers of both squares and also the attacks
But then optimizing around deduplication and computing less attacks is hard
mh you can use hyperbola approach for just horizontal/vert if you don't want to use a lookup
Hyperbola doesn’t work for rank right?
What I was planning to do was loop through the attacks for the leapers (including both colors of pawns) and then consider rank, file, both diags individually to add discovered attacks
Combining this way means you don’t have to compute the bitboard of piece attacks separately since you already need it to compute the attackers of the square
But after thinking about it for a while I conclude that this will be a very messy hack
mh it does, had a look on github an one impl is just
template<uint32_t sq>
static constexpr uint64_t rook_attack(uint64_t occ) {
return vertical_attack<sq>(occ) | horizontal_atkL<sq>(~occ) | horizontal_atkR<sq>(~occ);
}
so you can split that up into just vert/horizontal
Do you know if sf had attack tables in the HCE days
I think a broader issue is that currently we try to decouple nnue updates from the position
But threat inputs have a much larger dependence on the position
So trying to do threats separately without integrating it into the position and stuff like make_move naturally leads to awkward code
Eventually I think we do need to figure out some way to compute threat changes incrementally and I would appreciate help from more experienced people regarding this
Results of ./threat-inputs/mar14-sscg13-full-threats-256-8-output-buckets vs ./threat-inputs/mar14-sscg13-full-threats-256 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 3.50 +/- 4.74, nElo: 5.29 +/- 7.16
LOS: 92.60 %, DrawRatio: 44.70 %, PairsRatio: 1.03
Games: 9034, Wins: 2949, Losses: 2858, Draws: 3227, Points: 4562.5 (50.50 %)
Ptnml(0-2): [218, 1011, 2019, 1000, 269], WL/DD Ratio: 2.32
STC of 8 output buckets vs. 1 bucket:
https://tests.stockfishchess.org/tests/view/67d4935e517865b4a2dfcf8d
going to try larger L1, then different datasets later, including monty binpacks
Yeah seeing the gains at larger L1 will also be very helpful
In particular since inference is not in a horrible state right now I think it’s fine to take more time and plan out something better long-term
And I would appreciate any suggestions
Most important right now though is to scale the nets up to large size and verify gain over master at fixed nodes
Theoretically, with the current implementation, you compute all attacks of pieces, looping through ~40 threats and writing them, then write_difference loops through these and writes maybe 10 more on average
Whereas with real incremental you only need to loop through 12 attack bitboards and way less threats, and you can write the differences directly
Also: usually the accumulator updates are for both colors at once right, so we can try to optimize further by only computing the attacks once but writing the difference for both colors
This all needs more work to develop though
Looks like output buckets aren’t so great for full threats single layer
We’ll still have to see for multilayer later
Eh that's cache stuff I'm pretty sure. Like if the net was shared between instances it would gain
8 output buckets vs 1 bucket is hardly cache right
does nodestime work as a tc version of fixed nodes
is it because the later layers are constantly used, they probably sit in L1/L2. This is probably where the huge variation in elo gain comes from also
the new dataset appears quite superior to the old one
The new one is the same training sequence as the one used for simple threats. Old one was training directly on a single leela binpack from scratch
fishtest not a fan of L1-768 full threats being huge
Need to update fishtest.
NNUE that's just a rickroll loaded as an eval
@round stone What's the fixed nodes result of 512 Vs 256 full threats?
Don't really care about TC tests anyway, the speed is not representative of anything final so it's not telling anything
What we really are still needing is L1 1536 multilayer full threats with output buckets. To compare fixed nodes to master. I have access to 4x4090 u can use for training.
This speed stuff was supposed to be on the side not the main focus at this stage, the concept has not yet been proven to be spending so much time trying to optimise speed
This is getting out of hand (2.3k messages) because the process is not being understood I think. Like multilayer is being delayed as it is 'only 20 Elo whereas at TC the net is currently -100 Elo' but the TC result is irrelevant at this stage. We need a solid fixed nodes result first, so it is relevant there
although nowhere close to fully optimized, the speed is definitely sufficient for fixed nodes at a reasonable pace
Results of ./threat-inputs/mar15-sscg13-full-threats-512-S2 vs ./threat-inputs/mar15-sscg13-full-threats-256-S2 (10000+1e+04, 1t, 16MB, UHO_Lichess_4852_v1.epd):
Elo: 41.66 +/- 5.02, nElo: 64.55 +/- 7.71
LOS: 100.00 %, DrawRatio: 41.32 %, PairsRatio: 1.98
Games: 7802, Wins: 2882, Losses: 1951, Draws: 2969, Points: 4366.5 (55.97 %)
Ptnml(0-2): [104, 663, 1612, 1242, 280], WL/DD Ratio: 2.03
TC tests are important for establishing a baseline and quantifying improvement
optimizing speed is part of proving the concept, especially if speed is the biggest bottleneck
at larger L1, the training process and data becomes much more important
if an L1-512 full threats net is -50 elo STC vs. a far smaller L1-256 HalfKA v2 hm, then that's a problem
can you elaborate on why you think the process you have in mind is a better approach to being competitive vs. master at TC?
i think they more want to say
that we should get data about fixed nodes
to see if it can be competitive assuming good optimization
rn I think if you want to simulate an "optimized" implementation you can give 1.6x time odds as this is approximately how much faster Yukari is in the midgame to current stockfish at L1=256
is a 1.6x speedup within reach?
much larger L1 is going to need significant improvements to the training process + data to have any decent indication of fixed nodes strength
considering yukari's done it
there is no reason it shouldn't
well the 1.6x applies mostly to midgame
in endgame where there are few threats it isn't as large
I'll see where we are (full threats -> 256 vs smallnet) with +50% time odds
and I think my implementation is actively less efficient than a lazy approach
is this the right way round?
wait huh
If a L1 1536 is not remotely close to master at fixed nodes no further time needs to be invested
Ideally we would have the unoptimized version exactly how we want in terms of net and then optimise after
I think L1 1536 should be close to master
if we take simplified 1024 is -80, add +40 for full threats, add +20 for larger L1, and +20 for multilayer
Sure you can think but it needs to be proven. I have the GPUs let's just do it before investing a fuckton of time
right
yeah 4x4090 should massively speed up the training
as long as you have the data and know what to do
Linrock can provide the script and data to me
Ideally it would be complete in terms of arch so then we know this is exactly what we are even optimizing
nnue-pytorch is far ahead of bullet as far as maximizing strength of data, and this is only amplified with larger L1
That's fine. This is work that has to be done anyways, I still think it's an easier path than doing UE threat gen and stuff
With 4 GPUs it means 4 experiments can go in parallel
i think it saves time to prove/disprove at smaller L1
ok https://tests.stockfishchess.org/tests/view/67d5e893517865b4a2dfd2da +50% time odds should be actually correct now
if it can never be fast enough to become a master net with optimal UE, then there's no need for a long and complicated training process
in Disservin's profile for L1=256 up to 45% of the time is spent on append_active_threats, write_difference, and memmove_avx_unaligned_erms
memmove is probably because I did smth wrong with alignment
and if we can reduce those things to 1/4 of their current runtime with incremental threat computation that gives our +50% speedup approximately
Im not sure if it was there before I can check later
SSS but it seems that it's far superior to smallnet with +50% time?
Can you give a list of data loader features that are needed
I’ll just add them in the next week sometime
biggest two are:
- piece count probability distribution skipping
- WDL skipping
got any digestible code references for them?
piece count probability: https://github.com/official-stockfish/nnue-pytorch/pull/173
WDL skipping: https://github.com/official-stockfish/nnue-pytorch/pull/155
btw +50% time odds is not 100 elo right
it should still be superior even at the current version
i might make a test later
so if you grab 50% speedups from the threat calculation and indexing, we're gaming?
out of curiosity how is the relative speed of the trainer (bullet vs nnue-pytorch), also, does bullet have multi-GPU training?
yeah at least at L1=256
sadly being like 4x larger than current smallnet it doesn't fulfill the smallnet purpose
the smallnet primary purpose was to gain elo
so if a threats net can be used as a smallnet as a gainer, it's another path to production
i meant for replacing the lichess net
like it's gonna be 30MB
and that will be a bit bad
oh yea probably not as a lichess smallnet replacement, where the purpose was to be small
if a 1.5x speedup from L1-256 threats = +150 STC elo (from -50 to +100) vs the L1-256 smallnet then it looks promising to explore larger L1
bullet does not have multi-GPU training, but it's on the todo list, AIUI
well I assume full threats would've been like ~neutral at tc giiven it was +40 or so compared to simplified
training speed is hard to compare, since it varies due to many differences between them: net architectures, skipping, dataset format
it seems like it performs better against lichess smallnet with the original book though
compared to endgames.epd
for whatever reason
right now i'm getting around 2M pos/sec for these L1-512 threat nets, loading binpacks without any skipping
so gut feeling comparison, same order of magnitude, or rather slower or rather faster.
also config options i haven't closely looked into optimizing
also 120MB limit for net means that tc testing for larger L1 is out for now
i'd say training definitely seems faster with bullet, but that's also because piece count + WDL skipping makes it much slower
similar order of magnitude of training speed
using the sequential dataloader maybe 2-3x faster for small nets, vs. binpacks even without skipping
the data disk size usage vs. training speed tradeoff of bullet format vs. binpacks is very noticeable
I've done https://tests.stockfishchess.org/tests/view/67d5f445517865b4a2dfd2e6 as well which is equal tc
yea, need compression (leb128) and/or fishtest limit increase to go further than full threats L1-512 on fishtest
which will hopefully let us know how threat inputs scales with more time
in the meantime i'm hoping L1-256 and L1-512 experiments can give an indicator of how it scales
full threats -> 512 is like 80 mb right?
yea
does the memory issue become significant
viren might know better
ik viren advocates for not duplicating ram for nnue weights on multiple sf processes
huh +50% speed is 150 elo
this is a bit tyrannical scaling
but consistent with what we know https://github.com/official-stockfish/Stockfish/wiki/Useful-data#elo-from-speedups
(more or less)
i was expecting less than +100 elo with a +50% speedup but didn't really know what was reasonable
worth testing it vs. master to see where it's at there
since the L1-256 lichess smallnet is around -86 vs. master at STC
https://tests.stockfishchess.org/tests/view/67d33095517865b4a2dfc73a
it's only 1.5x time odds not 2x
i'm using the command as follows on nnue-pytorch main branch
python3 train.py ../bullet/data/test80-2024-02-feb-2tb7p.min-v2.v6.binpack --batch-size 16384 --features=HalfKAv2_hm --lambda=1.0 --gpus "0," --threads 8 --num-workers 8 --default_root_dir ./ --no-smart-fen-skipping
does this command look okay?
This gives me 92/6166 [00:17<19:01, 5.32it/s, loss=0.0209, v_num=1] which is ~90k pos/sec
I believe I'm training SF master net with L1=3072, bucket=8, having touched nothing else
Num virtual features: 0 i take this is not using a factoriser as well
shouldn't it be 15+0.15 vs. 10+0.1?
https://tests.stockfishchess.org/tests/view/67d5e893517865b4a2dfd2da
right now it's 15+0.5 which is 5x the base increment
https://github.com/jw1912/bullet/blob/sf-arch-i-think/examples/advanced.rs here is more or less the equivalent in bullet, there's probably some minor differences that have no effect on performance (like output buckets being calculated slightly differently)
superbatch 1 [12.5% (128/1024 batches, 171237 pos/sec)]
oh shoot-
how long is the average fishtest game
tested on gtx 1660 super btw
probably more than 80 plies
ok it's more like 2.5x time odds then
oopsies
yeah the real 1.5x time odds should be close to neutral then
updated with added L1 factoriser as seen in the LayerStacks class in NNUE pytorch
superbatch 1 [62.5% (640/1024 batches, 166802 pos/sec)]
ah the psqt subnet is output bucketed
superbatch 1 [37.5% (384/1024 batches, 165194 pos/sec)] have updated
superbatch 1 [37.5% (384/1024 batches, 154846 pos/sec)] + ranger optimiser
makes me wonder how much elo is from optimizing hyperparameters and the data filter changes
hmm looks like with the real +50% it's still better than smallnet
but significantly worse than master as expected
thanks for testing, so that would be 1.5 - 2.0x speedup or so, which is very relevant.
quite a bit I think, I don't dare to put a number to it, but people might start to forget how many iterations went in the current settings, we've quite literally tested over 5000 nets on fishtest.
(which made me realize that my local backup was not adjusted to the recent server changes... need to fix that... fixed).
the default net uses factorizing, so the features should be HalfKAv2_hm^ (I think)
good catch
309/6166 [01:35<30:14, 3.23it/s, loss=0.0147, v_num=3] with factoriser ~53k pos/sec
superbatch 1 [50.0% (512/1024 batches, 98099 pos/sec)] pushed to the branch i linked earlier
interesting, so getting the exact SF arch and training features in bullet would be quite nice for training.
Apart from the output bucket formula I’m now fairly certain the arch is exactly replicated in the branch I linked
For required extra features other than the data loading stuff I’m not sure
The NNUE-PyTorch ranger impl has more stuff than the bullet one but all the extra features seem unused by default
#[derive(Clone, Copy, Default)]
pub struct SfMaterialCount<const N: usize>;
impl<const N: usize> OutputBuckets<ChessBoard> for SfMaterialCount<N> {
const BUCKETS: usize = N;
fn bucket(&self, pos: &ChessBoard) -> u8 {
let piece_count = pos.occ().count_ones() as u8 - 1;
(piece_count / 4) as u8
}
}
one can just define that for the output buckets right?
Yeah
And for the optimiser stuff you can implement a custom optimiser that wraps the existing ranger one with any extra stuff you need
As long as it doesn’t require any funky gpu operations that aren’t yet supported
#[derive(Clone, Copy, Default)]
pub struct SfMaterialCount;
impl OutputBuckets<ChessBoard> for SfMaterialCount {
const BUCKETS: usize = 8;
fn bucket(&self, pos: &ChessBoard) -> u8 {
let piece_count = pos.occ().count_ones() as u8 - 1;
(piece_count / 4) as u8
}
}
don't need the generic param also
yeah just realised, im getting 670345 pos/sec for that arch
damn my gpu sucks lol
hmm, I do have a GPU I'd like to test, guess I have to figure out what to run. Is there a TL;DR / list of commands I could run. Worth mentioning I've never run anything rusty 🙂
here's a good place to start with getting bullet working for training:
https://github.com/jw1912/bullet/blob/main/docs/2-getting-started.md
after installing rust, take a look at the simple.rs example and train with cargo r -r --example simple
That is somewhat out of date now, the CUDA requirement is lower than 12.2 and you can also compile for CPU
The instructions in there should work fine tho
clone this https://github.com/Disservin/sf-bullet-train.git
replace the binpack path
let file_path = "G:\\stockfish-data\\leela96-filt-v2.min.binpack";
and run
cargo run --release .
ok, great, just got the simple to work..
if you run this on your arm cpu you need to drop the bmi2 in the Cargo.toml
sfbinpack = { package = "binpack", git = "https://github.com/Disservin/binpack-rust", rev = "483e9aac028b4c3e0671af6b28ff50f64d696558", features = ["bmi2"]}
working 'out-of-the-box'
Params: 72205464
Beginning Training
Net Name : test
Batch Size : 16384
Batches / Superbatch : 1024
Positions / Superbatch : 16777216
Start Superbatch : 1
End Superbatch : 10
Eval Scale : 400
Save Rate : 150
WDL Scheduler : constant 0
LR Scheduler : start 0.001 gamma 0.3 drop every 60 superbatches
Threads : 4
Output Path : checkpoints
superbatch 1 | time 2.6s | running loss 0.000000 | 6363127 pos/sec | total time 4.4s
Estimated time remaining in training: 0h 0m 39s
superbatch 2 | time 2.6s | running loss 0.000000 | 6463666 pos/sec | total time 6.9s
Estimated time remaining in training: 0h 0m 27s
superbatch 3 | time 2.6s | running loss 0.000000 | 6474294 pos/sec | total time 9.5s
Estimated time remaining in training: 0h 0m 22s
superbatch 4 | time 2.6s | running loss 0.000000 | 6477194 pos/sec | total time 12.1s
Estimated time remaining in training: 0h 0m 18s
superbatch 5 | time 2.6s | running loss 0.000000 | 6475190 pos/sec | total time 14.7s
Estimated time remaining in training: 0h 0m 14s
superbatch 6 | time 2.6s | running loss 0.000000 | 6475920 pos/sec | total time 17.3s
Estimated time remaining in training: 0h 0m 11s
superbatch 7 | time 2.6s | running loss 0.000000 | 6478756 pos/sec | total time 19.9s
Estimated time remaining in training: 0h 0m 8s
superbatch 8 | time 2.6s | running loss 0.000000 | 6475852 pos/sec | total time 22.5s
Estimated time remaining in training: 0h 0m 5s
superbatch 9 | time 2.6s | running loss 0.000000 | 6480538 pos/sec | total time 25.1s
Estimated time remaining in training: 0h 0m 2s
superbatch 10 | time 2.6s | running loss 0.000000 | 6494769 pos/sec | total time 27.7s
Estimated time remaining in training: 0h 0m 0s
Saved [test-10]
Total Training Time: 0h 0m 34s
Eval: 0.000cp
now that loss / eval is maybe a bit suspicious?
yeah deffo
hmm, I'm using the CUDA_PATH setting, not HIP.
seems happy building things, though
(but no changes to your repo, except for dropping bmi2 and specifying the path to the .binpack)
superbatch 1 | time 26.0s | running loss 0.011861 | 645176 pos/sec | total time 28.5s
Estimated time remaining in training: 0h 0m 0s
Saved [test-1]
Total Training Time: 0h 0m 41s
Eval: 23.493cp
thats how it should end, just changing the superbatches from 10 to 1 here so that it finishes quickly
did the simple example do something reasonable?
well, probably not:
Output Path : checkpoints
superbatch 1 | time 8.7s | running loss 0.000000 | 11501668 pos/sec | total time 11.7s
Estimated time remaining in training: 0h 7m 37s
similarly 0.0 loss
what happens if you add the line trainer.sanity_check(); before trainer.run() in the simple example?
lots of red ..
What GPU is this?
Try running cargo test --package bullet_core
all green
running 11 tests
test backend::cpu::crelu ... ok
test backend::cpu::matmul ... ok
test backend::cpu::relu ... ok
test backend::cpu::matmul2 ... ok
test backend::cpu::concat ... ok
test backend::cpu::screlu ... ok
test backend::cpu::sparse_affine ... ok
test backend::cpu::sparse_affine_check_not_batched ... ok
test backend::cpu::sparse_affine_batched_biases ... ok
test backend::cpu::sqrrelu ... ok
test backend::cpu::sparse_affine_dual ... ok
test result: ok. 11 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
though that's the CPU backend?
You aren’t passing the HIP feature are you?
wouldn't it just flat-out fail to link?
I guess no? That test however does:
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
Running unittests src/lib.rs (target/debug/deps/bullet_core-a748fb7f74ceb385)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s
Running unittests src/lib.rs (target/debug/deps/bullet_hip_backend-f99c5dab2957960e)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 11 filtered out; finished in 0.00s
Running unittests src/lib.rs (target/debug/deps/bullet_lib-21ef2babe27415ff)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
I’ll get home in a minute, will link you a branch to try
happy to try out (probably after dinner, let's see)
the typical vondele dinner break, i ran into many of those 😄
good stuff coming 😉
I guess this is why people joke about vondele's laptop; jeez that's some kit :p
it can run crysis
okay on branch debug-gh200, run cargo test --package bullet_hip_backend followed by cargo r -r --example simple
test still fails
running 11 tests
test tests::sparse_affine_check_not_batched ... ok
test tests::relu ... FAILED
test tests::sqrrelu ... FAILED
test tests::screlu ... FAILED
test tests::sparse_affine ... FAILED
test tests::sparse_affine_dual ... FAILED
test tests::sparse_affine_batched_biases ... FAILED
test tests::crelu ... FAILED
test tests::matmul ... ok
test tests::concat ... ok
test tests::matmul2 ... ok
i pushed another change just now that'll affect running the simple example
it looks like the kernels are not running but not raising any error about it
baffling
Ah, wait: called Result::unwrap()on anErr value: Cuda(cudaErrorUnsupportedPtxVersion)
driver issues
indeed the most common cause of this error is having mismatched driver for your cuda version
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 120GB On | 00000009:01:00.0 Off | 0 |
| N/A 16C P0 95W / 900W | 290MiB / 97871MiB | 0% Default |
| | | Disabled |
...
cudaErrorUnsupportedPtxVersion
This indicates that the provided PTX was compiled with an unsupported toolchain. The most common reason for this, is the PTX was generated by a compiler newer than what is supported by the CUDA driver and PTX JIT compiler.
a very "cool" gpu :p
unfortunately nvidia only list minimum requirements for x86-64 drivers
driver and cuda version definitely are compatible..
https://fieldprogrammable.gay/files/f660acb0-505b-43f0-ab3c-9edc6066c95b.png
this would match up though
...did you update your drivers since last rebooting?
wondering if it's a kernel/userland mismatch
me no... but certainly all good on that front.
the guy here apparently add to export, PATH and LD_LIBRARY_PATH and do a reboot
https://forums.developer.nvidia.com/t/agx-orin-error-cudaerrorunsupportedptxversion-the-provided-ptx-was-compiled-with-an-unsupported-toolchain/294408/7
well, let me test something...
I think that's required for CUDA in general
I tried a different rust install, it said this:
error: rustc 1.81.0-nightly is not supported by the following package:
[email protected] requires rustc 1.83
so I guess that's indeed a requirement for bullet_core?
yes
ok, so back to the previous install.
The rust install won’t make a difference here
The issue is with compiling the cuda kernels and rust has nothing to do with that other than build script invoking nvcc
is there a way to get output of that compilation (in particular invocation of nvcc, or similar?)
Yeah one sec I’m cooking dinner
cargo clean
Removed 1499 files, 597.5MiB total yeah
though no difference
so with the CXX=/bin/false I don't get much more than:
error occurred in cc-rs: Command LC_ALL="C" "nvcc" "-ccbin=/bin/false" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/users/vjoost/fish/bullet/target/debug/build/bullet_hip_backend-44a0a83a10a89e6d/out/9080fc8993de408a-include.o" "-c" "./kernels/include.cu" with args nvcc did not execute successfully (status code exit status: 1).
Also
--- stdout
cargo:rerun-if-changed=./kernels
rerun-if-env-changed=CUDA_PATH
Path CUDA_PATH="/user-environment/env/default"
cargo:rustc-link-lib=dylib=cublas
cargo:rustc-link-search=native=/user-environment/env/default/lib64
cargo:rerun-if-changed=/user-environment/env/default/include
TARGET = Some(aarch64-unknown-linux-gnu)
HOST = Some(aarch64-unknown-linux-gnu)
cargo:rerun-if-env-changed=CXX_aarch64-unknown-linux-gnu
CXX_aarch64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=CXX_aarch64_unknown_linux_gnu
CXX_aarch64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_CXX
HOST_CXX = None
cargo:rerun-if-env-changed=CXX
CXX = Some(/bin/false)
RUSTC_WRAPPER = None
cargo:rerun-if-env-changed=CC_ENABLE_DEBUG_OUTPUT
cargo:rerun-if-env-changed=NVCC_aarch64-unknown-linux-gnu
NVCC_aarch64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=NVCC_aarch64_unknown_linux_gnu
NVCC_aarch64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_NVCC
HOST_NVCC = None
cargo:rerun-if-env-changed=NVCC
NVCC = None
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
CARGO_CFG_TARGET_FEATURE = Some(neon)
cargo:rerun-if-env-changed=CXXFLAGS_aarch64-unknown-linux-gnu
CXXFLAGS_aarch64-unknown-linux-gnu = None
cargo:rerun-if-env-changed=CXXFLAGS_aarch64_unknown_linux_gnu
CXXFLAGS_aarch64_unknown_linux_gnu = None
cargo:rerun-if-env-changed=HOST_CXXFLAGS
HOST_CXXFLAGS = None
cargo:rerun-if-env-changed=CXXFLAGS
CXXFLAGS = None
CARGO_ENCODED_RUSTFLAGS = Some()
i think there isn't much more expected since bullet just compiles that one file
https://github.com/jw1912/bullet/blob/main/crates/bullet_hip_backend/build.rs#L59 replace this block with
cc::Build::new()
.cargo_warnings(false)
.cuda(true)
.cudart("shared")
.cargo_debug(true)
.debug(true)
.opt_level(3)
.files(&[KERNELS])
.out_dir(out_path)
.compile("libkernels.a");
panic!();
the issue is just setting CXX to something invalid doesn't actually compile anything
ideally would like to see what gets emitted
full output?
it is 🙂
output
That's the strange bit?
running: "nvcc" "-?"
cargo:warning=nvcc fatal : Unknown option '-?'
exit status: 1
nah that's just the cc crate being weird
this is doing some crazy stuff after the initial nvcc invocation
i assume due to using gcc
running: LC_ALL="C" "nvcc" "-ccbin=c++" "-Xcompiler" "-O3" "-Xcompiler" "-ffunction-sections" "-Xcompiler" "-fdata-sections" "-Xcompiler" "-fPIC" "-G" "-Xcompiler" "-gdwarf-4" "-Xcompiler" "-fno-omit-frame-pointer" "-Xcompiler" "-Wall" "-Xcompiler" "-Wextra" "-o" "/users/vjoost/fish/bullet/target/debug/build/bullet_hip_backend-44a0a83a10a89e6d/out/9080fc8993de408a-include.o" "-c" "./kernels/include.cu"
exit status: 0
the actual command seems to just work
yeah c++ is definitely gcc
maybe just need to work out what ptx version your system wants and tell it to emit that
https://github.com/jw1912/bullet/blob/main/docs/2-getting-started.md#general i recommend clang but i dont believe thats causing the issue here
so right now no clang available...
though it must be somewhere ..
would be easier if we could restrict the ptx version or so.
doesn't seem possible but can check the ptx version emmitted
I don't see any option that specifies the gpu type
ptx version != target sm
https://github.com/jw1912/cuda-stuff can you try cloning this repo and running make TARGET=sparse_fwd?
if this works fine then it would be some issue with the nvcc commands that bullet invokes
compiles into a main.exe
mv main.exe main
nvprof is executed but not installed
try running main
fails.
interesting
Running naive
Average Time: 0.625ms
Running vectorised
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
good let me figure this out.
sorry, im not really qualified to debug general cuda toolchain issues
might take some time.
sounds good
wait holy shit
nahhhh
that looks like the naive kernel runs
proposal: bullet should just always run the sanity checks on startup
yeah in silent mode
good thinking batman
i'll PR that later
I'm like robin at best
could you checkout check-errors-everywhere and do the same process again?
Running naive
Average Time: 0.597ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Running vectorised
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Error Status: no error
Running blocktiled
Average Time: 0.000ms
Error Status: the provided PTX was compiled with an unsupported toolchain.
Error Status: no error
Error Status: no error
so strange. Let me ask around
ah so it just takes 0.6ms to realise the ptx is the wrong version
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:27:38_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
aha
that...that would do it
^^
so compiler newer than the driver?
yeah looks like it
do you have 12.4 installed somewhere on your FS?
things are a bit more complicated in this setup... somewhat different approach to deploy software
OK, this if fixed if I pass nvcc -arch=native src/sparse_fwd/main.cu -o main.exe -lcublas
Running naive
Average Time: 0.586ms
Error Status: no error
Error Status: no error
Running vectorised
Average Time: 0.198ms
Error Status: no error
Error Status: no error
Error Status: no error
Running blocktiled
Average Time: 0.190ms
Error Status: no error
Error Status: no error
Error Status: no error
any way to pass -arch=native to the rust compilation args?
I think without specifying anything it compiles for the highest possible arch
RUSTFLAGS="-C target-cpu=native" cargo run ...
though for nvcc, hm
somewhere in this section, try adding .flag("-arch=native")
bingo!
running 11 tests
test tests::sparse_affine_check_not_batched ... ok
test tests::relu ... ok
test tests::matmul ... ok
test tests::crelu ... ok
test tests::sparse_affine ... ok
test tests::sqrrelu ... ok
test tests::screlu ... ok
test tests::sparse_affine_batched_biases ... ok
test tests::concat ... ok
test tests::sparse_affine_dual ... ok
test tests::matmul2 ... ok
\o/
yay
so, moving to the simple example, I guess
or to @candid ivy code, but I assume that needs a bullet upstream fix?
what fix was needed?
this
i dont think that's needed but might be faster
diff --git a/crates/bullet_hip_backend/build.rs b/crates/bullet_hip_backend/build.rs
index 71f7989..1c9cc2e 100644
--- a/crates/bullet_hip_backend/build.rs
+++ b/crates/bullet_hip_backend/build.rs
@@ -62,6 +62,7 @@ fn build_cuda(out_path: &Path) {
.cudart("shared")
.debug(false)
.opt_level(3)
+ .flag("-arch=native")
.files(&[KERNELS])
.out_dir(out_path)
.compile("libkernels.a");
I think without this option, it takes the highest sm that the compiler supports..
which might include unknown ptx
it defaults to sm_52
well that is probably ancient enough as well 😉
i think my 4090 is 80 or something?
it would be good to get data points on if this is a measurable speedup
let me test this
im not seeing a speedup locally but i have a crap gpu
I feel like "necessary for building on GH200 [even if obscure] and not a slowdown" is sufficiently convincing, tbh
yeah i'll merge it in
superbatch 1 | time 8.8s | running loss 0.037813 | 11413254 pos/sec | total time 10.7s
simple working..
you expect that bottleneck to be the filesystem or something of the code?
if you're loading bulletformat then its filesystem
binpack loading its the shuffling step
eye balling the change id say a 3% performance increase, but id estimate the error to be 5%
which would be replaced by fen skipping aggressively
good enough for me
i guess you are already trying /dev/shm ?
not yet, can do now
okay i've merged the native thing and rebased sf-arch-i-think branch
no real difference superbatch 1 | time 8.6s | running loss 0.037825 | 11637627 pos/sec | total time 10.2s (on simple)
then yeah, probably bottlenecked on the data loader
(it was a missed opportunity to not call loaders magazines /j)
but simple is, well, very simple and tiny arch
lol
https://github.com/jw1912/bullet/blob/main/examples/simple.rs#L56 for data preparing threads
https://github.com/jw1912/bullet/blob/main/examples/simple.rs#L62 data loading threads
increased both to 16
Worse superbatch 1 | time 12.4s | running loss 0.037861 | 8092996 pos/sec | total time 14.2
probably increasing the loader threads doing that
i'd guess the bottleneck is the shuffling step
which is single threaded
16/4 also worse superbatch 1 | time 14.1s | running loss 0.037859 | 7109966 pos/sec | total time 15.7s
damn
4/16 rougly equal superbatch 1 | time 9.2s | running loss 0.037878 | 10833312 pos/sec | total time 11.0s
interesting
4/4 best superbatch 1 | time 8.6s | running loss 0.037861 | 11631975 pos/sec | total time 10.3s
this is just a 768->128x2->1 network tbf
could you try the advanced example on branch sf-arch-i-think
ok, let me swap branches.
nice
I think I top out at like 7 million positions per second, so this is pretty huge :p
luckily as long as you aren't actually hitting the data loader bottleneck, it doesn't matter how fast/slow it is
so as long as less than 11m pos/sec on a real arch, all should be good
superbatch 1 [75.0% (768/1024 batches, 1638909 pos/sec)]
thread '<unnamed>' panicked at /users/vjoost/fish/bullet/crates/bullet_lib/src/trainer/default/loader/sfbinpack.rs:108:50:
called `Result::unwrap()` on an `Err` value: SendError { .. }
some panick...
hmm, but not every time
superbatch 1 | time 10.2s | running loss 0.018962 | 1647798 pos/sec | total time 11.8s
Estimated time remaining in training: 0h 0m 0s
Saved [test-1]
Total Training Time: 0h 0m 19s
Eval: 68.779cp
yeesh
binpack loading code has an edge case it seems
i would guess its something like if the binpack can't fill the shuffle buffer
it is a big binpack..
+1m pos/sec than my gpu damn
adnd another:
superbatch 1 [62.5% (640/1024 batches, 1635149 pos/sec)]
thread '<unnamed>' panicked at /users/vjoost/fish/bullet/crates/bullet_lib/src/trainer/default/loader/sfbinpack.rs:108:50:
called `Result::unwrap()` on an `Err` value: SendError { .. }
oh i see the bug
ready to pull when your are 😉
its sending the message to stop one of the loader threads twice
and on the second time it errors because the thread isn't around anymore to receive it
oh wait
have merged
yay
https://github.com/jw1912/bullet/blob/sf-arch-i-think/examples/advanced.rs#L97 could you uncomment the two lines on either side of trainer.run and run it?
should give some profiling info
sure, I see the GPUs 75% idle 😉
we need bigger batches /lh
no, just kidding, it is using 1/4 GPUs
i think HL=3072 might be so large that the pairwise mul might trip a cuda error for launching kernel with too large grid size if you increase batch size too much
that 211GB probably is able to fit the entire dataset of people into ram 😛
well Sf datasets are large
and it is seemingly not the bottleneck
profiling reduces a little bit the performance
profile
interesting ratio
| Node 21 = SparseAffineDualActivate 1367 4208 | ~3.08
compared to mine
| Node 21 = SparseAffineDualActivate 37160 87121 | ~2.34
i suspect the aligned loads on the forward pass are going extremely hard
well this has been interesting
GH200 is insane
the chip's worth like a year's wages for me >.>
it is quite nice indeed... if there is anything I can still measure let me know, thanks for the help getting it to run.
i have some upcoming optmisations that will hopefully help this arch a lot in particular
I'd be more than happy to test a multi-gpu implementation if it appears 🙂
ok, let me know.
I wanted to switch to using cudarc (rather than custom code as is at the moment) before doing that because then I'd be offloading writing tons of device handling boilerplate (for which the safety isn't trivial)
but then some soundness issue was pointed out and now its getting loads of breaking changes lol https://github.com/coreylowman/cudarc/issues/340
might just stick with the current cudarc version and not touch the unsound part
too many things to do, not enough free time
relatable ..
so, I think I found back the old nnue-pytorch benchmarks I did #nnue-dev message
now, can we match that to what we just ran with bullet?
I have no idea what a pytorch iteration is :p
I think it is the same 16384 batch ...
I think so as well.
that's also the math jw did earlier
so, probably could run bullet for that size to compare better
which means bullet trains 2.4x faster on an even larger L1 :p
though the dataskipping will be different... and that might be limiting. idk
2560 superbatch 1 | time 8.6s | running loss 0.019224 | 1960857 pos/sec | total time 10.3s
It's also probably not entirely equivalent because bullet uses superbatches
so roughly 3x..
but yes, comparison might be slightly off.
Still makes a big difference.
not on bullet side at least
ok, so would be interesting to see a bullet trained net match SF master net.. obviously a non-trivial exercise.
though a 3x speedup would help doing that.
as a fun additional note, if you decreased the number of the buckets the gap between the two would most likely grow quite significantly
it would take ages to recreate an entire run yeah
one stage probably reasonable
well, with this kind of speedups goes probably rather quick, less than a day for a stage.
oh i didn't realise individual stages were so short
I think so, but I've a bit forgotten how long we train one stage.
@round stone will know
well at first id like to see someone successfuly load such a net into sf and get somewhere close like 200 elo range or something
sure..
and when that is "public" then im sure people will naturally play around and try it
I'm assuming that if the net arch is the same that might be more or less doable?
1st stage - 400 superbatches
2nd-11th stage ~ 800 superbatches each
so 10s per superbatch right now..
the binary format is slightly different, like no header, (anything else?) and then leb128 and permutation i guess
i think linrock is using the normal superbatch definition, the one in that sf arch example is different
i reduced batches/superbatch because otherwise it would take forever for me to run it lol
1 bullet superbatch = 1 nnue-pytorch epoch
right, ye
both are ~100 million samples
the sf-arch-i-think advanced example is doing ~16.78m samples per superbatch
nnue-pytorch training speed is much slower due to all the skipping going on
it's a tradeoff for strength
im confused as to how it has an effect
in bullet you are either hard limited by data loading speed, or not limited at all
because all that is done asynchronously from training
so you're either waiting on batches to be sent or drawing from a pre-prepared queue with no delay
all i know is disabling nnue-pytorch piece count probability skipping makes training a lot faster, but also weaker
yes, skipping in nnue-pytorch is definitely slowing it down.
what cli args do i need to pass to disable fen skipping entirely?
can't disable piece count skipping with args. have to modify the source code
otherwise maybe --no-wld-fen-skipping and --random-fen-skipping 0 in addition
even on large nets?
i'm mostly trying to understand how the data loader would be implemented if its effect on speed isn't all or nothing
https://github.com/official-stockfish/nnue-pytorch/blob/master/training_data_loader.cpp#L925 i can just remove this i think
and maybe more of those https://github.com/official-stockfish/nnue-pytorch/blob/14124a0c9c6d70b25f46e5bbe443c1c97fd55fee/training_data_loader.cpp#L873-L887 ?
I'm trying to reconstruct from old messages here #nnue-dev message but I think we be using only 1 out 15 fens or so.
some of those filters are also being done in bullet, and the random skipping and stuff can be disabled with cli args i think
188/6166 [00:54<28:45, 3.47it/s, loss=0. with defaults
187/6166 [00:56<30:19, 3.29it/s, loss=0.0128, v_num=7 with the piece count skipping removed and additional --no-smart-fen-skipping --no-wld-fen-skipping --random-fen-skipping 0 args
there's some variance between runs it seems
Needs a bit to warmup too
let me try a smaller net
on a very small net i can observe the training getting stuck cool
so i think it works how i expect it to
btw can someone test if https://github.com/sscg13/Stockfish/commit/6eafac4204bbbb06f47fb1bfadd3f912ce0a1c0e is a speed gain
on my laptop it's like ~5% for 4 thread speedtest
Result of 10 runs
==================
base (...wld025-sb120) = 977351 +/- 9978
test (...peedup-maybe) = 975058 +/- 10170
diff = -2292 +/- 13205
speedup = -0.0023
P(speedup > 0) = 0.3670
Hmm I guess it’s ~neutral then
you can also try running an STC on fishtest to see if a speedup is detected there
I mean all it does is use a pre-existing array to write the accumulator to and pass to L2
Instead of declaring a new one
So I’m not too keen on doing a whole STC
not even NNUE?
NNUE was long ago... not sure what happened back then. It was something like someone bolted NNUE to Stockfish and it won, then Stockfish decided to slap NNUE onto Stockfish.
But the time to first gain was probably really quick given that classical evaluation was comparatively weak compared to NNUE.
thats not even close
Really? I thought the NNUE vs classical was a cakewalk.
It was not as simple as slapping an NNUE onto Stockfish to get an instant elo gain?
it is not that simple
you must keep in mind that sf hce is still 100s of elo stronger than any other hce around
There's many things going on at once here (like bullet transition also) and the pipeline is complex after 5 years of optimisation
KANs are generally overrated
but yes, it could the be tried
the advantage of this idea is that there already exists a strong network that indicated it could work (the monty value network)
and the implementation just requires training an NNUE version and working out how to best optimise the efficient updates
on the other hand there does not exist any KAN chess network of notable strength
KANs are generally unscalable and prone to overfit, but Stockfish has an abundance of data for the model size and so on.
That being said, I tried KAN training and it failed to even beat master at loss.
yes but obviously we would need a serious attempt at it to actually rule it out
your naivety is shown by comparing loss to master on the first networks you try to train...
secondly do you have a plan for UE or some ~equivalent with a KAN
a suggested architecture
etc
It was despite my network having a higher compute budget. I didn't completely rule it out, but I didn't deem it worth it.
Well, the first layer would be the same.
Only the second layer onward would be KAN.
That being said, I only used the old feature transformer, not trained a whole new net.
u sure about this?
I tried several epochs and it seemed to stagnate. I tried a bunch of pruning, re-initializing, etc trick.
What I found out is that most didn't even stay in the range where nontrivial behavior would occur.
So, it was approximately an MLP anyway.
Yeah... it was an amateur-ish attempt. I didn't clean the training data and so on, so don't take this as anything more than 2 cents.
btw when shawn's nnue refactor is merged I'll also have to do a major rebase on that (I might also try to figure out incremental attack tables during this, so it might take a while)
but I'll leave the current branch for testing
https://tests.stockfishchess.org/tests/view/67da5c508c7f315cc372a9d6 in the meanwhile we'll check LTC vs smallnet bc why not
I thought it would scale a bit more
Maybe not being multilayer plays a part in this idk
btw @formal smelt is there any estimate on how long it'll take to get bullet all sorted out for training sf nets
i have a repo here, https://github.com/Disservin/sf-bullet-train/blob/quant-pst-old-correct-pst-values/src/main.rs
loading the psqt values works i think but something else isn't converted entirely correct, i think the layer stack weights are in a different format but still couldnt figure out what else/maybe i did it wrong
bullet ordering is pnbrqkpnbrq but SF ordering is pnbrqpnbrqk for halfkav2_hm
is that it?
that shouldn't matter cause the indices are calculated from the type and color, https://github.com/Disservin/sf-bullet-train/blob/quant-pst-old-correct-pst-values/src/halfkav2_hm.rs#L59
ok now that refactor is here in the next week or so I'll try to rebase everything
if multilayer works though and a new net is trained I can still make quick updates on old branch
i think the affine transforms all expect i8 weights and i32 biases that might be a bit tricky
why would that be tricky ? bullet is already outputting that
oh huh
the other thing I can think of rn is that you also have to pad all the input dimensions to the next multiple of 32 for the affine transforms
well that is only needed for one layer right now.. and i do that otherwise it wouldnt even load into sf
yeah well ik
btw what are the hash values in the net intended to do
just some verification things that the loaded net is supported in this arch, not really important
btw have you tried checking if write_parameters returns the same file as what you inputted
no but i don't see how that wouldn't be the case
https://github.com/Disservin/sf-bullet-train/blob/quant-pst-old-correct-pst-values/convert_quantised_to_pytorch.py#L219
here the fc hashes are just hard coded
but if everything is being read correctly and inference is still not working that means it's more likely to be an issue with bullet no?
well reading it into an array is one thing, but the more important thing is that the layout of the weights for example is correct
oh wait are you saying the array sizes could match but the layout might not
yeah well the array sizes currently definitely match
so must be something in the layout which isn't in the way sf expects it
btw I find it funny how read_parameters is called on the activation functions even though they just return true
affine transform weights are written row major I think
well if you want to try and fix it, you can just give this repo a try, i can send you a bullet checkpoint as well if needed
ok sure
do you still have the bullet header info that describes how bullet outputs the weights
there's some info here #1351682122162634796 message and in the normal channel too i think
https://drive.google.com/file/d/1Rl5eLPIZlL31spzWMtGv8YQBwvyyyf9l/view?usp=sharing
here is the checkpoint you can load
eh why is it 1 GB
oh I see
can I also get the bullet config used to train this net
clone the linked repo
ah it has it ok
and checkout the quant-pst-old-correct-pst-values branch
and the quantised nets need to be converted using the python script python convert_quantised_to_pytorch.py ./checkpoints/halfkav2_hm-stm/test-80/quantised.bin bullet.nnue
@stray reef how did you end up inferencing threat inputs? Did you add incremental attack tables?
Also if you have the net somewhere I’ll try to get it to work so fixed nodes can be run
I have one version with and one version without incremental attack tables. The speedup was around 15-20%
Speedup at which L1 btw
I do have the net, it's not a SF arch though, I can send it when I'm at my pc again
Alright that’s fine I’ll try to get it to work
that number is from a (80624 -> 1024)x2 -> 1x8 net
Wait 15% speedup at that size is really good
It's still a lot slower than my master net though (1.4M nps vs 2.3M nps), and "only" +30 fixed nodes
Btw can you run your version vs my Stockfish branch (change the L1 to 1024) and lemme know the nps difference
i haven't yet looked at small threat inputs, maybe that has less updates, that could help
Simplified and full are ~ the same speed
rip
And we know the +30 fixed nodes elo approximately translates into similar STC diff
does that branch have inference for a single layer net already? and can you send the link?
Yes single layer with 8 output buckets
Btw linrock’s simplified threats (15776 -> 1024)x2 -> 1x8 is known to be around -80 fixed nodes to sf master
But there is still a lot of experimentation remaining
My current feeling is that the amount of extra feature updates makes it sort of infeasable. But I will try some multilayer architectures with small threat inputs, with an L1 that's similar speed to my master net, and see what happens.
Btw have you measured the number of feature updates
I get it’s ~ 8 per color per node in sf in the midgame
Though on the other hand, linrocks (80624->256)x2 -> 1x8 net matched my net at fixed nodes 😅
matched your 2048hl multilayer net?
dont have concrete numbers right now, only looked at some print statements where it was mostly 5-6 threat updates, but sometimes 20+
1536, but yes
that's insane
Maybe small hidden layers and multilayer are the way to go here
i also had a 384 HL net that was very half-assed and matched my master net too
Yeah 8 is an average count
If you run a long search with my branch and type eval it’ll tell you the total number of accumulator updates
already transposed and quantised (255, 64)
ah and the output bucket formula is the one from bullet, not (piececount - 1) / 4
looks like changing EvalFileDefaultNameBig and the output bucket formula isn't enough to get this working. i'll try again tomorrow i suppose
does this make plentychess a yukari clone? /j
(I will just continue memeing that attack tables and mailboxes are slow)
what engine is this network for?
i trained it for plentychess, but it could be used with any engine
You need to name it nn-12 digits of sha hash like the sf nets
How to set it for SF? It crashes whem I just set file location. It need some conversion or rename?
it won't work with any official stockfish version or development build. I'm trying to figure out myself how to use it with stockfish
Ok. I will try with another strong engines
probably worded that previous message badly. this is an experimental network architecture. given the right modifications to the code, it could be used with any engine
but you won't find any engine in which that net will work with no modifications
I always looking for huge nets for ab engines
one good thing about these types of nets though, the lack of input buckets makes them incredibly fast to train
so I realized that for various reasons (eg quantization and weight types) I can't just compose existing nnue layers meaning supporting this in sf is not as easy as I hoped, uhh since I already plan to rebase everything on the NNUE refactor and add incremental attack tables I probably will just wait for sf style multilayer to make its way to bullet
btw is bullet just (piececount - 2) / 4
I actually have (piececount - 1) / 4 rn lemme check if we're missing some elo bc of that
that looks like a fix.
I guess that explains why output buckets weren't so great compared to single layer initially
btw are these publicly accessible? I checked github and didn't find anything
(hoping to get some information about how to do incremental attack tables and also if there are any other optimizations I'm missing)
I will push my branches once they've been cleaned up properly, it's still a bit of a mess (maybe later today?). I looked at Yukari for the logic and transferred it to bitboards
I'm curious, what are your training settings?
just trained a net with these settings
const HIDDEN_SIZE: usize = 512;
const SCALE: f32 = 400.0;
fn main() {
let mut trainer = TrainerBuilder::default()
.optimiser(optimiser::AdamW)
.loss_fn(Loss::SigmoidMSE)
.input(ThreatInputsSimple)
.output_buckets(outputs::MaterialCount::<8>)
.feature_transformer(HIDDEN_SIZE)
.activate(Activation::CReLU)
.add_layer(16)
.activate(Activation::SCReLU)
.add_layer(32)
.activate(Activation::SCReLU)
.add_layer(1)
.build();
let start_epoch = 1;
let experiment_name = "0087";
let output_path = format!("/mnt/d/Chess Data/Selfgen/Training/{}", experiment_name);
let settings = LocalSettings {
threads: 4,
output_directory: &output_path.as_str(),
test_set: None,
batch_queue_size: 512,
};
let data_loader: loader::DirectSequentialDataLoader = loader::DirectSequentialDataLoader::new(&["/mnt/d/Chess Data/Selfgen/interleaved.data"]);
let schedule = TrainingSchedule {
net_id: format!("net-{}", experiment_name).to_string(),
eval_scale: SCALE,
steps: TrainingSteps {
batch_size: 16_384,
batches_per_superbatch: 6104,
start_superbatch: start_epoch,
end_superbatch: 500,
},
wdl_scheduler: wdl::ConstantWDL { value: 0.5 },
lr_scheduler: lr::CosineDecayLR { initial_lr: 0.001, final_lr: 0.001 * 0.3 * 0.3 * 0.3 * 0.3, final_superbatch: 500 },
save_rate: 10,
};
trainer.set_optimiser_params(optimiser::AdamWParams {
decay: 0.01,
beta1: 0.9,
beta2: 0.999,
min_weight: -0.99,
max_weight: 0.99,
});
trainer.run(&schedule, &settings, &data_loader);
}
apart from the arch, that pretty much matches the first stage of my master net (except 420 SBs instead of 500)
min_weight: -0.99,
max_weight: 0.99,
...huh.
scratch this. i must have messed up this test. maybe I accidentally tested my 384 vs linrocks 256. Or it's the difference between full and simple threat inputs.
Trained two simple threat inputs 512 L1 nets today (one single layer, one multilayer, aka -> 16 -> 32 -> 1) to see what L1 is needed to beat my master net at fixed nodes. The results were a bit disappointing.
512 with layers vs. main: Elo: -16.13 +/- 6.58, nElo: -23.63 +/- 9.63
512 single layer vs. main: Elo: -33.53 +/- 6.49, nElo: -50.06 +/- 9.63
I doubt that a 768 L1 net would be noticably stronger than master at fixed nodes, so 896 or 1024 would be necessary at least (for my case, obviously). And for that I am really not sure if I can make it fast enough...
fixed nodes single layer full inputs should be ~30 elo better than simplified at the same L1 so a full threat input net could be noticeably stronger at L1=768
maybe it's the slowdown?
slowdown is like 15%
i mean the last time we did this it was 0 stc and 7 ltc
so like
yeah
it's negative stc now because the new L1=256 net is slightly better
which is more evidence that data work will be very important
technologov machine dropped a massive diff
idk it feels like threat inputs somehow always attract high residuals
I've already said why multiple times lol. It's cache effects because the net is being duplicated for each instance instead of shared
Memory accesses are far less predictable with threat inputs compared to king buckets, having a large portion of the net in cache is important
512 Vs 256 is not a 15% slowdown when using multiple instances with no sharing, this is all this is showing
In monty mmap was worth 40 Elo at SPRT conditions. And I wasn't even using hyperthreading like fishtest machines. And it wasn't even a threat input net (but it was non UE so still had cache issues). This stuff really matters a lot
I always thought sharing would be absolutely necessary for this to pass. Which is part of why testing at TC is a huge pain in the ass for now (and the results are useless because the TC Elo diff of scaling L1 is totally dependent on how optimized the speed is), I'm still wanting to see scaling L1 and doing fixed nodes. It's a far easier path to validate the idea, if a L1 1536 full threat net (with multilayer, output buckets) surpasses master SF net the idea is finally validated after all this time, the speed can then be fully sorted after
btw are the branches available now
since it looks like upstream sf nnue refactor will conclude soon I’m also going to start rebasing threat inputs again and hopefully end up with a much faster version
Have you considered trying this is obsidian also instead? Would be a lot easier since it's literally just swapping the regular inputs with the threat ones. It already uses bullet and lc0 data, no other variables are changing
@wide oasis is also up for helping with it I think
sure
Yeah. I mean all the code for it already exists in the branches (training and inference) it's mostly copy paste job
what is needed from my end?
Sure if gabe is willing to help
Note that if you copy paste my inference code you lose quite a bit of speed over incremental attack tables
But it’ll be good for fixed nodes
I’ll try and convert threat input code into some form suitable for Obsidian use, if you tell me what is needed
note that there is no multilayer threat input net yet right now
Hm? Can't it just use obsidian multilayer it's already there. I thought it's just an input swap
I meant the actual net
I got the GPUs if any training needs to take place after swapping the inputs in its config
gabe gonna have to talk with jw about how to do it
or you i guess
obsidian uses float in later layers though right?
Well it already trains multilayer with bullet. So whatever it does will work
@wide oasis what's your bullet training config?
And the Leela binpacks used
I'll just swap in the full threats and start training some nets I guess
ok I think I should be able to get inference (inputs -> accumulator) to work but it might take a few days
it's 4 files (im from phone rn)
isn't it quicker if I train the net with all the data ready?
Think we need a new baseline anyways, might as well simplify to one stage. Will need to be using more data I think for both for fair comparison
Sorry, didn't get around to it until now.
Uploaded the branches and nets for a single layer ((80624 -> 1024)x2 -> 1x8) and multi layer ((80624 -> 2048)x2 -> (16 -> 32 -> 1)x8) net now. They are +30 and +55 to my master net at fixed nodes, respectively, but far too slow even with the current UE impl.
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full
https://github.com/Yoshie2000/PlentyChess/tree/threat-inputs-full-layers
Changing to simplified threat inputs is super easily done in threat-inputs.h/cpp, though it seems I lost my small threat inputs nets and therefore can't test this with a working network at the moment
Hmm 1024 to 2048 and multilayer fixed nodes gain seems surprisingly low
How much data do you have
If we assume that a (768->N) net needs around N million positions of data then you would need around 50B positions to have a similar input saturation for (80624 -> 2048)
6B positions
I could take some older positions that are generated with less nodes, maybe that still works together on such a big net
that would put me at around 13B altogether
yeah I'm using smth like N*M*16K / (average number of features per position) to estimate the order of magnitude of data required
50B is definitely not feasable for me, that's for sure. And I think that goes for all selfgenners (apart from maybe Jay)
I'll test (fixed nodes) the 1024 -> 1x8 net vs linrock's 512 which I still have somewhere
yeah the thing about threat inputs is if it works it becomes another advantage to leelers
Small threat inputs are less data starved, but they are not faster so they're not really useful
for your impl approximately when do you staart to see significant slowdown
what L1?
yeah
in mine 256 -> 512 is around 15% slowdown
but it was quite slow to begin with
btw does 0084.bin have the bullet padding trimmed at the end
mh I tested a 512 L1 morelayers net that was slower than master and worse at fixed nodes. I haven't tested the incremental speed differences yet
how fast is sf master (or 17.1), your master and 512 L1 (in nps)
yes, I process bullets raw.bin in a way so just the weights are included in 0084.bin, no padding, and already aligned to 64 bytes. See https://github.com/Yoshie2000/PlentyChess/blob/threat-inputs-full/tools/process_net.cpp
this makes generating verbatim nets also very easy
ok I wasn't sure if the 0000 bytes at the end were padding or part of the net
thanks
in my impl 512 L1 is around 15% slower than sf master but my impl has over 1/4 of the runtime lost to overhead with threat inputs
btw are the output buckets transposed
I think they are? from reading process_net.cpp
and what is the quantization
bc I'm getting info string NNUE evaluation using nn-239c9dddf51e.nnue (157MiB, (80624, 1024, 1)) info depth 1 seldepth 2 multipv 1 score cp 931 nodes 20 nps 20000 hashfull 0 tbhits 0 time 1 pv h2h4 info depth 2 seldepth 3 multipv 1 score cp 330 nodes 103 nps 51500 hashfull 0 tbhits 0 time 2 pv b1c3 a7a6 info depth 3 seldepth 4 multipv 1 score cp 586 nodes 127 nps 63500 hashfull 0 tbhits 0 time 2 pv b1c3 a7a6 info depth 4 seldepth 5 multipv 1 score cp 679 nodes 239 nps 79666 hashfull 0 tbhits 0 time 3 pv b1a3 a7a6 info depth 5 seldepth 6 multipv 1 score cp 757 nodes 341 nps 113666 hashfull 0 tbhits 0 time 3 pv b1a3 b7b6 h2h4 a7a6 info depth 6 seldepth 7 multipv 1 score cp 420 nodes 457 nps 114250 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 info depth 7 seldepth 8 multipv 1 score cp 396 nodes 485 nps 121250 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 b7b6 f2f3 info depth 8 seldepth 8 multipv 1 score cp 22 nodes 614 nps 153500 hashfull 0 tbhits 0 time 4 pv g1h3 a7a6 b1a3 c7c6 b2b3 c6c5 f2f3 info depth 9 seldepth 9 multipv 1 score cp 115 nodes 670 nps 134000 hashfull 0 tbhits 0 time 5 pv d2d3 a7a6 h2h4 a6a5 g1h3 c7c6 info depth 10 seldepth 12 multipv 1 score cp 225 nodes 711 nps 142200 hashfull 0 tbhits 0 time 5 pv d2d3 a7a6 h2h4 a6a5 g1h3 c7c6 b1c3 c6c5 info depth 11 seldepth 22 multipv 1 score cp 216 nodes 3113 nps 283000 hashfull 2 tbhits 0 time 11 pv d2d3 a7a6 f2f4 c7c6 h2h4 d8a5 d1d2 h7h5 d2a5 g8h6 a5h5 info depth 12 seldepth 18 multipv 1 score cp 251 nodes 14188 nps 405371 hashfull 9 tbhits 0 time 35 pv e2e3 a7a6 d2d4 d7d6 f1a6 b8a6 c2c3 g7g6 d1d3 a6b4 d3b5 c7c6 info depth 13 seldepth 15 multipv 1 score cp 128 nodes 21150 nps 406730 hashfull 10 tbhits 0 time 52 pv e2e3 a7a6 d2d4 c7c6 b1c3 e7e6 d1d3 g8f6 d3c4 h7h5 info depth 14 seldepth 26 multipv 1 score cp 45 nodes 175351 nps 475205 hashfull 79 tbhits 0 time 369 pv c2c3 d7d5 e2e4 e7e5 a2a3 f8b4 g1e2 c8e6 e2f4 e5f4 a3b4 b8d7
which seems very wrong
they are not transposed in 0084.bin, but they will be transposed before being baked into the engine in transposePermuteNetwork(). If you compile normally, there will be a temporary processed.bin which is the file that's baked into the engine
ohh
input quant 510, l1 quant 64
I don't have a 512 net right now, it got lost somehow. 0084.bin is 1.45M nps, SF master is 1.6M nps on my machine
hmm -10% to master
plenty is 2.1M nps on this machine. so that's quite hard to get close to
for comparison on my machine sf 1024 is -33% to master
ok sounds like my impl is faster then. that adds up, since I gained quite a lot of speed from the incremental threat calculations
I guess an L1 512, full threat inputs morelayers with 13B positions should beat my master net at fixed nodes. But it'll still be relatively slow. And there's also an issue in bullet where threat input nets with morelayers and pairwise produces only dead nets on init... perhaps @formal smelt has some idea what would have to be done to improve that, since pairwise is quite an important speedup
i'm gonna sign off for today though
ok yeah