#UE Threat Inputs for AB
1 messages · Page 8 of 1
the profiling is posted somewhere. dont know the search terms to find but someone should
and are we still maintaining two accumulators, one for each side/using king buckets?
yes
two accumulators
well actually 4 now
2 for each side
one tracks king bucketed part
the other tracks threats
dumb question, coudl you combine them
around 5% last time profiling was run, maybe someone could run again with latest branch
actually where are they combined lol
this actually loses, because then every time you move the king you need to refresh the entire accumulator
ohh good point
at the inference call
another idea, probably minor
since we never evaluate in check
we can ignore threats that would imply a check
but this is odd because it involves color
does that mean we have 4 accumulators of size 1024, each according to one perspective?
yeah, but really think of it as separating the big accumulator
e.g. threats + halfkav2_hm
so there is an accumulator which tracks all threat feature contributions
and there is an accumulator which tracks all halfkav2_hm feature contributions (this one also has the biases)
and the "true" accumulator is the sum of these two
very unsure how much this would save
requires a trainer change
so is much more involved
threats implying a check is what, king attacks?
yeah, like if a piece of opposite color attacks a king
I think so...
but also the king attacks that piece?
or just one way
no i think as long as the king is attacked by a piece of the opposite color
we are in check
applies to both kings i think
I see
actually it might not
now that I think of it
since those features are actually never used anyways
it is def minor though
😩
Looks like I have yet another cursed speedup idea.
I mean... I still do regular patches, but speedups are needed for threat inputs.
^_^
Curious to see how far the speedups can go
why are we always updating threats after do_move when a large part of the time we return immediately
Do you report a speedup on mine?
One potentially major improvement is to defer the threat calculation to eval time, but this is a rather involved change
No, the new one.
oh
If we can bandaid easy things for now that would also be good in the short term
can you show me the source that showed threat input calculations is 5% before I go about this
which new one
Basically, I loaded four cache lines to force the prefetcher to prefetch.
And then come back later to finish.
This was from a week ago, right is the newer one I think
powerful lmao
I've never gotten prefetching to help
Not sure if it helps either.
In other news stc smp indeed concluded at ~neutral with master this time in 10k games
If nothing else I guess wait for a week of speedups while the double length net training runs and see where we are then
@rare jacinth
usual caveat that my computer is weird applies, but here's threat_inputs at the moment
Oh, your computer isn't weird... it's ahead of its time.
I'll try it out in a bit
what did you run? speedtest multithreaded or bench single threaded?
I run speedtest 16 threads.
And this latest one reported a slight speedup.
Though my PC ain't very reliable at speedtest. There are background processes and so on.
I'm not seeing set_check_info for some reason even though that's at least 1% slowdown from my understanding
it probably got inlined
that's my local result.. on speedtest
and here 32 concurrent single threaded speedtests
quite a difference..
shared-memory patch 😭
Well... here's a thing. I was desperately looking for speedups so we could push threat_inputs so maybe some ideas weren't working.
@prime mica
But well... here's something. If you can investigate why my monomorphization patch speeds up massively on some machines but not others, maybe we can find a way?
I’m really not sure
I think it might simply be the memory bandwidth issue
And threat inputs weights being big
It’s something that’s fundamentally impossible to tune for
I think mmap
Hmm? WDYM?
sharing the net across instances
will be a gainer for master but a greater gainer for threat inputs cuz of fatter net
We could try doing threat inputs STC + shared memory vs master with shared memory
If it works then could catalyze the shared memory branch to be pushed over the finish line
I had a look, it is probably still fairly easy to rebase the mmap branch on the threats branch tbh.
Huzzah
so good enough for testing.
but the mmap branch still needs some work..
it is a bit a beast in itself.
Lol
threats+mmap seems to beat master (without mmap) in a quick and dirty test...
(not entirely fair, obviously)
How's the new prefetch speedtest?
Interesting, seems like append_changed_indices disappeared from the hot spots
Actually this is strange, it used to take 5% of runtime, I don’t believe now it takes <1%
That would be too good to be true
maybe the compiler realized it could be inlined ...
yeah it probably got inlined
seems to make no difference 😩
https://tests.stockfishchess.org/tests/live_elo/68fc8e33637acd2a11e72dad wow this is failing hard
despite it being a 2% speedup locally
"failing hard"
yes
fail high, fail low, fail hard
I mean if it were a 2% speedup across the board it should pass STC quite quickly
Fishtest the dream crusher
I'ma put up shared-memory vs. threat inputs + shared-memory on fishtest
unless someone's done that already or has objections
feel free, if you can get it to work
But well, was that 2% even less than what I gave?
i-cache problem?
yeah potench
I was working off of the tip of threat_inputs rather than urs tho
curous how your approach would work when applied to the current
Well... specializing everything is not that good.
It's a niche optimization, not something to be broadly applied.
This ruins i-cache like crazy.
agreed
the aggressive unrolling doesn't help
I actually don't think unrolling the fused thing is ever helpful
Have you got the fishtest link yet? 🙂
What? These two speedup patches seem to both not work on their own but combined they work together?
https://tests.stockfishchess.org/tests/view/68fcc468637acd2a11e72df2
is this your first time using fishtest
don't jinx
but also
please use pyshbench or smth to double check this
One of the patches work massively well on anematode's machine while the other doesn't work. Mine seem to report a marginal speedup on the prefetch one.
I mean the combined patch
Seems to have a marginal speedup.
No, but in local testing:
1 mmap-master : 19.3 1.2 155153.5 294912 53
2 mmap-1024-nn-26b0e5126117.nnue : 9.1 1.2 148825.5 294912 50
3 master : 0.0 ---- 109242.0 221184 49
4 1024-nn-26b0e5126117.nnue : -15.4 1.3 102875.0 221184 47
-10 with mmap in your machine is better than the normal -15 i guess
Alright here we go.
i8 feature weights (but only the threat weights)
Fixed nodes:
Elo | -2.52 +- 4.09 (95%)
Conf | N=20000 Threads=1 Hash=16MB
Games | N: 10084 W: 2811 L: 2884 D: 4389
Penta | [153, 1260, 2291, 1183, 155]
https://furybench.com/test/3531/
STC:
Elo | 11.81 +- 4.29 (95%)
SPRT | 8.0+0.08s Threads=1 Hash=16MB
LLR | 2.89 (-2.25, 2.89) [0.00, 2.50]
Games | N: 6212 W: 1627 L: 1416 D: 3169
Penta | [6, 630, 1632, 823, 15]
(this is with QA=255, clamped into [-128, 127], about 3000 or so of 50M weights are affected by this, no QAT, no special clamping during training)
for sf one of the more pressing concerns is that weights are stored as x2 internally
i have not found a good way to bypass this
either we do two shifts, or perform a double add when combining the accumulators

anyways if someone who is better at simd would like to do this in sf
- take the current net
- read in the threat weights as normal i16s (important: don't multiply by x2)
- clamp them to i8s
- unpack i8 simd vectors into i16s
- deal with the x2 somehow on inference
https://github.com/official-stockfish/Stockfish/commit/c6a1e7fd4232ec151206fab16cb7daa23bfd7137 the x2 thing is barely a speedup anyways @rocky vigil
what's the issue with x2 now?
clamping threat weights to i8
basically because we were never bounded by size of weights we can x2 freely
but now since we want to clamp it to i8 we cant do that
this is actually gg btw
clamp it to i8/2 
naively, x2 can be dealt with by either a double add on combining accumulators, or by introducing an additional shift
both seem not ideal
^^ this thing
we can just QA=127, clamped to [-64, 63]
or we can QA=255, clamped to [-128, 127] and elide the *2 during load altogether
doesn't that require a training change?
true
if we request a new training run let's also wait to figure out weight clipping then
i'm sure retraining plenty net with weight clipping is also worth some additional
is there a reason why 127 was done previously?
instead of 255
no retrain
that patch was written right before linrock quit I think
one of his last experiments I told him to try 255
but it never made the light of day
ah
currently vondele is training a 1024 factorized net with 1600 epochs/stage.
It will be ready ~friday 31th October.
https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/pipelines/2119342422
after recent developments it might even be outdated before it's done lol
Elo is Elo
no like with the i8 quant
eh
we'll see
it'll probably take a few days to get stuff sorted out
this could be an extra +10 stc/+5 ltc
huzzah
LTC seems to struggle a bit. I think this is due to worker distribution, the GCP workers seem to benefit less from this (either due to avx2 or due to different memory behaviour)
I will repeat both STC+LTC on my own machine to get comparable numbers
google cloud platform
these are 8 cores of epyc 7b13 per worker
oh google cloud
that would be awesome.
first create an account on https://furybench.com/, then when @hollow crystal approves it clone https://github.com/aronpetko/OpenBench and run Client/client.py
so if we have a branch for the threat net that can be used, somebody just PR the changed recipe to the nettest repo and I can start the CI run?
yes, a retrain of the threat net is easy at this point, see message above
it's very strange, fury's 7950X3D is also mimicing the STC +11 LTC +3
possibly the +11 is a fluke
i guess putting more games into this to get smaller error bars is worth it
The +11 is definitely too high
But even then tbh. If it comes in lower how does it help? If the LTC passes there is no way to know it regresses at higher TCs
i'd say if it has the "standard" speedup scaling from STC to LTC it's pretty safe to say it's not gonna go negative at any higher TC
The fixed nodes loss would suggest it's pretty safe to say regardless I think
alright. yeah i guess if monty got +16 with L1=3072 and also doing psq weights as i8, then +11 is indeed unrealistic
Monty also doesn't have UE which I assume would inflate the Elo gain
btw, +3 LTC is still very good
even if it's "only" like +6 STC
update on this, overestimated, thinking +6 stc / +3 ltc now
yeah 6STC/3LTC is still amazing
with i8 threat weights it might make sense to lay the weights out slightly differently
by interleaving two accumulator registers worth of weights
so imagine
i8 weights[128];
->
i16 weights[64]; // weights[0], weights[64], weights[1], weights[65], ...
ehh but an index from black pov is diff from white pov
wdym?
oh that kind
yeah
I meant for this
I have no idea how we currently impl the i8 stuff btw
we dont
yeah that is just extra inefficient lol
ah so you want to fuse the updates for both POVs essentially?
oh no
like currently you have weightsVec which I assume is half the size of inputVec
but you could have something like
for (int i = 0; i < L1_ITERATIONS; i += 2) {
VecI16 goose = weightsVecPacked[i];
outputVec[i] = addEpi16(inputVec[i], shiftRightEpi16(goose, 8));
outputVec[i + 1] = addEpi16(inputVec[i + 1], sextEpi8to16(goose));
}```
idk exactly what you'd use for sextEpi8to16 but the point is you sign extend the lower 8 bits of each 16-bit pair to the full thingy
maybe our old friend maddubs or whatever
then it would be the same # of computational instructions but you could use full-width loads rather than two half-width loads
there might be something better ofc
I see, that sounds good, let's see if we find something good for extracting the lower i8
why don't we just & 0xFF
because then we'll be adding it as if it's unsigned 😩
right right i forgot the sign bit is in the wrong place
but yeah maddubs with set1_epi16(0x00_01) would work and it's one instruction
maddubs also seems to have lower throughput than cvtepi8_epi16 according to intel docs (i hope i understand throughput correctly in this context)
O
where does it say that?
do u mean this column?
CPI = cycles per instruction so lower is better
1 means 1 per clock cycle, 0.5 means 2 per clock cycle, etc.
_mm512_maddubs_epi16 has 0.5 except for the first row
not sure how this translates to the most recent architectures and amd but hey
yum
got it to work with _mm512_srai_epi16(_mm512_slli_epi16(x, 8), 8) instead of maddubs now, with maddubs it'd require masking the upper 8 bits away still
unless i am again misunderstanding something
oh why
the semantics of maddubs are (iirc)
dst[i] = (u8)src1[2*i] * (i8)src2[2*i] + (u8)src1[2*i+1] * (i8)src2[2*i+1]
maybe u need to flip the order
the weights needs to come second
_mm_maddubs_epi16(_mm_set1_epi16(1), weights)
hm not sure what i'm doing wrong
give me a minute
ah i was only working in add, forgot to modify sub and addsub 
got it now
how much sense do you think it makes to test this on zen5?
I think it'll exaggerate the benefit compared to other architectures
but worth a shot
does the weight interleaving make any difference on ur computeR?
when running bench 15x through hyperfine it's within error pretty much (this is a 7900X)
i'm also gonna run a VSTC on furybench
yeah the VSTC failed
😩
But well, on the second thought, wouldn't it be a good idea?
I mean... if it passes VVLTC ofc.
VVLTC is what Stockfish aims for no?
A few seconds per move on an 8-thread hardware sounds reasonable.
But 72t 10s is probably about VVLTC.
60+0.6 8th is probably less than what most people use SF for. 10+0.1 72th maybe not...
100MB seems like nothing nowadays. Games nowadays are in megabytes.
at this point most people probably use a webassembly-based stockfish lol
Roughly equivalent.
true actually
What is the current state? It seems like there was some jumping to 126rrr3-i8t-wide but I thought that didn’t work?
0126rrr3-i8t-wide == 0126rrr3-i8t, except the branch i8weights-threatonly-wideload processes the net differently during compilation, in order to use full loads (e.g. __m512i on avx512 instead of __m256i). but that was slower, at least in my impl
i have now merged this https://furybench.com/test/3545/
i sort of used rrr6 on accident initially, it was neutral against rrr3, which is why i re-tested later with rrr3 before merging that
https://github.com/linrock/nnue-pytorch/commit/f131f3dade86c05e8a8f6a008eb550bca02ccc62
if there's interest in QA=255, i confirmed this nnue-pytorch commit with cj's patch works for training
and retraining the final stage at the time with QA=255 with the same dataset led to about +1 elo
so i'd expect it to be slightly stronger for the main net. using 255 for the smallnet was slight negative or neutral
https://tests.stockfishchess.org/tests/view/671476f686d5ee47d953c8e7
QA=255 inference diff, and QA=255 smallnet (nn-aa2736ae40b1.nnue)
ah how is training with clipping / QAT
I'll check it out, see if I can get it to work
haven't tested it yet
what we are trying to do is use QA=255 and also clamp certain weights to i8 range
so that way we don't have to use the x2 trick
https://tests.stockfishchess.org/tests/view/68fc3184637acd2a11e72d4e
Elo: -0.45 ± 3.4 (95%) LOS: 39.8%
Total: 10000 W: 2599 L: 2612 D: 4789
Ptnml(0-2): 30, 1165, 2616, 1166, 23
nElo: -0.90 ± 6.8 (95%) PairsRatio: 0.99
is this the latest measurement of threats vs. master?
if it's already neutral at 5+0.05 th 8, why not try 60+0.6 th 8?
it could already be better than master
still waiting on a couple of new tricks really
if nothing else there's a 2x length training run ongoing to see if we can squeeze 1 more elo
anyways I'm really happy that we can get it neutral at stc smp without spsa
alright, i'm guessing it's already better than master in its current state at vltc smp
also QA=255 alone should be at least +1 elo
yeah it looks like the stc sprt elo is off, bc the stc progtest against obsidian has barely gone anywhere?
that or error bars
the progtest against plenty 7 is looking great tho. not sure why it doesn't translate against obsidian
strange
although funny it looks like the "scaling" of threat inputs is mostly because it's slower
https://tests.stockfishchess.org/tests/view/68fcc468637acd2a11e72df2 why not let the sprt finish
7742 also has 256mb L3 cache, dunno if that's relevant
yeah, if you move from not fitting into L3 to fully fitting into L3 such things could happen?
Damn, i hope this translates to CCC and TCEC 
Another big gainer found, this time a training improvement.
STC
Elo | 5.31 +- 2.81 (95%)
SPRT | 8.0+0.08s Threads=1 Hash=16MB
LLR | 2.91 (-2.25, 2.89) [0.00, 2.50]
Games | N: 15308 W: 3911 L: 3677 D: 7720
Penta | [36, 1717, 3928, 1923, 50]
https://furybench.com/test/3557/
LTC
Elo | 6.90 +- 3.89 (95%)
SPRT | 40.0+0.40s Threads=1 Hash=64MB
LLR | 1.89 (-2.25, 2.89) [0.00, 2.50]
Games | N: 6850 W: 1747 L: 1611 D: 3492
Penta | [2, 688, 1913, 816, 6]
https://furybench.com/test/3558/
I currently have a 4 stage training setup, with SBs 300, 300, 400, 300, and WDL 0.15, 0.3, 0.6, 1.0.
I figured for threat inputs less/longer stages with less LR jumping might be beneficial, so I switched to 2 stages with SBs 1000, 400 and WDL 0.15->0.6, 1.0 for this new network
the 0.3 WDL stage used old and partially bugged 5ksn data, that's gone now, it's all just 20ksn, 20ksn adversarial and 5ksn adversarial data thrown together. and the score is not given by search, but by the chonker (768x12 -> 4096)x2 -> (96 -> 192 -> 192 -> 1)x8 net (which was previously done in the first stage as well)
i previously tested simpler ways of merging the early stages, as well as increasing the length of the last stage to 400 SBs, and that all failed on its own
What is SB here?
superbatch, i.e. 100M positions
And these values for WDL?
0.0 WDL means training purely on score, 1.0 WDL means training purely on game result. Everything else is a linear blend
And by 0.15->0.6 I mean that it scales up WDL linearly during the 1000 SBs, instead of keeping WDL constant
I can do some work on this if anyone has a machine with strong gpus I can ssh into, I’ve been asked not to use the leela gpus for stockfish related stuff
@violet badger
no ssh access, but one can specify this in a recipe at nettest, and the CI pipeline will execute? So fork the repo, adjust recipe according to your liking, PR, and wait and see.
now, I must say that I tried optimizing these values so there is probably no low hanging fruit. However, not excluding this is possible.
About the moves with no threat changes
I feel like they tend to be endgame positions?
But yeah it definitely feels like there should be a bit to gain from handling the empty case like a null move
I wonder if this (attkr ^ attkd) == 8 is much faster somehow
I think the most important thing right now is to look in the int8 threats..
If there is a branch with the updated trainer, the training should be kicked off asap, in parallel to the little experiment that is running on longer training
Should we try to pass a 1280 net at VVLTC?
no, like that's absolutely useless IMO
I mean, Stockfish optimizes for around VVLTC right?
why?
Because it's where real uses come.
no, that's somehow a misconception
And 100MB more is like a drop of water for modern rams.
100MB more is like a hell of a lot for modern L3 caches
Because it and longer TC (that aren't practical to test) are where real use cases come?
especially on your phone
Yeah... and then? I mean... if it passes VVLTC, it should be considered good for normal use cases right?
(It's rare that normal folks would use Stockfish at lower TC anyway.)
why do you think so?
average lichess game analysis is seconds..
on slower hardware
right now, the priority should be to innovate on these ideas...
this will be helped by keep it a bit nimble and agile.
not by pushing a large net through some TC that can't be supported by the resources.
So, speedups, int8, mmap nets, smarter training processes.
Ahh! I see.
But well, Stockfish does indeed optimize for VVLTC right? At least with things like singular extension. Were Stockfish optimized for lichess analysis, we'd be pushing all sorts of anti-scalers through.
Though yeah... I agree it's not a priority for now. Maybe we can come back to it later when necessary.
@twilit oriole can you give me a full list of the techniques you've tried
Keeping the download small and keeping the net fitting in cache is helpful but Stockfish has been more than strong enough for basically any use case for a long time. Remember phones were beating grandmasters since the 2000s!
I think at this point the purpose of Stockfish development is pure entertainment.
The stronger the engine, the harder it is to learn from it.
The good things that come from them are rather seeing interesting matches.
@regal steeple for https://tests.stockfishchess.org/tests/view/68fde428637acd2a11e72f83 maybe (attkr ^ attkd) >> 3 is better than (attkr ^ attkd) & 8?
i feel like the == 8 instead of & 8 might be nontrivial speedup for whatever reason
The thread is 7k messages and some of the techniques are not from me (especially the more incremental ones like finding the speedups in threat gen etc)
So, a list is not possible
@rare jacinth what types of techniques are you looking for? Speedups? Arch changes? Quantization?
i am very surprised that removing king threats is not a minor speedup
There are known gainers that can be focused on also. Like lazy threats, i8 threats etc
yeah those are the two big (known) ones, mostly depend on effort
minor: at some point we should perform a cleanup of smallnet, so that it stops using an unnecessary 20MB
afaik there's no way to cleanly do this under a single class, unless someone has a better suggestion
the best way I can think of right now is to declare a base feature transformer class with the psq parts and then have a separate threat feature transformer that inherits the functions and adds threat support
broad architectural ideas like adding inputs of whether a pawn is a passed pawn/removing certain threats/etc.
i cannot speak for viren but what we currently do is remove duplicate threats
i.e. rook attacks queen
also a point to investigate is why removing (enemy) threats to a king seems to be a minor slowdown on fishtest
right now we are using 90MB more than master, so if master is using ~140MB of net, we are using ~230MB
can cut it down to 210MB with this first
then i8 will make it around 110MB
I'll try nobranch ver of this maybe
oh but I cant compile shit fml
Even on master I got significant differences from changing the implementation of make_index
So def worth a shot
https://furybench.com/test/3549/ @stray reef dark worker destroyed the gain of your test lol
There's probably no point running this with other workers ngl
This is due to exceeding the L3 cache. 128MB on dark worker
Along with history tables and such has to fit in the L3 with the net
Just 1 cache miss is bad because then you have a latency spike in the eval. 2 cache misses isn't that much slower than 1
doesn't really matter, we know/have good reason to believe that L3 influences the speedup, and that's awesome
Maybe, I dont really know, lets see how cjs attempt goes
that thing doesn't include your change btw
I managed to get it in branchless but it inflated table size
will try again maybe
@rocky vigil so now QA=255 is fine what's the blocker on the i8 threats. Since no x2 trick is needed

Cool
tmrw i will attempt to get started on this
also for then
And is anyone on the speedup side of things looking into lazy threats
I'll ask in advance @naive comet where is weight clipping on the trainer side located
I might but I think it is likely I get nowhere with it lol
What's the approx speedup percent for that
So I can see what's the expected result if it is working lol
threat tracking seems ~5% but you would need to know how much lazy tracking would save
over normal
Oh. I guess it will be within error bars with speedup tool which makes things harder
I have no clue
maybe ask sopel
maybe i can try smth in plenty today or tomorrow
tcec has 256 MB L3 per CPU so should def be a big speedup
fingers crossed
Was this "fuse transformer" patch merged in the Threat Inputs branch, whatever it is?
https://tests.stockfishchess.org/tests/view/690008ee637acd2a11e73441

Potench
The problem is it requires the final add/sub to be computed at the same time as transformed features
Fwiw I don’t think it’s a good patch bc it breaks encapsulation of the layers
and I’m not planning on trying to get it cleaned up or PRed any time soon
+2 ELO stc still pretty good
btw i still think trying to replace https://github.com/xu-shawn/Stockfish/blob/threat_inputs/src/nnue/features/full_threats.cpp#L71 with a lookup is worth a shot
Oh yes
Should be easy... replace the "map" with a table 4x the size. You can get rid of the logic of determining whether or not it's an enemy and so on.
Then, maybe you could like replace the index calculatio with simpler, less precise arithmetic. Pad unused slots. It would take a bit more memory but as long as they don't get inside the cache, should be fine.
I tried even a 2x size table and it tanks a ton
i think the first course of action is to shrink the perspective thing (same patch as ces32 (I might be misremembering his name) but without removing templates)
@naive comet it seems like since this commit self.quantized_one has changed from 255 to 127
or is this commit like a fix to a previous QA=255 attempt
it also seems like self.hidden_quantized_one has disappeared
actually what is self.hidden_quantized_one
I cannot find a reference to it
oh there's a typo
that is a commit on his branch...
not a fix
that was just our old attempt
I'll have smth up soon
for u to check
@naive comet can you check https://github.com/sscg13/nnue-pytorch/commit/6d5c50ac427eae851b5a02a3c721068ef85bace7
oh shoot I found a couple of typos
like it should be model.quantization.(etc)
instead of model.(etc0
does everything else seem ok
I still haven't figured out how to selectively clip weights
alright I'll apply model.quantization. fix
i mean the naive way is just to add some like bool is_threat_weight to quantize_feature_transformer
actually I find it strange
that the trainer basically entirely ignores weight limits
during training
?
it clips for non-ft I'm pretty sure
it has to
or the affine transforms will overflow
also what do you think of this idea
ft doesn't clip
just do that
really?
i thought ft got clipped to 2 * (ONE)
or smth
huh
i guess it doesn't actually
very interesting
i still feel like it would be beneficial to make the trainer aware of the clipping
we can do that later surely
also like
it seems the entire tensor of FT weights
is being passed as a whole big chunk
so that would need to be split between threat part and psq par
started working on it, but it's much more annoying than i thought as it requires keeping track of basically all board updates that happen during the move
you'd have to duplicate like 90% of makemove and at that point i really don't think there is a speedup here
though possibly there is a smart way to do it better than i'm thinking
bruh what is going on with https://tests.stockfishchess.org/tests/view/68f4e178637acd2a11e72170
imo if it reaches the 800k limit we can just merge
looks like a microarchitectural oddity
I think as a rule of thumb auto purge should just be off for nfc patches...
Can merge already, it already passed and then got auto purged
800k is just an arbitrary limit
You can raise that if you want
There are weird filters?
And weird stuff in set theory to work with long big numbers
Just keep it up we believe in you
Can’t you do something with the same mechanism as lazy accumulator updates
If we’re not doing any clipping during training though it’s easier to just post-process the weights
we should clip during training
I think
idk how we currently do it for the other weights
Yeah this is probably both easier and more effective than trying to do it at quantization level
that's what I'm trying
ideally i'd figure out from a static board + move what to update
Yeah smth like that would be ideal
So that way the positions are already given
So no need to duplicate makemove
How else would you figure out from a board+move what to update, without duplicating half of makemove?
What if we use both this board, the move, and the next board
Then ideally we know info from both ends
Long run of 1024 Factorized nets finished.
So far, 3rd stage net is the best, 5th stage net test is still initializing.
2nd stage net is defending, though
step 2: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795300
step 3: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795301
step 4: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795302
step 5: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/5137461961076608/2926829081096545/-/jobs/11918795303
I'm locally playing step 3, comparing to the best 1024n so far (nn-962e10fb93ee.nnue vs nn-26b0e5126117.nnue), doesn't look like any gain.
what about stage 5 net?

stuck job, but I don't think it will be any different. Seems like the longer training plateaus at roughly the same value for all steps.
Meanwhile stage 4 net closed the gap with stage 3 net, and we still don't know performance of stage 5 net

1 nn-26b0e5126117.nnue : 9.7 1.5 74929.5 147456 51
2 nn-962e10fb93ee.nnue : 7.8 1.5 74340.0 147456 50
so, having established longer training as no better... i8 is next 🙂
Ouch
not even testing nn-e5bcdd034264.nnue (stage 5 net), just in case?
Yeah i8 is next then
I can add it.
No difference:
1 nn-962e10fb93ee.nnue : 12.9 1.8 38195.0 73728 52
2 nn-e5bcdd034264.nnue : 12.8 1.8 38189.5 73728 52
I guess 26b is the strongest we can get without i8 or spsa then
☹️
748K tests and still not passing
https://tests.stockfishchess.org/tests/live_elo/68f4e178637acd2a11e72170
😐
It’s a speedup, just on the borderline to be enough elo
So, what remains to be done?
Some speedup, i8 inference, i8 nets training, verbatim nets.....and eventually SPSA the net
Is it correct?
Is that the roadmap, more or less?
Barring Elo gain.....
🙏
Anyway... if we're gonna do i8 quantization, then we could just like store the 1024 in i16?
ig @regal steeple can pr the maybemaybemaybe test
the branch should probably use the nn-26b0e5126117.nnue network?
i think everything should be fine
like it compiles and has a bench
I haven't touched anything
The test used an old net because it was running so long, I rebased the pr branch so the net in the pr should be correct
bruh I filled nps instead of nodes for bench again
hate this
aight https://github.com/sscg13/Stockfish/commit/7a41f56227aad2fae1d00166b2f6d6e281264756 has the correct bench
you might need to check this out post-master rebase
Merged and rebased
so what's the current guestimate on Elo difference at STC?
Nice
Well not much new so still -5 or so stc
Unless the search behavior is different
Weight clipping for i8 will hopefully be solved today…
Any blockers for i8 clipping?
me
the fact that I don't know pytorch
what I think it supposed to happen is to get a tensor slice corresponding to the threat weights
and then clip them like that
but idk how to do it
Add a new field here
Logic goes here
model.input.weight should be the tensor that gets clipped
Only threat weights get i8?
yeah for yoshie it was the most beneficial
If I ever buy a GPU I will definitely try quanitizing the main net to i8
I mean you don’t need a gpu to run just the quantization
sure but I think it needs to be re-trained
Honestly if we are doing that it’s probably better to just overhaul the feature system
But I guess for testing just hard code selecting the corresponding slice from model.input.weight
yeah, how would we make that slice
something that looks like weight[0:12345]=weight[0:12345].clamp(...)
so like here, add
{
"params": [model.input.weight[0:79856]],
"min_weight": -self.max_threat_weight,
"max_weight": self.max_threat_weight,
}
?
if the slicing is that easy then I'm happy
afaik it's used as like
.data()
so as long as it works
how would I define max_threat_weight
i know it should be smth like
ft_quantized_one / 2
but idk if there are any details I need to watch out for
i did not have to do any retraining for i8 threat weights to work
SF is different, they had to change the QA
with bullet, if no QAT is used, one can just re-quantise a checkpoint
so i'm wondering why that's not possible for SF
cuz bullet > nnue-pytorch 
anyways check
to make sure I haven't done anything wrong
and if it looks good we can start a test run
I'll get to giving that a test run later today, but obviously that shouldn't stop people from having a look now 😉
net sharing is merged, so probably makes sense to rebase the SF threats branch on master, and run another 10k test of the current state.
Do we re-try the 1280 net on STC?
Shared memory should disproportionately benefit larger nets I think.
1024 with i8 is the way go go rn
i don't think the benefit is that big, per position only a tiny fraction of inputs are used
so, let's see where I get with the above branch by @rocky vigil ....

Interesting a 38% speedup is apparently only 25 Elo in SF now. Used to be far more IIRC
might be... this is not the conventional way of measuring this (timeodds).
(should be maybe similar, but not certain)
let me try to measure once with current master, interesting enough a question..
1 shared_memoryPRtc138 : 49.7 1.8 41962.5 73728 57
2 shared_memoryPRtc100 : 0.0 ---- 31765.5 73728 43
tc adjustment (tc=13.8+0.138)
That was the node count in the match master vs master and sharedmem vs sharedmem
9 ===== shared_memory =====-
10 771 seconds for 169406377858 nodes
11 nps: 2.19723e+08
12 ===== master =====-
13 769 seconds for 122427206326 nodes
14 nps: 1.59203e+08
Interesting. I wonder how the elo of a 20% time odds difference changes with SF version
Like if it gets less with recent versions or stays constant
I suspect it will get less.
we can. idk if the checkpoint is kept
rebased
not forever, but I do have those.
then why train again?
prolly no reason to
Hoping that having the trainer aware of clipping is a slight gain
huh when did that semicolon get removed ???
anyways that error is fixed now...
let me try again.
no, so code got deleted
git diff 5bcb0036825206ad6a23df6ed1b07211e3a73f58
diff --git a/training_data_loader.cpp b/training_data_loader.cpp
index 9f04699..8c79b9b 100644
--- a/training_data_loader.cpp
+++ b/training_data_loader.cpp
@@ -1202,7 +1202,7 @@ extern "C" {
{
return new SparseBatch(FeatureSet<Full_Threats>{}, entries);
}
- else if (feature_set == "Full_Threats^")
+ else if (feature_set == "Full_Threats^")
{
return new SparseBatch(FeatureSet<Full_ThreatsFactorized>{}, entries);
}
@@ -1267,10 +1267,6 @@ extern "C" {
{
return new FeaturedBatchStream<FeatureSet<Full_Threats>, SparseBatch>(concurrency, filenames_vec, batch_size, cyclic, skipPredicate);
}
- else if (feature_set == "Full_Threats^")
- {
- return new FeaturedBatchStream<FeatureSet<Full_ThreatsFactorized>, SparseBatch>(concurrency, filenames_vec, batch_size, cyclic, skipPredicate);
- }
fprintf(stderr, "Unknown feature_set %s\n", feature_set_c);
return nullptr;
}
???
i have no idea what happened
strange
maybe started from a different branch or so?
pretty sure that's a fix we made earlier or.
sounds plausible
alright that should be fixed now
so if there's any error it must be either with the different quantization or the clipping
as a side note I think the current method also clips the threat psqbucket weights
Against the i8 current net?
yeah against just clipping the weights directly
i mean someone is free to try i8 post-processing the current net
just read the threat weights and then write them as i8
I've posted the checkpoint above..
(a link to)
it would probably be a good baseline, and test of the inference code .
well apparently it is 8x slower
gah
i think there are too many weights clipped too frequently
that's a bit too slow to be practical.. I guess probably not quite right.
i think we need a more fine grained optimization to make this work
it's clipping every batch
it should be safe to reduce the threat clipping to every SB (epoch) instead
and that should end the slowdown
8x slowdown sounds too much?
while we're at it I think the init could be better in threat inputs
afaik bullet has some stuff that improves the init
yeah not sure why
maybe it's the slicing that's causing the issue?
is it maybe doing this cpu side, i.e. transferring stuff back and forth?
could be, maybe it's transferring all 320 MB of threat weights every time
what's the cpu-gpu bandwidth?
shouldn't be too hard to check if this is the case...
high, but obviously that would be bottleneck. (450GB/s)
worth trying, but I can hardly imagine it being that slow in general... unless something is unexpected.
can u check the changes see if anything besides clipping could possibly be the slowdown
here are the batches cyclic between 0-6103
Shouldn’t be
Everything is automatically moved to cpu via lightning
actually if it's doing the clipping on cpu
uh
clipping 80 million weight
takes a lot of time
is it fine to just replace ?? with 6103
@frosty imp
oh
bruh
how do I do it like at the end of every superbatch then
or is it batch_idx
I mean there is a batch_idx
batch_idx == 6103 works?
That I don’t know
actually let's do 0
since quantization clips it at the very end anyways
btw @frosty imp could you also run stc smp
serialize hmm
issue with "deepcopy"
this is out of my depth
i cannot connect this to anything new introduced
try not doing the slice thing
I think serialize runs on the cpu so you can prolly test that
but the slicing is only for clipping threat weights
I don't see why it would affect
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001
So perhaps something is interacting to implicitly create a tensor
I think maybe the view was copied since the quant config is part of model
Huh
not sure what you're referring to by this
the view of model.input.weights is in the clipping dictionary
the model itself owns a copy of the clipping dictionary
when you deepcopy the model, you deepcopy the clipping dictionary and therefore the view itself
python be like
i mean isn't that like most languages that passes by reference
so locally I removed the slice from the weight clipping config
and instead this function exists
will this also solve the deepcopy issue?
try it I guess
if this is the issue then breaking it into a separate function would help
like run serialize?
yeah
don't I need a checkpoint for that
you can do it on a .nnue as well
there is a checkpoint
#1336647760388034610 message
i can also do it with this, the issue is that idk how to run it in the first place
i have such warnings as
RuntimeWarning: invalid value encountered in exp2
epsneg_f128 = exp2(ld(-113))
RuntimeWarning: invalid value encountered in exp2
tiny_f128 = exp2(ld(-16382))
RuntimeWarning: invalid value encountered in exp2
eps=exp2(ld(-112)),
RuntimeWarning: invalid value encountered in nextafter
self._smallest_subnormal = nextafter(
RuntimeWarning: invalid value encountered in log10
self.precision = int(-log10(self.eps))```
but no deepcopy crash
that's probably when you start from the checkpoint, not the nnue?
i also have no idea what that even did
py -u serialize.py ../Stockfish/src/nn-26b0e5126117.nnue test.nnue --device=0 --features=Full_Threats --l1=1024 --ft_compression=leb128
The crash was with:
python -u serialize.py /workspace/scratch/c82d8c15bf8b/run/lightning_logs/version_0/checkpoints/nonopt.nnue /workspace/scratch/c82d8c15bf8b/run/lightning_logs/version_0/checkpoints/last.nnue --ft_optimize_data=/workspace/data/official-stockfish/master-binpacks/fishpack32.binpack --device=0 --features=Full_Threats --l1=1024 --ft_optimize_count=100000 --ft_optimize --ft_compression=leb128
and if ft_optimize so
you can try that with a small ft_optimize_count ?
do I really need to download 5.58 GB binpack for this...
any tiny binpack will do?
ok I'll try it with small.binpack
even just start the download and kill it ...
if you use small.binpack, try to reduce optimize_count. Could also be that this only crashes when run on GPU?
same thing happens if I use the base threat_inputs branch
it gives the invalid value warnings
and then terminal looks like it indicates a crash
there is an additional warning of ```Warning: Numpy built with MINGW-W64 on Windows 64 bits is experimental, and only available for
testing. You are advised not to use it for production.
CRASHES ARE TO BE EXPECTED - PLEASE REPORT THEM TO NUMPY DEVELOPERS```
ok so i think something is wrong here
because it is not even getting to the python part
in the meanwhile, could you check that https://github.com/sscg13/nnue-pytorch/commit/2f978cce16324358e541d22a87610500eee7ac1f fixes the slowdown?
running
ah
the syntax is not what I thought it was
hmm
oh i typoed
weight
not weights
apparently
restarted
alright runs now with basically no speed loss
in several minutes we'll know if serialize works
so it looks like it worked and a net was uploaded
now let me check that the threat weights are indeed in i8 range...
zip compresses better (48 MB instead of 65)
good sign hopefully
very early net, might also contribute
by how much do they exceed the limits
now lemme recompile
if it's only a couple we could just clamp it... obviously not the cleanest solution ofc
why does gcc say it invokes undefined behavior past 22528 * l1
I am pretty confident that weights has size 102384 * l1
what is ThreatInputDimensions * HalfDimensions
also shouldn't it be weights[i] >= 128 and weights[i] < -128?
number of threat weights
79856 * 1024
weights[0:79856*1024] should be the threat weights
129
133
132
137
136
137
132
137
128
145
128
140
134
131
143
128
129
138
129
131
143
128
129
148
128
129
128
143
143
139
139
128
128
128
145
132
147
139
129
135
131
131
150
142
175
147
137
129
128
130
150
151
153
147
140
131
130
134
129
131
131
136
133
147
132
155
141
143
140
137
138
136
153
135
131
162
143
154
151
-129
144
138
143
138
146
138
133
164
160
148
149
138
135
129
133
151
149
138
141
155
139
134
155
143
141
-130
-134
150
146
154
144
129
161
147
140
132
157
146
153
150
147
159
140
128
167
160
133
128
142
153
146
-134
153
143
142
140
165
145
128
142
154
-133
145
131
147
143
131
156
143
141
-138
136
138
-139
-135
-132
-133
-137
-130
-144
-131
-134
165
159
160
131
133
152
143
146
166
149
-130
152
-149
139
147
-133
-136
148
139
155
145
132
159
139
134
136
133
138
-130
-136
-132
-143
-130
140
134
144
143
165
142
132
here are the weights in question
it might be because the lr is too high
so clipping at the start of epoch
still gives them a full epoch to exceed the limits
nevertheless, 200 is still better than the 400k it says for the current net
can't you add something to the loss?
but rather straightforward?
yeah, probably.
does weight[0:79856].clamp(-128, 127) do what I want it to do?
post-integer conversion
if that line does what I want it to do then https://github.com/sscg13/nnue-pytorch/commit/c718d1543a5ef4068b04770914a4385c5d874b6b can be used
unsure whether this or
weight[0:79856] = weight[0:79856].clamp(-128, 127) is right
it is possible they both are
yeah I think this should be
not what I have
ah... yeah, so well, in a few minutes I can start the next run.
yeah clamp_() is the in-place version
wtf is this naming
so either weight[0:79856] = weight[0:79856].clamp(-128, 127) or weight[0:79856].clamp_(-128, 127)
gonna go with the former bc yeah
🤯
they should have a [[nodiscard]] equivalent on weight[0:79856].clamp(-128, 127) lol
oh ig not a thing in Python
restarted with the latest syntax
still 133 exceeding limits
extremely strange
oh shoot
I've been reading the psq weights
actually why are so many of them small
0 out of 81772544 threat weights exceed i8 limits
looks like it works
that also explains the compiler warning
as it turns out inputdimensions should probably be named psqinputdimensions
oh well
so, things look better now?
yeah
very surprising
that only 200 of the psq weights have managed to exceed 128 in abs. value
i guess this is an artifact of only 8 epochs
now the harder part to verify is that the QA=255 is working
I suppose I could start a longer train, like one 800epoch run. Means we have a reasonable net by tomorrow?
yeah
curious if 100 epoch stages 1-5 produces a better net faster than just 800 epoch of stage 1
i suppose it doesn't really matter
keeping it to single stage might also be easier to set up
if this 1 stage has the advantage we can kind of compare with 1stage normal setup.
so, let me start this.
also
meanwhile would be interesting to see what happens just quantizing the existing checkpoint tbh.
on inference side requires an extra x2 somewhere (either as shift, or double add) but it is also interesting
or, the inference side is still not entirely worked out
i think it is possible actually to just post-process the latest nnue
in SF even, yeah, I guess so.
also @frosty imp is there a reason why all of this code is necessary instead of
read_leb_128(stream, threatWeights, HalfDimensions * ThreatInputDimensions)
read_leb_128(stream, weights, HalfDimensions * InputDimensions)
...same for psqt
oh
so the leb128 here actually includes a length counter for the slice
so it must all be read in one go
meanwhile still open to suggestions on how to avoid declaring threat weights when it's unncessary
(saves 20 MB of memory)
explain?
for the smallnet
which doesn't use threat inputs
there is no reason to declare a 79856 * 128 array
for the threat weights
that will be unused
huzzah
is declaring a [0] array valid
DRY lover
@naive comet what needs to be done inference side for the QA=255
anywhere with 127*2, replace with 255, delete weight doubling, gg
ok
we'll see how it goes in like
6 hours or so
tbh I wonder if removing the unused threat weights in smallnet actually does anything
besides just shaving off 20 MB
surely the compiler knows they're unused
it is, in fact, valid; allows you to declare alignment
wai
ok it's lldb time
can't figure out this templating mess
nvm it's in the scaling lmao
alright
6 MiB instead of 28
or smth
let's go
info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))```
that's better
i like it
lmao this technically has to be 128 otherwise it'll fail some static assert
banger
@frosty imp https://github.com/xu-shawn/Stockfish/pull/25
eh lemme try refactoring it in a better way
sure
is 20 MB of memory that valuable
i thought since it was unused actually
it just sits in memory and does nothing
will u have this by the time the stage 1 i8 net is trained
eh just merged your pr
spent the time going off a tangent switching everything to std::array because the way it is now is a pain to refactor
oh
do the later layers just remain unchanged?
net is trained now
validation loss looks fine
0 out of 81772544 threat weights exceed i8 limits```
good start
@naive comet https://github.com/sscg13/Stockfish/commit/83eb0e1d835e138194237c33cc968c48f42a6a68 look good?
lgtm I think
info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))```
let's go?
bench matches
can someone check single/multithread speedtest of https://github.com/sscg13/Stockfish/commit/5a6633ad554f22ef1ad953bff2af74d0db3c0b79 vs right before this commit
yessir gimme a bit
💀
oh shoot
lemme in fact upload to fishtest
we're gonna be doing a test vs previous stage 1 anyways
O
Failed to download from https://tests.stockfishchess.org/api/nn/nn-81c52631cfec.nnue
maybe you need to upload
uploaded to fishtest now
danke
yep
since we're doing i8 we might as well skip the leb128 nonsense for that section
later tho
story of my lief
yeah i want to
leb still compresses it somewhat though
compared to verbatim
bench is 2266138
