#UE Threat Inputs for AB

1 messages · Page 8 of 1

rare jacinth
#

where can I find the latest branch with all changes merged?

#

and what percent of time is going to threat calculation right now?

prime mica
twilit oriole
#

the profiling is posted somewhere. dont know the search terms to find but someone should

rare jacinth
#

and are we still maintaining two accumulators, one for each side/using king buckets?

rocky vigil
#

yes

#

two accumulators

#

well actually 4 now

#

2 for each side

#

one tracks king bucketed part

#

the other tracks threats

prime mica
#

dumb question, coudl you combine them

rocky vigil
prime mica
#

actually where are they combined lol

rocky vigil
prime mica
#

ohh good point

rocky vigil
#

another idea, probably minor

#

since we never evaluate in check

#

we can ignore threats that would imply a check

#

but this is odd because it involves color

rare jacinth
#

does that mean we have 4 accumulators of size 1024, each according to one perspective?

rocky vigil
#

yeah, but really think of it as separating the big accumulator

#

e.g. threats + halfkav2_hm

#

so there is an accumulator which tracks all threat feature contributions

#

and there is an accumulator which tracks all halfkav2_hm feature contributions (this one also has the biases)

#

and the "true" accumulator is the sum of these two

lofty cedar
#

Okay... another one potential threat input speedup.

#

Apruvu sama.

rocky vigil
#

requires a trainer change

#

so is much more involved

prime mica
#

threats implying a check is what, king attacks?

rocky vigil
#

yeah, like if a piece of opposite color attacks a king

lofty cedar
#

I think so...

prime mica
#

or just one way

rocky vigil
#

no i think as long as the king is attacked by a piece of the opposite color

#

we are in check

#

applies to both kings i think

prime mica
#

I see

rocky vigil
#

now that I think of it

#

since those features are actually never used anyways

#

it is def minor though

prime mica
#

😩

lofty cedar
#

Looks like I have yet another cursed speedup idea.

prime mica
#

lol

#

you have rly pivoted

lofty cedar
#

I mean... I still do regular patches, but speedups are needed for threat inputs.

prime mica
#

^_^

rocky vigil
#

Curious to see how far the speedups can go

rare jacinth
#

why are we always updating threats after do_move when a large part of the time we return immediately

prime mica
#

here's somethin

#

will put on fishtest and see

lofty cedar
#

Do you report a speedup on mine?

rocky vigil
prime mica
#

was major

lofty cedar
#

No, the new one.

prime mica
#

oh

rocky vigil
#

If we can bandaid easy things for now that would also be good in the short term

rare jacinth
prime mica
#

which new one

lofty cedar
#

Basically, I loaded four cache lines to force the prefetcher to prefetch.

#

And then come back later to finish.

rocky vigil
#

This was from a week ago, right is the newer one I think

prime mica
#

I've never gotten prefetching to help

lofty cedar
#

Not sure if it helps either.

rocky vigil
#

In other news stc smp indeed concluded at ~neutral with master this time in 10k games

#

If nothing else I guess wait for a week of speedups while the double length net training runs and see where we are then

prime mica
#

@rare jacinth

#

usual caveat that my computer is weird applies, but here's threat_inputs at the moment

lofty cedar
#

How's my prefetch patch?

#

Does it work?

lofty cedar
prime mica
#

I'll try it out in a bit

violet badger
#

what did you run? speedtest multithreaded or bench single threaded?

lofty cedar
#

I run speedtest 16 threads.

#

And this latest one reported a slight speedup.

#

Though my PC ain't very reliable at speedtest. There are background processes and so on.

rare jacinth
prime mica
#

it probably got inlined

violet badger
#

that's my local result.. on speedtest

#

and here 32 concurrent single threaded speedtests

#

quite a difference..

prime mica
#

shared-memory patch 😭

lofty cedar
#

Well... here's a thing. I was desperately looking for speedups so we could push threat_inputs so maybe some ideas weren't working.

#

@prime mica
But well... here's something. If you can investigate why my monomorphization patch speeds up massively on some machines but not others, maybe we can find a way?

prime mica
#

I’m really not sure

#

I think it might simply be the memory bandwidth issue

#

And threat inputs weights being big

#

It’s something that’s fundamentally impossible to tune for

lofty cedar
#

Would SMP help?

#

Let's try SMP?

naive comet
#

I think mmap

lofty cedar
#

Hmm? WDYM?

naive comet
#

sharing the net across instances

will be a gainer for master but a greater gainer for threat inputs cuz of fatter net

lofty cedar
#

Oh, yeah...

#

But for this, let's try this first.

prime mica
#

We could try doing threat inputs STC + shared memory vs master with shared memory

#

If it works then could catalyze the shared memory branch to be pushed over the finish line

violet badger
#

I had a look, it is probably still fairly easy to rebase the mmap branch on the threats branch tbh.

prime mica
#

Huzzah

violet badger
#

so good enough for testing.

#

but the mmap branch still needs some work..

#

it is a bit a beast in itself.

prime mica
#

Lol

violet badger
#

threats+mmap seems to beat master (without mmap) in a quick and dirty test...

#

(not entirely fair, obviously)

lofty cedar
#

How's the new prefetch speedtest?

rocky vigil
#

Actually this is strange, it used to take 5% of runtime, I don’t believe now it takes <1%

#

That would be too good to be true

violet badger
#

maybe the compiler realized it could be inlined ...

prime mica
#

yeah it probably got inlined

prime mica
prime mica
#

despite it being a 2% speedup locally

daring wren
#

"failing hard"

prime mica
#

yes

#

fail high, fail low, fail hard

#

I mean if it were a 2% speedup across the board it should pass STC quite quickly

rocky vigil
#

Fishtest the dream crusher

prime mica
#

I'ma put up shared-memory vs. threat inputs + shared-memory on fishtest

#

unless someone's done that already or has objections

rocky vigil
#

feel free, if you can get it to work

lofty cedar
prime mica
#

yeah potench

#

I was working off of the tip of threat_inputs rather than urs tho

#

curous how your approach would work when applied to the current

lofty cedar
#

Well... specializing everything is not that good.

#

It's a niche optimization, not something to be broadly applied.

#

This ruins i-cache like crazy.

prime mica
#

agreed

#

the aggressive unrolling doesn't help

#

I actually don't think unrolling the fused thing is ever helpful

amber fern
lofty cedar
naive comet
#

is this your first time using fishtest

#

don't jinx

#

but also

#

please use pyshbench or smth to double check this

lofty cedar
#

One of the patches work massively well on anematode's machine while the other doesn't work. Mine seem to report a marginal speedup on the prefetch one.

naive comet
#

I mean the combined patch

lofty cedar
#

Seems to have a marginal speedup.

violet badger
# amber fern Have you got the fishtest link yet? 🙂

No, but in local testing:

   1 mmap-master                       :    19.3    1.2  155153.5  294912    53
   2 mmap-1024-nn-26b0e5126117.nnue    :     9.1    1.2  148825.5  294912    50
   3 master                            :     0.0   ----  109242.0  221184    49
   4 1024-nn-26b0e5126117.nnue         :   -15.4    1.3  102875.0  221184    47
rocky vigil
#

-10 with mmap in your machine is better than the normal -15 i guess

violet badger
#

right

#

benefits threats somewhat more than master, as expected.

stray reef
#

Alright here we go.

i8 feature weights (but only the threat weights)

Fixed nodes:

Elo   | -2.52 +- 4.09 (95%)
Conf  | N=20000 Threads=1 Hash=16MB
Games | N: 10084 W: 2811 L: 2884 D: 4389
Penta | [153, 1260, 2291, 1183, 155]

https://furybench.com/test/3531/

STC:

Elo   | 11.81 +- 4.29 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.89 (-2.25, 2.89) [0.00, 2.50]
Games | N: 6212 W: 1627 L: 1416 D: 3169
Penta | [6, 630, 1632, 823, 15]

https://furybench.com/test/3530/

#

(this is with QA=255, clamped into [-128, 127], about 3000 or so of 50M weights are affected by this, no QAT, no special clamping during training)

rocky vigil
#

for sf one of the more pressing concerns is that weights are stored as x2 internally

#

i have not found a good way to bypass this

#

either we do two shifts, or perform a double add when combining the accumulators

twilit oriole
rocky vigil
#

anyways if someone who is better at simd would like to do this in sf

#
  1. take the current net
#
  1. read in the threat weights as normal i16s (important: don't multiply by x2)
#
  1. clamp them to i8s
#
  1. unpack i8 simd vectors into i16s
#
  1. deal with the x2 somehow on inference
naive comet
rocky vigil
#

wait +1.7% is like

#

decent

#

would like to preserve it if possible

frosty imp
#

what's the issue with x2 now?

rocky vigil
#

clamping threat weights to i8

naive comet
#

basically because we were never bounded by size of weights we can x2 freely

#

but now since we want to clamp it to i8 we cant do that

frosty imp
#

clamp it to i8/2 Kappa

rocky vigil
#

naively, x2 can be dealt with by either a double add on combining accumulators, or by introducing an additional shift

#

both seem not ideal

naive comet
#

OH wait

#

yoshie

naive comet
#

we can just QA=127, clamped to [-64, 63]

#

or we can QA=255, clamped to [-128, 127] and elide the *2 during load altogether

rocky vigil
#

doesn't that require a training change?

naive comet
#

yeah

#

dont we have vondele's gpu?

rocky vigil
#

true

naive comet
#

I'm sure 1 retrain for practically guaranteed passer is fine 😉

#

yoshie 🤝 sf

rocky vigil
#

if we request a new training run let's also wait to figure out weight clipping then

#

i'm sure retraining plenty net with weight clipping is also worth some additional

rocky vigil
#

instead of 255

naive comet
#

no retrain

rocky vigil
#

oh

#

so it was an old choie

#

that just got locked in

#

over all the different stages

naive comet
#

that patch was written right before linrock quit I think

#

one of his last experiments I told him to try 255

#

but it never made the light of day

rocky vigil
#

ah

green moat
rocky vigil
#

after recent developments it might even be outdated before it's done lol

rocky vigil
#

no like with the i8 quant

#

eh

#

we'll see

#

it'll probably take a few days to get stuff sorted out

rocky vigil
prime mica
#

huzzah

stray reef
prime mica
#

what is GCP

#

happy to contribute some CPU time as well if u can show me how

stray reef
#

google cloud platform
these are 8 cores of epyc 7b13 per worker

prime mica
#

oh google cloud

stray reef
violet badger
violet badger
rocky vigil
stray reef
#

possibly the +11 is a fluke

#

i guess putting more games into this to get smaller error bars is worth it

twilit oriole
#

The +11 is definitely too high

#

But even then tbh. If it comes in lower how does it help? If the LTC passes there is no way to know it regresses at higher TCs

stray reef
#

i'd say if it has the "standard" speedup scaling from STC to LTC it's pretty safe to say it's not gonna go negative at any higher TC

twilit oriole
#

The fixed nodes loss would suggest it's pretty safe to say regardless I think

stray reef
#

alright. yeah i guess if monty got +16 with L1=3072 and also doing psq weights as i8, then +11 is indeed unrealistic

twilit oriole
#

Monty also doesn't have UE which I assume would inflate the Elo gain

rocky vigil
#

even if it's "only" like +6 STC

rocky vigil
stray reef
#

yeah 6STC/3LTC is still amazing

prime mica
#

with i8 threat weights it might make sense to lay the weights out slightly differently

#

by interleaving two accumulator registers worth of weights

#

so imagine

#
i8 weights[128];
->
i16 weights[64];  // weights[0], weights[64], weights[1], weights[65], ...
naive comet
#

ehh but an index from black pov is diff from white pov

prime mica
#

wdym?

prime mica
#

I meant for this

naive comet
#

I have no idea how we currently impl the i8 stuff btw

rocky vigil
#

we dont

naive comet
stray reef
prime mica
#

oh no

#

like currently you have weightsVec which I assume is half the size of inputVec

#

but you could have something like

#
for (int i = 0; i < L1_ITERATIONS; i += 2) {
   VecI16 goose = weightsVecPacked[i];
   outputVec[i] = addEpi16(inputVec[i], shiftRightEpi16(goose, 8));
   outputVec[i + 1] = addEpi16(inputVec[i + 1], sextEpi8to16(goose));
}```
#

idk exactly what you'd use for sextEpi8to16 but the point is you sign extend the lower 8 bits of each 16-bit pair to the full thingy

#

maybe our old friend maddubs or whatever

#

then it would be the same # of computational instructions but you could use full-width loads rather than two half-width loads

#

there might be something better ofc

stray reef
#

I see, that sounds good, let's see if we find something good for extracting the lower i8

prime mica
#

lol

#

lemme call Lisa Su rq

stray reef
#

why don't we just & 0xFF

prime mica
#

because then we'll be adding it as if it's unsigned 😩

stray reef
#

right right i forgot the sign bit is in the wrong place

prime mica
#

but yeah maddubs with set1_epi16(0x00_01) would work and it's one instruction

stray reef
#

maddubs also seems to have lower throughput than cvtepi8_epi16 according to intel docs (i hope i understand throughput correctly in this context)

prime mica
#

O

#

where does it say that?

#

do u mean this column?

#

CPI = cycles per instruction so lower is better

#

1 means 1 per clock cycle, 0.5 means 2 per clock cycle, etc.

stray reef
#

_mm512_maddubs_epi16 has 0.5 except for the first row

#

not sure how this translates to the most recent architectures and amd but hey

prime mica
#

yum

stray reef
#

unless i am again misunderstanding something

prime mica
#

oh why

#

the semantics of maddubs are (iirc)

#
dst[i] = (u8)src1[2*i] * (i8)src2[2*i] + (u8)src1[2*i+1] * (i8)src2[2*i+1]
#

maybe u need to flip the order

#

the weights needs to come second

#

_mm_maddubs_epi16(_mm_set1_epi16(1), weights)

stray reef
#

hm not sure what i'm doing wrong

#

give me a minute

#

ah i was only working in add, forgot to modify sub and addsub kekw

#

got it now

prime mica
#

lololol

#

happens

stray reef
#

how much sense do you think it makes to test this on zen5?

prime mica
#

I think it'll exaggerate the benefit compared to other architectures

#

but worth a shot

#

does the weight interleaving make any difference on ur computeR?

stray reef
#

when running bench 15x through hyperfine it's within error pretty much (this is a 7900X)

#

i'm also gonna run a VSTC on furybench

prime mica
#

darn :/

#

I wonder whether the bottleneck is still memory then

stray reef
#

yeah the VSTC failed

prime mica
#

😩

lofty cedar
#

But well, on the second thought, wouldn't it be a good idea?

I mean... if it passes VVLTC ofc.

#

VVLTC is what Stockfish aims for no?

#

A few seconds per move on an 8-thread hardware sounds reasonable.

#

But 72t 10s is probably about VVLTC.

stray reef
#

60+0.6 8th is probably less than what most people use SF for. 10+0.1 72th maybe not...

lofty cedar
#

100MB seems like nothing nowadays. Games nowadays are in megabytes.

prime mica
#

at this point most people probably use a webassembly-based stockfish lol

stray reef
#

true actually

rocky vigil
stray reef
#

0126rrr3-i8t-wide == 0126rrr3-i8t, except the branch i8weights-threatonly-wideload processes the net differently during compilation, in order to use full loads (e.g. __m512i on avx512 instead of __m256i). but that was slower, at least in my impl

rocky vigil
#

Ah I see

#

What’s the difference between rrr3 and rrr6 then

stray reef
#

i sort of used rrr6 on accident initially, it was neutral against rrr3, which is why i re-tested later with rrr3 before merging that

round stone
#

and retraining the final stage at the time with QA=255 with the same dataset led to about +1 elo

#

so i'd expect it to be slightly stronger for the main net. using 255 for the smallnet was slight negative or neutral

rocky vigil
rocky vigil
stray reef
rocky vigil
#

what we are trying to do is use QA=255 and also clamp certain weights to i8 range

#

so that way we don't have to use the x2 trick

round stone
rocky vigil
#

stc smp yes

#

stc is like -5

round stone
#

if it's already neutral at 5+0.05 th 8, why not try 60+0.6 th 8?

#

it could already be better than master

rocky vigil
#

still waiting on a couple of new tricks really

#

if nothing else there's a 2x length training run ongoing to see if we can squeeze 1 more elo

#

anyways I'm really happy that we can get it neutral at stc smp without spsa

round stone
#

alright, i'm guessing it's already better than master in its current state at vltc smp

#

also QA=255 alone should be at least +1 elo

rocky vigil
#

i8 clamp above in PlentyChess is like

#

very big as well

rocky vigil
stray reef
#

the progtest against plenty 7 is looking great tho. not sure why it doesn't translate against obsidian

rocky vigil
#

strange

#

although funny it looks like the "scaling" of threat inputs is mostly because it's slower

twilit oriole
rocky vigil
#

in all seriousness it seems like big machines benefit way more from the i8 quant

split warren
#

7742 also has 256mb L3 cache, dunno if that's relevant

violet badger
#

yeah, if you move from not fitting into L3 to fully fitting into L3 such things could happen?

stray reef
#

Damn, i hope this translates to CCC and TCEC kekw

stray reef
#

Another big gainer found, this time a training improvement.

STC

Elo   | 5.31 +- 2.81 (95%)
SPRT  | 8.0+0.08s Threads=1 Hash=16MB
LLR   | 2.91 (-2.25, 2.89) [0.00, 2.50]
Games | N: 15308 W: 3911 L: 3677 D: 7720
Penta | [36, 1717, 3928, 1923, 50]

https://furybench.com/test/3557/
LTC

Elo   | 6.90 +- 3.89 (95%)
SPRT  | 40.0+0.40s Threads=1 Hash=64MB
LLR   | 1.89 (-2.25, 2.89) [0.00, 2.50]
Games | N: 6850 W: 1747 L: 1611 D: 3492
Penta | [2, 688, 1913, 816, 6]

https://furybench.com/test/3558/

I currently have a 4 stage training setup, with SBs 300, 300, 400, 300, and WDL 0.15, 0.3, 0.6, 1.0.
I figured for threat inputs less/longer stages with less LR jumping might be beneficial, so I switched to 2 stages with SBs 1000, 400 and WDL 0.15->0.6, 1.0 for this new network

the 0.3 WDL stage used old and partially bugged 5ksn data, that's gone now, it's all just 20ksn, 20ksn adversarial and 5ksn adversarial data thrown together. and the score is not given by search, but by the chonker (768x12 -> 4096)x2 -> (96 -> 192 -> 192 -> 1)x8 net (which was previously done in the first stage as well)

#

i previously tested simpler ways of merging the early stages, as well as increasing the length of the last stage to 400 SBs, and that all failed on its own

rare jacinth
#

What is SB here?

stray reef
#

superbatch, i.e. 100M positions

rare jacinth
#

And these values for WDL?

stray reef
#

0.0 WDL means training purely on score, 1.0 WDL means training purely on game result. Everything else is a linear blend

#

And by 0.15->0.6 I mean that it scales up WDL linearly during the 1000 SBs, instead of keeping WDL constant

rare jacinth
#

I can do some work on this if anyone has a machine with strong gpus I can ssh into, I’ve been asked not to use the leela gpus for stockfish related stuff

naive comet
#

@violet badger

violet badger
#

no ssh access, but one can specify this in a recipe at nettest, and the CI pipeline will execute? So fork the repo, adjust recipe according to your liking, PR, and wait and see.

#

now, I must say that I tried optimizing these values so there is probably no low hanging fruit. However, not excluding this is possible.

rocky vigil
#

About the moves with no threat changes

#

I feel like they tend to be endgame positions?

#

But yeah it definitely feels like there should be a bit to gain from handling the empty case like a null move

#

I wonder if this (attkr ^ attkd) == 8 is much faster somehow

violet badger
#

I think the most important thing right now is to look in the int8 threats..

#

If there is a branch with the updated trainer, the training should be kicked off asap, in parallel to the little experiment that is running on longer training

lofty cedar
#

Should we try to pass a 1280 net at VVLTC?

violet badger
#

no, like that's absolutely useless IMO

lofty cedar
#

I mean, Stockfish optimizes for around VVLTC right?

violet badger
#

why?

lofty cedar
#

Because it's where real uses come.

violet badger
#

no, that's somehow a misconception

lofty cedar
#

And 100MB more is like a drop of water for modern rams.

violet badger
#

100MB more is like a hell of a lot for modern L3 caches

lofty cedar
violet badger
#

especially on your phone

lofty cedar
#

Yeah... and then? I mean... if it passes VVLTC, it should be considered good for normal use cases right?

#

(It's rare that normal folks would use Stockfish at lower TC anyway.)

violet badger
#

why do you think so?

#

average lichess game analysis is seconds..

#

on slower hardware

#

right now, the priority should be to innovate on these ideas...

#

this will be helped by keep it a bit nimble and agile.

#

not by pushing a large net through some TC that can't be supported by the resources.

#

So, speedups, int8, mmap nets, smarter training processes.

lofty cedar
#

Ahh! I see.

But well, Stockfish does indeed optimize for VVLTC right? At least with things like singular extension. Were Stockfish optimized for lichess analysis, we'd be pushing all sorts of anti-scalers through.

#

Though yeah... I agree it's not a priority for now. Maybe we can come back to it later when necessary.

rare jacinth
#

@twilit oriole can you give me a full list of the techniques you've tried

eternal onyx
#

Keeping the download small and keeping the net fitting in cache is helpful but Stockfish has been more than strong enough for basically any use case for a long time. Remember phones were beating grandmasters since the 2000s!

lofty cedar
#

I think at this point the purpose of Stockfish development is pure entertainment.

#

The stronger the engine, the harder it is to learn from it.

#

The good things that come from them are rather seeing interesting matches.

rocky vigil
#

i feel like the == 8 instead of & 8 might be nontrivial speedup for whatever reason

twilit oriole
#

So, a list is not possible

rocky vigil
#

@rare jacinth what types of techniques are you looking for? Speedups? Arch changes? Quantization?

#

i am very surprised that removing king threats is not a minor speedup

twilit oriole
#

There are known gainers that can be focused on also. Like lazy threats, i8 threats etc

rocky vigil
#

yeah those are the two big (known) ones, mostly depend on effort

#

minor: at some point we should perform a cleanup of smallnet, so that it stops using an unnecessary 20MB

rocky vigil
#

the best way I can think of right now is to declare a base feature transformer class with the psq parts and then have a separate threat feature transformer that inherits the functions and adds threat support

rare jacinth
rocky vigil
#

i cannot speak for viren but what we currently do is remove duplicate threats

#

i.e. rook attacks queen

#

also a point to investigate is why removing (enemy) threats to a king seems to be a minor slowdown on fishtest

rocky vigil
#

can cut it down to 210MB with this first

#

then i8 will make it around 110MB

naive comet
#

oh but I cant compile shit fml

prime mica
#

Even on master I got significant differences from changing the implementation of make_index

#

So def worth a shot

twilit oriole
#

There's probably no point running this with other workers ngl

#

This is due to exceeding the L3 cache. 128MB on dark worker

#

Along with history tables and such has to fit in the L3 with the net

#

Just 1 cache miss is bad because then you have a latency spike in the eval. 2 cache misses isn't that much slower than 1

stray reef
regal steeple
naive comet
#

that thing doesn't include your change btw

#

I managed to get it in branchless but it inflated table size

#

will try again maybe

twilit oriole
#

@rocky vigil so now QA=255 is fine what's the blocker on the i8 threats. Since no x2 trick is needed

rocky vigil
#

no blocker

#

uh

#

the blocker is I have a chem exam today

naive comet
twilit oriole
#

Cool

rocky vigil
#

also for then

twilit oriole
#

And is anyone on the speedup side of things looking into lazy threats

rocky vigil
#

I'll ask in advance @naive comet where is weight clipping on the trainer side located

twilit oriole
#

What's the approx speedup percent for that

#

So I can see what's the expected result if it is working lol

rocky vigil
#

threat tracking seems ~5% but you would need to know how much lazy tracking would save

#

over normal

twilit oriole
#

Oh. I guess it will be within error bars with speedup tool which makes things harder

rocky vigil
#

so like, I estimate 1-2% speedup

#

which is good still

#

but not that easy to see

naive comet
#

maybe ask sopel

stray reef
rocky vigil
stray reef
#

fingers crossed

green moat
rocky vigil
#

Ask @prime mica if it can be applied

#

It probably can tbh

prime mica
#

Potench

#

The problem is it requires the final add/sub to be computed at the same time as transformed features

#

Fwiw I don’t think it’s a good patch bc it breaks encapsulation of the layers

#

and I’m not planning on trying to get it cleaned up or PRed any time soon

rocky vigil
#

Oh actually it can’t be applied

#

Bc the accumulators are split

prime mica
#

Right

#

It’s a horrible change for only +2 ELO STC

rocky vigil
#

+2 ELO stc still pretty good

prime mica
#

Eh

#

Every impediment to trying new net architectures is -1 bajillion

rocky vigil
prime mica
#

Oh yes

rocky vigil
#

like, removing attacks to king

#

is free

#

with a lookup tablle

lofty cedar
#

Should be easy... replace the "map" with a table 4x the size. You can get rid of the logic of determining whether or not it's an enemy and so on.

Then, maybe you could like replace the index calculatio with simpler, less precise arithmetic. Pad unused slots. It would take a bit more memory but as long as they don't get inside the cache, should be fine.

naive comet
#

I tried even a 2x size table and it tanks a ton

#

i think the first course of action is to shrink the perspective thing (same patch as ces32 (I might be misremembering his name) but without removing templates)

rocky vigil
#

or is this commit like a fix to a previous QA=255 attempt

#

it also seems like self.hidden_quantized_one has disappeared

#

actually what is self.hidden_quantized_one

#

I cannot find a reference to it

#

oh there's a typo

naive comet
naive comet
#

that was just our old attempt

rocky vigil
#

I'll have smth up soon

#

for u to check

#

oh shoot I found a couple of typos

#

like it should be model.quantization.(etc)

#

instead of model.(etc0

#

does everything else seem ok

#

I still haven't figured out how to selectively clip weights

naive comet
#

it's fucked up I think

rocky vigil
#

i took this from linrock commit

#

ihni what it's doing

naive comet
#

ehh

#

then nvm just leave it

#

ok yeah it's right

#

I'm stupid

rocky vigil
#

wait how does it work

#

sorry

naive comet
#

ok I think it's fine

#

now for the clipping

rocky vigil
rocky vigil
#

actually I find it strange

#

that the trainer basically entirely ignores weight limits

#

during training

#

?

naive comet
#

it clips for non-ft I'm pretty sure

#

it has to

#

or the affine transforms will overflow

rocky vigil
#

well yeah

#

but I'm looking for ft clipping

rocky vigil
naive comet
#

ft doesn't clip

rocky vigil
#

i thought ft got clipped to 2 * (ONE)

#

or smth

#

huh

#

i guess it doesn't actually

#

very interesting

#

i still feel like it would be beneficial to make the trainer aware of the clipping

naive comet
#

we can do that later surely

rocky vigil
#

also like

#

it seems the entire tensor of FT weights

#

is being passed as a whole big chunk

#

so that would need to be split between threat part and psq par

stray reef
#

you'd have to duplicate like 90% of makemove and at that point i really don't think there is a speedup here

#

though possibly there is a smart way to do it better than i'm thinking

rocky vigil
#

imo if it reaches the 800k limit we can just merge

prime mica
#

looks like a microarchitectural oddity

#

I think as a rule of thumb auto purge should just be off for nfc patches...

twilit oriole
frosty imp
#

You can raise that if you want

shell breach
#

And weird stuff in set theory to work with long big numbers

#

Just keep it up we believe in you

frosty imp
rocky vigil
naive comet
#

we should clip during training

#

I think

#

idk how we currently do it for the other weights

rocky vigil
stray reef
#

ideally i'd figure out from a static board + move what to update

rocky vigil
#

Yeah smth like that would be ideal

#

So that way the positions are already given

#

So no need to duplicate makemove

stray reef
#

How else would you figure out from a board+move what to update, without duplicating half of makemove?

rocky vigil
#

Then ideally we know info from both ends

violet badger
#

I'm locally playing step 3, comparing to the best 1024n so far (nn-962e10fb93ee.nnue vs nn-26b0e5126117.nnue), doesn't look like any gain.

green moat
#

what about stage 5 net?
copium

violet badger
#

stuck job, but I don't think it will be any different. Seems like the longer training plateaus at roughly the same value for all steps.

green moat
#

Meanwhile stage 4 net closed the gap with stage 3 net, and we still don't know performance of stage 5 net
copium

violet badger
#
   1 nn-26b0e5126117.nnue    :     9.7    1.5  74929.5  147456    51
   2 nn-962e10fb93ee.nnue    :     7.8    1.5  74340.0  147456    50
#

so, having established longer training as no better... i8 is next 🙂

rocky vigil
#

Ouch

green moat
#

not even testing nn-e5bcdd034264.nnue (stage 5 net), just in case?

rocky vigil
#

Yeah i8 is next then

violet badger
#

No difference:

   1 nn-962e10fb93ee.nnue    :    12.9    1.8  38195.0   73728    52
   2 nn-e5bcdd034264.nnue    :    12.8    1.8  38189.5   73728    52
rocky vigil
#

I guess 26b is the strongest we can get without i8 or spsa then

prime mica
#

☹️

green moat
prime mica
#

lol

#

needs to be put out of its misery

rocky vigil
#

It’s a speedup, just on the borderline to be enough elo

green moat
#

So, what remains to be done?
Some speedup, i8 inference, i8 nets training, verbatim nets.....and eventually SPSA the net

#

Is it correct?
Is that the roadmap, more or less?

rocky vigil
#

yep

#

afaik after i8 we'll probably look towards cleaning it up to be merged

green moat
green moat
lofty cedar
#

Anyway... if we're gonna do i8 quantization, then we could just like store the 1024 in i16?

rocky vigil
#

ig @regal steeple can pr the maybemaybemaybe test

violet badger
#

the branch should probably use the nn-26b0e5126117.nnue network?

rocky vigil
#

@frosty imp looks like we got some rebasing to do now

#

i will attempt to do this rn

rocky vigil
#

i think everything should be fine

#

like it compiles and has a bench

#

I haven't touched anything

regal steeple
rocky vigil
#

bruh I filled nps instead of nodes for bench again

#

hate this

rocky vigil
frosty imp
#

Merged and rebased

violet badger
#

so what's the current guestimate on Elo difference at STC?

rocky vigil
#

Nice

#

Well not much new so still -5 or so stc

#

Unless the search behavior is different

#

Weight clipping for i8 will hopefully be solved today…

frosty imp
#

Any blockers for i8 clipping?

daring wren
rocky vigil
#

what I think it supposed to happen is to get a tensor slice corresponding to the threat weights

#

and then clip them like that

#

but idk how to do it

naive comet
#

yeah for yoshie it was the most beneficial

prime mica
#

If I ever buy a GPU I will definitely try quanitizing the main net to i8

frosty imp
#

I mean you don’t need a gpu to run just the quantization

prime mica
#

sure but I think it needs to be re-trained

frosty imp
#

But I guess for testing just hard code selecting the corresponding slice from model.input.weight

rocky vigil
frosty imp
rocky vigil
#

?

#

if the slicing is that easy then I'm happy

frosty imp
#

nah the slicing creates a new object

#

or wait

#

yeah I think you can do that

rocky vigil
#

afaik it's used as like

#

.data()

#

so as long as it works

#

how would I define max_threat_weight

#

i know it should be smth like

#

ft_quantized_one / 2

#

but idk if there are any details I need to watch out for

stray reef
#

i did not have to do any retraining for i8 threat weights to work

twilit oriole
#

SF is different, they had to change the QA

stray reef
#

right

#

do they have QAT? or what is the reason for the retraining then?

rocky vigil
#

to avoid having a bunch of x2's

#

for mulhi

stray reef
#

with bullet, if no QAT is used, one can just re-quantise a checkpoint

#

so i'm wondering why that's not possible for SF

naive comet
#

cuz bullet > nnue-pytorch gigachad

rocky vigil
#

anyways check

#

to make sure I haven't done anything wrong

#

and if it looks good we can start a test run

violet badger
#

I'll get to giving that a test run later today, but obviously that shouldn't stop people from having a look now 😉

violet badger
#

net sharing is merged, so probably makes sense to rebase the SF threats branch on master, and run another 10k test of the current state.

lofty cedar
#

Do we re-try the 1280 net on STC?

#

Shared memory should disproportionately benefit larger nets I think.

stray reef
#

1024 with i8 is the way go go rn

#

i don't think the benefit is that big, per position only a tiny fraction of inputs are used

violet badger
#

so, let's see where I get with the above branch by @rocky vigil ....

twilit oriole
twilit oriole
#

Interesting a 38% speedup is apparently only 25 Elo in SF now. Used to be far more IIRC

violet badger
#

might be... this is not the conventional way of measuring this (timeodds).

#

(should be maybe similar, but not certain)

#

let me try to measure once with current master, interesting enough a question..

violet badger
#
   1 shared_memoryPRtc138    :    49.7    1.8  41962.5   73728    57
   2 shared_memoryPRtc100    :     0.0   ----  31765.5   73728    43
#

tc adjustment (tc=13.8+0.138)

#

That was the node count in the match master vs master and sharedmem vs sharedmem

   9    ===== shared_memory =====-
  10   771 seconds for 169406377858 nodes
  11   nps:  2.19723e+08
  12    ===== master =====-
  13   769 seconds for 122427206326 nodes
  14   nps:  1.59203e+08
twilit oriole
#

Interesting. I wonder how the elo of a 20% time odds difference changes with SF version

#

Like if it gets less with recent versions or stays constant

violet badger
#

I suspect it will get less.

frosty imp
#

rebased

violet badger
stray reef
#

then why train again?

frosty imp
#

prolly no reason to

violet badger
#

to enable QAT eventually...

#

I can upload a checkpoint somewhere.

rocky vigil
rocky vigil
violet badger
#

let me try again.

rocky vigil
#

well segfault on that

#

gah

rocky vigil
#

null pointer

#

amazing

violet badger
#

no, so code got deleted

#

git diff 5bcb0036825206ad6a23df6ed1b07211e3a73f58

#
diff --git a/training_data_loader.cpp b/training_data_loader.cpp
index 9f04699..8c79b9b 100644
--- a/training_data_loader.cpp
+++ b/training_data_loader.cpp
@@ -1202,7 +1202,7 @@ extern "C" {
         {
             return new SparseBatch(FeatureSet<Full_Threats>{}, entries);
         }
-        else if (feature_set == "Full_Threats^")
+        else if (feature_set == "Full_Threats^") 
         {
             return new SparseBatch(FeatureSet<Full_ThreatsFactorized>{}, entries);
         }
@@ -1267,10 +1267,6 @@ extern "C" {
         {
             return new FeaturedBatchStream<FeatureSet<Full_Threats>, SparseBatch>(concurrency, filenames_vec, batch_size, cyclic, skipPredicate);
         }
-        else if (feature_set == "Full_Threats^") 
-        {
-            return new FeaturedBatchStream<FeatureSet<Full_ThreatsFactorized>, SparseBatch>(concurrency, filenames_vec, batch_size, cyclic, skipPredicate);
-        }
         fprintf(stderr, "Unknown feature_set %s\n", feature_set_c);
         return nullptr;
     }
rocky vigil
#

???
i have no idea what happened
strange

violet badger
#

maybe started from a different branch or so?

#

pretty sure that's a fix we made earlier or.

rocky vigil
#

ok

#

ah i think my local branch was behind

#

since i just merged the pr on github

violet badger
#

sounds plausible

rocky vigil
#

alright that should be fixed now

#

so if there's any error it must be either with the different quantization or the clipping

#

as a side note I think the current method also clips the threat psqbucket weights

frosty imp
rocky vigil
#

i mean someone is free to try i8 post-processing the current net

#

just read the threat weights and then write them as i8

violet badger
#

I've posted the checkpoint above..

#

(a link to)

#

it would probably be a good baseline, and test of the inference code .

rocky vigil
#

well apparently it is 8x slower

#

gah

#

i think there are too many weights clipped too frequently

violet badger
#

that's a bit too slow to be practical.. I guess probably not quite right.

rocky vigil
#

i think we need a more fine grained optimization to make this work

#

it's clipping every batch

#

it should be safe to reduce the threat clipping to every SB (epoch) instead

#

and that should end the slowdown

frosty imp
#

8x slowdown sounds too much?

rocky vigil
#

while we're at it I think the init could be better in threat inputs

#

afaik bullet has some stuff that improves the init

rocky vigil
#

maybe it's the slicing that's causing the issue?

violet badger
#

is it maybe doing this cpu side, i.e. transferring stuff back and forth?

rocky vigil
#

could be, maybe it's transferring all 320 MB of threat weights every time

#

what's the cpu-gpu bandwidth?

#

shouldn't be too hard to check if this is the case...

violet badger
#

high, but obviously that would be bottleneck. (450GB/s)

rocky vigil
#

hmm

#

lemme split the clipping and only call threat clipping on batch 0

violet badger
#

worth trying, but I can hardly imagine it being that slow in general... unless something is unexpected.

rocky vigil
#

here are the batches cyclic between 0-6103

frosty imp
#

Everything is automatically moved to cpu via lightning

rocky vigil
#

actually if it's doing the clipping on cpu

#

uh

#

clipping 80 million weight

#

takes a lot of time

rocky vigil
#

@frosty imp

frosty imp
#

Idk

#

Isn’t batch the data?

rocky vigil
#

oh

#

bruh

#

how do I do it like at the end of every superbatch then

#

or is it batch_idx

frosty imp
#

I mean there is a batch_idx

rocky vigil
#

batch_idx == 6103 works?

frosty imp
#

That I don’t know

rocky vigil
#

actually let's do 0

#

since quantization clips it at the very end anyways

#

btw @frosty imp could you also run stc smp

rocky vigil
#

serialize hmm

#

issue with "deepcopy"

#

this is out of my depth

#

i cannot connect this to anything new introduced

frosty imp
#

I think serialize runs on the cpu so you can prolly test that

rocky vigil
#

I don't see why it would affect

frosty imp
#

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001

#

So perhaps something is interacting to implicitly create a tensor

#

I think maybe the view was copied since the quant config is part of model

rocky vigil
#

Huh

rocky vigil
frosty imp
#

the view of model.input.weights is in the clipping dictionary

#

the model itself owns a copy of the clipping dictionary

#

when you deepcopy the model, you deepcopy the clipping dictionary and therefore the view itself

twilit oriole
#

python be like

frosty imp
#

i mean isn't that like most languages that passes by reference

rocky vigil
#

so locally I removed the slice from the weight clipping config

#

and instead this function exists

#

will this also solve the deepcopy issue?

frosty imp
#

try it I guess

frosty imp
rocky vigil
frosty imp
#

yeah

rocky vigil
#

don't I need a checkpoint for that

frosty imp
#

you can do it on a .nnue as well

rocky vigil
#

ok

#

how do I actually run serialize standalone

violet badger
#

#1336647760388034610 message

rocky vigil
violet badger
#

(that includes the permutation optimization, which you might want to skip)

rocky vigil
#

i have such warnings as

RuntimeWarning: invalid value encountered in exp2
  epsneg_f128 = exp2(ld(-113))
RuntimeWarning: invalid value encountered in exp2
  tiny_f128 = exp2(ld(-16382))
RuntimeWarning: invalid value encountered in exp2
  eps=exp2(ld(-112)),
RuntimeWarning: invalid value encountered in nextafter
  self._smallest_subnormal = nextafter(
RuntimeWarning: invalid value encountered in log10
  self.precision = int(-log10(self.eps))```
but no deepcopy crash
violet badger
#

that's probably when you start from the checkpoint, not the nnue?

rocky vigil
#

i also have no idea what that even did

#

py -u serialize.py ../Stockfish/src/nn-26b0e5126117.nnue test.nnue --device=0 --features=Full_Threats --l1=1024 --ft_compression=leb128

violet badger
#

The crash was with:

python -u serialize.py /workspace/scratch/c82d8c15bf8b/run/lightning_logs/version_0/checkpoints/nonopt.nnue /workspace/scratch/c82d8c15bf8b/run/lightning_logs/version_0/checkpoints/last.nnue --ft_optimize_data=/workspace/data/official-stockfish/master-binpacks/fishpack32.binpack --device=0 --features=Full_Threats --l1=1024 --ft_optimize_count=100000 --ft_optimize --ft_compression=leb128
#

and if ft_optimize so

#

you can try that with a small ft_optimize_count ?

rocky vigil
#

do I really need to download 5.58 GB binpack for this...

violet badger
#

any tiny binpack will do?

rocky vigil
#

ok I'll try it with small.binpack

violet badger
#

even just start the download and kill it ...

rocky vigil
#

maybe was a crash

#

no error message though

violet badger
#

if you use small.binpack, try to reduce optimize_count. Could also be that this only crashes when run on GPU?

rocky vigil
#

it gives the invalid value warnings

#

and then terminal looks like it indicates a crash

#

there is an additional warning of ```Warning: Numpy built with MINGW-W64 on Windows 64 bits is experimental, and only available for
testing. You are advised not to use it for production.

CRASHES ARE TO BE EXPECTED - PLEASE REPORT THEM TO NUMPY DEVELOPERS```

#

ok so i think something is wrong here

#

because it is not even getting to the python part

violet badger
#

running

rocky vigil
#

ah

#

the syntax is not what I thought it was

#

hmm

#

oh i typoed

#

weight

#

not weights

#

apparently

violet badger
#

restarted

rocky vigil
#

alright runs now with basically no speed loss

#

in several minutes we'll know if serialize works

#

so it looks like it worked and a net was uploaded

#

now let me check that the threat weights are indeed in i8 range...

#

zip compresses better (48 MB instead of 65)

#

good sign hopefully

violet badger
#

very early net, might also contribute

rocky vigil
#

well, better

#

i might need to be more aggressive with the limits

prime mica
#

by how much do they exceed the limits

rocky vigil
#

now lemme recompile

prime mica
#

if it's only a couple we could just clamp it... obviously not the cleanest solution ofc

rocky vigil
#

this is only a test run

#

i can just fix this

prime mica
#

O ok

#

huzzah

rocky vigil
#

why does gcc say it invokes undefined behavior past 22528 * l1

#

I am pretty confident that weights has size 102384 * l1

prime mica
#

what is ThreatInputDimensions * HalfDimensions

#

also shouldn't it be weights[i] >= 128 and weights[i] < -128?

rocky vigil
#

79856 * 1024

#

weights[0:79856*1024] should be the threat weights

prime mica
#

😩

#

maybe weights is too smol

rocky vigil
#
129
133
132
137
136
137
132
137
128
145
128
140
134
131
143
128
129
138
129
131
143
128
129
148
128
129
128
143
143
139
139
128
128
128
145
132
147
139
129
135
131
131
150
142
175
147
137
129
128
130
150
151
153
147
140
131
130
134
129
131
131
136
133
147
132
155
141
143
140
137
138
136
153
135
131
162
143
154
151
-129
144
138
143
138
146
138
133
164
160
148
149
138
135
129
133
151
149
138
141
155
139
134
155
143
141
-130
-134
150
146
154
144
129
161
147
140
132
157
146
153
150
147
159
140
128
167
160
133
128
142
153
146
-134
153
143
142
140
165
145
128
142
154
-133
145
131
147
143
131
156
143
141
-138
136
138
-139
-135
-132
-133
-137
-130
-144
-131
-134
165
159
160
131
133
152
143
146
166
149
-130
152
-149
139
147
-133
-136
148
139
155
145
132
159
139
134
136
133
138
-130
-136
-132
-143
-130
140
134
144
143
165
142
132

here are the weights in question

#

it might be because the lr is too high

#

so clipping at the start of epoch

#

still gives them a full epoch to exceed the limits

#

nevertheless, 200 is still better than the 400k it says for the current net

violet badger
#

can't you add something to the loss?

rocky vigil
#

more complexity

#

can also enforce the clipping at final quantization step probably

violet badger
#

but rather straightforward?

rocky vigil
#

probably

#

maybe one of the things that counts as QAT

violet badger
#

yeah, probably.

rocky vigil
#

does weight[0:79856].clamp(-128, 127) do what I want it to do?

#

post-integer conversion

violet badger
#

seems reasonable, should I give it another go?

#

started

rocky vigil
#

it is possible they both are

rocky vigil
#

not what I have

violet badger
#

ah... yeah, so well, in a few minutes I can start the next run.

rocky vigil
#

yeah clamp_() is the in-place version

prime mica
rocky vigil
#

so either weight[0:79856] = weight[0:79856].clamp(-128, 127) or weight[0:79856].clamp_(-128, 127)
gonna go with the former bc yeah

prime mica
#

🤯

#

they should have a [[nodiscard]] equivalent on weight[0:79856].clamp(-128, 127) lol

#

oh ig not a thing in Python

violet badger
#

restarted with the latest syntax

rocky vigil
#

still 133 exceeding limits

#

extremely strange

#

oh shoot

#

I've been reading the psq weights

#

actually why are so many of them small

#

0 out of 81772544 threat weights exceed i8 limits

#

looks like it works

rocky vigil
#

as it turns out inputdimensions should probably be named psqinputdimensions

#

oh well

violet badger
#

so, things look better now?

rocky vigil
#

yeah

#

very surprising

#

that only 200 of the psq weights have managed to exceed 128 in abs. value

#

i guess this is an artifact of only 8 epochs

#

now the harder part to verify is that the QA=255 is working

violet badger
#

I suppose I could start a longer train, like one 800epoch run. Means we have a reasonable net by tomorrow?

rocky vigil
#

yeah

#

curious if 100 epoch stages 1-5 produces a better net faster than just 800 epoch of stage 1

#

i suppose it doesn't really matter

#

keeping it to single stage might also be easier to set up

violet badger
#

if this 1 stage has the advantage we can kind of compare with 1stage normal setup.

rocky vigil
#

yeah

#

right

#

and if it works can save having to redo it

violet badger
#

so, let me start this.

#

also

#

meanwhile would be interesting to see what happens just quantizing the existing checkpoint tbh.

rocky vigil
#

on inference side requires an extra x2 somewhere (either as shift, or double add) but it is also interesting

#

or, the inference side is still not entirely worked out

#

i think it is possible actually to just post-process the latest nnue

violet badger
#

in SF even, yeah, I guess so.

rocky vigil
#

also @frosty imp is there a reason why all of this code is necessary instead of

read_leb_128(stream, threatWeights, HalfDimensions * ThreatInputDimensions)
read_leb_128(stream, weights, HalfDimensions * InputDimensions)
...same for psqt
#

oh

#

so the leb128 here actually includes a length counter for the slice

#

so it must all be read in one go

rocky vigil
#

meanwhile still open to suggestions on how to avoid declaring threat weights when it's unncessary

#

(saves 20 MB of memory)

prime mica
#

explain?

rocky vigil
#

for the smallnet

#

which doesn't use threat inputs

#

there is no reason to declare a 79856 * 128 array

#

for the threat weights

#

that will be unused

prime mica
#

oh

#

why not just make the size depend

#
threatWeights[IsSmallNet ? 0 : ...]```
rocky vigil
#

oh

#

right right

prime mica
#

huzzah

rocky vigil
#

is declaring a [0] array valid

prime mica
#

idt so

#

but [1] ^_^

rocky vigil
#

fair

#

yeah this is smart

#

no code duplication needed

prime mica
#

DRY lover

rocky vigil
#

@naive comet what needs to be done inference side for the QA=255

naive comet
#

anywhere with 127*2, replace with 255, delete weight doubling, gg

rocky vigil
#

ok

#

we'll see how it goes in like

#

6 hours or so

#

tbh I wonder if removing the unused threat weights in smallnet actually does anything

#

besides just shaving off 20 MB

#

surely the compiler knows they're unused

upbeat pewter
rocky vigil
#

huh stackoverflow says not

#

idk

#

can just use 1

#

who cares

rocky vigil
#

gah

prime mica
#

wai

rocky vigil
#

ok it's lldb time

#

can't figure out this templating mess

#

nvm it's in the scaling lmao

#

alright

#

6 MiB instead of 28

#

or smth

#

let's go

#
info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))```
#

that's better

#

i like it

rocky vigil
prime mica
#

banger

rocky vigil
naive comet
#

lmao then surely it passes now right

#

even without i8

frosty imp
#

eh lemme try refactoring it in a better way

rocky vigil
#

sure

rocky vigil
#

i thought since it was unused actually

#

it just sits in memory and does nothing

naive comet
#

idk

#

I have no clue

rocky vigil
frosty imp
#

eh just merged your pr

#

spent the time going off a tangent switching everything to std::array because the way it is now is a pain to refactor

rocky vigil
#

oh

rocky vigil
#

net is trained now

#

validation loss looks fine

#
0 out of 81772544 threat weights exceed i8 limits```
#

good start

rocky vigil
naive comet
#

lgtm I think

rocky vigil
#
info string NNUE evaluation using nn-37f18f62d772.nnue (6MiB, (22528, 128, 15, 32, 1))```
#

let's go?

#

bench matches

prime mica
#

🥳

#

u are cooking

rocky vigil
prime mica
#

yessir gimme a bit

rocky vigil
#

cool cool

#

expecting single thread ~ neutral, more threads to be big speedup

prime mica
#

do I need to manually download it in this case

#

that's fine but just checking

rocky vigil
#

💀

#

oh shoot

#

lemme in fact upload to fishtest

#

we're gonna be doing a test vs previous stage 1 anyways

prime mica
#

O

#

maybe you need to upload

rocky vigil
prime mica
#

danke

rocky vigil
#

or

#

uploading

#

i should say

#

it might take a minute

#

to go through

prime mica
#

yep

#

since we're doing i8 we might as well skip the leb128 nonsense for that section

#

later tho

rocky vigil
#

504 gateway time out 💀

#

uh

#

lemme retry

prime mica
#

story of my lief

rocky vigil
#

oh

#

it went through

#

ok

rocky vigil
#

leb still compresses it somewhat though

#

compared to verbatim

prime mica
#

bench is 2266138