#Optimize search parameters for LTC and VVLTC

24 messages · Page 1 of 1 (latest)

onyx blade
#

My try with using log depth in reduction parameters failed. But know i will try a more direct approach which instead try to use the log thread count. Then we can generate optimal parameters for both TCs via a formula like A + B * msb(threads.size()). I trying now a big search tune (like the VVLTC tunes) on LTC (see here https://tests.stockfishchess.org/tests/view/68b4c53578ed7a752a9e8fe2) which tunes the A values (i obmitted there the log term because its zero single threaded). Using later the tuned values and the master values which were tuned at VVLTC we can calculate B without changing the parameters for VVLTC.

If this works we have the possibility to tune now at both TCs without affecting the other.

  1. if we tune a LTC we tune only A because second term is zero). After tune recalculate B = old master values /3 - A to maintain VVLTC values.
  2. if we tune at VVLTC we tune only B (A keeped fixed to not change LTC performace)
#

Correction: recalculation formual is B = (old VVLTC master value - A) / 3

mystic raptor
#

That’s a very interesting direction you’re exploring — I really like the idea of separating the parameters into an A-term (single-thread LTC tuning) and a B-term (multi-thread scaling), so that we can leverage both STC/LTC data and VVLTC tuning without the two interfering with each other. It feels like a more principled way of handling the different time controls and thread counts than just pushing everything into one bucket.

That said, I wonder if this doesn’t tie us down a bit too closely to Fishtest’s chosen benchmarks. The formulas are built specifically on STC (10+0.1), LTC (60+0.6 single thread), and VVLTC (60+0.6 8 threads). But in practice, users play and analyze under a huge variety of conditions: some run fast blitz on multiple threads, some use single-thread at long time controls for analysis, and others may have thread counts far beyond 8 (or on asymmetric hardware).

Wouldn’t there be a risk that by tuning so precisely to the Fishtest triplet of [STC, LTC, VVLTC], we’re optimizing very well for those test conditions, but potentially “locking in” parameters that aren’t quite as optimal in the broader landscape of how Stockfish is actually used?

Do you think the A + B * msb(threads.size()) model will generalize well outside of the Fishtest environment, or is it more of a pragmatic compromise to fit into the current testing infrastructure?

onyx blade
#

That is the question. But currently it is worst because the search stuff is tuned at VVLTC 60+0.6 th 8 so omptimize exactly for this. What probaly should checked how it goes if thread count is higher like TCEC with 512 threads. That is the reason why i choiced log(threads) instead of a linear term because it doesn't raise so fast. With log(threads) going from 8 to 512 thread the B term will increase by a factor of 3. But if we use a linear term it would increase by a factor of 64. But later we should investigate this at higher thread count if possible. Also perhaps a test at VLTC should be done to see if this is hurted bad then in future it would use the LTC weights and not the VVLTC weights as now. But my impression in the past is that many times the VVLTC not really good works at VLTC (but there are not many VLTC tests for investigation). So i think using more thread to increase effective TC is not only a TC thing but has an own thread core dimension (that is one of the reasons i try this).

onyx blade
#

I test now the values after 88K of 200K LTC tuning.
STC already passed https://tests.stockfishchess.org/tests/view/68bf28c159efc3c96b610a6f.
LTC currently running here https://tests.stockfishchess.org/tests/view/68bf710259efc3c96b610a96.
So fare it seems good (damn probably I jinxed!).
If it passes i plan following tests:

  • VLTC but what would the best test SPRT-non-regression or SPRT-gainer or a 60K fixed games
  • VVVLTC with as many threads as possible to check if the msb(#threads) scales badly (but i would expect a gain). So hopefully the fleet can be on at weekend then i would start a 60+0.6 with 70 threads (hash 512*70/8 = 4480 MB) test that would at least double the weight of the msb term (at TCEC conditions it would be trippled). probably with higher throughput or even prio 1 for a short time. Here also is the question what is the best test.

I personaly prefer fixed games testing in both cases so we get a better picture what the effects are. What are your opinions (especially @little nymph @barren briar )?

little nymph
#

IMO these kind of massive changes would need to be clear gainers. My concern with this is that it is hardly sustainable to train/test for very long TC or very many threads. Generally, I'm in favor of experiments and new ideas, but we're spending disproportionate resources for VLTC already. If we know start targeting 100s of threads, I think we're on the wrong track.

onyx blade
#

Expanding testing to more threads is NOT the point of this idea (and not intended). But since we consider VVLTC tuning and performance also we have the unsatisfing situation that LTC and VVLTC contracts regarding search and so the performance of progression tests are hurt each other. But that is the consequence that wired mixture to follow two targets (best would be concentrate on one. VVLTC would be the more important in practice but its not feasable with average core count on fishtest). So this a practical attempt to simply decoupling search for LTC and VVLTC by using thread dependent parameters. Many people including me had tried to TC dependecy by using nodes, depth and so on but until now everything doesnot work (probably so we have no clear separation of the TCs). But luckily we can disinguish this both TCs also by thread count. So with patch we can optimize search parameters indepently. And i already said in my inital statement thread count seems by it self an important dimension and not only effective TC which is also a big motivation for this idea. If we compare searches with same number of nodes but different number of threads used the probably have similar strength but the search tree is very different (especially if we compare singel vs multiple threads). So it seems logic that even in this case they have different optimal search parameters. The VLTC and VVVLTC test i propose here is only a one time test to observe the principal behavior after this patch in this scenarios to verify that they are not completly off. Par example if for 70 threads 100 elo got lost i think everyone would say that is no good change but for 2-3 elo its worth (thats a one time loss) for decoupling LTC and VVLTC. My personal opinion (or better hope) is that it also scales but that would be a bonus.

#

That is also the reason I prefer a fixed games test so a better estimate is possible how the stuff is changed.

onyx blade
#

On variant we could use instead is simply cap the thread count in the interpolation formula by 8 so it behaves for higher threads the same. Then a VVVLTC definitly is not necessary. But perhaps the original formula scales well which would give us a nive bonus there.

onyx blade
onyx blade
onyx blade
#

VLTC at half point seems to loose a little bit elo but lets see. As the fleet is on i have another idea in this direction with the difference to target better the effective TC. The thinking is that reasonable estimate of effective TC could be nodes per worker * number of threads. But the effect seems more to be the log of this term. So log(nodes*#threads) = log(nodes) + log(#threads) gives together with a constant term and the usual msb approximation we get the formula A + B * (msb(nodes+1) + msb(threads.size()). I try to tune this here at VVLTC https://tests.stockfishchess.org/tests/view/68c1d72a59efc3c96b610e51. The hope is that this perhaps catches the TC/search relation better (but we loose the advantage of decoupling LTC and VVLTC).

#

For the tuning himself i use some scaling of the B term to balance it with the A term.

onyx blade
onyx blade
#

After 177K tuning i make two measurements at STC and LTC to check if my hope that the TC scaling is somehow better modeled with the new formula holds:

#

For the standard test at VVLTC i wait for the fleet

#

But the bench is really low. Lets see.

onyx blade
#

Had stopped measurement because i used the already scaled B factors here but forgot to change the P function accordingly so the factors were scaled twice. Now the bench seems also more reasonable. Start two new measurements:

onyx blade
#

So fare both measurement are negative: STC = -9.51 ELO , LTC = -8,26 ELO. Interesting is that in contrast to classical VVLTC tunings STC and LTC ELO are not fare away. So perhaps scales somehow but on a low level because of some other problem (If fleet one i will do the VVLTC testing). What comes up in my mind is that we use directly the searched nodes which means that especially at node near the root the first few moves have a lower factor then the last moves. If say per iteration the nodes at mostly double then msb(nodes) increase at most by 1. That means an additional parameter B is added. Perhaps not to much but i doesn't know if this behavior hurts. So a try would be to save the searched nodes after each iteration and use that one in the formula. Then the search parameters are constant during an iteration. But if we do this approximately the equation power(branching factor, rootDepth) = nodes holds then rootDepth = constant * log(nodes). So for simplification we could replace the log(nodes) term with rootDepth. So i try to use A + B * rootDepth as search parameter formula. Use LTC tuning because no fleet but check later at VVLTC if it scales: https://tests.stockfishchess.org/tests/view/68c59f8b6d91bee0e315c8a1

onyx blade
#

@little nymph Thanks for the crashing info. I have looked at it and it seems only a few crashes. Probably in happens for the smaller parameters if we reach some ackward positions were rootDepth fast raises (and only a few nodes per iteration are searched). Then the B factor can be get big. So some clamping should be later done if values are tested. Also perhaps my other idea to use log of the saved node count for each iteration i should try. Then if rootDepth explodes nodes will be more constrained (if we assume 100 GNodes per search like at TCEC superfinal is possible we would get at most a term of ~ B * 36 instead of B * 254 (=MAX_PLY) or more