Measure an engine's strength by seeing whether other engines agree with it. We let engines have a look at each other's (different) PVs, and the weaker engine should eventually agree with the stronger engine, in terms of best moves and or eval. Of course there's the chess is drawish tendacy and such, but wonder if this metric would be useful in any way.
#Agreeability (a way to measure engine strength)
11 messages · Page 1 of 1 (latest)
You mean to measure engine strenght by observing the move correlation between a weaker engine and a stronger one, so we get an idea about the weaker engine playing strenght? I dont think its useful. I think making engines play each other is the best way to go
well, with sth like agreeability, its a function that clarifies things like 'oh this is a missed win by sf' or 'leela misevaled this OCB ending'. although the ending may be a draw, it would be nice to know the process of it, and how both engines 'thought' or 'felt' during the game, sort of like a collaborative post match analysis for both engines. it is never meant to replace elo as a measure of raw strength, just sth extra and possibly useful
Yeah its useful but not to measure raw strenght
And could be a nice feature to compare the succes of different engine choices and determine with choices were better and what was the thought scheme behind the scenes
Prob determining choice succes by comparing outcome from same pos, and evaluate stuff just like missed opportunities or suboptimal moves
But thats outside stockfish realm
I'm really needing sth like this metric rn, for more general engine/position testing purposes. rn im having 2 nets fundamentally disagreeing with each other on whether an (xiangqi) endgame is decisive, and in this case both material and tempo are of the utmost important, but it is getting really really hard to tell say whether a simplification is correct, or whether a move thats more akin to tactical repositioning is correct, for so many cases. sometimes they agree on moves so i can just make those moves and ease both analysis forward, but when they diverge, ig all i can do is run a ton of engines games for many positions, which also isnt ideal.
and even then i need to pay close attention to every graph to see how the games evolve, because sometimes there are just one move blunders
especially in the case where engines disagrees on decisiveness.
for example, if engine A thinks the initial position is lost, it may play what engine B considers 'one move blunders', as it thinks its losing regardless. but whether those are actually 'one move blunders' relates to whether initial position is decisive, something that we dont know as of testing. vice versa, if engine B thinks the initial position is drawn, it may play what engine A considers 'missed wins', as it thinks its gotta draw regardless, but whether those are actually 'missed wins' and actually loses advantage, again is determined by whether the initial position is decisive.