Strange evaluation behaviour | Stockfish | Page 1

charred ibex Jul 8, 2024, 3:39 PM

#

There are some opening positions (move 1 - 4) that are evaluated to "around -1.5" at e.g. depth 41.
Analyzing even deeper, e.g. depth 51 or further the evaluation doesn't change too much, maybe decrease to -1.65, maybe stay around -1.5.
Also the PV is (increasingly) stable.

From my experience, an initial evaluation of -1.5 is on "the brink of defeat", but when grinding through to the endgame, sometimes there will be a magic saving and the evaluation gradually reverts to 0.

Yet there also is this phenomenon, where I follow a -1.5 position (#1 move) through to e.g. move 40, always using the same search depth or even higher ones to see where it's going, but here the evaluation ever decreases towards -2, -3, -4, etc. until #, the further the game progresses, and without basically ever changing the first couple of moves of the initially determined line.

(The evaluation fluctuates a bit and is not a linear decline when following the game, e.g. -1.5, -1.53, -1.51, -1.49, -1.58, -1.54, -1.61, etc., but in the end it goes all the way down and I've never witnessed a position holdable once it crossed -2, as in go back to -1.5.

It's like Stockfish is suggesting the best and maybe only move in a given starting position and then calculates itself into its demise.
But although I know what's coming in general, based on this pattern, Stockfish never understands from the start, that the game will be lost in the end, even if I go back to the starting position and "the hash" should have gained some knowledge about the wrong path.

Any suggestions what I should make of this?

Do I need to calculate such a line at the starting point to depth 90+ and hope for some magic change of heart for the first move?
(Often there is no other first move, because they are already evaluated to -1.8 or worse, which only speeds up the above mentioned phenomenon).

Is there maybe some way to "backpropagate" gathered knowledge of later positions for the evaluation of earlier positions of the line, so that Stockfish might understand that the line won't be working and prune the failing aspects of a line immediately?

I highly appreciate any advice or explanation of this.

astral spear Jul 8, 2024, 4:29 PM

#

-1.5 is essentially already lost, with sf eval normalisation

#

+1 is 50% chance to win and -1 is 50% chance to lose

#

(at move 32)

charred ibex Jul 8, 2024, 4:39 PM

#

ok, thank you, the position I actually have in mind is 1.g4 ("the Grob"), so move 1, not move 32.
Stockfish doesn't find any means of survival after 1...d5.
Looking for months now while following line after line, ever increasing search depths.
I find it really hard to believe that White should be lost right after the first move.
I know it's really bad, but a straight "forced loss"? 😅

astral spear Jul 8, 2024, 4:44 PM

#

"is the grob lost" is an open question

#

that progress is being made on, I think?

charred ibex Jul 8, 2024, 4:46 PM

#

That's what I'm trying to answer, because I really like playing it.

astral spear Jul 8, 2024, 4:46 PM

#

chessdb has it at -1.33 currently

astral spear Jul 8, 2024, 4:46 PM

#

charred ibex That's what I'm trying to answer, because I really like playing it.

u wot

charred ibex Jul 8, 2024, 4:48 PM

#

Yes, evaluations begin at -1.33 to -1.5-ish depending on the search depth.
But each line you follow ever decreases no matter which of the "promising" PVs you follow.
(That's what I described in my original posting)
Every line after 1...d5 ends bad for White.

astral spear Jul 8, 2024, 4:48 PM

#

(chessdb is a gigantic distributed analysis project)

charred ibex Jul 8, 2024, 5:04 PM

#

Thanks for pointing me to that project!
But they actually have the same effect.
1.g4 has a score of -148
Following the top moves. ... g5 e5 Bg2 Nc6 d4 ... and on move 28. Bxe3 already has a score of -187.
That matches my experience, because the "decline towards -2" happens gradually up to around move 40.

lavish compass Jul 8, 2024, 6:39 PM

#

In addition to what happens to SF cdb has an eval decay, so the leaves being higher than the root is expected. Also for cdb you could right click to get the pv that results in that eval.
But for SF it is probably just the later you are in the game the less variations there are to calculate.

celest briar Jul 8, 2024, 7:01 PM

#

I think your question stems from a misunderstanding about how Stockfish works. Stockfish's rating is a number that represents the possibility of victory, draw and defeat. A value of -1.5 indicates that black is almost (and here the almost is important) certain to win, assuming the correct moves are played. When advancing in a line of play, it is normal for the evaluation to change, since you are moving the reference point towards a more advanced state of the game, allowing Stockfish to analyze from a greater number of moves. If you take a look at the Stockfish conversion table, you will see that around 2 or -2 it is estimated that white (2) or black (-2) definitely wins, with no possibility of a draw or reverse result. Values greater than 2 or less than -2 do not necessarily describe a probabilistic change in the outcome of the game, but rather the degree of completion of the game. Keep in mind that this explanation is a more or less and is not intended to meet the technical needs of a rigorous exposition. I hope I've helped.

charred ibex Jul 8, 2024, 7:38 PM

#

Thanks for explaining, also for the hint to the technical term_"evaluation decay"_, which I can research now, and which then seems to be normal and expected.

So you all are indeed really helpful!

(And yes, I read the evaluation nowadays is WDL based which is then converted back to CP for convenience, constantly re-aligned so that 1 keeps being equal to a winning chance of around 50 %, as I understand it.)

severe quartz Jul 8, 2024, 7:40 PM

#

fun fact in the ccc bullet event a couple months ago, stockfish actually failed to convert a +4.8 position against berserk, instead drawing. so misevaluations do happen. they're just exceedingly rare, and they get ever more rare as you get farther from +-1

#

essentially, what you saw is just proof that stockfish is not perfect, not even particularly close to perfect, despite the fact that it is best. best != perfect, and best != "cant be beaten by worse engines", it certainly can be so beaten.

#

"best" only means "beats them more often than they beat us"

charred ibex Jul 8, 2024, 7:43 PM

#

Yes, I got that.
And I think I also understand that if I were to explore a refutation for a specific line, I cannot simply stop at like -2 or -3 and state "That's it", but rather have to prove it to the end.
(Because of the error margin)

severe quartz Jul 8, 2024, 7:44 PM

#

charred ibex Yes, I got that. And I think I also understand that if I were to _explore_ a ref...

yes, ultimately evals are probabilistic. what sort of probability/confidence you think is sufficient to "cutoff" is up to you.

#

this is a reason that correspondence players use many rapid-TC games to evaluate positions, in addition to a long search. playing hundreds or thousands of playouts from a position will often give better insight than simply one long search and pv

celest briar Jul 8, 2024, 7:44 PM

#

charred ibex Thanks for explaining, also for the hint to the technical term_"evaluation decay...

After a certain point, the growth or decrease of the evaluation values no longer indicates a substantive change in the WDL, but rather where the game falls between "okay, I know that one of the sides wins, but there is still plenty of work to do" and "ok, this is practically done"

charred ibex Jul 8, 2024, 7:46 PM

#

That was my basic feeling about it, yes.

celest briar Jul 8, 2024, 7:48 PM

#

So basically you know that you are going to die and the next question is how much time you have left or how advanced is your condition

astral spear Jul 8, 2024, 7:55 PM

#

charred ibex Thanks for explaining, also for the hint to the technical term_"evaluation decay...

the italics are wrong

#

scores are "cp" internally and wdl is estimated for display

#

at the same time as the internal score is normalised to be +1 = 50% win

#

that model drifts out as patches and nets are merged, so it is recomputed every so often

charred ibex Jul 8, 2024, 8:09 PM

#

ah ok, thanks for clarifying!

#Strange evaluation behaviour