Page 1 of 2

comparison between quality of evaluations

PostPosted: 14 Dec 2005, 06:48
by Uri Blass
I wonder if there is objective test to compare between quality of evaluations of chess programs.

The idea is that all programs use the same search algorithm and search to fixed depth and the program with the better evaluation is simply the program that wins.

programmers may change their program to use the same search algorithm as tscp and we can find by match to fixed depth which program has better evaluation.

It may be possible that the better evaluation at small depth is not going to be the better evaluation at big depth and it may also be interesting to see if it happens.

If programmers are interested in modifying their program to have the same search algorithm as tscp then it may be interesting to compare evaluations in noomen matches

It may be interesting if there are programs that can beat tscp with 1 or 2 ply handicap at some depth under these conditions.

First testing suggest that movei is better than tscp under these conditions but the difference is less than 1 ply(the difference between movei and tscp is more than 1 ply with default algorithm of movei because checks in the qsearch and some extensions give movei more advantage relative to the demage that it suffers from pruning).



I think that improving the speed of tscp for this testing by better order of moves thanks to killers and hash tables and other speed improvement may be productive because I think that it is not going to change the level of tscp at fixed depth but it is going to make it faster in fixed depth

Uri

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 11:27
by H.G.Muller
This is an interesting idea, and definitely worth trying. I am afraid, however, that the results will not be clear-cut: I expect a very strong interaction between search algortithm and evalution.

For instance, a program which automatically in QS searches capture moves over squares that were evacuated on the previous ply might not benefit much from awarding score points for pinning pieces, because it resolves the pinnings before doing the evaluation. For a program that does not resolve pinnings in its QS a good scoring of pinning situations might be crucial, to see them coming when the actual loss/gain is still beyond the horizon, and provide a smooth transition between staticaly determined and search-based scores, and hence a stable search during iterative deepening.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 12:02
by Reinhard Scharnagl
Theoretically chess is a zero sum game. But in practise nobody owns the necessary full information. Thus computer chess de facto is no zero sum game.

Because of that there are a lot of situations where there would be no 'optimal' move. The choice of a 'best' move for a specified playing program depends on its prefered playing style and also on the identity of its opponent.

Regards, Reinhard.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 12:56
by H.G.Muller
I think you mean to say: "game of perfect information".

Chess is a zero-sum game: the gains of one opponent are the losses of the other. Only if we would award tournament points for, e.g., "beauty of play-style", rather than for checkmating the opponent, would it become a non-zero-sum game.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 13:52
by Reinhard Scharnagl
Not at all. I mean, what I have said. Chess as a game IDEA might be a zero sum game, but nobody is able to play such an IDEAL game, because of the lack of information.

What e.g. a chess engine is doing, merely is an approximation to play that ideal game. So if you watch two different chess programs together playing a chess game, their evaluations mostly will not sum up to zero.

Reinhard.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 14:30
by H.G.Muller
Apparantly we differ in our definition of 'zero-sum game'. In my definition, how you play the game (perfect or otherwise) does not have the slightest effect on if the game is zero-sum: this is purely determined by the pay-off matrix for both players. I am willing to conceed that chess is a 'one-sum game', though, rather than a 'zero-sum game' since on the average the players receive half a point per match... :wink: (In a tournament where all play equal number of matches, this difference is immaterial, which is why game-theoristst still consider chess a zero-sum game. It would become a different game altogether if you would be awarded by the total score of all games you managed to play within a certain time interval, especially programs that would decide to agree on a draw as quickly as possible would perform very good in such an environment, even if they did not know how to win a game at all, provided they would know how to lose very slowly...)

I do agree with your statement tht two chess-players not necessarly agree about who is ahead, and how much. But this makes it all the more interesting to investigate the merits of their evaluation, to check how well their opinion correlates with the eventual winning of the game.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 14:33
by Alessandro Scotti
Reinhard Scharnagl wrote:What e.g. a chess engine is doing, merely is an approximation to play that ideal game. So if you watch two different chess programs together playing a chess game, their evaluations mostly will not sum up to zero.


It is not necessary that the two evaluations are identical. However, in a typical game each player evaluates the game assuming that it is zero-sum, i.e. if I believe I'm winning by "x" amount than I also believe you are losing by same amount. Without this assumption, minimax-based algorithms cannot work.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 14:47
by Reinhard Scharnagl
Hi Alessandro,

what you have written makes me say following: as long as minimax is always the base of a chess engine, human beings will have a chance.

We have experienced during centuries, that it will give chess players more success, if they would adapt their playing style to the actual opponent's personality (which would make no sense in a zero game at all).

In chess programming this has given me the idea, not to sum up the evaluation of a given position into one single scalar, but into a vector of several aspects. This vector then will be transformed into a scalar by multiplying it with a (color depending) kind of personality weight vector, which should be found and optimized for any individual player or program.

That results in some interesting conclusions: a) there could be different 'optimal' moves depending on the players' personality, b) minimax would no longer work, the values have to be recalculated from the position vectors each time the plys are changing.

Reinhard.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 15:18
by H.G.Muller
Oh, I think knowledge about your opponents capabilities (which you might define as part of his personality) is of very large importance. Even in the case were perfect knowledge on the game is present, like end-game tablebases.

For instance, look at the following position from the KBPPKB end-game:
[diag]1b6/k7/P1P5/1B6/K7/8/8/8[/diag]
1b6/k7/P1P5/1B6/K7/8/8/8

I don't know if this actually is a draw, for argument's sake let's assume it is one. Looking in the tablebase then reveals to white that despite his being two pawns up, he still can not force the win. If the opponent also has a tablebase for this end-game, he might as well agree on the draw immediately.

But if the opponent does not have the tablebase, how should we continue the game to maximize our winning chances? Unlike the won positions in the TB, which are distingushed by their DTM, all draw positions are strictly equivalent. We could include counters to measure the distance to the legal game end through the 50-move rule, stretching out the game as long as possible in the hope he makes a mistake. But this assumes the probability for such a mistake is the same in every position.

What is to prevent the program from sacrificing a pawn on the first move, another pawn on the second move, and its bishop on the third? It will still be a draw after that... Non-perfect players (as opposed to table basis) distinguish a draw that you are happy with to salvage, from a draw that you unfortunately just could not win.

If you play c6-c7, only a singe move (BxP) maintains the draw, all other lose. Does that mean you have maximum probability for a mistake? Well, if the opponent is an idiot that fails to take hanging pieces and moves randomly, yes, but otherwise it is of course inconceivable that he will not find the BxP...

So you can only meaningfully decide what to do if you know something about the capabilities of your opponent. Which possiblities you offer him to step out of the draw sector to the lost sector he is likely to fall for? If you know he thinks 20 ply ahead, you should present him with situations on which as many of the moves as possible that still result in a good position after 20 ply, actually are lost after 21 ply. But if he thinks 22 ply ahead, that doesn't work at all...

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 15:31
by Uri Blass
The position that you posted is a simple draw.

After Kb6 Kc7 I see nothing for white to do because white cannot tell the bishop to go out of b8-a7.

Uri

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 15:40
by H.G.Muller
That is what my evaluation function as a not-too-good human chess-player told me, but I did not bother to actually look it up in the TB, hence the disclaimer. :wink:

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 15:52
by Alvaro Begue
Reinhard Scharnagl wrote:Hi Alessandro,

what you have written makes me say following: as long as minimax is always the base of a chess engine, human beings will have a chance.

What makes you think that? Are you thinking of a particularly simple version of minimax, or what?

We have experienced during centuries, that it will give chess players more success, if they would adapt their playing style to the actual opponent's personality

And why can't you incorporate opponent modelling in minimax?

(which would make no sense in a zero game at all).

Why all the confusion about chess being a zero-sum game? It doesn't mean that it's fair, or that opponent modelling doesn't matter, or any of the weird things that have been said in this thread. Rock-paper-scissors is clearly a zero-sum game, and the only thing that matters there is opponent modelling.

You say that adapting the playing style to the actual opponent's personality wouldn't make sense in a zero game, and then you go on advocating opponent modelling in chess. This is what makes no sense at all.

In case you need a definition (although it has been explained here already): http://en.wikipedia.org/wiki/Zero-sum

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 17:11
by Reinhard Scharnagl
Hi Alvaro,

no need to explain a zero sum game. Chess (in its idea) is. But any played game has (in its beginning) stages without complete information. Thus there are evaluations differing from those the game merely has: -1, 0, +1. This reflects, that the de facto chess game is not played as a zero game. In a zero game the opponent's evaluations don't matter. And because Chess is not PLAYED as a zero game, it does matter, and hence minimax is not appropriate to that intended approach. Because then the evaluations of one engine's view are not the same as the negative evaluations of the opponent's view. If you change your engine's playing style, you actually change it for BOTH sides simultaneously. But the idea is to establish a reality matching style individually for each of both sides.

Reinhard.

Re: comparison between quality of evaluations

PostPosted: 14 Dec 2005, 17:12
by H.G.Muller
Alvaro Begue wrote:And why can't you incorporate opponent modelling in minimax?

Actually, minimax is perfectly suited for non-zero-sum games as well. The only requirement is that participants take turns in optimizing their own evaluation score. There is no reason at all why the score of one should be the exact opposite of the score of the other, as in a zero-sum game. More important is that all players can calculate the outcome of their moves and those of their opponents in advance (perfect information). Minimax is of little use in card games like Bridge, even if they are zero-sum games.

Consider, for instance, a game where you alternately move the same king on a chess board, and one side gets awarded for moving north, the other for moving east, where the game is made interesting by declaring certain squares, amongst which a8, inaccessible. Both sides have orthogonal, rather than opposite goals, but a mini-max type search works fine.

For very cooperative games, the name minimax might not be very applicabe, though, and maximax would describe it better.

Re: comparison between quality of evaluations

PostPosted: 16 Dec 2005, 01:05
by Uri Blass
I do an experiment to compare qualities of evaluation between Movei and tscp when both use the same search algorithm.

results so far in the Noomen match

1)Tscp(2 plies)-Movei(1 ply) 72-28
2)Tscp(3 plies)-Movei(2 plies)75-25
3)Tscp(4 plies)-Movei(3 plies) 56.5-43.5
4)Tscp(5 plies)-Movei(4 plies) 56.5-43.5
5)Tscp(6 plies)-Movei(5 plies) 20-24

Re: comparison between quality of evaluations

PostPosted: 16 Dec 2005, 04:34
by Uri Blass
Unfortunately I had to stop the match because it seems that there is some bug in the chessbase gui.

I saw movei get depth 1 instead of depth 5 and based on looking at the logfile it simply did not play and Fritz is cheating.

result was 26-21 for movei when it seems that it first happened but maybe it happened earlier and I did not pay attention to the problem.

Uri

Re: comparison between quality of evaluations

PostPosted: 16 Dec 2005, 10:18
by Ryan Benitez
I do not think fixed depth is a fair way to test eval as some engines calculate hanging pieces, threatened pieces, entombed pieces, trapped pieces, trajectories, and more that can all be costly to calculate. If having a lot of knowledge of the position does not out way the extra time spent to calculate it the eval is not a good one. If the search and time control can be equal normal games of chess at whatever time control you prefer test seams like a fair way of testing.

Ryan

Re: comparison between quality of evaluations

PostPosted: 16 Dec 2005, 11:35
by Uri Blass
Ryan Benitez wrote:I do not think fixed depth is a fair way to test eval as some engines calculate hanging pieces, threatened pieces, entombed pieces, trapped pieces, trajectories, and more that can all be costly to calculate. If having a lot of knowledge of the position does not out way the extra time spent to calculate it the eval is not a good one. If the search and time control can be equal normal games of chess at whatever time control you prefer test seams like a fair way of testing.

Ryan


I do not agree.

1)I never claimed that better evaluation means better engine and I think that calculating hanging pieces or trapped pieces can be certainly part of a good evaluation.

2)Using normal games of chess is not acceptable way to test quality of evaluation.
For example by that definition if you improve the order of moves you may have significantly better evaluation even if the evaluation remains the same.

Even by my definition order of moves may change what move to choose in case of moves with equal score but I believe that this change is usually small(it can be big in case that the probability for 2 moves with the same score is high but I believe that this can be a significant factor only with a very simple evaluation like only material evaluation).

Note that I believe that not caring too much about speed and caring to detect trapped pieces may be productive even if results are not good at normal time control because later the knowledge may be used for better pruning decisions and you always can decide later not to calculate full evaluation when the remaining depth is too small.

Uri

Re: comparison between quality of evaluations

PostPosted: 16 Dec 2005, 12:18
by Daniel Shawul
Code: Select all
1)I never claimed that better evaluation means better engine and I think that calculating hanging pieces or trapped pieces can be certainly part of a good evaluation.


Just an idea on evaluating hanging pieces. I don't think it is a good idea to
add/decrease score if you know a piece hangs. It tells us that there is a threat which we may use to guide search better. plus a normal quiescence search resolves all capture/non capture sequences.
Evaluating Pinned and trapped pieces may be a good idea as this is related with mobility. if we generate pinning moves in qsearch evaling pinned pieces might not be that important also.
daniel

Re: comparison between quality of evaluations

PostPosted: 16 Dec 2005, 12:21
by Daniel Shawul
correction:
....all capture and check sequences...