comparison between quality of evaluations

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

Re: comparison between quality of evaluations

Postby H.G.Muller » 16 Dec 2005, 13:42

If you look at it in a broader perspective, evaluation falls in he class of 'convergence-accelerating techniques'. A program that searches all the way upto checkmate (e.g. tablebases) would not need evaluation, it just counts DTM. Evaluation is an (attempted) approximate way to predict what further search would uncover.

(Convergence-accelerating techniques are often used in numerical analysis. For instance, if one has to calculate the infinite series 1-1/2+1/3-1/4+1/5-1/6+...., one can make a computer simply do the addition brute force (the chess equivalent would be fixed-depth minimax), adding terms (increasing depth) until the result is stable to the desired precision. A mathematician can tell you that the eventual result equals ln(2)=0.693147, and the computer gets there by brute force only after a million terms, and not even a supercomputer could get you to 12 digits this way. One can try to be less dumb with the additions, and not stop abruptly (which makes the result alternately too high and to low) but take the average of to consecutive terminations, which amount to adding the last term not fully, but only half (the equivalent of QS, where you don't consider all moves in the last ply of the search, but only a fraction). The number of good digits grows much faster then. (Try it in your favourite spreadsheet!) You can push this approach further by, e.g. including the last 2 terms you consider only with weights 0.75 and 0.25, respectively (a smarter and deeper 'QS'). On the other hand, if you are a mathematician, (but don't have a logarithm table at hand), you would realize that you can guess the remaining part of the sum, e.g. by approximating it as an integral, and that everyting beyond the last term 1/N you did take into account explicitly adds up to ~ -0.5/(N+0.5+0.25/N). Correcting every partial sum with this simple formula (the 'evaluation') produces a result that is correct to 6 digits already after 20 terms, rather than a million! The correction is only applicable, though, to the abrupt termination of the series. If you took the last term into account only by half, you would need another correction. So you cannot speak of a 'better correction' in general, the correction has to be tailored to the way you terminate the brute-force calculation. A comparison is only meaningful between corrections in their own context.
This example might not appeal at all to non numerical analistst.... :? )

Thus, as any convergence accelerator, the evaluation is highly dependent on the conditions under which you terminate your search. For instance, a search algorithm without quescence search would need a very good static exchange evaluator. In a program that exhaustively search all captures in QS, like TSCP, an SEE would be completely useless, since it would always contribute zero at evaluation time, since eval is only called when there is nothing left to (profitably) capture. If the SEE is not perfect, it might actualy hurt: in QS nodes where all captures are tried and found to be no good (i.e. lose material) we fall back on the evaluation, and if the SEE erroneously thinks that there is a good capture (because of subtle tactics exceeding its capablilities), the evaluation will be totally wrong.

Another example that I actually encontered in my minimalistic chess code posted in another thread, is end-game evaluation. Before I included a has-table, it could only do 12 plies in the KQK end-game, which in general is not enough to see the mate. Without any help from the evaluation the program than would run with one of its pieces to a corner and keep sulking there, never finding the mate, and hardy winning any games. A simple evaluation that was helpfull draws the king through the center of the board though piece-square tables, but even that would not bring the mate within the horizon, certainly not in KRK. It needed to include an evaluation rule that awarded the winning king for being 'close' to the losing king, e.g. in the same 4x4 corner of the board. This very rough ad-hoc measure for the DTM makes it find the mate always. But now that I implemented the hash table even the KRK mate is found easily in the search (the hash table is large enough to build the full tablebase), and the rule only hurts. Evaluation points for taking the 'opposition' or pushing passed pawns forward is good for searches that are not able to find the queening within their horizon (e.g. by extensions for moving passed pawns). For programs that do search deep enough, these evaluation terms are just noise that might put it off track: pushing the pawn too fast in the KPK end game is the classical beginners mistake that gives away the win.

So the more I think of it, the more important I believe it is that the evalution is tuned to the way the search terminates. Other aspects of the seach, such as extension and move sorting, probably interact with the evaluation much less. (With the caveat that when internal iterative deepening is used to sort the moves on the score from the previous iteration, the efficiency of the alpha-beta might be strongly dependent on the evaluation. But if you go to fixed depth irrespective of time or node count, you implicitly corrected for that.)
User avatar
H.G.Muller
 
Posts: 3453
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: comparison between quality of evaluations

Postby Uri Blass » 16 Dec 2005, 14:04

My experience is that even without hash piece square table is enough for
KRK.

distance between the kings was needed only for positions like KQKR

I agree that the search of tscp is not best search for testing evaluations and I also thought that searching fixed depth with no qsearch may be better for that purpose but in that case it is obvious that bonus for side to move is not going to help so
I think that maybe it is best to search to depth x or x+1 randomely in order to do good comparison of evaluations.

The idea is that at every node after x plies it may choose randomely if to return the evaluation or return the result of one ply search.

Uri
User avatar
Uri Blass
 
Posts: 727
Joined: 09 Oct 2004, 05:59
Location: Tel-Aviv

Re: comparison between quality of evaluations

Postby H.G.Muller » 16 Dec 2005, 15:11

Well, you have seen my minimalistic program, it is very stupid and had also no move sorting. So perhaps it just did not get deep enough, or the piece-square tables were suboptimal. With KRK it just positioned its own king in the center (e4-d4-e5-d5) and was chasing the enemy king around it with the rooks by checking it...

Anyway, the point it illustrates is that the less depth you have, the more burden you put on the evaluation to guess what is beyond the horizon. Such in agreeent with the assumption that even a bare minimax search (i.e. no QS) will eventually converge with increasing depth. When the horizon is sufficiently far away, both sides will be able to steer away from the troubled spots. But this is an unproven conjecture, and convergence will be exceedingly slow at best.

Strictly fixed depth is at least well-defined environment, but it has a danger: evaluation functions from programs that have a good QS might do very poorly in such a test, although they have very good evaluation for the truly long-term aspects of the position (say, pawn structure, passed pawns, kings safety). If they regularly blunder away material because they lack any insight in tactics and purely relied on QS to reach 'tactics-free nodes', this will completely dominate their performance. This is rather unfair, because it is no crime to rely on QS if you know that your QS is good.

For your idea to work, I think one should first separate the long-term components of the evaluation from those aimed at resolving short-term tactics. But people could do that of course, for their own program they will know what to disable.

So perhaps TSCP is not such a bad choice after all: since it searches all captures in QS, the nodes at which evaluation is done are all quiet. (Well... they are actually only quiet from one side, the QS does not incorporate null-move threat detection, I think, although it tests for check-threats.) That means that tactical components of the evaluation will simply not be exercized, unless (as in the example a gave before) they actually misjudge the situation. This would result in an unnecesary error, that would have been easily avoided by disabling the relevant code. But the resulting performance penalty seems less unfair: after all, the test did uncover an imperfection in the evaluation in a judgement that would have been important in the context of the original search algorithm (or the author would not have bothered to put it in his evaluation).
User avatar
H.G.Muller
 
Posts: 3453
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Previous

Return to Programming and Technical Discussions

Who is online

Users browsing this forum: No registered users and 55 guests