Winboard Forum

by **H.G.Muller** » 12 Aug 2009, 10:37

Currently Polyglot derives the weights of its books from a collection of PGN games purely from the frequencies that move has been played from the given position with various results, as 2*wins + 1*draws + 0*losses. This suffers from several problems. For instance, it is debatable if losses should really not be counted at all. If a position occured 10 times, all in lost games, and grandmasters played move A 8 times and move B 2 times, it is still likely that move A is better than move B. Some moves lose faster than others, and a good move might turn a lost position against a weaker opponent into a win, while a poor move would lose even to a patzer. A second problem is that the current algorithm does not take into account that the position to which a move leads might be reachable through other paths, also contributing to the accuracy of the WDL statistics. E.g. position A might have been played only 5 times, because it is an unusual transposition, and a move to position B rom it might have been played only two times, one draw, one loss. Not really enough to conclude that B is a poor move with any degree of certainty. But poition B might occur 1000 times in the PGN set through other paths, with 500 wins, 200 draws and 300 losses (for the side that had to move in position A). So the information would be there to conclude that position B is almost certainly pretty favorable for us, with ~60% score. Polyglot would currently fail to spot that, and consider only the two times the move was played from A.

If weights are supposed to represent (relative) winning probabilities, or more accurately, estimates for the expected result, they should be derived from PGN statistics in an entirely different way. (The problem of how to convert such winning probabilities in a move choice, by making the trade-off between unpredictability and quality, is an independent one, which can be easily addressed at probe time, and which I won't discuss now.)

Chess strategy is based on minimax. Look at the following hypothetical situation: A position A occurred 1000 times in the PGN set, and 900 times a move AB was played to position B, 100 times a move AC to position C. Position B occurred 950 times, and had an average result of 45%. Position C occurred 105 times, with an average result of 60%. The average result of position A was thus 0.9*45%+0.1*60% = 46.5%. Does that mean you should avoid position B? Of course not! When your intention is to play AC from A, position A will give you an average result of 60%! Is this now a far-fetched example? If move AC is so good compared to AB, why would AB have been played 9 times as much as AC? People would prefer the best move, not? But there are very plausible reasons why this situation could occur in practice. Move AC might be a novelty, a recently discvered refutation of the line. Before it was known people always played AB here, with poor results. Then move AC was discovered, with good results. After having been tested 105 times, everyone beleived AC refuted the line, so the opponents started to avoid position A, so AC never got the opportunity to statistically surpass AB.

So a logical way to build the book is to not simply look at the gross average result of a certain position, but in stead look at the result of the daughter position individual moves leading from it, and then using only using the statistics of that best move. This pure minimax approach would only work if the average result (whatever form of "averaging" was used) for each position was precisely know. In practice you would of course run quickl into positions with such a small occurrence that this is no longer the case, and you have to compromise. If a position occurred only twice, two different moves were played, one winning and the other losing, it makes no sense to say "from here I would play that move which had a 100% result, so this position will give me a 100% result". More reasonable to say would be: "This position occurred two times, and I have not enough statistics to decide which move was better, so I assume they are equally good, and together they scored 50%". So depending on the statistical error in the average result of each of the daughter nodes, you woud have to combine them to get an average result and statistical uncertainty in it for the current position. In the limit of infinitely large occurrence, where the uncertainties vanish, you would use minimax, in the other limit, of very low occurrence, you would use plain averaging. For intermediate occurence frequencies you would have to use some transition between the two. I am pretty sure a mathematically sound formula could be derived from this based on Bayesian statistics (although I have not yet done so). I was told that thi is essentially how Monte-Carlo search works as well, so perhaps the required formula is already known from that area of research.

Of course the ratings could also be taken into account. If a position occurred 10 times, with 60% result (for white), but the average rating of the white players was 300 higher than that of the black players, this would indicate that the position is actually very favorable for black (as based on the ratings an 85% result should have been expected). This is a bit dangerous, as it assumes the rating differences are not based on uperior opening knowledge. So perhaps the rating difference should be discounted somewhat, and more so as the position occurred deeper into the game, where the players already have "used up" the opening component of their rating. (This in itself would be interesting to investigate: how does a rating difference translate into an average result for positions other than the standard opening position.)

I think it would be very worthwile to implement such an algorithm in the book-building mode of Polyglot.

by **Michel** » 12 Aug 2009, 14:30

It is confusing to think about this. Applying minimax seems to discard information.

In your example move AC seems better but this is perhaps since it was played less and people have not yet learned to refute it. Whereas line AB is clearly a well understood solid line.

Basing the rating of A on the value of AC discards a lot of information.

I think one should aim for a (weight,uncertainty) pair (in a Bayesian sense). I am not sure minimax is the best way to deal with this (how do you incorporate uncertainty?) but there are other search algorithms that might be investigated like conspiracy numbers.

by **H.G.Muller** » 12 Aug 2009, 16:59

I agree. Minimax could ony be what you get when the widths are negligible, and when you owuld alas go for the best move.

But the weights should be consistent. If we really think that B is 45% and C is 60%, then A would only be 46.5% if we play the moves with probabilities proportional to the frequencies with which they were played, totally disregarding the average results. (And if we did that, the average result we calculate from it would of course not be very relevant, as they are apparently not used in the move selection process anway.) In practice we will bias the move choice in some way towards moves with high scores. This bias should then be taken into account when backing up the statistics towards the root.

E.g. say that we would reward every 1% of expected result by a 1.2 times larger probability to be played, the move AC would be upgraded by a factor 1.1^15 = 15.4. So in stead of 9:1 we would play AB and AC in the ratio 9:15.4, or 37% AB and 63% AC. The expected result of position A should then be 0.37*45% + 0.63*60% = 54.45%, in stead of 46.5%.

by **Guenther Simon** » 12 Aug 2009, 18:12

H.G.Muller wrote:I agree. Minimax could ony be what you get when the widths are negligible, and when you owuld alas go for the best move.

But the weights should be consistent. If we really think that B is 45% and C is 60%, then A would only be 46.5% if we play the moves with probabilities proportional to the frequencies with which they were played, totally disregarding the average results. (And if we did that, the average result we calculate from it would of course not be very relevant, as they are apparently not used in the move selection process anway.) In practice we will bias the move choice in some way towards moves with high scores. This bias should then be taken into account when backing up the statistics towards the root.

E.g. say that we would reward every 1% of expected result by a 1.2 times larger probability to be played, the move AC would be upgraded by a factor 1.1^15 = 15.4. So in stead of 9:1 we would play AB and AC in the ratio 9:15.4, or 37% AB and 63% AC. The expected result of position A should then be 0.37*45% + 0.63*60% = 54.45%, in stead of 46.5%.

There is much more to think about from your original example. Only calculating the probability is not enough.
You also need to look at the date the move was played(of course also at how strong the players were and how their
elo difference was!), if a move lost 6 times, drew 6 times and won only 3 times,
but all losses were in 2000-2006, when since 2007 only wins and draws happened it is clear that the move later
was rehabilitated by better moves in a distant ply. Of course this all is a bit academic, because I don't think
any software exists which really takes all into account and IMO a lot of work is already done _before_ making
the book itself by bookmaking software. Good books are made out of already prefiltered and pretuned special pgn files
and still they need to be aftertuned manually and so on, as Mark already said.

Guenther

by **H.G.Muller** » 13 Aug 2009, 16:44

OK, I have given this problem some more thought. At first I was wrestling with the Bayesian approach. The usual prior, assuming independent homogeneously distributed probabilities for the winning chance of each move, was leading to conclusions that seemed intuitively wrong. E.g. if 10 moves would have been played one time each, 5 winning, 5 losing, Bayeisan estimates for the winningprobabilit would be 66% and 33%, respectvely, and a stratgy that would only play the moves that had been winning would assign the entire position a 66% winning probability. Yet this would be the typical result for 5 equally strong moves, each with 50% winning chance (meaning that the position also had 50% winning probability).

The crux seems to be the assumption that the winning probabilities are independent. This assmption is false, and makes a very poor prior. The winning probabilities are very much dependent, as these are all moves from the same position, which are so close in evaluation that expert players do not prefer one of them over the other. This constitutes evidence that the moves in fact should have equal winning probabilities.

So in calculating the likelihood of wnning probabilities for each move, a prior is needed that penalizes a difference in winning probabilities between equally played moves. Only stronger evidence that the moves have unequal winning probabilities (like a 60 out of 100 vs a 30 out of 100 score) should be able to overrule this prior, while weak evidence (like 1 out of 1 vs 0 out of 1) should hardly dent it.

In case of unequal frequencies, the playing frequencies itself (but not scores) should be used to derive a prior which assigns the best likelihood to a winning probability with a certain difference between that for the most-played move and the lesser-played move. The prior winning probablities themselves could be estimated in a Bayesian manner by adding one prior game to each move, so that moves that have been played 2 vs 1 times would not assume as much difference in strength as 200 vs 100. (Namely 3:2 and 201:101.) Perhaps this prior number of games should be made an adjustble parameter, so you could set it to, say, 10. (Which would give 12:11 vs 210:110 for the estmates of the relative playing frequencies.) It would be necesary to assume some relation between playing-frequency ratio and winning-probability differerence to then extract the most-likely winning probabilities for each of the moves based on their average scores.

Given these wnning probabilities, and the strategy for selecting moves based on winning probabilities, the "average result" could then be backed up towards the root of the opening tree in a consistent manner.

The point that Guenther makes is valid, but I suspect that backing up "average results" in a consistent way would automatically tae care of it. If a move gets rehabilitated by a later move in an unambiguous way, the node where that happens would get assigned a new "average result" which is very different from its over-all statistics, beause it gets dominated by the new and much better move. This altered score would then backpropagate all along the line.

by **Daniel Uranga** » 14 Aug 2009, 18:42

Is a very bad idea using strong engines to weight leaf nodes? The results could be then back-propagated to the rest of the book.
Rybka is stronger than most GMs even without book...

by **H.G.Muller** » 30 Aug 2009, 14:09

OK, my final conclusions about the weights:

The process can be split into three parts:
1) deriving estimates (value + error bar) for the score% of the individual moves from a position
2) using the estimates to assign each move a probability to be played
3) using move score% and playing probabilities to derive an estimate for the position score%

The result of (3) is then recursively fed into (1) of the parent node, negamax fashion (i.e. flipping the score%).

About (2):
This seems to require an adjustable parameter tuning the book between strength and variation. With which frequency are you willing to play a move that gives a 1% worse result? It seems logically to suppress moves with the same difference in score% by the same frequency factor. E.g a factor 2 per %. So that a move that leeds to a 1% lower score (in your estimate) would be played twice less frequent, and a move that scores 3% lower 8 times less frequent. (I.e. the playing probability should be an exponential function of the estimated score%.) So when you would have 3 moves, scoring 54%, 53% and 51%, respectively, they would be played in the ratio 8:4:1 for this particular setting of the variability control. These would then be the ratio of the weights that go into the Polyglot book. That still leaves a scaling factor undetermined; This could be used to hide a clue as to the reliability of the weights in the book. If the weights were based on etimates with a low standard error you could make them all very large, if the errors were large, the weigths could all be made small.

About (1):
This is a tricky Bayesian problem. The key to a solution is that the fact that expert players choose the moves is strong prior evidence for the fact that the moves should lead to very similar score%. E.g. if the position was played 10 times, with 10 different moves, 5 wins and 5 losses, it is very likely that all moves lead to about 50% score, and very unlikely that five of them would lead on average to 100% score and five of them to 0% score. The Bayesian estimate in case of completely independent move expectations for a 1-out-of-1 result would be 67%, btw, ((k+1)/(n+2) k=n=1), but even this awards the moves that happened to end in a win way too much. The fact that there were many moves that lost, and that were played equally often, really should be held strongly against the moves that did win as well.

So there are two pieces of evidence that should go into the estimates for the score% of the moves: the probablilyy with which they were played, and the score% they themselves yielded. The latter are obtained from the statistics, (at the leaves), or recursive application of the propagation algorithm. In both cases this will give us a socre% r_i with a standard error s_i for each move i. The question is how to incorporate the evidence from the playing frequency.

The basic assumption is that good moves are played more often thn poor moves, by strong players. The stronger the player, the more he will discriminate against poorer moves. I.e. a moves that looks like it will score 1% less might be totally rejected by grand masters, played in 10% of the case by club players, and nearly equally often by coffee-house players. So some adjustable parameter is needed here that tells us how much confidence we have in the players of the PGN collection from which we derive the book. We assume that the (relative) playing probabilities can be converted to a (relative) perceived score% by some multiplicative factor. E.g if it is played 2 times as much, it is likely perceived as 1% stronger, (so for 8 times as much, 3% stronger). The playing probabilities have to be estimated from the playing frequencies, which will be subject to a statistical error. This error translates to an error bar for the estimated percieved score%.

Now (and this is a very important point) even GMs are not infallible. So the perceived score% need not be the actual score%. No matter how small the error bar in the perceived score% deduced from the playing statistics, it can always be off compared to the true score%. E.g. if move A ws played a milion times with 40% result, and move B 10,000 times with 60% result, (so from the statistics you woud derive playing frequency of 91% +/- 0.1% and 9% +/- 0.1%, respectively, making it very, very certain that GMs prefer A over B, the score% of the moves equally certainly shows that the GMs were wrong and B actually is the better move. So the standard error in the prior score% estimated from the playing frequencies alone (and not the results) is the sum of the statistical error in the playing frequencies (which is roughly 1/sqrt(N) for a move played N times) and the "judgement error" J. Added according to Pythagoras (i.e. sqrt(J*J + 1/N)) because these are independent errors.

So we end up with priors R_i for the score% given as V*log(N_i) with standard errors S_i = V*sqrt(J*J + 1/N_i). As the priors are relative, they only predict something for the difference between the score% of the moves, not the score% itself. (e.g. all moves could be lousy, but you have to play something.) So they must be shifted by a completely undetermined score% to best coincide with the actually observed score% r_i. This is done by shifting them over the average discrepacy D = <r_i - R_i> In this average we have to take account of the relative reliabilities of these differences, i.e. it has to be a weighted average, where (as usual) the weights are inversely proportional to the squares of the standard errors. If s_i is the standard error in r_i, this would give weights of 1/(s_i*s_i + V*V*(J*J + 1/N_i))^2. Let's call R'_i = R_i + D these shifted prior score%.

Now the most-likely true score% of a move i, in view of the prior, would be a weighted average Q_i = (w_i*r_i + W_i*R'_i)/(w_i + W_i), where the weights are the inverse squares of the standard errors s_i and S_i in these quantities. This would cause the intuitively correct behavior: for positions with very poor statistics (N_i ~ 1) we would have S_i ~ V << 1 (as J << 1). Due to the small number of games the score% would also have an error s_i ~ 0.5. So the estmate R'_i based on playing frequency and total win percentage of all moves together would be far more accurate than tht of the observed score%, and get a correspondingly larger weight in the average, to the point where it i mainly the over-all statistics of all moves together that determines the score% estimate of every individual move. OTOH, if N_i gets asymptotically large, S_i would tend to the fixed value V*J, while s_i ~ 0.5/sqrt(N_i) would tend to zero. So the observed score% would dominate over the estimate determined from the playing frequency. A move with a very certain bad score record would be estimated worse than one with a certain good score record, even when it had be played a lot more.

About (3):
Finally the score% estimates Q_i would have to be converted to playing probabilities p_i of the moves, in accordance with the setting of the quality / variability trade-off. (e.g. p_i = exp(b*Q_i)/SUM_i{exp(b*Q_i)}.) Once these probabilities are decided, it becomes possible to calculate the expected score% of the position, as SUM_i{p_i * Q_i}. Note that if one move has a much better performance Q_i then all the others, it basically gets p_i = 1, as the exponential function kills all others. So this single move then determines the score% of the node (minimax). When all moves (sa M of them) have the same Q_i = Q, they each get equal probability p_i = 1/M, and the total node gets the same score% Q. The standard error in this position score% will be smaller than that in the individul Q_i, because it is an average. In principle (if the score% of the various move are independent) the error in the position score% would be sqrt(SUM_i{p_i*p_i * e_i*e_i}), where e_i is the error Q_i. So when one move has p_i = 1, this would be equal to the e_i of this move. But with M equivalent moves it would be sqrt(M*{1/M*1/M * e*e}) = e/sqrt(M).

Note that the error e_i in the score% estimate Q_i includes a contribution from r_i (weighted by w_i) and R_i (weighted by W_i), but also from the average discrepancy D (wich was included in R'_i). This latter contains the error in the averaged r_i, based on all movees, and is thus smaller than the error s_i in an indvidual move. But under conditions where S_i is very small (poor statstics) this error would dominate the over-all error. We will have to apply the usual formulae for error propagation based on how the position score% is derived from the individual r_i and R_i, (for given p_i), including the contributions hidden in D. And the assumption that all observed score% of the daughter positions are independent could be violated if two such move indirectly lead to the sme position due to move transposition. (But it seem best to ignore that for now.)

by **Jon** » 19 Feb 2013, 10:54

Assuming you are working from an archive of games between strong human players, there are a few other trends that might have an influence on the results you see. When a GM faces a lower rated player, you often see them choose offbeat or ultrasharp moves in order to get their opponent away from what he knows, and thinking for himself, thus capitalizing on the GM's superior knowledge. When the same GM is faced with a stronger opponent, he may go with solid mainstream lines in order to avoid being thrown into an unfamiliar situation.

The impact of this is that offbeat moves can develop quite good statistics because they are most often played against weaker players. The raw WDL stats give the offbeat move too much weight, but you can mitigate this in part by using the performance Elo of the moves.

Minimax is an excellent principle to incorporate, but one of the problems is that games especially at lower levels are riddled with blunders, so the game result itself is not such a useful stat to look at. Book makers try to get around this by focusing only on games between very high rated players, but then you run into the problem of there being too few games in some of the lines to generalize their results.

After Chess Openings Wizard came out, many books were produced backsolved/minimaxed from engine evaluations as Daniel Uranga was suggesting, but the quality of these books is noticeably lower than books based on high quality GM games. It seems to be the case that engines when they play without books in the opening lose quite a few Elo points, and thus it becomes apparent that the evaluation of most engines was not designed with the opening in mind. In the middlegame however, we do not have several centuries of theory/games to fall back on, so there is no choice but to rely on strategic principles and/or calculation/evaluation.

Of the systems mentioned, I do think a weighting algorithm that references performance Elo minimaxed in some way would probably be better than many existing bookmaker's algorithms.

A better algorithm for Polyglot would be most welcome. It would also be interesting if WinBoard could minimax/backsolve user or engine evaluation scores for the developing tree of analysis in middlegames and endgames. This kind of feature would be of great use to correspondence players. These new WinBoard features displaying multipv's and allowing one to exclude moves from the analysis with UCI are two great steps in that direction and much appreciated.

by **crystalclear** » 22 Feb 2013, 05:21

Another factor to consider with an opening book is its intended purpose.

One opening book might be useful for engines in competition trying to win, where it doesn't matter if an engine plays a Bobby Fischer line repeatedly and wins with it every time. That same opening book might bore someone to death as they want some varied games and the computer only ever plays the same lines.

The Glass opening book format has a sort of move quality field eg normal, tournament, etc.
While that is quite useful in a single book supplied with an engine, I think it'd be better to have two books if you really wanted to experiment, and enter a tournament.

I looked at Fischers Ruy Lopez games once. He won with it so regularly that people became afraid to even try it when Fischer had white.
So I asked myself what would Fischer have played in a Ruy Lopez as Black. Mostly he played Sicilian defence, so there wasn't much to look at, but I did find one game
and he played Bc5 quite early on - a variation that was rarely played by his opponents.

So now here is where that is relevant to opening books? What's likely to be a good move for black? The Bc5 that Fischer played and (if my memory is right) used to win, or a popular move with a reasonably good percentage record, but that Fischer seemed to be able to beat every time?

So for tournament book creation, I don't think weights are that important. Books will have a sort of branching factor for your moves and a branching factor for opponents moves. If your branching factor is one - a single best move stored, then I guess you will be able to store about double the depth in a book compared to having a branching factor matching your opponents.

And what is the likely outcome when you emerge from your book?
Let's assume you emerge from book at depth 2N instead of depth N.
When you reach depth N, you should I think have a better expected score than if you have a large branching factor in the book for your moves. This is because not all moves will be equally good. By putting 5 moves from a position into a book, you are conciously deciding that you will play inferior moves some of the time, unless all the moves happen to be equally good, and I think that is unlikely. Even if the game theoretical value (win,draw,loss) is the same, one position might require greater opponent search depth or more accurate move play in order to succeed. So why damage the opening outcome by including apparently inferior lines? The reason is to protect yourself from repeated defeat by the same game if you get it wrong. But like a human player, if you get your opening book wrong then its time to improve. Like in real life - I should stop playing the elephant gambit.

But not only does a book with no branching for your moves (only for theirs) give a better score at depth N, but by continuing to depth 2N say, it gives hopefully a better score at depth 2N than you would get using the engine. This is because the book should be including accumulated knowledge from many games with long time controls and quality players. So we ought to get the advantage of the superior player, and the longer thinking time, and multiple games, all contributing to the book quality and improving over our one-off over the board play. Then when we come out of book, we should also have a time advantage over an opponent that has branching in his opening book both for his own moves and his opponents.

I have used this approach to building my own opening book and it can deliver at least 2000 different checkmates straight out of the book.
Really, it would make more sense to stop the book once there is a sufficient advantage, or when even when a large enough advantage can be seen by a quick search.
It might also make sense to remove moves from a book where even the most limited search finds the correct continuations.
Those are things I haven't done.

I just ran a little test on FICS. I started a 5 minute game against Singularity (Critter 1.6a) and Singularity was down to 4:18 when my engine came out of book.
Against other engines they are often down to less than 4 minutes. That gives my engine 25% more thinking time than its opponent. Next game Singularity out of book at around move 4 it seemed, mine out of book around move 8, 4:03 on Singularities clock. And my eval was at 0.38 if I remember (playing white).

==

On the original question of the weights in an opening book, I see it as a numerical field that people can interpret pretty much how they feel. Maybe as a winning probability, or maybe as an indication of how often should that move be played. I simply add up the sequence of weights of the possible moves to get a series of weights (sigma 1 to i) and than pick a random number in the range up to the final total. Any complex calculations of probabilities can be done to give the probability of making the move instead of the probability of winning. These probabilities of making the move are then effectively the move weights. And then using the book is then back to being simply based on move probability being proportional to move weight - which is how I think it should be.

==

My mind is open on the question of whether it would be more logical to have a white book and a black book than everything in one book.

Winboard Forum

Polyglot book building: deriving weights

Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Re: Polyglot book building: deriving weights

Who is online