Winboard Forum

by **H.G.Muller** » 30 Jul 2007, 10:24

I am starting this new thread on BayesElo, because it seems that there is something disastrously wrong with the way the prior is applied, if you are not using BayesElo on the results of a round robin. This can lead to an unrealistically compressed rating list, for instance in the case of a Swiss tournament with a very wide rating spread. In the Promo Division of Chess War, the ratings calculated by BayesElo with the default setting for the prior seem do be on the average at least a factor 2 too low. Setting prior=0 gives more realistic ratings, but leaves the result very sensitive to statistical fluctuations (which tends to exaggerate the placement of individual players on the scale).

So in summary:
* Without prior, the ratings are spread out over a realistic range, but individual programs are not placed accurately on the scale, because the analysis ascribes extreme scores all to skill, even when they are very likely to be due to luck.
* With prior, the ranking of the players is more realistic, but the overall rating scale gets strongly compressed.

It would be nice if a procedure for better application of the prior could be developed, that gets both ranking and scale correct at the same time.

by **H.G.Muller** » 30 Jul 2007, 11:52

Let me first re-iterate the prior procedure, (and its justifivation), which consists of supplementing the games set with some virtual draws.

In a coin-flip experiment, where the probability for heads is p, and for tails thus q = 1 - p, the probability for having K times heads in N flips is proportional to

Q = P^K * q^(N-K)

In the Bayesian treatment, this probability is 'inverted', and interpreted as the likelyhood that the probability for heads in a single event is p, given K times heads was observed. (While originally it was the probability for K times heads, given the probability for heads in a single event is p).

Now we can calculate the value of p for which Q is maximum (the maximum-likelyhood estimate of p). This is most easily done by maximizing Z = log Q:

Z = K*log(p) + (N-K)*log(1-p)

taking the derivative

Z' = K/p - (N-K)/(1-p) = 0
K*(1-p) = (N-K)*p
K = N*p
p = K/N

This is what you would normally (= in a non-Bayesian approach) guess: the observed fraction of heads events is the estimate of the probability.

Now it turns out that, although with this estimate you have a maximum likelyhood that your p will be exactly right, that on the average your value for p will be wrong! K/N systematically overestimates the difference of p with 1/2. This is most clear in 100% (or 0%) scores: with K=N you get p=1. Indeed p=1 makes a 100% score maximally likely (namely a certainty). But there are many p slightly below 1 where the score is also quite likely, while by definition p can never be larger than 1. So on the average (testing all kind of coins, and passing only those that scored 100% heads) p will be below 1.

The Bayesian formula for Q(p) allows you to exactly calculate the average, as a normal expectation value;

E(p) = INTEGRAL p*Q(p) dp / INTEGRAL Q(p) dp

Doing some integration by parts (sorry about that...) then gives

E(p) = (K+1)/(N+2)

This is the same as what you get with the normal estimate for p, when you supplement your sample of coin flips by two 'virtual' flips, one heads, one tails, before calculating the heads fraction. (In Chess, where there also is a draw possibility, you could add two virtual draws to the match result.)

The Bayesian expected p (which is the basis for the awarded rating) can thus be obtained through the 'trick' of adding virtual draws before doing the naive (non-Bayesian) calculation. This trick gives you the exact result for any value of K and N, and is thus very useful. It is not completely obvious, though, if it can be generalized to having more than one match, between more than two players, while retaining these proprties of simplicity and exactness.

by **H.G.Muller** » 30 Jul 2007, 12:54

Now the problem is that it does not seem possible to keep the nice property that the number of virtual draws you have to add is independent of the number of games played, (and between who), in the multiplayer case. This is easy to see from the cases where the 'pairing network' has a tree topology (i.e. no loops).

In such a case, the only win probabilities that occur in the expression for Q are those between players i and j that actually played games against each other (i.e. for which N_ij != 0). And these probabilities can all be independently chosen. And the likelyhood Q of a certain set of probabilities p_ij occurring is simply the product of a number of factors like Q(p) in the binary-encounter case of the previous post.

If we apply the same procedure of calculating the Bayesian expectation of the win probabilities, we can simply calculate it for each p_ij separately, because Q factorizes. This leads to the outcome

p_ij = (K_ij + 1)/(N_ij + 2).

Again the only thing we have to do to get the exact result is supplement the data set with some virtual draws, two on each link of the pairing network.

It might thus seem that the virtual draws can still be applied independently of N_ij. But in fact this is not true, as no virtual draws should be applied to 'links' with N_ij = 0. Adding virtual draws on 'links' that had no games before, in general does change the likelyhood an expectation of other win probabilities. So at the very least we lose the convenient property that we can add the virtual draws independently of looking at the pairing network. For tree-like pairings, this is not yet too bad, as it is easy enough to see which N_ij are zero and which not. But for other topologies of the pairing network, it is not obvious even that no virtual draws would be needed on links without games, or how many per link in general.

Winboard Forum

BayesElo and the prior

BayesElo and the prior

Re: BayesElo and the prior

Re: BayesElo and the prior

Who is online