Winboard Forum

by **Rémi Coulom** » 15 Jan 2005, 13:58

Hi,

I have made a new version that fixes the problems reported by Dann (Thanks, Dann):
http://remi.coulom.free.fr/Bayesian-Elo/

R?mi

by **Rémi Coulom** » 22 Jan 2005, 15:00

A new version of ELOStat is available (on the Arena web site):

Changes from version 1.2 to 1.3:

maximum number of different players/programs increased to 1500

algorithm for calculating the confidence intervals completely changed (now uses the so called nonparametric ABC method (approximated bootstrap confidence)) by Efron and Tibshirani. Many thanks to Dr. Jeff Lischer (US) for drawing my attention to this fantastic method and to all users who pointed out the insufficiencies of the old method.

some minor bugs in individual statistics output removed

Here is a link that explains bootstrap:
http://en.wikipedia.org/wiki/Bootstrap#Statistics

I believe the Bayesian approach is still much better. I have no time to do it right now, but I will probably start implementing the "inverse of the Hessian" approach in one or two weeks.

R?mi

by **Peter Fendrich** » 22 Jan 2005, 20:00

R?mi Coulom wrote:I believe the Bayesian approach is still much better.
R?mi

Absolutely,
I think that you are on the right track, even if i have some doubts about your maximum likelihood approach....
In a way Bayesian methods are bootstrapping as well.
/Peter

by **Robert Allgeuer** » 23 Jan 2005, 17:17

A question to the statistics experts:

If I have understood correctly EloStat makes various assumptions and simplifications about the distribution of the error for the calculation of the error margins.

What are these assumptions?
One is that it assumes that error margins against the whole set of opponents is the same as against one opponent with the average rating.
But what kind of distribution function does it assume? Any other assumptions?
Also, is there a paper on its method available somewhere?

Thanks
Robert

by **Rémi Coulom** » 23 Jan 2005, 18:07

Robert Allgeuer wrote:A question to the statistics experts:

If I have understood correctly EloStat makes various assumptions and simplifications about the distribution of the error for the calculation of the error margins.

What are these assumptions?
One is that it assumes that error margins against the whole set of opponents is the same as against one opponent with the average rating.
But what kind of distribution function does it assume? Any other assumptions?
Also, is there a paper on its method available somewhere?

Thanks
Robert

This was discussed earlier in this thread. There is a CSS paper explaining the past versions of ELOStat. Dieter gave a summary (first page, dated Thu Dec 16, 2004 2:43 pm, I do not know how to link to a message).

The strange thing about Elostat is that it computes two variances. So, the assumed distribution is two half-Gaussians. It seems wrong to me. It often generates strange asymetric intervals. bayeselo also generate asymetric intervals but, usually, the asymetry is the reverse of Elostat. For instance, using the games of the second division of WBEC, with bayeselo

Code: Select all: Rank Name Elo + - games score ratio 1 Jonny 2.70 2654 103 90 56 45 80% 2 Spike 0.8a 2577 94 86 56 40 71% 3 DanChess 1.07 2576 92 86 56 39 70% 4 TRACE 1.33 2482 87 85 52 30 58% 5 E.T.Chess 071204 2471 84 80 56 37 66% 6 Pseudo 0.6h 2453 84 82 56 32 57% 7 Petir 2.0 2443 88 86 52 29.5 57% 8 Snitch 1.0.8 2416 82 81 56 29.5 53% 9 Muse 0.899b uci 2408 83 81 55 30.5 55% 10 TheCrazyBishop 0052 2408 83 83 56 29 52% 11 Terra 3.3b11 2402 87 87 56 27 48% 12 Leila 0.53h 2396 83 83 56 27 48% 13 Cerebro 1.30 2391 92 93 52 25 48% 14 Frenzee 199 2386 86 87 56 26 46% 15 Amateur 2.80 2375 81 83 56 24.5 44% 16 Scidlet 3.6 2359 84 86 52 22.5 43% 17 Bruja 1.8 2356 83 85 52 22.5 43% 18 Queen 2.45 2338 85 86 52 23 44% 19 Chezzz 1.0.3 2329 85 89 56 21.5 38% 20 Resp 0.19 2307 87 91 52 20 38% 21 Knightx 1.86 2301 84 87 56 22 39% 22 Djinn 0.870f 2281 84 88 56 21 38% 23 Butcher 1.53 2263 87 94 55 16.5 30% 24 Esc 1.16 2224 93 102 52 15 29%

with ELOStat (1.2):

Code: Select all: Program Elo + - Games Score Av.Op. Draws 1 Jonny 2.70 : 2628 66 124 56 80.4 % 2383 21.4 % 2 Spike 0.8a : 2551 72 105 56 71.4 % 2391 21.4 % 3 DanChess 1.07 : 2549 74 102 56 69.6 % 2405 21.4 % 4 E.T.Chess 071204 : 2473 77 82 56 66.1 % 2358 32.1 % 5 TRACE 1.33 : 2469 90 82 52 57.7 % 2415 26.9 % 6 Pseudo 0.6h : 2440 87 80 56 57.1 % 2390 25.0 % 7 Petir 2.0 : 2434 91 87 52 56.7 % 2387 21.2 % 8 Snitch 1.0.8 : 2415 92 71 56 52.7 % 2396 30.4 % 9 Muse 0.899b uci : 2407 90 74 55 55.5 % 2369 30.9 % 10 TheCrazyBishop 0052 : 2402 94 75 56 51.8 % 2390 25.0 % 11 Terra 3.3b11 : 2396 81 94 56 48.2 % 2408 17.9 % 12 Cerebro 1.30 : 2394 91 97 52 48.1 % 2407 11.5 % 13 Leila 0.53h : 2393 69 94 56 48.2 % 2405 32.1 % 14 Frenzee 199 : 2390 83 91 56 46.4 % 2414 17.9 % 15 Amateur 2.80 : 2372 74 88 56 43.8 % 2415 30.4 % 16 Bruja 1.8 : 2364 75 91 52 43.3 % 2411 32.7 % 17 Scidlet 3.6 : 2363 79 91 52 43.3 % 2410 28.8 % 18 Queen 2.45 : 2342 80 92 52 44.2 % 2382 26.9 % 19 Chezzz 1.0.3 : 2332 91 82 56 38.4 % 2415 19.6 % 20 Resp 0.19 : 2320 91 85 52 38.5 % 2402 23.1 % 21 Knightx 1.86 : 2318 88 83 56 39.3 % 2394 21.4 % 22 Djinn 0.870f : 2297 91 81 56 37.5 % 2385 21.4 % 23 Butcher 1.53 : 2280 105 74 55 30.0 % 2427 20.0 % 24 Esc 1.16 : 2252 125 76 52 28.8 % 2409 11.5 %

You can notice that for players near the top, the asymetry in bayeselo is the reverse of ELOStat. Those player have won more than they have lost. So, it seems normal that the lower bound of the confidence interval should be tigher than the upper bound. Every win contributes to narrowing the lower bound, and every loss contributes to narrowing the upper bound. For instance, if you consider the extreme case where one player has only wins, then you know a lower bound of its rating, but no upper bound. ELOStat gives the reverse asymetry.

ELOStat 1.3 is an improvement in this respect:

Code: Select all: 1 Jonny 2.70 : 2628 98 93 56 80.4 % 2383 21.4 % 2 Spike 0.8a : 2551 90 87 56 71.4 % 2391 21.4 % 3 DanChess 1.07 : 2549 89 86 56 69.6 % 2405 21.4 % 4 E.T.Chess 071204 : 2473 79 77 56 66.1 % 2358 32.1 % 5 TRACE 1.33 : 2469 83 82 52 57.7 % 2415 26.9 % 6 Pseudo 0.6h : 2440 81 80 56 57.1 % 2390 25.0 % 7 Petir 2.0 : 2434 87 86 52 56.7 % 2387 21.2 % 8 Snitch 1.0.8 : 2415 77 77 56 52.7 % 2396 30.4 % 9 Muse 0.899b uci : 2407 78 78 55 55.5 % 2369 30.9 % 10 TheCrazyBishop 0052 : 2402 80 80 56 51.8 % 2390 25.0 % 11 Terra 3.3b11 : 2396 84 84 56 48.2 % 2408 17.9 % 12 Cerebro 1.30 : 2394 91 91 52 48.1 % 2407 11.5 % 13 Leila 0.53h : 2393 76 76 56 48.2 % 2405 32.1 % 14 Frenzee 199 : 2390 84 85 56 46.4 % 2414 17.9 % 15 Amateur 2.80 : 2372 77 78 56 43.8 % 2415 30.4 % 16 Bruja 1.8 : 2364 79 80 52 43.3 % 2411 32.7 % 17 Scidlet 3.6 : 2363 81 82 52 43.3 % 2410 28.8 % 18 Queen 2.45 : 2342 82 83 52 44.2 % 2382 26.9 % 19 Chezzz 1.0.3 : 2332 84 86 56 38.4 % 2415 19.6 % 20 Resp 0.19 : 2320 85 87 52 38.5 % 2402 23.1 % 21 Knightx 1.86 : 2318 83 84 56 39.3 % 2394 21.4 % 22 Djinn 0.870f : 2297 83 85 56 37.5 % 2385 21.4 % 23 Butcher 1.53 : 2280 88 91 55 30.0 % 2427 20.0 % 24 Esc 1.16 : 2252 97 102 52 28.8 % 2409 11.5 %

R?mi

by **Dann Corbit** » 24 Jan 2005, 19:42

Interesting is the reversal of trace and e.t.chess.

by **Rémi Coulom** » 13 Feb 2005, 17:36

Hi,

I have made a new version of my program:
http://remi.coulom.free.fr/Bayesian-Elo/

This version can compute uncertainty margins assuming a Gaussian likelihood distribution by estimating its covariance with the inverse of the opposite of the Hessian. It can also use this matrix to compute an table of "liklihood that A is stronger than B".

To show how cool this feature is, I'll give a small example. Suppose that A has played 32 games against B, scoring 75%. Also, C has played 32 games against D, scoring 75%. So, we are fairly sure that A is tronger than B, and C is stronger than D. Suppose now, that B plays a game against C, and draws. Estimating confidence intervals from those results becomes tricky now. Infering the likelihood that one player is stronger than another from those intervals is impossible.

Here is the output of bayeselo:

Code: Select all: Rank Name Elo + - games score draws 1 A 146 267 267 32 75% 25% 2 C 16 253 253 33 74% 27% 3 B -16 253 253 33 25% 27% 4 D -146 267 267 32 25% 25%

All the ratings are very uncertain, because of that single draw between B and C. The matrix of "likelihood that i is stronger than j", computed by bayeselo (multiplied by 1000 and rounded) is:

Code: Select all: A C B D A 689 996 861 C 310 551 996 B 3 448 689 D 138 3 310

Here is the output of ELOStat on the same set of game results:

Code: Select all: Program Elo + - Games Score Av.Op. Draws 1 A : 210 121 114 32 75.0 % 19 25.0 % 2 B : 19 109 115 33 25.8 % 203 27.3 % 3 C : -19 115 109 33 74.2 % -203 27.3 % 4 D : -210 114 121 32 25.0 % -19 25.0 %

Because ELOStat does not take covariance into consideration, and assumes that the ratings of opponents are their true ratings, it obtains very narrow confidence intervals.

As Dann would write, interesting is the reversal of B and C (Dann, is this correct english grammar? It sounds more German than English). In Bayeselo, B obtains a lower rating than C, because it drew C playing white. The advantage of playing white is about 32 points in bayeselo, so this is the rating difference we get between B and C. ELOStat does not make any difference between white and black, so it would have made sense to give the same rating to B and C. The reason for the 38-point difference comes from the fact that ELOStat considers the average of the ratings of opponents. This particular example demonstrates that it is not a good idea.

These new features of bayeselo are cool, but not perfect. The main problem is that the likelihood distribution is not really a Gaussian. As the number of games becomes large, it becomes close to a Gaussian. But the approximation is a little wrong when there are few games. For instance, if we consider the extreme case of one single draw between A and B, the inverse-of-the-Hessian stuff gives a +/- 250 uncertainty (that you can see in the example above), while the true distribution has +/- 322. This, of course is the most extreme situation. The difference between the true distribution and the Gaussian approximation is usually much smaller.

In order to get even more accurate results, it might be possible to use the Metropolis-Hastings algorithm, using the Gaussian as the proposal density:
http://en.wikipedia.org/wiki/Metropolis-Hastings_algorithm
I will try to implement this in the next version.

This new binary of bayeselo had to be compiled with mingw, because msvc does not have the erfc function in <math.h>. There was also a problem with gcc: erfc is in <math.h> but no std::erfc in <cmath>. The g++ guys should fix this, if they read the winboard forum.

Compiling with mingw makes calculations 10% slower on my P4. But parsing the file is almost twice faster. So, all-in-all, it is a speed gain. I am not sure about the reason of the speed difference when parsing the file. Maybe std::map<std::string,int> is much faster with gcc.

R?mi

by **Rémi Coulom** » 14 Feb 2005, 13:33

Hi,

I have quickly packed the source code (with GPL licence), for those who are interested:
http://remi.coulom.free.fr/Bayesian-Elo ... lo.tar.bz2

R?mi

by **Dann Corbit** » 14 Feb 2005, 23:40

I am sure that I will enjoy looking at what you did very much.

I am hoping that eventually, I will even understand it!
;-)

by **Dann Corbit** » 14 Feb 2005, 23:51

You can find an open source version of erfc() in the Cephes collection by Moshier at Netlib.

by **Anonymous** » 15 Feb 2005, 00:42

Dann Corbit wrote:You can find an open source version of erfc() in the Cephes collection by Moshier at Netlib.

Yes. One of excellent quality. But not without problems. The Cephes routines need quite some support routines. When I studied them (years ago) I think, there could also be some problems with already defined global symbols in typical Standard C math libraries. I guess, it is not trivial (few minutes) to just use the erfc of Stephen Moshier. It will probably be not easy to extract it from the library.

I guess erfc is used in Remi's program to calculate some error margins. Accuracy will probably not matter at all and 2-3 significant digits will be enough. One could use 1-erf, and in the extreme cases where result is close to zero and a slightly good relative accuracy is needed (1-erf won't work then) a taylor expansion.

Regards,
Dieter

by **Dann Corbit** » 15 Feb 2005, 02:34

http://home.online.no/~pjacklam/notes/invnorm/

by **Rémi Coulom** » 15 Feb 2005, 10:51

Dann Corbit wrote:I am sure that I will enjoy looking at what you did very much.

I am hoping that eventually, I will even understand it!
;-)

I have started to write a short tutorial paper that explains the theory. Right now, my plans are:

Implement Metropolis-Hastings. This will provide a very accurate matrix of "likelihood that i is stronger than j". This will also allow to compute the expected ratings, instead of the maximum-likelihood ratings, and make Peter happy.
Run some cross-validation tests, to compare the predictive powers of bayeselo and ELOStat.
Add functionality to the program so that it can handle go games, with handicap and komi.
Finish this tutorial paper about the statistical analysis of game results, and submit it to the ICGA journal.

So, it may take some time before you get this tutorial, but it will probably be there in a few months.

R?mi

by **Dann Corbit** » 15 Feb 2005, 15:13

This is really a fantastic and valuable effort.

There is really no other open source project that performs these calculations effectively.

I am sure that it will become an important factor for a number of projects in the future.

I also think that the tables it generates feel somewhat more sensible intuitively. So I think that it may be a better model than alternatives.

by **Dann Corbit** » 18 Feb 2005, 22:32

Minimal erfc stuff from Cephes
ftp://cap.connx.com/pub/chess-engines/n ... rferfc.zip

I built a BayesElo using MS VC++ with it and it works fine.

In order to use it, #include "cephes.h" into your project where you plan to call erf() or erfc() along with the C and H files.

by **Rémi Coulom** » 19 Feb 2005, 19:45

R?mi Coulom wrote:Right now, my plans are:

Implement Metropolis-Hastings. This will provide a very accurate matrix of "likelihood that i is stronger than j". This will also allow to compute the expected ratings, instead of the maximum-likelihood ratings, and make Peter happy.

I have spent this afternoon implementing this, just to find out that it does not work. Well, it works for a small number of players, but not for a large number. The cost is still exponential with the number of players, so it is not practical for more than 4-5. I naively thought that using the Gaussian as a helper would be the magic trick that allows good high-dimensional sampling. I could have saved an afternoon if I had thought about this a little more. I am feeling a little stupid now.

Thanks Dann for the erfc code. I will probably not use it because I am happy with gcc, but I may find a use for it, sometime.

R?mi

by **Dann Corbit** » 21 Feb 2005, 20:01

R?mi Coulom wrote:
R?mi Coulom wrote:Right now, my plans are:

Implement Metropolis-Hastings. This will provide a very accurate matrix of "likelihood that i is stronger than j". This will also allow to compute the expected ratings, instead of the maximum-likelihood ratings, and make Peter happy.

I have spent this afternoon implementing this, just to find out that it does not work. Well, it works for a small number of players, but not for a large number. The cost is still exponential with the number of players, so it is not practical for more than 4-5. I naively thought that using the Gaussian as a helper would be the magic trick that allows good high-dimensional sampling. I could have saved an afternoon if I had thought about this a little more. I am feeling a little stupid now.

Thanks Dann for the erfc code. I will probably not use it because I am happy with gcc, but I may find a use for it, sometime.

R?mi

It is useful for me because I like the MS VC++ IDE better than GCC (but I am coming around to Eclipse). At any rate, I can make faster binaries with MS VC++ also.

So some other people might also find it useful.

by **Rémi Coulom** » 21 Sep 2005, 20:29

Hi,

I revive this huge old thread because I have spent some time working on bayeselo recently. As I had planned, I made some more statistical tests to measure the predictive power of bayeselo. This leads to a new version that I believe is much better.

What I first did is this: from the WBEC data I selected at random one game out of two. From these extracted games, I computed elo ratings for every player. Then, I compared the results of these players on the second half of WBEC games with expected results according to computed elo ratings. This is what I got:

The plot in red indicates the proportions of wins as a function of rating difference. The green curve is the ideal model, purple and blue curves are 95% confidence boundaries around the green curve.

This shows a phenomenon that I had intuitively observed: bayeselo tends to overestimate the differences in ratings between players.

The fundamental reason for this overestimation is the lack of proper prior. As it is computed, the maximum likelihood estimation assumes an a-priori uniform distribution over all possible ratings. This is bad for two reasons. First, from a mathematical point of view, there is no such thing as a uniform distribution over an infinite range. Assuming such a prior may lead to big fallacies in Bayesian reasoning, such as the two-envelope paradox. Second, we know for sure that small rating differences are much more likely than huge ones.

So, I have produced a new version of bayeselo that uses a more reasonable prior. And here is what I get now on WBEC data:

It is far from perfect, but it is better.

The prior is set by automatically adding "virtual draws" between players. The default value is set to 2. I found that 2.6 is what maximizes the likelihood on WBEC data, but I prefered to leave the default to a smaller value. A value of 0 gives exactly the same behaviour as the previous version. You can set the prior manually with the new "prior" command in bayeselo. When using this command, it will tell you its consequences like this:

Code: Select all: ResultSet-EloRating>prior 0.1 0.1 With this prior, you will get the following Elo differences: 1-0 : +475.016 2-0 : +590.27 3-0 : +658.945 4-0 : +708.031 5-0 : +746.258 ResultSet-EloRating>prior 1 1 With this prior, you will get the following Elo differences: 1-0 : +150.156 2-0 : +232.215 3-0 : +288.12 4-0 : +330.435 5-0 : +364.46 ResultSet-EloRating>prior 2 2 With this prior, you will get the following Elo differences: 1-0 : +89.2563 2-0 : +150.156 3-0 : +195.835 4-0 : +232.215 5-0 : +262.379

I am very convinced that this new version produces much better Elo ratings, especially when the number of games is small. You can get it there:

http://remi.coulom.free.fr/Bayesian-Elo/

R?mi

by **Dann Corbit** » 22 Sep 2005, 01:19

R?mi Coulom wrote:{snip}
I am very convinced that this new version produces much better Elo ratings, especially when the number of games is small. You can get it there:

http://remi.coulom.free.fr/Bayesian-Elo/

R?mi

Fantastic!
Can I get a source tarball?

by **Kirill Kryukov** » 22 Sep 2005, 01:42

Great! Thanks, R?mi!

Is there a chance future bayeselo may run in fully un-interactive mode? I like to run it from another program to automatically process PGN collection from my tournaments. So it would be very nice if you could run "bayeselo -readpgn ... -elo -mm -exactdist -ratings" (whatever we have to type now) and it will just print its output to stdout?

Winboard Forum

ELOStat algorithm ?

new version

New version of ELOStat

Re: New version of ELOStat

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

New version of bayeselo

Source code

Thank you for this

The erfc function

Re: The erfc function

Another place for erfc() with several implementations

Re: Thank you for this

Re: ELOStat algorithm ?

Re: ELOStat algorithm ?

Metropolis

Re: Metropolis

Bayeselo now with a proper prior

Re: Bayeselo now with a proper prior

Re: ELOStat algorithm ?

Who is online