ELOStat algorithm ?

Programming Topics (Computer Chess) and technical aspects as test techniques, book building, program tuning etc

Moderator: Andres Valverde

new version

Postby Rémi Coulom » 15 Jan 2005, 13:58

Hi,

I have made a new version that fixes the problems reported by Dann (Thanks, Dann):
http://remi.coulom.free.fr/Bayesian-Elo/

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

New version of ELOStat

Postby Rémi Coulom » 22 Jan 2005, 15:00

A new version of ELOStat is available (on the Arena web site):
Changes from version 1.2 to 1.3:
  • maximum number of different players/programs increased to 1500
  • algorithm for calculating the confidence intervals completely changed (now uses the so called nonparametric ABC method (approximated bootstrap confidence)) by Efron and Tibshirani. Many thanks to Dr. Jeff Lischer (US) for drawing my attention to this fantastic method and to all users who pointed out the insufficiencies of the old method.
  • some minor bugs in individual statistics output removed

Here is a link that explains bootstrap:
http://en.wikipedia.org/wiki/Bootstrap#Statistics

I believe the Bayesian approach is still much better. I have no time to do it right now, but I will probably start implementing the "inverse of the Hessian" approach in one or two weeks.

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Re: New version of ELOStat

Postby Peter Fendrich » 22 Jan 2005, 20:00

R?mi Coulom wrote:I believe the Bayesian approach is still much better.
R?mi
Absolutely,
I think that you are on the right track, even if i have some doubts about your maximum likelihood approach....
In a way Bayesian methods are bootstrapping as well.
/Peter
User avatar
Peter Fendrich
 
Posts: 193
Joined: 26 Sep 2004, 20:28
Location: Sweden

Re: ELOStat algorithm ?

Postby Robert Allgeuer » 23 Jan 2005, 17:17

A question to the statistics experts:

If I have understood correctly EloStat makes various assumptions and simplifications about the distribution of the error for the calculation of the error margins.

What are these assumptions?
One is that it assumes that error margins against the whole set of opponents is the same as against one opponent with the average rating.
But what kind of distribution function does it assume? Any other assumptions?
Also, is there a paper on its method available somewhere?

Thanks
Robert
Robert Allgeuer
 
Posts: 124
Joined: 28 Sep 2004, 19:09
Location: Konz / Germany

Re: ELOStat algorithm ?

Postby Rémi Coulom » 23 Jan 2005, 18:07

Robert Allgeuer wrote:A question to the statistics experts:

If I have understood correctly EloStat makes various assumptions and simplifications about the distribution of the error for the calculation of the error margins.

What are these assumptions?
One is that it assumes that error margins against the whole set of opponents is the same as against one opponent with the average rating.
But what kind of distribution function does it assume? Any other assumptions?
Also, is there a paper on its method available somewhere?

Thanks
Robert
This was discussed earlier in this thread. There is a CSS paper explaining the past versions of ELOStat. Dieter gave a summary (first page, dated Thu Dec 16, 2004 2:43 pm, I do not know how to link to a message).

The strange thing about Elostat is that it computes two variances. So, the assumed distribution is two half-Gaussians. It seems wrong to me. It often generates strange asymetric intervals. bayeselo also generate asymetric intervals but, usually, the asymetry is the reverse of Elostat. For instance, using the games of the second division of WBEC, with bayeselo
Code: Select all
Rank Name                            Elo   +    -  games score ratio
   1 Jonny 2.70                     2654  103   90    56    45   80%
   2 Spike 0.8a                     2577   94   86    56    40   71%
   3 DanChess 1.07                  2576   92   86    56    39   70%
   4 TRACE 1.33                     2482   87   85    52    30   58%
   5 E.T.Chess 071204               2471   84   80    56    37   66%
   6 Pseudo 0.6h                    2453   84   82    56    32   57%
   7 Petir 2.0                      2443   88   86    52  29.5   57%
   8 Snitch 1.0.8                   2416   82   81    56  29.5   53%
   9 Muse 0.899b uci                2408   83   81    55  30.5   55%
  10 TheCrazyBishop 0052            2408   83   83    56    29   52%
  11 Terra 3.3b11                   2402   87   87    56    27   48%
  12 Leila 0.53h                    2396   83   83    56    27   48%
  13 Cerebro 1.30                   2391   92   93    52    25   48%
  14 Frenzee 199                    2386   86   87    56    26   46%
  15 Amateur 2.80                   2375   81   83    56  24.5   44%
  16 Scidlet 3.6                    2359   84   86    52  22.5   43%
  17 Bruja 1.8                      2356   83   85    52  22.5   43%
  18 Queen 2.45                     2338   85   86    52    23   44%
  19 Chezzz 1.0.3                   2329   85   89    56  21.5   38%
  20 Resp 0.19                      2307   87   91    52    20   38%
  21 Knightx 1.86                   2301   84   87    56    22   39%
  22 Djinn 0.870f                   2281   84   88    56    21   38%
  23 Butcher 1.53                   2263   87   94    55  16.5   30%
  24 Esc 1.16                       2224   93  102    52    15   29%

with ELOStat (1.2):
Code: Select all
    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Jonny 2.70                     : 2628   66 124    56    80.4 %   2383   21.4 %
  2 Spike 0.8a                     : 2551   72 105    56    71.4 %   2391   21.4 %
  3 DanChess 1.07                  : 2549   74 102    56    69.6 %   2405   21.4 %
  4 E.T.Chess 071204               : 2473   77  82    56    66.1 %   2358   32.1 %
  5 TRACE 1.33                     : 2469   90  82    52    57.7 %   2415   26.9 %
  6 Pseudo 0.6h                    : 2440   87  80    56    57.1 %   2390   25.0 %
  7 Petir 2.0                      : 2434   91  87    52    56.7 %   2387   21.2 %
  8 Snitch 1.0.8                   : 2415   92  71    56    52.7 %   2396   30.4 %
  9 Muse 0.899b uci                : 2407   90  74    55    55.5 %   2369   30.9 %
 10 TheCrazyBishop 0052            : 2402   94  75    56    51.8 %   2390   25.0 %
 11 Terra 3.3b11                   : 2396   81  94    56    48.2 %   2408   17.9 %
 12 Cerebro 1.30                   : 2394   91  97    52    48.1 %   2407   11.5 %
 13 Leila 0.53h                    : 2393   69  94    56    48.2 %   2405   32.1 %
 14 Frenzee 199                    : 2390   83  91    56    46.4 %   2414   17.9 %
 15 Amateur 2.80                   : 2372   74  88    56    43.8 %   2415   30.4 %
 16 Bruja 1.8                      : 2364   75  91    52    43.3 %   2411   32.7 %
 17 Scidlet 3.6                    : 2363   79  91    52    43.3 %   2410   28.8 %
 18 Queen 2.45                     : 2342   80  92    52    44.2 %   2382   26.9 %
 19 Chezzz 1.0.3                   : 2332   91  82    56    38.4 %   2415   19.6 %
 20 Resp 0.19                      : 2320   91  85    52    38.5 %   2402   23.1 %
 21 Knightx 1.86                   : 2318   88  83    56    39.3 %   2394   21.4 %
 22 Djinn 0.870f                   : 2297   91  81    56    37.5 %   2385   21.4 %
 23 Butcher 1.53                   : 2280  105  74    55    30.0 %   2427   20.0 %
 24 Esc 1.16                       : 2252  125  76    52    28.8 %   2409   11.5 %

You can notice that for players near the top, the asymetry in bayeselo is the reverse of ELOStat. Those player have won more than they have lost. So, it seems normal that the lower bound of the confidence interval should be tigher than the upper bound. Every win contributes to narrowing the lower bound, and every loss contributes to narrowing the upper bound. For instance, if you consider the extreme case where one player has only wins, then you know a lower bound of its rating, but no upper bound. ELOStat gives the reverse asymetry.

ELOStat 1.3 is an improvement in this respect:
Code: Select all
  1 Jonny 2.70                     : 2628   98  93    56    80.4 %   2383   21.4 %
  2 Spike 0.8a                     : 2551   90  87    56    71.4 %   2391   21.4 %
  3 DanChess 1.07                  : 2549   89  86    56    69.6 %   2405   21.4 %
  4 E.T.Chess 071204               : 2473   79  77    56    66.1 %   2358   32.1 %
  5 TRACE 1.33                     : 2469   83  82    52    57.7 %   2415   26.9 %
  6 Pseudo 0.6h                    : 2440   81  80    56    57.1 %   2390   25.0 %
  7 Petir 2.0                      : 2434   87  86    52    56.7 %   2387   21.2 %
  8 Snitch 1.0.8                   : 2415   77  77    56    52.7 %   2396   30.4 %
  9 Muse 0.899b uci                : 2407   78  78    55    55.5 %   2369   30.9 %
 10 TheCrazyBishop 0052            : 2402   80  80    56    51.8 %   2390   25.0 %
 11 Terra 3.3b11                   : 2396   84  84    56    48.2 %   2408   17.9 %
 12 Cerebro 1.30                   : 2394   91  91    52    48.1 %   2407   11.5 %
 13 Leila 0.53h                    : 2393   76  76    56    48.2 %   2405   32.1 %
 14 Frenzee 199                    : 2390   84  85    56    46.4 %   2414   17.9 %
 15 Amateur 2.80                   : 2372   77  78    56    43.8 %   2415   30.4 %
 16 Bruja 1.8                      : 2364   79  80    52    43.3 %   2411   32.7 %
 17 Scidlet 3.6                    : 2363   81  82    52    43.3 %   2410   28.8 %
 18 Queen 2.45                     : 2342   82  83    52    44.2 %   2382   26.9 %
 19 Chezzz 1.0.3                   : 2332   84  86    56    38.4 %   2415   19.6 %
 20 Resp 0.19                      : 2320   85  87    52    38.5 %   2402   23.1 %
 21 Knightx 1.86                   : 2318   83  84    56    39.3 %   2394   21.4 %
 22 Djinn 0.870f                   : 2297   83  85    56    37.5 %   2385   21.4 %
 23 Butcher 1.53                   : 2280   88  91    55    30.0 %   2427   20.0 %
 24 Esc 1.16                       : 2252   97 102    52    28.8 %   2409   11.5 %

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Re: ELOStat algorithm ?

Postby Dann Corbit » 24 Jan 2005, 19:42

Interesting is the reversal of trace and e.t.chess.
Dann Corbit
 

New version of bayeselo

Postby Rémi Coulom » 13 Feb 2005, 17:36

Hi,

I have made a new version of my program:
http://remi.coulom.free.fr/Bayesian-Elo/

This version can compute uncertainty margins assuming a Gaussian likelihood distribution by estimating its covariance with the inverse of the opposite of the Hessian. It can also use this matrix to compute an table of "liklihood that A is stronger than B".

To show how cool this feature is, I'll give a small example. Suppose that A has played 32 games against B, scoring 75%. Also, C has played 32 games against D, scoring 75%. So, we are fairly sure that A is tronger than B, and C is stronger than D. Suppose now, that B plays a game against C, and draws. Estimating confidence intervals from those results becomes tricky now. Infering the likelihood that one player is stronger than another from those intervals is impossible.

Here is the output of bayeselo:
Code: Select all
Rank Name   Elo    +    - games score draws
   1 A      146  267  267    32   75%   25%
   2 C       16  253  253    33   74%   27%
   3 B      -16  253  253    33   25%   27%
   4 D     -146  267  267    32   25%   25%

All the ratings are very uncertain, because of that single draw between B and C. The matrix of "likelihood that i is stronger than j", computed by bayeselo (multiplied by 1000 and rounded) is:
Code: Select all
     A   C   B   D
A      689 996 861
C  310     551 996
B    3 448     689
D  138   3 310   

Here is the output of ELOStat on the same set of game results:
Code: Select all
  Program   Elo    +   -   Games   Score   Av.Op.  Draws

1 A       :  210  121 114    32    75.0 %     19   25.0 %
2 B       :   19  109 115    33    25.8 %    203   27.3 %
3 C       :  -19  115 109    33    74.2 %   -203   27.3 %
4 D       : -210  114 121    32    25.0 %    -19   25.0 %

Because ELOStat does not take covariance into consideration, and assumes that the ratings of opponents are their true ratings, it obtains very narrow confidence intervals.

As Dann would write, interesting is the reversal of B and C (Dann, is this correct english grammar? It sounds more German than English). In Bayeselo, B obtains a lower rating than C, because it drew C playing white. The advantage of playing white is about 32 points in bayeselo, so this is the rating difference we get between B and C. ELOStat does not make any difference between white and black, so it would have made sense to give the same rating to B and C. The reason for the 38-point difference comes from the fact that ELOStat considers the average of the ratings of opponents. This particular example demonstrates that it is not a good idea.

These new features of bayeselo are cool, but not perfect. The main problem is that the likelihood distribution is not really a Gaussian. As the number of games becomes large, it becomes close to a Gaussian. But the approximation is a little wrong when there are few games. For instance, if we consider the extreme case of one single draw between A and B, the inverse-of-the-Hessian stuff gives a +/- 250 uncertainty (that you can see in the example above), while the true distribution has +/- 322. This, of course is the most extreme situation. The difference between the true distribution and the Gaussian approximation is usually much smaller.

In order to get even more accurate results, it might be possible to use the Metropolis-Hastings algorithm, using the Gaussian as the proposal density:
http://en.wikipedia.org/wiki/Metropolis-Hastings_algorithm
I will try to implement this in the next version.

This new binary of bayeselo had to be compiled with mingw, because msvc does not have the erfc function in <math.h>. There was also a problem with gcc: erfc is in <math.h> but no std::erfc in <cmath>. The g++ guys should fix this, if they read the winboard forum.

Compiling with mingw makes calculations 10% slower on my P4. But parsing the file is almost twice faster. So, all-in-all, it is a speed gain. I am not sure about the reason of the speed difference when parsing the file. Maybe std::map<std::string,int> is much faster with gcc.

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Source code

Postby Rémi Coulom » 14 Feb 2005, 13:33

Hi,

I have quickly packed the source code (with GPL licence), for those who are interested:
http://remi.coulom.free.fr/Bayesian-Elo ... lo.tar.bz2

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Thank you for this

Postby Dann Corbit » 14 Feb 2005, 23:40

I am sure that I will enjoy looking at what you did very much.

I am hoping that eventually, I will even understand it!
;-)
Dann Corbit
 

The erfc function

Postby Dann Corbit » 14 Feb 2005, 23:51

You can find an open source version of erfc() in the Cephes collection by Moshier at Netlib.
Dann Corbit
 

Re: The erfc function

Postby Anonymous » 15 Feb 2005, 00:42

Dann Corbit wrote:You can find an open source version of erfc() in the Cephes collection by Moshier at Netlib.


Yes. One of excellent quality. But not without problems. The Cephes routines need quite some support routines. When I studied them (years ago) I think, there could also be some problems with already defined global symbols in typical Standard C math libraries. I guess, it is not trivial (few minutes) to just use the erfc of Stephen Moshier. It will probably be not easy to extract it from the library.

I guess erfc is used in Remi's program to calculate some error margins. Accuracy will probably not matter at all and 2-3 significant digits will be enough. One could use 1-erf, and in the extreme cases where result is close to zero and a slightly good relative accuracy is needed (1-erf won't work then) a taylor expansion.

Regards,
Dieter
Anonymous
 


Re: Thank you for this

Postby Rémi Coulom » 15 Feb 2005, 10:51

Dann Corbit wrote:I am sure that I will enjoy looking at what you did very much.

I am hoping that eventually, I will even understand it!
;-)

I have started to write a short tutorial paper that explains the theory. Right now, my plans are:
  1. Implement Metropolis-Hastings. This will provide a very accurate matrix of "likelihood that i is stronger than j". This will also allow to compute the expected ratings, instead of the maximum-likelihood ratings, and make Peter happy.
  2. Run some cross-validation tests, to compare the predictive powers of bayeselo and ELOStat.
  3. Add functionality to the program so that it can handle go games, with handicap and komi.
  4. Finish this tutorial paper about the statistical analysis of game results, and submit it to the ICGA journal.
So, it may take some time before you get this tutorial, but it will probably be there in a few months.

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Re: ELOStat algorithm ?

Postby Dann Corbit » 15 Feb 2005, 15:13

This is really a fantastic and valuable effort.

There is really no other open source project that performs these calculations effectively.

I am sure that it will become an important factor for a number of projects in the future.

I also think that the tables it generates feel somewhat more sensible intuitively. So I think that it may be a better model than alternatives.
Dann Corbit
 

Re: ELOStat algorithm ?

Postby Dann Corbit » 18 Feb 2005, 22:32

Minimal erfc stuff from Cephes
ftp://cap.connx.com/pub/chess-engines/n ... rferfc.zip

I built a BayesElo using MS VC++ with it and it works fine.

In order to use it, #include "cephes.h" into your project where you plan to call erf() or erfc() along with the C and H files.
Dann Corbit
 

Metropolis

Postby Rémi Coulom » 19 Feb 2005, 19:45

R?mi Coulom wrote:Right now, my plans are:
  1. Implement Metropolis-Hastings. This will provide a very accurate matrix of "likelihood that i is stronger than j". This will also allow to compute the expected ratings, instead of the maximum-likelihood ratings, and make Peter happy.

I have spent this afternoon implementing this, just to find out that it does not work. Well, it works for a small number of players, but not for a large number. The cost is still exponential with the number of players, so it is not practical for more than 4-5. I naively thought that using the Gaussian as a helper would be the magic trick that allows good high-dimensional sampling. I could have saved an afternoon if I had thought about this a little more. I am feeling a little stupid now.

Thanks Dann for the erfc code. I will probably not use it because I am happy with gcc, but I may find a use for it, sometime.

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Re: Metropolis

Postby Dann Corbit » 21 Feb 2005, 20:01

R?mi Coulom wrote:
R?mi Coulom wrote:Right now, my plans are:
  1. Implement Metropolis-Hastings. This will provide a very accurate matrix of "likelihood that i is stronger than j". This will also allow to compute the expected ratings, instead of the maximum-likelihood ratings, and make Peter happy.

I have spent this afternoon implementing this, just to find out that it does not work. Well, it works for a small number of players, but not for a large number. The cost is still exponential with the number of players, so it is not practical for more than 4-5. I naively thought that using the Gaussian as a helper would be the magic trick that allows good high-dimensional sampling. I could have saved an afternoon if I had thought about this a little more. I am feeling a little stupid now.

Thanks Dann for the erfc code. I will probably not use it because I am happy with gcc, but I may find a use for it, sometime.

R?mi


It is useful for me because I like the MS VC++ IDE better than GCC (but I am coming around to Eclipse). At any rate, I can make faster binaries with MS VC++ also.

So some other people might also find it useful.
Dann Corbit
 

Bayeselo now with a proper prior

Postby Rémi Coulom » 21 Sep 2005, 20:29

Hi,

I revive this huge old thread because I have spent some time working on bayeselo recently. As I had planned, I made some more statistical tests to measure the predictive power of bayeselo. This leads to a new version that I believe is much better.

What I first did is this: from the WBEC data I selected at random one game out of two. From these extracted games, I computed elo ratings for every player. Then, I compared the results of these players on the second half of WBEC games with expected results according to computed elo ratings. This is what I got:
Image
The plot in red indicates the proportions of wins as a function of rating difference. The green curve is the ideal model, purple and blue curves are 95% confidence boundaries around the green curve.

This shows a phenomenon that I had intuitively observed: bayeselo tends to overestimate the differences in ratings between players.

The fundamental reason for this overestimation is the lack of proper prior. As it is computed, the maximum likelihood estimation assumes an a-priori uniform distribution over all possible ratings. This is bad for two reasons. First, from a mathematical point of view, there is no such thing as a uniform distribution over an infinite range. Assuming such a prior may lead to big fallacies in Bayesian reasoning, such as the two-envelope paradox. Second, we know for sure that small rating differences are much more likely than huge ones.

So, I have produced a new version of bayeselo that uses a more reasonable prior. And here is what I get now on WBEC data:

Image

It is far from perfect, but it is better.

The prior is set by automatically adding "virtual draws" between players. The default value is set to 2. I found that 2.6 is what maximizes the likelihood on WBEC data, but I prefered to leave the default to a smaller value. A value of 0 gives exactly the same behaviour as the previous version. You can set the prior manually with the new "prior" command in bayeselo. When using this command, it will tell you its consequences like this:
Code: Select all
ResultSet-EloRating>prior 0.1
0.1
With this prior, you will get the following Elo differences:
1-0 : +475.016
2-0 : +590.27
3-0 : +658.945
4-0 : +708.031
5-0 : +746.258
ResultSet-EloRating>prior 1
1
With this prior, you will get the following Elo differences:
1-0 : +150.156
2-0 : +232.215
3-0 : +288.12
4-0 : +330.435
5-0 : +364.46
ResultSet-EloRating>prior 2
2
With this prior, you will get the following Elo differences:
1-0 : +89.2563
2-0 : +150.156
3-0 : +195.835
4-0 : +232.215
5-0 : +262.379


I am very convinced that this new version produces much better Elo ratings, especially when the number of games is small. You can get it there:

http://remi.coulom.free.fr/Bayesian-Elo/

R?mi
Rémi Coulom
 
Posts: 96
Joined: 12 Nov 2004, 13:47
Location: Lille, France

Re: Bayeselo now with a proper prior

Postby Dann Corbit » 22 Sep 2005, 01:19

R?mi Coulom wrote:{snip}
I am very convinced that this new version produces much better Elo ratings, especially when the number of games is small. You can get it there:

http://remi.coulom.free.fr/Bayesian-Elo/

R?mi


Fantastic!
Can I get a source tarball?
Dann Corbit
 

Re: ELOStat algorithm ?

Postby Kirill Kryukov » 22 Sep 2005, 01:42

Great! Thanks, R?mi!

Is there a chance future bayeselo may run in fully un-interactive mode? I like to run it from another program to automatically process PGN collection from my tournaments. So it would be very nice if you could run "bayeselo -readpgn ... -elo -mm -exactdist -ratings" (whatever we have to type now) and it will just print its output to stdout?
User avatar
Kirill Kryukov
 
Posts: 127
Joined: 21 Sep 2005, 09:56

PreviousNext

Return to Programming and Technical Discussions

Who is online

Users browsing this forum: Google [Bot] and 9 guests