Some statistical considerations for gauntlets

Discussions about Winboard/Xboard. News about engines or programs to use with these GUIs (e.g. tournament managers or adapters) belong in this sub forum.

Moderator: Andres Valverde

Some statistical considerations for gauntlets

Postby H.G.Muller » 19 Jun 2007, 15:53

Statistically speaking, the most efficient way to measure the strength difference between two engines is having them play each other. Unfortunately this method does not work, as it is sensitive to systematic errors caused by playing styles. It is very normal that two engines A and B that have equal Elo rating based on results against a large variety of opponents score consistently 60-40 against each other because the style of A happens to be particularly effective against the style of B.

To avoid this systematical error, you have to play both A and B against a variety of opponents C, and derive their rating difference from the score difference of such gauntlets. If A and B are close in strength, and you play them against opponents C that are also close to them, this requires 4 times as many games to get the same reliability: not only do you have to play the games for A and B separately, but as the result now is calculated as the difference of two statistical quantities. And in such a case the statistical variances add, so that the individual results have to have twice smaller variance to get the same accuracy of the difference. And that takes twice as many games for the gauntlet of each A and B as you would have needed in the direct match between A and B.

Now this conclusion is valid if the opponents C have the same strength as A and B. If A and B are close, the reliability drops as C gets farther off in rating compared to A and B. I calculated how bad this effect is. Ignoring the draw possibility, the variance in the result of a game is PHI(x)*(1-PHI(x)), where PHI(x) is the win probability for an Elo difference x. In the Elo model PHI(x) is the cumulative normal distribution. The statistical standard error is the square root of this. The expected difference in the result is PHI(A-C)-PHI(B-C) ~ (A-B) * PHI'(A-C), where PHI'(x) = exp(-0.5*x*x/s*s) / sqrt(2*PI* s) the normal distribution function itself. The standard deviation in the Elo model is s=280 pts.

Calculating the relative accuracy of the result, i.e. the score difference divided by the standard deviation, as a function of the rating of A (or B) and C, it is indeed confirmed that the accuracy is largest when x = A-C = 0. If C is 280 Elo points away, the relative accuracy drops to 83% of the ideal situation. To compensate for this, you would need 1/(0.83*0.83) = 1.45 times as many games, i.e. 45% more games in each gauntlet. This is an inconvenience, but not yet a disaster.

(You might be forced to use opponents with a larger than desirable Elo difference with the engines under test simply because there are not enough engines available of equal strength. But only testing at equal strength is not recommended in any case, as the Elo rating is just as much determined by how efficiently an engine finishes off opponents weaker than it, and how well it can hold its ground against engines stronger than it. If the engine under test has an assymmetry there, e.g. getting totally whipped by engines that are only a little stronger, but bungling many draws against engines that are significantly weaker, (e.g. because it does not recognize repetition draws), testing only against equal opponents would lead to overestimating its rating.)

At a difference of 2*s = 560 pts, you would need nearly 5 times as many games to reach the same reliability. So having opponents in your gauntlet that are 560 rating points away from the engines under test only extracts 1/5 as much information per game as having equal opponents. The score against such opponents would only be 2.3% (or 97.7%). For opponents 280 points away the score would be 16% (or 84%), and the information per game would be 0.69. The full table is:

Code: Select all
ELO    win     relative  nr of
diff.  prob.   accuracy  games
  -     50%    1.000     1.00
  28    54%    0.998     1.00
  56    58%    0.993     1.01
  84    62%    0.984     1.03
 112    66%    0.971     1.06
 140    69%    0.955     1.10
 168    73%    0.936     1.14
 196    76%    0.914     1.20
 224    79%    0.888     1.27
 252    82%    0.860     1.35
 280    84%    0.830     1.45
 308    86%    0.797     1.57
 336    88%    0.762     1.72
 364    90%    0.726     1.90
 392    92%    0.688     2.11
 420    93%    0.649     2.37
 448    95%    0.610     2.69
 476    96%    0.570     3.07
 504    96%    0.531     3.55
 532    97%    0.492     4.14
 560    98%    0.453     4.87

User avatar
H.G.Muller
 
Posts: 3453
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: Some statistical considerations for gauntlets

Postby Marc Lacrosse » 19 Jun 2007, 16:24

Hi HG

In my own experience, when I compare two differently tuned versions of the same engine (which is similar to comparing two different engines with close strength), the most efficient approach is to begin with both playing a series of Nunn matches against the same panel of well-known engines and analysing the joined set of games with Rémi Coulom's Bayeselo utility (I mix games from both series of matches with those of a once for all played RR tournament among the reference engines).

One year ago I demonstrated how close to large rating lists results this fast testing approach approach can lead (see http://users.skynet.be/mlcc/chessbazaar/mlmfl.html )

Since then I slightly modified my scheme.

For example I am busy tuning an experimental version of a well-known program.
For each new set of parameters I let it play 256 1+1 games (4 opponents * 32 positions * 2 (black then white)) against the four best free engines from march 2007 CCRL blitz list : Rybka 1.0b, Toga 1.3.1, Spike 1.2, Naum 2.0).

With Bayeselo I get an estimation +/- 19 elo points at 95% comfidence level from each such 256 games test.
Once some parameter modification leads to clear improvement with this scheme I begin longer testing at slower pace.

So far I am happy with this way of proceeding.

Sure It would be necessary to carely select other reference engines if you are to test an engine or engine(s) who are in a very different rating range.

By the way Remi Coulpm has interesting theoretical material on the precise fact that his analytical approach is specially powerful in cases of engines with very different strength.

Did you have a look at Bayeselo pages ?

http://remi.coulom.free.fr/Bayesian-Elo/#theory

Regards

Marc
Marc Lacrosse
 
Posts: 116
Joined: 29 Jan 2005, 09:04
Location: Belgium

Re: Some statistical considerations for gauntlets

Postby H.G.Muller » 19 Jun 2007, 17:42

Yes, I did. I think the Bayesian approach is great. Indeed one can always get the error bars from applying BayesElo to the results.

The reason I did the calculation above was to get some feel for the number of games one has to play to get a certain accuracy, even before having any results. For matches between approximately equal engines the standard error in the score fraction is easily calculated as 0.4/sqrt(N). I always used this formula as an estimate for the number of games I would need to resolve a certain score difference.

But it reassuring to know that even against opponents that are ~300 Elo points away, each game still counts as heavily as 0.69 games against equal opponents, so that you hardly lose any accuracy when playing a gauntlet spread over +/-300 Elo.

The Nunn positions are indeed very useful. I also used them, for testing micro-Max improvements, as the number of acceptable opponents in that Elo range was very limited (most engines were buggy and unsuitable for automatic testing, as they continually hung the system), and the ones that played without problems usually played reproducibly (just like uMax).

Up to now I used the rule that I kicked opponents out of the gauntlet if they did not score within the 25%-75% interval. But the calculation shows that this was a little too conservative, and that one can still learn quite a lot from playing engines that score 16% or 84%.
User avatar
H.G.Muller
 
Posts: 3453
Joined: 16 Nov 2005, 12:02
Location: Diemen, NL

Re: Some statistical considerations for gauntlets

Postby Yannik Snoeckx » 24 Jun 2007, 17:01

H.G.Muller wrote:But it reassuring to know that even against opponents that are ~300 Elo points away, each game still counts as heavily as 0.69 games against equal opponents, so that you hardly lose any accuracy when playing a gauntlet spread over +/-300 Elo.


Hi HG,

About a year ago I asked in the tournament forum about what differences would be acceptable and the empirical answers from the experts were all around 300-400. Thanks for the theory that proves it.

If I remember correctly there is such a limit of 300 for the calculation of some official rating lists like the one of the french federation for example.

And if one wants to verify all that, one can fiddle with pgnscanner from G. Guillory. His program has an option to include only engines within a limited range for the elo calculation.

Best wishes.
Y. Snoeckx
Yannik Snoeckx
 
Posts: 74
Joined: 12 Jul 2005, 17:08
Location: Geneva / Switzerland


Return to Winboard and related Topics

Who is online

Users browsing this forum: No registered users and 18 guests