Winboard Forum

Posted: **19 Jun 2007, 15:53**

Statistically speaking, the most efficient way to measure the strength difference between two engines is having them play each other. Unfortunately this method does not work, as it is sensitive to systematic errors caused by playing styles. It is very normal that two engines A and B that have equal Elo rating based on results against a large variety of opponents score consistently 60-40 against each other because the style of A happens to be particularly effective against the style of B.

To avoid this systematical error, you have to play both A and B against a variety of opponents C, and derive their rating difference from the score difference of such gauntlets. If A and B are close in strength, and you play them against opponents C that are also close to them, this requires 4 times as many games to get the same reliability: not only do you have to play the games for A and B separately, but as the result now is calculated as the difference of two statistical quantities. And in such a case the statistical variances add, so that the individual results have to have twice smaller variance to get the same accuracy of the difference. And that takes twice as many games for the gauntlet of each A and B as you would have needed in the direct match between A and B.

Now this conclusion is valid if the opponents C have the same strength as A and B. If A and B are close, the reliability drops as C gets farther off in rating compared to A and B. I calculated how bad this effect is. Ignoring the draw possibility, the variance in the result of a game is PHI(x)*(1-PHI(x)), where PHI(x) is the win probability for an Elo difference x. In the Elo model PHI(x) is the cumulative normal distribution. The statistical standard error is the square root of this. The expected difference in the result is PHI(A-C)-PHI(B-C) ~ (A-B) * PHI'(A-C), where PHI'(x) = exp(-0.5*x*x/s*s) / sqrt(2*PI* s) the normal distribution function itself. The standard deviation in the Elo model is s=280 pts.

Calculating the relative accuracy of the result, i.e. the score difference divided by the standard deviation, as a function of the rating of A (or B) and C, it is indeed confirmed that the accuracy is largest when x = A-C = 0. If C is 280 Elo points away, the relative accuracy drops to 83% of the ideal situation. To compensate for this, you would need 1/(0.83*0.83) = 1.45 times as many games, i.e. 45% more games in each gauntlet. This is an inconvenience, but not yet a disaster.

(You might be forced to use opponents with a larger than desirable Elo difference with the engines under test simply because there are not enough engines available of equal strength. But only testing at equal strength is not recommended in any case, as the Elo rating is just as much determined by how efficiently an engine finishes off opponents weaker than it, and how well it can hold its ground against engines stronger than it. If the engine under test has an assymmetry there, e.g. getting totally whipped by engines that are only a little stronger, but bungling many draws against engines that are significantly weaker, (e.g. because it does not recognize repetition draws), testing only against equal opponents would lead to overestimating its rating.)

At a difference of 2*s = 560 pts, you would need nearly 5 times as many games to reach the same reliability. So having opponents in your gauntlet that are 560 rating points away from the engines under test only extracts 1/5 as much information per game as having equal opponents. The score against such opponents would only be 2.3% (or 97.7%). For opponents 280 points away the score would be 16% (or 84%), and the information per game would be 0.69. The full table is:

Code: Select all: ELO win relative nr of diff. prob. accuracy games - 50% 1.000 1.00 28 54% 0.998 1.00 56 58% 0.993 1.01 84 62% 0.984 1.03 112 66% 0.971 1.06 140 69% 0.955 1.10 168 73% 0.936 1.14 196 76% 0.914 1.20 224 79% 0.888 1.27 252 82% 0.860 1.35 280 84% 0.830 1.45 308 86% 0.797 1.57 336 88% 0.762 1.72 364 90% 0.726 1.90 392 92% 0.688 2.11 420 93% 0.649 2.37 448 95% 0.610 2.69 476 96% 0.570 3.07 504 96% 0.531 3.55 532 97% 0.492 4.14 560 98% 0.453 4.87

Posted: **19 Jun 2007, 16:24**

Hi HG

In my own experience, when I compare two differently tuned versions of the same engine (which is similar to comparing two different engines with close strength), the most efficient approach is to begin with both playing a series of Nunn matches against the same panel of well-known engines and analysing the joined set of games with Rémi Coulom's Bayeselo utility (I mix games from both series of matches with those of a once for all played RR tournament among the reference engines).

One year ago I demonstrated how close to large rating lists results this fast testing approach approach can lead (see http://users.skynet.be/mlcc/chessbazaar/mlmfl.html )

Since then I slightly modified my scheme.

For example I am busy tuning an experimental version of a well-known program.
For each new set of parameters I let it play 256 1+1 games (4 opponents * 32 positions * 2 (black then white)) against the four best free engines from march 2007 CCRL blitz list : Rybka 1.0b, Toga 1.3.1, Spike 1.2, Naum 2.0).

With Bayeselo I get an estimation +/- 19 elo points at 95% comfidence level from each such 256 games test.
Once some parameter modification leads to clear improvement with this scheme I begin longer testing at slower pace.

So far I am happy with this way of proceeding.

Sure It would be necessary to carely select other reference engines if you are to test an engine or engine(s) who are in a very different rating range.

By the way Remi Coulpm has interesting theoretical material on the precise fact that his analytical approach is specially powerful in cases of engines with very different strength.

Did you have a look at Bayeselo pages ?

http://remi.coulom.free.fr/Bayesian-Elo/#theory

Regards

Marc

Posted: **19 Jun 2007, 17:42**

Yes, I did. I think the Bayesian approach is great. Indeed one can always get the error bars from applying BayesElo to the results.

The reason I did the calculation above was to get some feel for the number of games one has to play to get a certain accuracy, even before having any results. For matches between approximately equal engines the standard error in the score fraction is easily calculated as 0.4/sqrt(N). I always used this formula as an estimate for the number of games I would need to resolve a certain score difference.

But it reassuring to know that even against opponents that are ~300 Elo points away, each game still counts as heavily as 0.69 games against equal opponents, so that you hardly lose any accuracy when playing a gauntlet spread over +/-300 Elo.

The Nunn positions are indeed very useful. I also used them, for testing micro-Max improvements, as the number of acceptable opponents in that Elo range was very limited (most engines were buggy and unsuitable for automatic testing, as they continually hung the system), and the ones that played without problems usually played reproducibly (just like uMax).

Up to now I used the rule that I kicked opponents out of the gauntlet if they did not score within the 25%-75% interval. But the calculation shows that this was a little too conservative, and that one can still learn quite a lot from playing engines that score 16% or 84%.

Posted: **24 Jun 2007, 17:01**

H.G.Muller wrote:But it reassuring to know that even against opponents that are ~300 Elo points away, each game still counts as heavily as 0.69 games against equal opponents, so that you hardly lose any accuracy when playing a gauntlet spread over +/-300 Elo.

Hi HG,

About a year ago I asked in the tournament forum about what differences would be acceptable and the empirical answers from the experts were all around 300-400. Thanks for the theory that proves it.

If I remember correctly there is such a limit of 300 for the calculation of some official rating lists like the one of the french federation for example.

And if one wants to verify all that, one can fiddle with pgnscanner from G. Guillory. His program has an option to include only engines within a limited range for the elo calculation.

Best wishes.
Y. Snoeckx

Winboard Forum

Some statistical considerations for gauntlets

Some statistical considerations for gauntlets

Re: Some statistical considerations for gauntlets

Re: Some statistical considerations for gauntlets

Re: Some statistical considerations for gauntlets