Some statistical considerations for gauntlets
Posted: 19 Jun 2007, 15:53
Statistically speaking, the most efficient way to measure the strength difference between two engines is having them play each other. Unfortunately this method does not work, as it is sensitive to systematic errors caused by playing styles. It is very normal that two engines A and B that have equal Elo rating based on results against a large variety of opponents score consistently 60-40 against each other because the style of A happens to be particularly effective against the style of B.
To avoid this systematical error, you have to play both A and B against a variety of opponents C, and derive their rating difference from the score difference of such gauntlets. If A and B are close in strength, and you play them against opponents C that are also close to them, this requires 4 times as many games to get the same reliability: not only do you have to play the games for A and B separately, but as the result now is calculated as the difference of two statistical quantities. And in such a case the statistical variances add, so that the individual results have to have twice smaller variance to get the same accuracy of the difference. And that takes twice as many games for the gauntlet of each A and B as you would have needed in the direct match between A and B.
Now this conclusion is valid if the opponents C have the same strength as A and B. If A and B are close, the reliability drops as C gets farther off in rating compared to A and B. I calculated how bad this effect is. Ignoring the draw possibility, the variance in the result of a game is PHI(x)*(1-PHI(x)), where PHI(x) is the win probability for an Elo difference x. In the Elo model PHI(x) is the cumulative normal distribution. The statistical standard error is the square root of this. The expected difference in the result is PHI(A-C)-PHI(B-C) ~ (A-B) * PHI'(A-C), where PHI'(x) = exp(-0.5*x*x/s*s) / sqrt(2*PI* s) the normal distribution function itself. The standard deviation in the Elo model is s=280 pts.
Calculating the relative accuracy of the result, i.e. the score difference divided by the standard deviation, as a function of the rating of A (or B) and C, it is indeed confirmed that the accuracy is largest when x = A-C = 0. If C is 280 Elo points away, the relative accuracy drops to 83% of the ideal situation. To compensate for this, you would need 1/(0.83*0.83) = 1.45 times as many games, i.e. 45% more games in each gauntlet. This is an inconvenience, but not yet a disaster.
(You might be forced to use opponents with a larger than desirable Elo difference with the engines under test simply because there are not enough engines available of equal strength. But only testing at equal strength is not recommended in any case, as the Elo rating is just as much determined by how efficiently an engine finishes off opponents weaker than it, and how well it can hold its ground against engines stronger than it. If the engine under test has an assymmetry there, e.g. getting totally whipped by engines that are only a little stronger, but bungling many draws against engines that are significantly weaker, (e.g. because it does not recognize repetition draws), testing only against equal opponents would lead to overestimating its rating.)
At a difference of 2*s = 560 pts, you would need nearly 5 times as many games to reach the same reliability. So having opponents in your gauntlet that are 560 rating points away from the engines under test only extracts 1/5 as much information per game as having equal opponents. The score against such opponents would only be 2.3% (or 97.7%). For opponents 280 points away the score would be 16% (or 84%), and the information per game would be 0.69. The full table is:
To avoid this systematical error, you have to play both A and B against a variety of opponents C, and derive their rating difference from the score difference of such gauntlets. If A and B are close in strength, and you play them against opponents C that are also close to them, this requires 4 times as many games to get the same reliability: not only do you have to play the games for A and B separately, but as the result now is calculated as the difference of two statistical quantities. And in such a case the statistical variances add, so that the individual results have to have twice smaller variance to get the same accuracy of the difference. And that takes twice as many games for the gauntlet of each A and B as you would have needed in the direct match between A and B.
Now this conclusion is valid if the opponents C have the same strength as A and B. If A and B are close, the reliability drops as C gets farther off in rating compared to A and B. I calculated how bad this effect is. Ignoring the draw possibility, the variance in the result of a game is PHI(x)*(1-PHI(x)), where PHI(x) is the win probability for an Elo difference x. In the Elo model PHI(x) is the cumulative normal distribution. The statistical standard error is the square root of this. The expected difference in the result is PHI(A-C)-PHI(B-C) ~ (A-B) * PHI'(A-C), where PHI'(x) = exp(-0.5*x*x/s*s) / sqrt(2*PI* s) the normal distribution function itself. The standard deviation in the Elo model is s=280 pts.
Calculating the relative accuracy of the result, i.e. the score difference divided by the standard deviation, as a function of the rating of A (or B) and C, it is indeed confirmed that the accuracy is largest when x = A-C = 0. If C is 280 Elo points away, the relative accuracy drops to 83% of the ideal situation. To compensate for this, you would need 1/(0.83*0.83) = 1.45 times as many games, i.e. 45% more games in each gauntlet. This is an inconvenience, but not yet a disaster.
(You might be forced to use opponents with a larger than desirable Elo difference with the engines under test simply because there are not enough engines available of equal strength. But only testing at equal strength is not recommended in any case, as the Elo rating is just as much determined by how efficiently an engine finishes off opponents weaker than it, and how well it can hold its ground against engines stronger than it. If the engine under test has an assymmetry there, e.g. getting totally whipped by engines that are only a little stronger, but bungling many draws against engines that are significantly weaker, (e.g. because it does not recognize repetition draws), testing only against equal opponents would lead to overestimating its rating.)
At a difference of 2*s = 560 pts, you would need nearly 5 times as many games to reach the same reliability. So having opponents in your gauntlet that are 560 rating points away from the engines under test only extracts 1/5 as much information per game as having equal opponents. The score against such opponents would only be 2.3% (or 97.7%). For opponents 280 points away the score would be 16% (or 84%), and the information per game would be 0.69. The full table is:
- Code: Select all
ELO win relative nr of
diff. prob. accuracy games
- 50% 1.000 1.00
28 54% 0.998 1.00
56 58% 0.993 1.01
84 62% 0.984 1.03
112 66% 0.971 1.06
140 69% 0.955 1.10
168 73% 0.936 1.14
196 76% 0.914 1.20
224 79% 0.888 1.27
252 82% 0.860 1.35
280 84% 0.830 1.45
308 86% 0.797 1.57
336 88% 0.762 1.72
364 90% 0.726 1.90
392 92% 0.688 2.11
420 93% 0.649 2.37
448 95% 0.610 2.69
476 96% 0.570 3.07
504 96% 0.531 3.55
532 97% 0.492 4.14
560 98% 0.453 4.87