Winboard Forum

by **Volker Pittlik** » 26 Jul 2007, 16:47

I've made a test with the same engines and the same playing conditions but with different time controls. I used games with one minute initially and 1 second increment nd compared the results to games at 10 minutes initially with 10 seconds increment. I want to see if there are much differences. It seems to be more or less the same although there are some differences. I'm a bit surprised that the rating can be so different if the rank is identical. I expected a bigger differences in the ranks (if any).

Code: Select all: 1-1 Diffs --- Rating out of Rank Name Elo + games maxElo minElo Rank error margin? 1 Fruit (Toga) 1.2.1a 193 42 40 220 235 153 0 no 2 Glaurung2 -e4 perf 125 40 39 220 165 86 1 no 3 Spike 1.2 Turin 97 39 38 220 136 59 -1 no 4 Ruffian 2.1.0 75 37 36 220 112 39 2 yes 5 Shredder Classic 1.3 58 38 38 220 96 20 0 no 6 Scorpio 1.91 49 37 37 220 86 12 -2 yes 7 Crafty-21.5 -46 37 37 220 -9 -83 2 no 8 Jonny 2.83 -47 37 38 220 -10 -85 -1 no 9 Zappa 1.1 -62 37 38 220 -25 -100 -1 no 10 Yace Paderborn -65 37 38 220 -28 -103 0 no 11 Arasan 9.5 -98 37 38 220 -61 -136 0 yes 12 Hermann 2.0 -280 46 50 220 -234 -330 0 yes 10-10 ----- 1 Fruit (Toga) 1.2.1a 187 41 39 220 228 148 2 Spike 1.2 Turin 143 39 38 220 182 105 3 Glaurung2 e4 perf 105 38 37 220 143 68 4 Scorpio 1.91 92 38 37 220 130 55 5 Shredder Classic 1.3 39 37 37 220 76 2 6 Ruffian 2.1.0 21 37 37 220 58 -16 7 Jonny 2.83 -33 37 37 220 4 -70 8 Zappa 1.1 -52 37 37 220 -15 -89 9 Crafty-21.5 -60 37 37 220 -23 -97 10 Yace Paderborn -81 37 38 220 -44 -119 11 Arasan 9.5 -152 39 40 220 -113 -192 12 Hermann 2.0 -210 41 44 220 -169 -254

Volker

P.S. Other conditions:

Processor: Intel Core2Duo E6300
OS: Linux 2.6.18.8-0.5 SMP i686 (SuSE 10.2, 32 bit)
Xboard: 4.2.7
Polyglot:1.4
Ponder: off
Learning: off
Hash: Approximately 32 MB if adjustable else defaults, swapping not tolerated
Books: Own books if available, else self created generic books, no manual tuning
TBs: and other endgame stuff up tp 4 pieces in RAM disks
RAM: 1 GB

by **H.G.Muller** » 30 Jul 2007, 09:26

Nothing unusual here.

In the first place the error bars on the rating tell you were the true rating is supposed to ly (with 68% confidence). If you do a re-measurement of the rating, that new measurement will have its own error bars, and will thus on the average differ more from the first measurement than the true rating would.

The error bars should be added, in root-mean-square fashion, and for equal ranges that means they get about 40% larger.

Then there is the second effect: the re-measurement will only ly within these enlarged error bars in 68% of the cases. That means it is expected to ly outide of these error bars in 32% of the cases. As you test 12 engines here, it is thus quite normal that 4 of them will fall outside of the given error bars by more than 40%. Even if you would have tested under exactly the same conditions. (Assuming the randomness in the engines is enough to consider these indepedent tests.)

by **Greg Simpson** » 01 Aug 2007, 10:58

I know the default confidence in Bayeselo is 95% (it can be changed). I thought that was the standard in all the ratings programs. Am I wrong?

by **H.G.Muller** » 02 Aug 2007, 08:03

Oh, you might be right. I just assumed it was the standard error, as it is usual in statistics to quote that. The 95% confidence interval is 1.96 times the standard error.

So then the fraction of ptograms that would be expected to ly outside of the confidence interval would be somewhat smaller. But you still expect on the average several to ly outside it, and often you will observe a number larger than the average.

This is not really a strong indication that the ratings are actually different. (Although they might of course be, as this is a different time control.)

Winboard Forum

Differences at different time controls

Differences at different time controls

Re: Differences at different time controls

Re: Differences at different time controls

Re: Differences at different time controls

Who is online