Page 1 of 1

How much is enough? (or Probability, part 2)

PostPosted: 28 Oct 2004, 15:11
by Igor Gorelikov
How much is enough? (or Probability, part 2)

My previous post deals with cross-tables and winners. Now we look at the
rating lists after each round robin event. I try to find out the minimal
number of games that are needed for proper rating calculation.

Conditions are the same:
"The six engines which are close by strength (of AEGT King Class) played two
round robins in a row. Hardware is Celeron 567MHz 128MB, the shortest
time control possible for decent chess: 1 min + 3 sec per game (ie each
game lasts for 4 minutes on average)."

Note the first column which is added by me and which shows changes in places
(plus means up, while minus means down.)
Code: Select all
1st event (each event is 2-round robin with 60 games in total)

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Delfi 4.5                      : 2622  235 175    10    70.0 %   2475   40.0 %
  2 Thinker 4.6c                   : 2589  244 142    10    65.0 %   2482   50.0 %
  3 Ruffian 1.0.5                  : 2528  266 141    10    55.0 %   2494   50.0 %
  4 AnMon 5.50                     : 2441  168 255    10    40.0 %   2511   40.0 %
  5 Pharaon 3.1                    : 2410  279 244    10    35.0 %   2517   10.0 %
  6 Pro Deo 1.0                    : 2410  204 244    10    35.0 %   2517   30.0 %

2nd event

Chng   Program                          Elo    +   -   Games   Score   Av.Op.  Draws
in pl

+1  1 Thinker 4.6c                   : 2605  141 110    20    67.5 %   2478   45.0 %
-1  2 Delfi 4.5                      : 2558  153 133    20    60.0 %   2488   30.0 %
 0  3 Ruffian 1.0.5                  : 2500  141 141    20    50.0 %   2500   30.0 %
+1  4 Pharaon 3.1                    : 2470  144 162    20    45.0 %   2505   20.0 %
-1  5 AnMon 5.50                     : 2441  133 153    20    40.0 %   2511   30.0 %
 0  6 Pro Deo 1.0                    : 2426  126 149    20    37.5 %   2514   35.0 %

3rd event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Thinker 4.6c                   : 2579  114  96    30    63.3 %   2484   40.0 %
 0  2 Delfi 4.5                      : 2549  122 105    30    58.3 %   2490   30.0 %
 0  3 Ruffian 1.0.5                  : 2549  122 115    30    58.3 %   2490   23.3 %
+1  4 AnMon 5.50                     : 2461  108 124    30    43.3 %   2508   26.7 %
-1  5 Pharaon 3.1                    : 2431  131 117    30    38.3 %   2514   16.7 %
 0  6 Pro Deo 1.0                    : 2431  109 117    30    38.3 %   2514   30.0 %



Those three events are usual shifting from pillar to post.

Code: Select all

4th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

+2  1 Ruffian 1.0.5                  : 2544  104  96    40    57.5 %   2491   25.0 %
-1  2 Thinker 4.6c                   : 2529  107  82    40    55.0 %   2494   35.0 %
-1  3 Delfi 4.5                      : 2507  113  87    40    51.2 %   2499   27.5 %
 0  4 AnMon 5.50                     : 2500   99  99    40    50.0 %   2500   25.0 %
+1  5 Pro Deo 1.0                    : 2471   87 107    40    45.0 %   2506   30.0 %
-1  6 Pharaon 3.1                    : 2449  107 102    40    41.2 %   2510   17.5 %



The first important moment. From now on (i.e. till the final event) three
engine take their constant places: 1,2,6.

Code: Select all
5th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2553   90  86    50    59.0 %   2489   26.0 %
 0  2 Thinker 4.6c                   : 2529   95  70    50    55.0 %   2494   38.0 %
+2  3 Pro Deo 1.0                    : 2494   75 101    50    49.0 %   2501   30.0 %
-1  4 Delfi 4.5                      : 2488   77  99    50    48.0 %   2502   28.0 %
-1  5 AnMon 5.50                     : 2477   83  96    50    46.0 %   2505   24.0 %
 0  6 Pharaon 3.1                    : 2459   92  92    50    43.0 %   2508   18.0 %

6th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2564   79  78    60    60.8 %   2487   28.3 %
 0  2 Thinker 4.6c                   : 2539   84  66    60    56.7 %   2492   36.7 %
+2  3 AnMon 5.50                     : 2485   73  89    60    47.5 %   2503   25.0 %
 0  4 Delfi 4.5                      : 2485   68  89    60    47.5 %   2503   31.7 %
-2  5 Pro Deo 1.0                    : 2481   70  88    60    46.7 %   2504   30.0 %
 0  6 Pharaon 3.1                    : 2446   83  81    60    40.8 %   2511   21.7 %


The second important moment: three engines tight for places 3-5. From now
on they will shift their places!

Code: Select all
7th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2572   72  75    70    62.1 %   2485   27.1 %
 0  2 Thinker 4.6c                   : 2525   80  59    70    54.3 %   2495   37.1 %
+2  3 Pro Deo 1.0                    : 2487   63  83    70    47.9 %   2502   30.0 %
 0  4 Delfi 4.5                      : 2487   61  83    70    47.9 %   2502   32.9 %
-2  5 AnMon 5.50                     : 2471   70  79    70    45.0 %   2506   24.3 %
 0  6 Pharaon 3.1                    : 2458   71  77    70    42.9 %   2508   25.7 %


8th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2555   69  68    80    59.4 %   2489   26.2 %
 0  2 Thinker 4.6c                   : 2536   72  58    80    56.2 %   2493   35.0 %
 0  3 Pro Deo 1.0                    : 2489   60  77    80    48.1 %   2502   28.8 %
 0  4 Delfi 4.5                      : 2485   58  76    80    47.5 %   2503   32.5 %
 0  5 AnMon 5.50                     : 2478   64  75    80    46.2 %   2504   25.0 %
 0  6 Pharaon 3.1                    : 2456   67  71    80    42.5 %   2509   25.0 %


Hurray! Now we get absolute truth. All engines are on their right places
and don't want to change their positions. Nevertheless check it more...

Code: Select all
9th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2555   65  63    90    59.4 %   2489   27.8 %
 0  2 Thinker 4.6c                   : 2542   67  56    90    57.2 %   2491   34.4 %
+2  3 AnMon 5.50                     : 2490   58  73    90    48.3 %   2502   25.6 %
-1  4 Pro Deo 1.0                    : 2481   56  71    90    46.7 %   2504   31.1 %
-1  5 Delfi 4.5                      : 2477   57  70    90    46.1 %   2504   30.0 %
 0  6 Pharaon 3.1                    : 2455   61  66    90    42.2 %   2509   28.9 %


Maybe it's not so absolute?
They (three other engines) continue their stupid dances ;-(

Code: Select all
10th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2547   63  58   100    58.0 %   2491   28.0 %
 0  2 Thinker 4.6c                   : 2520   67  52   100    53.5 %   2496   33.0 %
+1  3 Pro Deo 1.0                    : 2494   52  70   100    49.0 %   2501   30.0 %
-1  4 AnMon 5.50                     : 2494   55  70   100    49.0 %   2501   26.0 %
 0  5 Delfi 4.5                      : 2477   54  66   100    46.0 %   2505   30.0 %
 0  6 Pharaon 3.1                    : 2468   56  65   100    44.5 %   2506   29.0 %

11th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2540   61  56   110    56.8 %   2492   26.4 %
 0  2 Thinker 4.6c                   : 2524   63  49   110    54.1 %   2495   33.6 %
+1  3 AnMon 5.50                     : 2503   67  51   110    50.5 %   2499   26.4 %
+1  4 Delfi 4.5                      : 2492   51  66   110    48.6 %   2502   28.2 %
-2  5 Pro Deo 1.0                    : 2489   51  65   110    48.2 %   2502   29.1 %
 0  6 Pharaon 3.1                    : 2452   56  59   110    41.8 %   2510   27.3 %


12th event

      Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2532   59  51   120    55.4 %   2494   29.2 %
 0  2 Thinker 4.6c                   : 2522   61  48   120    53.8 %   2496   32.5 %
+2  3 Pro Deo 1.0                    : 2498   49  64   120    49.6 %   2500   27.5 %
-1  4 AnMon 5.50                     : 2498   49  64   120    49.6 %   2500   27.5 %
-1  5 Delfi 4.5                      : 2490   49  63   120    48.3 %   2502   28.3 %
 0  6 Pharaon 3.1                    : 2461   52  58   120    43.3 %   2508   28.3 %


Note that the rating lists say practically the same after 40-60 games and
after 120 games. That is
- the number one is Ruffian
- the number two is Thinker
- the number six is Pharaon
- the other three engines are very close and their differentiation needs
much more games (hundreds? thousands?)

Conclusions:
1) The minimal number of games for rough rating estimation is 40. Even
though it needs more tests with greater number of engines.
2) To differentiate between some engines/versions you need your whole life
(or more?)

Igor

Re: How much is enough? (or Probability, part 2)

PostPosted: 28 Oct 2004, 15:32
by fierz
hi igor,

as you can see in your experiment the error bars are proportional to 1/sqrt(N) where N is the number of games; as usual in statistics. as you can also see in your experiment, your error bars are roughly +-60 at 100 games.

=> let's make a small table

N delta
25 120
100 60
400 30
1600 15
6400 8

this answers your question as to how many games you would need to find the "truth". personally, i couldn't care less whether an engine is 10 or 20 points stronger than another, they are just of very similar strength. 50 or 100 points is relevant, and therefore one should play at least something like 150 games to test an engine. when i want to test a new version of muse, i use 240 games. most of the time, the results are not significant (i.e. not 50 elo more or less), and i just go with the flow :-)

cheers
martin

Re: How much is enough? (or Probability, part 2)

PostPosted: 28 Oct 2004, 15:47
by Igor Gorelikov
Hi Martin!

Thanks for clarification. Your table looks nice and convincing.
Of course, OPTIMUM number of games is greater than 40.
But I have tried to find out "minimum minimorum".

Igor

Re: How much is enough? (or Probability, part 2)

PostPosted: 29 Oct 2004, 13:58
by Igor Gorelikov
Two more round robins don't change the situation so I stop this test and will try events with more participants.

Code: Select all
13th event

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2525   58  48   130    54.2 %   2495   28.5 %
 0  2 Thinker 4.6c                   : 2520   58  45   130    53.5 %   2496   33.1 %
+1  3 AnMon 5.50                     : 2500   52  52   130    50.0 %   2500   29.2 %
-1  4 Pro Deo 1.0                    : 2496   47  61   130    49.2 %   2501   27.7 %
 0  5 Delfi 4.5                      : 2491   46  60   130    48.5 %   2502   29.2 %
 0  6 Pharaon 3.1                    : 2469   50  57   130    44.6 %   2506   27.7 %

14th event

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

 0  1 Ruffian 1.0.5                  : 2537   54  49   140    56.4 %   2493   27.1 %
 0  2 Thinker 4.6c                   : 2525   55  45   140    54.3 %   2495   31.4 %
 0  3 AnMon 5.50                     : 2506   58  45   140    51.1 %   2499   27.9 %
+1  4 Delfi 4.5                      : 2494   45  58   140    48.9 %   2501   27.9 %
-1  5 Pro Deo 1.0                    : 2481   47  56   140    46.8 %   2504   26.4 %
 0  6 Pharaon 3.1                    : 2456   50  53   140    42.5 %   2509   26.4 %



Igor

Something to give good numbers

PostPosted: 30 Oct 2004, 00:47
by Dann Corbit
I find that inclusion of a program at about 100-150 Elo higher than the others along with about 100-150 Elo below the others is very helpful.

When all the programs in a group are very close in strength, you get very near to a pure random walk. Since they really are close, every game is more closely resembling a coin toss. So the closer in strength, the more games are needed to clearly divide them. Therefore, the addition of a much weaker and a much stronger program is helpful. However, 200-300 Elo is too much. It walks over the opposition or gets crushed by it and that does not impart much useful data.

Re: How much is enough? (or Probability, part 2)

PostPosted: 30 Oct 2004, 01:43
by Heinz van Kempen
Hi Dann, Igor and all :D ,

I think what Dann wrote makes a lot of sense. Maybe even include two programs that are clearly stronger (100 ELO) and two that are considerably weaker by 100 points. As soon as the two stronger ones will have a comfortable lead and the two weaker ones are much behind the main field you will get at least the number of games necessary to verify a difference of 100 points. Would be interesting to see if this can be done with less than 80 games per engine. The problem is to find two engines that are exactly 100 points stronger or weaker than the main field :-).

For engines closer together I have no hope that something decisive can be tested. So I think you can tell 100 testers to run a match Ruffian 1.0.5 versus ProDeo over 100 games and you will get all results at least from 70:30 for Ruffian to 70:30 for ProDeo, even if they use same hardware, time control and GUI.

Best Regards
Heinz

Re: How much is enough? (or Probability, part 2)

PostPosted: 01 Nov 2004, 10:15
by Igor Gorelikov
Hi Dann,
Just one remark.
I think if you take more engines (for instance, 12) than you get more rating difference because it's hard to find 12 TOP engines of similar strengh.
The more engines the more probability of their variance.

Igor