Winboard Forum

by **Igor Gorelikov** » 28 Oct 2004, 15:11

How much is enough? (or Probability, part 2)

My previous post deals with cross-tables and winners. Now we look at the
rating lists after each round robin event. I try to find out the minimal
number of games that are needed for proper rating calculation.

Conditions are the same:
"The six engines which are close by strength (of AEGT King Class) played two
round robins in a row. Hardware is Celeron 567MHz 128MB, the shortest
time control possible for decent chess: 1 min + 3 sec per game (ie each
game lasts for 4 minutes on average)."

Note the first column which is added by me and which shows changes in places
(plus means up, while minus means down.)

Code: Select all: 1st event (each event is 2-round robin with 60 games in total) Program Elo + - Games Score Av.Op. Draws 1 Delfi 4.5 : 2622 235 175 10 70.0 % 2475 40.0 % 2 Thinker 4.6c : 2589 244 142 10 65.0 % 2482 50.0 % 3 Ruffian 1.0.5 : 2528 266 141 10 55.0 % 2494 50.0 % 4 AnMon 5.50 : 2441 168 255 10 40.0 % 2511 40.0 % 5 Pharaon 3.1 : 2410 279 244 10 35.0 % 2517 10.0 % 6 Pro Deo 1.0 : 2410 204 244 10 35.0 % 2517 30.0 % 2nd event Chng Program Elo + - Games Score Av.Op. Draws in pl +1 1 Thinker 4.6c : 2605 141 110 20 67.5 % 2478 45.0 % -1 2 Delfi 4.5 : 2558 153 133 20 60.0 % 2488 30.0 % 0 3 Ruffian 1.0.5 : 2500 141 141 20 50.0 % 2500 30.0 % +1 4 Pharaon 3.1 : 2470 144 162 20 45.0 % 2505 20.0 % -1 5 AnMon 5.50 : 2441 133 153 20 40.0 % 2511 30.0 % 0 6 Pro Deo 1.0 : 2426 126 149 20 37.5 % 2514 35.0 % 3rd event Program Elo + - Games Score Av.Op. Draws 0 1 Thinker 4.6c : 2579 114 96 30 63.3 % 2484 40.0 % 0 2 Delfi 4.5 : 2549 122 105 30 58.3 % 2490 30.0 % 0 3 Ruffian 1.0.5 : 2549 122 115 30 58.3 % 2490 23.3 % +1 4 AnMon 5.50 : 2461 108 124 30 43.3 % 2508 26.7 % -1 5 Pharaon 3.1 : 2431 131 117 30 38.3 % 2514 16.7 % 0 6 Pro Deo 1.0 : 2431 109 117 30 38.3 % 2514 30.0 %

Those three events are usual shifting from pillar to post.

Code: Select all: 4th event Program Elo + - Games Score Av.Op. Draws +2 1 Ruffian 1.0.5 : 2544 104 96 40 57.5 % 2491 25.0 % -1 2 Thinker 4.6c : 2529 107 82 40 55.0 % 2494 35.0 % -1 3 Delfi 4.5 : 2507 113 87 40 51.2 % 2499 27.5 % 0 4 AnMon 5.50 : 2500 99 99 40 50.0 % 2500 25.0 % +1 5 Pro Deo 1.0 : 2471 87 107 40 45.0 % 2506 30.0 % -1 6 Pharaon 3.1 : 2449 107 102 40 41.2 % 2510 17.5 %

The first important moment. From now on (i.e. till the final event) three
engine take their constant places: 1,2,6.

Code: Select all: 5th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2553 90 86 50 59.0 % 2489 26.0 % 0 2 Thinker 4.6c : 2529 95 70 50 55.0 % 2494 38.0 % +2 3 Pro Deo 1.0 : 2494 75 101 50 49.0 % 2501 30.0 % -1 4 Delfi 4.5 : 2488 77 99 50 48.0 % 2502 28.0 % -1 5 AnMon 5.50 : 2477 83 96 50 46.0 % 2505 24.0 % 0 6 Pharaon 3.1 : 2459 92 92 50 43.0 % 2508 18.0 % 6th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2564 79 78 60 60.8 % 2487 28.3 % 0 2 Thinker 4.6c : 2539 84 66 60 56.7 % 2492 36.7 % +2 3 AnMon 5.50 : 2485 73 89 60 47.5 % 2503 25.0 % 0 4 Delfi 4.5 : 2485 68 89 60 47.5 % 2503 31.7 % -2 5 Pro Deo 1.0 : 2481 70 88 60 46.7 % 2504 30.0 % 0 6 Pharaon 3.1 : 2446 83 81 60 40.8 % 2511 21.7 %

The second important moment: three engines tight for places 3-5. From now
on they will shift their places!

Code: Select all: 7th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2572 72 75 70 62.1 % 2485 27.1 % 0 2 Thinker 4.6c : 2525 80 59 70 54.3 % 2495 37.1 % +2 3 Pro Deo 1.0 : 2487 63 83 70 47.9 % 2502 30.0 % 0 4 Delfi 4.5 : 2487 61 83 70 47.9 % 2502 32.9 % -2 5 AnMon 5.50 : 2471 70 79 70 45.0 % 2506 24.3 % 0 6 Pharaon 3.1 : 2458 71 77 70 42.9 % 2508 25.7 % 8th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2555 69 68 80 59.4 % 2489 26.2 % 0 2 Thinker 4.6c : 2536 72 58 80 56.2 % 2493 35.0 % 0 3 Pro Deo 1.0 : 2489 60 77 80 48.1 % 2502 28.8 % 0 4 Delfi 4.5 : 2485 58 76 80 47.5 % 2503 32.5 % 0 5 AnMon 5.50 : 2478 64 75 80 46.2 % 2504 25.0 % 0 6 Pharaon 3.1 : 2456 67 71 80 42.5 % 2509 25.0 %

Hurray! Now we get absolute truth. All engines are on their right places
and don't want to change their positions. Nevertheless check it more...

Code: Select all: 9th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2555 65 63 90 59.4 % 2489 27.8 % 0 2 Thinker 4.6c : 2542 67 56 90 57.2 % 2491 34.4 % +2 3 AnMon 5.50 : 2490 58 73 90 48.3 % 2502 25.6 % -1 4 Pro Deo 1.0 : 2481 56 71 90 46.7 % 2504 31.1 % -1 5 Delfi 4.5 : 2477 57 70 90 46.1 % 2504 30.0 % 0 6 Pharaon 3.1 : 2455 61 66 90 42.2 % 2509 28.9 %

Maybe it's not so absolute?
They (three other engines) continue their stupid dances ;-(

Code: Select all: 10th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2547 63 58 100 58.0 % 2491 28.0 % 0 2 Thinker 4.6c : 2520 67 52 100 53.5 % 2496 33.0 % +1 3 Pro Deo 1.0 : 2494 52 70 100 49.0 % 2501 30.0 % -1 4 AnMon 5.50 : 2494 55 70 100 49.0 % 2501 26.0 % 0 5 Delfi 4.5 : 2477 54 66 100 46.0 % 2505 30.0 % 0 6 Pharaon 3.1 : 2468 56 65 100 44.5 % 2506 29.0 % 11th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2540 61 56 110 56.8 % 2492 26.4 % 0 2 Thinker 4.6c : 2524 63 49 110 54.1 % 2495 33.6 % +1 3 AnMon 5.50 : 2503 67 51 110 50.5 % 2499 26.4 % +1 4 Delfi 4.5 : 2492 51 66 110 48.6 % 2502 28.2 % -2 5 Pro Deo 1.0 : 2489 51 65 110 48.2 % 2502 29.1 % 0 6 Pharaon 3.1 : 2452 56 59 110 41.8 % 2510 27.3 % 12th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2532 59 51 120 55.4 % 2494 29.2 % 0 2 Thinker 4.6c : 2522 61 48 120 53.8 % 2496 32.5 % +2 3 Pro Deo 1.0 : 2498 49 64 120 49.6 % 2500 27.5 % -1 4 AnMon 5.50 : 2498 49 64 120 49.6 % 2500 27.5 % -1 5 Delfi 4.5 : 2490 49 63 120 48.3 % 2502 28.3 % 0 6 Pharaon 3.1 : 2461 52 58 120 43.3 % 2508 28.3 %

Note that the rating lists say practically the same after 40-60 games and
after 120 games. That is
- the number one is Ruffian
- the number two is Thinker
- the number six is Pharaon
- the other three engines are very close and their differentiation needs
much more games (hundreds? thousands?)

Conclusions:
1) The minimal number of games for rough rating estimation is 40. Even
though it needs more tests with greater number of engines.
2) To differentiate between some engines/versions you need your whole life
(or more?)

Igor

by **fierz** » 28 Oct 2004, 15:32

hi igor,

as you can see in your experiment the error bars are proportional to 1/sqrt(N) where N is the number of games; as usual in statistics. as you can also see in your experiment, your error bars are roughly +-60 at 100 games.

=> let's make a small table

N delta
25 120
100 60
400 30
1600 15
6400 8

this answers your question as to how many games you would need to find the "truth". personally, i couldn't care less whether an engine is 10 or 20 points stronger than another, they are just of very similar strength. 50 or 100 points is relevant, and therefore one should play at least something like 150 games to test an engine. when i want to test a new version of muse, i use 240 games. most of the time, the results are not significant (i.e. not 50 elo more or less), and i just go with the flow :-)

cheers
martin

by **Igor Gorelikov** » 28 Oct 2004, 15:47

Hi Martin!

Thanks for clarification. Your table looks nice and convincing.
Of course, OPTIMUM number of games is greater than 40.
But I have tried to find out "minimum minimorum".

Igor

by **Igor Gorelikov** » 29 Oct 2004, 13:58

Two more round robins don't change the situation so I stop this test and will try events with more participants.

Code: Select all: 13th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2525 58 48 130 54.2 % 2495 28.5 % 0 2 Thinker 4.6c : 2520 58 45 130 53.5 % 2496 33.1 % +1 3 AnMon 5.50 : 2500 52 52 130 50.0 % 2500 29.2 % -1 4 Pro Deo 1.0 : 2496 47 61 130 49.2 % 2501 27.7 % 0 5 Delfi 4.5 : 2491 46 60 130 48.5 % 2502 29.2 % 0 6 Pharaon 3.1 : 2469 50 57 130 44.6 % 2506 27.7 % 14th event Program Elo + - Games Score Av.Op. Draws 0 1 Ruffian 1.0.5 : 2537 54 49 140 56.4 % 2493 27.1 % 0 2 Thinker 4.6c : 2525 55 45 140 54.3 % 2495 31.4 % 0 3 AnMon 5.50 : 2506 58 45 140 51.1 % 2499 27.9 % +1 4 Delfi 4.5 : 2494 45 58 140 48.9 % 2501 27.9 % -1 5 Pro Deo 1.0 : 2481 47 56 140 46.8 % 2504 26.4 % 0 6 Pharaon 3.1 : 2456 50 53 140 42.5 % 2509 26.4 %

Igor

by **Dann Corbit** » 30 Oct 2004, 00:47

I find that inclusion of a program at about 100-150 Elo higher than the others along with about 100-150 Elo below the others is very helpful.

When all the programs in a group are very close in strength, you get very near to a pure random walk. Since they really are close, every game is more closely resembling a coin toss. So the closer in strength, the more games are needed to clearly divide them. Therefore, the addition of a much weaker and a much stronger program is helpful. However, 200-300 Elo is too much. It walks over the opposition or gets crushed by it and that does not impart much useful data.

by **Heinz van Kempen** » 30 Oct 2004, 01:43

Hi Dann, Igor and all

,

I think what Dann wrote makes a lot of sense. Maybe even include two programs that are clearly stronger (100 ELO) and two that are considerably weaker by 100 points. As soon as the two stronger ones will have a comfortable lead and the two weaker ones are much behind the main field you will get at least the number of games necessary to verify a difference of 100 points. Would be interesting to see if this can be done with less than 80 games per engine. The problem is to find two engines that are exactly 100 points stronger or weaker than the main field :-)

.

For engines closer together I have no hope that something decisive can be tested. So I think you can tell 100 testers to run a match Ruffian 1.0.5 versus ProDeo over 100 games and you will get all results at least from 70:30 for Ruffian to 70:30 for ProDeo, even if they use same hardware, time control and GUI.

Best Regards
Heinz

by **Igor Gorelikov** » 01 Nov 2004, 10:15

Hi Dann,
Just one remark.
I think if you take more engines (for instance, 12) than you get more rating difference because it's hard to find 12 TOP engines of similar strengh.
The more engines the more probability of their variance.

Igor

Winboard Forum

How much is enough? (or Probability, part 2)

How much is enough? (or Probability, part 2)

Re: How much is enough? (or Probability, part 2)

Re: How much is enough? (or Probability, part 2)

Re: How much is enough? (or Probability, part 2)

Something to give good numbers

Re: How much is enough? (or Probability, part 2)

Re: How much is enough? (or Probability, part 2)

Who is online