Probability and computer chess.

Discussions about Winboard/Xboard. News about engines or programs to use with these GUIs (e.g. tournament managers or adapters) belong in this sub forum.

Moderator: Andres Valverde

Probability and computer chess.

Postby Igor Gorelikov » 27 Oct 2004, 13:49

Probability and computer chess.

If you take several engines of similar level and run several tournaments in
a row can you guess winners?
My assumption was that you cannot.
So I performed a small experiment. The six engines which are close by strengh
(of AEGT King Class) played two round robins in a row of 6.
Hardware is Celeron 567MHz 128MB, the shortest time control possible for
decent chess: 1 min + 3 sec per game (ie each game lasts for 4 minutes on
average).

Here are results. Each time there is a new winner! Only after 5 events the
winner repeats (Ruffian).
Code: Select all
   Engine        Score  De Th Ru An Ph Pr    S-B
1: Delfi 4.5     7,0/10 ?? 01 == 1= =1 11   30,25
2: Thinker 4.6c  6,5/10 10 ?? =1 == 11 ==   29,75
3: Ruffian 1.0.5 5,5/10 == =0 ?? 1= 01 1=   25,00
4: AnMon 5.50    4,0/10 0= == 0= ?? 00 11   19,75
5: Pharaon 3.1   3,5/10 =0 00 10 11 ?? 00   17,00
6: Pro Deo 1.0   3,5/10 00 == 0= 00 11 ??   16,25

   Engine        Score  Th Ph De Ru Pr An    S-B
1: Thinker 4.6c  7,0/10 ?? == 11 01 == 11   32,00
2: Pharaon 3.1   5,5/10 == ?? 00 1= 11 01   25,75
3: Delfi 4.5     5,0/10 00 11 ?? 10 =0 =1   23,50
4: Ruffian 1.0.5 4,5/10 10 0= 01 ?? 01 10   22,75
5: Pro Deo 1.0   4,0/10 == 00 =1 10 ?? 0=   21,00
6: AnMon 5.50    4,0/10 00 10 =0 01 1= ??   18,50

   Engine        Score  Ru Th De An Pr Ph    S-B
1: Ruffian 1.0.5 7,5/10 ?? 01 11 0= 11 11   32,00
2: Thinker 4.6c  5,5/10 10 ?? == 11 1= 00   29,00
3: Delfi 4.5     5,5/10 00 == ?? 01 11 =1   22,25
4: AnMon 5.50    5,0/10 1= 00 10 ?? =0 11   23,75
5: Pro Deo 1.0   4,0/10 00 0= 00 =1 ?? 11   15,25
6: Pharaon 3.1   2,5/10 00 11 =0 00 00 ??   13,75

   Engine        Score  An Pr Ru Ph De Th    S-B
1: AnMon 5.50    7,0/10 ?? 11 =0 =1 01 11   32,25
2: Pro Deo 1.0   6,5/10 00 ?? =1 =1 11 1=   26,25
3: Ruffian 1.0.5 5,5/10 =1 =0 ?? 00 1= 11   24,25
4: Pharaon 3.1   5,0/10 =0 =0 11 ?? 10 10   23,75
5: Delfi 4.5     3,0/10 10 00 0= 01 ?? 0=   16,25
6: Thinker 4.6c  3,0/10 00 0= 00 01 1= ??   12,75

   Engine        Score  Pr Ru Th Ph De An    S-B
1: Pro Deo 1.0   6,5/10 ?? 10 1= 01 1= 1=   29,50
2: Ruffian 1.0.5 6,5/10 01 ?? =0 1= =1 11   28,00
3: Thinker 4.6c  5,5/10 0= =1 ?? =1 0= 1=   26,75
4: Pharaon 3.1   5,0/10 10 0= =0 ?? 11 01   22,50
5: Delfi 4.5     3,5/10 0= =0 1= 00 ?? 01   17,75
6: AnMon 5.50    3,0/10 0= 00 0= 10 10 ??   14,50

   Engine        Score  Ru Th An De Pr Ph    S-B
1: Ruffian 1.0.5 7,0/10 ?? 10 1= 11 1= ==   32,00
2: Thinker 4.6c  6,5/10 01 ?? 0= == 11 11   27,25
3: AnMon 5.50    5,5/10 0= 1= ?? 01 0= 11   25,50
4: Delfi 4.5     4,5/10 00 == 10 ?? 1= ==   20,25
5: Pro Deo 1.0   3,5/10 0= 00 1= 0= ?? 01   17,00
6: Pharaon 3.1   3,0/10 == 00 00 == 10 ??   15,00

Conclusions are obvious and commonplace:
1) The engines are actually close by strengh.
2) Any engine can win the event held for similar engines if the number of
games is small.

Just to complete overall picture, I add assembled cross-table and ratings.
Code: Select all
2004.10.23 - 2004.10.25
                     Score                1            2            3            4            5            6
-------------------------------------------------------------------------------------------------------------
 1: Ruffian 1.0.5  36.5 / 60   XXXXXXXXXXXX =0100111=010 1=100==1111= ==01111==111 1=0111=0011= 010=11001===
 2: Thinker 4.6c   34.0 / 60   =1011000=101 XXXXXXXXXXXX ==1111001=0= 1011==1=0=== ====1=0=0=11 11==0001=111
 3: AnMon 5.50     28.5 / 60   0=011==0000= ==0000110=1= XXXXXXXXXXXX 0==010011001 111==0110=0= 001011=11011
 4: Delfi 4.5      28.5 / 60   ==10000==000 0100==0=1=== 1==101100110 XXXXXXXXXXXX 11=011000=1= =111=10100==
 5: Pro Deo 1.0    28.0 / 60   0=1000=1100= ====0=1=1=00 000==1001=1= 00=100111=0= XXXXXXXXXXXX 110011=10101
 6: Pharaon 3.1    24.5 / 60   101=00110=== 00==1110=000 110100=00100 =000=01011== 001100=01010 XXXXXXXXXXXX
-------------------------------------------------------------------------------------------------------------
180 games: +67 =52 -61


    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Ruffian 1.0.5                  : 2564   79  78    60    60.8 %   2487   28.3 %
  2 Thinker 4.6c                   : 2539   84  66    60    56.7 %   2492   36.7 %
  3 AnMon 5.50                     : 2485   73  89    60    47.5 %   2503   25.0 %
  4 Delfi 4.5                      : 2485   68  89    60    47.5 %   2503   31.7 %
  5 Pro Deo 1.0                    : 2481   70  88    60    46.7 %   2504   30.0 %
  6 Pharaon 3.1                    : 2446   83  81    60    40.8 %   2511   21.7 %


But how much games are enough? That's the question.
To be continued...

Igor
User avatar
Igor Gorelikov
 
Posts: 153
Joined: 27 Sep 2004, 10:12
Location: St. Petersburg, Russia

Re: Probability and computer chess.

Postby Anonymous » 27 Oct 2004, 14:16

In my opinion, this test is flawed because of the unreasonable time control.
Anonymous
 

Re: Probability and computer chess.

Postby Igor Gorelikov » 27 Oct 2004, 14:19

Here time contol is of no importance.

Igor
User avatar
Igor Gorelikov
 
Posts: 153
Joined: 27 Sep 2004, 10:12
Location: St. Petersburg, Russia

Re: Probability and computer chess.

Postby Volker Pittlik » 27 Oct 2004, 14:22

Igor Gorelikov wrote:Probability and computer chess.

...
Code: Select all
    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Ruffian 1.0.5                  : 2564   79  78    60    60.8 %   2487   28.3 %
  2 Thinker 4.6c                   : 2539   84  66    60    56.7 %   2492   36.7 %
  3 AnMon 5.50                     : 2485   73  89    60    47.5 %   2503   25.0 %
  4 Delfi 4.5                      : 2485   68  89    60    47.5 %   2503   31.7 %
  5 Pro Deo 1.0                    : 2481   70  88    60    46.7 %   2504   30.0 %
  6 Pharaon 3.1                    : 2446   83  81    60    40.8 %   2511   21.7 %


But how much games are enough? That's the question.
To be continued...

Igor


Thanks Igor for this interesting experiment. I'm afraid really much more games are needed. Look at the Elo ratings and the error margins. Assuming Elostat is calculating correctly then Ruffian's "true" rating is somewhere between 2486 and 2643 (with an error probality of 5%). Pharaon's "true" score is somewhere between 2365 and 2529. Therefore Pharaon is possibly the "true" number one and Ruffian may be "only" number 3.

To save test time you can just copy the PGNs of all games and paste it to the same file again and again and watch how the error margin gets narrower. I once did that for different Ruffian versions. I don't remember exactly but thousands of games would have needed to distinguish the different versions.

Regards

Volker
Last edited by Volker Pittlik on 27 Oct 2004, 14:41, edited 1 time in total.
User avatar
Volker Pittlik
 
Posts: 1031
Joined: 24 Sep 2004, 10:14
Location: Murten / Morat, Switzerland

Re: Probability and computer chess.

Postby José Carlos » 27 Oct 2004, 14:28

Igor Gorelikov wrote:Here time contol is of no importance.

Igor


I agree. The problem is too few games. Another test that shows the same behaviour is: pick two engines of silimar strength, run a 100 games match. Now write down the results and pick series of 10 games. You'll find 7-3, 2-8, 5.5-4.5, etc. Only the final result is significative.
So yes, any tournamet of so few games is basically random in regard to final standings, unless you use engines of very different strength: if you match Shredder vs Averno and get 9.5-0.5, that's significative because you know in advance that Shredder is much stronger.
_____________________________
José Carlos Martínez Galán
User avatar
José Carlos
 
Posts: 102
Joined: 26 Sep 2004, 03:22
Location: Murcia (Spain)

Re: Probability and computer chess.

Postby Anonymous » 27 Oct 2004, 15:20

Igor Gorelikov wrote:Here time contol is of no importance.

Igor


I think your test is interesting. But it is a test to compare closeness of engine strength, and to say time control is of no importance in such a test is total nonsense.
Anonymous
 

Re: Probability and computer chess.

Postby Volker Pittlik » 27 Oct 2004, 15:26

David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.


Because...?

Volker
User avatar
Volker Pittlik
 
Posts: 1031
Joined: 24 Sep 2004, 10:14
Location: Murten / Morat, Switzerland

Re: Probability and computer chess.

Postby Uri Blass » 27 Oct 2004, 15:34

David Dahlem wrote:
Igor Gorelikov wrote:Here time contol is of no importance.

Igor


I think your test is interesting. But it is a test to compare closeness of engine strength, and to say time control is of no importance in such a test is total nonsense.


I agree that time control is of no importance because the test is not to find which engine is better at long time control.

You can expect similiar behaviour at longer time control except the fact that the order between engines with similiar strength may be different.

Uri
User avatar
Uri Blass
 
Posts: 727
Joined: 09 Oct 2004, 05:59
Location: Tel-Aviv

Re: Probability and computer chess.

Postby Heinz van Kempen » 27 Oct 2004, 15:40

Hello Igor and all :D ,

thanks for the test. I agree to Volker. If rating difference is 20 points or less you will need thousands of games, if such difference can be proven then. This is somehow frustrating for testers and also a cause that sometimes people refuse to believe in results from other testers differing to their own. When you compare the many Elolists after several hundred of games you will find similarities but also differences for some engines and those differences must not necessarily come from diverse GUI?s, hardware or slightly different time controls but are often accidental.

What Dave wrote about time controls is also interesting. I do not refer to the fact that some engines are considerably better or weaker with more or less time (Quark is a known example here), but that less games are needed with longer timecontrol to prove differences compared to Blitz.
For example let us assume that Delfi and Tao are only 30 points apart in rating and that this difference will be the same in Blitz and 40/40 for example. Will we then see this difference after less games with longer timecontrol compared to Blitz, because both will play usually more exact with more time? No one has proven this as far as I know, but it is a riddle to solve.

Best Regards
Heinz
Last edited by Heinz van Kempen on 27 Oct 2004, 16:30, edited 2 times in total.
Heinz van Kempen
 
Posts: 160
Joined: 27 Sep 2004, 07:35
Location: Leverkusen, Germany

Re: Probability and computer chess.

Postby Anonymous » 27 Oct 2004, 15:54

Volker Pittlik wrote:
David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.


Because...?

Volker


I already gave the reason, (which you snipped). Again, this test is about comparing engine strengths, and, obviously, time controls are closely related to engine strength.
Anonymous
 

Re: Probability and computer chess.

Postby José Carlos » 27 Oct 2004, 17:17

David Dahlem wrote:
Volker Pittlik wrote:
David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.


Because...?

Volker


I already gave the reason, (which you snipped). Again, this test is about comparing engine strengths, and, obviously, time controls are closely related to engine strength.


Game in 5 minutes with todays computer is the same as game in 50 minutes in the past, with computers 10 times slower. So saying that blitz is irrelevant regarding playing strength is the same as saying that all tests done in the past are useless, no matter the time control.
...And today's game in 50 minutes will be considered crap when computers are 10 times faster than today.
All of this is incorrect, David.
You have to decide what you're testing: for example, I want to test strength in my computer (athlon MP 2400) with ponder on, own books, 3-4-5 EGTB's and games in 40 minutes, plus 10 seconds increment.
If you have a faster computer, all the same equal, your results will probably be slightly different from mine after 10000 games. It's that simple. Because we have different testing conditions. But both results are useful, as long as you know what you are measuring exactly.
_____________________________
José Carlos Martínez Galán
User avatar
José Carlos
 
Posts: 102
Joined: 26 Sep 2004, 03:22
Location: Murcia (Spain)

Re: Probability and computer chess.

Postby Anonymous » 27 Oct 2004, 17:41

Jos? Carlos wrote:
David Dahlem wrote:
Volker Pittlik wrote:
David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.


Because...?

Volker


I already gave the reason, (which you snipped). Again, this test is about comparing engine strengths, and, obviously, time controls are closely related to engine strength.


Game in 5 minutes with todays computer is the same as game in 50 minutes in the past, with computers 10 times slower. So saying that blitz is irrelevant regarding playing strength is the same as saying that all tests done in the past are useless, no matter the time control.
...And today's game in 50 minutes will be considered crap when computers are 10 times faster than today.
All of this is incorrect, David.
You have to decide what you're testing: for example, I want to test strength in my computer (athlon MP 2400) with ponder on, own books, 3-4-5 EGTB's and games in 40 minutes, plus 10 seconds increment.
If you have a faster computer, all the same equal, your results will probably be slightly different from mine after 10000 games. It's that simple. Because we have different testing conditions. But both results are useful, as long as you know what you are measuring exactly.


Exactly right. And this thread is about a test that measures the reliablity of tournament results between approximately equal strength engines. No test is 100% reliable, since there will always be an error margin. And common sense should show that a time control of 1 minute + 3 second increment will increase the error margin. Longer time controls will decrease the error margin. It's as simple as that. :-)
Anonymous
 

Re: Probability and computer chess.

Postby Uri Blass » 27 Oct 2004, 18:09

I see no reason to think that fast time control will increase the error margin.

Uri
User avatar
Uri Blass
 
Posts: 727
Joined: 09 Oct 2004, 05:59
Location: Tel-Aviv

Re: Probability and computer chess.

Postby fierz » 28 Oct 2004, 10:52

i see absolutely no reason the time control would have anything to do with this.

for an example of the exact same phenomenon at long time controls (90 min + 30sec/move) see kurt utzinger's current tournament at http://www.utzingerk.com/at_2004.htm - he has results for every round robin cycle on that page, and you can see how most engines won at least once.

cheers
martin
fierz
 

my impression

Postby mike schoonover » 28 Oct 2004, 12:48

hi igor,
allthough i am not a tester my impression is much the same as your results.
i've been playing around with engine v engine games and matchs since 1999 and have noticed on any given day how one engine will
win against a field and then another will win against the same field.
i consider ruffian probably the strongest freeware i have.
i recall though sos arena4 and greenlight 3 amounst others
that have had there day against ruffian.
all this is relying an my memory and general overall impression
and not scientific testing.
if one were to take all the comercial engines each with its own pc(identical of course),
and ran a tournement x 100 interations the results would
probably be simular to yours.
but alas i could'nt do this due to a technical glitch.(lack of money):)
regards
mike
by the time i get there,i'll be there
mike schoonover
 
Posts: 154
Joined: 27 Sep 2004, 23:15
Location: st paul minnesota,usa

Re: Probability and computer chess.

Postby Rudolf Posch » 28 Oct 2004, 19:07

It's all a question of statistic.
I have been developing RDChess and when releasing a new version I played a number of test games against the older RDChess version and a few other engines in order to see if the version is stronger.
I used shorter time controls (usually 5 min for a whole game) in order to get a reasonable number of games.
It was really frustrating. Once RdchessA won against RDChessB (or engine XXXXX) 30:10:10, the next time 20:18:12 and the next-next time e.g. 40:8:2.
But even worse is the following: I tuned a single parameter in RDChess, say some single value in the evaluation function and wanted to get a fast respone if the change is "good or bad" by playing a few (5 or 10) games.
One cannot trust the result of so few games, when the strength has changed not too much.

In Austria there is a licensed gambling "6 from 45", you have to hit 6 numbers out of 1 ... 45 (in Germany it is the same game "6 out of 49" as far is I remember).
The chance at "6 from 45" is roughly 1:8 millions to guess the right 6 numbers. if you play each week one row, you have to wait statistically 160.000 years to get 6 right numbers. But if you are lucky, you may win the fortune next week.
But if you are unlucky, you will have to play 320.000 years until you win a 6, or even longer ...

So I am sure, in the next 100.000 years RDChess will win with great probability at least one game against Ruffian :?

Rudolf
Rudolf Posch
 
Posts: 4
Joined: 03 Oct 2004, 20:28
Location: Austria


Return to Winboard and related Topics

Who is online

Users browsing this forum: No registered users and 4 guests