Winboard Forum

by **Igor Gorelikov** » 27 Oct 2004, 13:49

Probability and computer chess.

If you take several engines of similar level and run several tournaments in
a row can you guess winners?
My assumption was that you cannot.
So I performed a small experiment. The six engines which are close by strengh
(of AEGT King Class) played two round robins in a row of 6.
Hardware is Celeron 567MHz 128MB, the shortest time control possible for
decent chess: 1 min + 3 sec per game (ie each game lasts for 4 minutes on
average).

Here are results. Each time there is a new winner! Only after 5 events the
winner repeats (Ruffian).

Code: Select all: Engine Score De Th Ru An Ph Pr S-B 1: Delfi 4.5 7,0/10 ?? 01 == 1= =1 11 30,25 2: Thinker 4.6c 6,5/10 10 ?? =1 == 11 == 29,75 3: Ruffian 1.0.5 5,5/10 == =0 ?? 1= 01 1= 25,00 4: AnMon 5.50 4,0/10 0= == 0= ?? 00 11 19,75 5: Pharaon 3.1 3,5/10 =0 00 10 11 ?? 00 17,00 6: Pro Deo 1.0 3,5/10 00 == 0= 00 11 ?? 16,25 Engine Score Th Ph De Ru Pr An S-B 1: Thinker 4.6c 7,0/10 ?? == 11 01 == 11 32,00 2: Pharaon 3.1 5,5/10 == ?? 00 1= 11 01 25,75 3: Delfi 4.5 5,0/10 00 11 ?? 10 =0 =1 23,50 4: Ruffian 1.0.5 4,5/10 10 0= 01 ?? 01 10 22,75 5: Pro Deo 1.0 4,0/10 == 00 =1 10 ?? 0= 21,00 6: AnMon 5.50 4,0/10 00 10 =0 01 1= ?? 18,50 Engine Score Ru Th De An Pr Ph S-B 1: Ruffian 1.0.5 7,5/10 ?? 01 11 0= 11 11 32,00 2: Thinker 4.6c 5,5/10 10 ?? == 11 1= 00 29,00 3: Delfi 4.5 5,5/10 00 == ?? 01 11 =1 22,25 4: AnMon 5.50 5,0/10 1= 00 10 ?? =0 11 23,75 5: Pro Deo 1.0 4,0/10 00 0= 00 =1 ?? 11 15,25 6: Pharaon 3.1 2,5/10 00 11 =0 00 00 ?? 13,75 Engine Score An Pr Ru Ph De Th S-B 1: AnMon 5.50 7,0/10 ?? 11 =0 =1 01 11 32,25 2: Pro Deo 1.0 6,5/10 00 ?? =1 =1 11 1= 26,25 3: Ruffian 1.0.5 5,5/10 =1 =0 ?? 00 1= 11 24,25 4: Pharaon 3.1 5,0/10 =0 =0 11 ?? 10 10 23,75 5: Delfi 4.5 3,0/10 10 00 0= 01 ?? 0= 16,25 6: Thinker 4.6c 3,0/10 00 0= 00 01 1= ?? 12,75 Engine Score Pr Ru Th Ph De An S-B 1: Pro Deo 1.0 6,5/10 ?? 10 1= 01 1= 1= 29,50 2: Ruffian 1.0.5 6,5/10 01 ?? =0 1= =1 11 28,00 3: Thinker 4.6c 5,5/10 0= =1 ?? =1 0= 1= 26,75 4: Pharaon 3.1 5,0/10 10 0= =0 ?? 11 01 22,50 5: Delfi 4.5 3,5/10 0= =0 1= 00 ?? 01 17,75 6: AnMon 5.50 3,0/10 0= 00 0= 10 10 ?? 14,50 Engine Score Ru Th An De Pr Ph S-B 1: Ruffian 1.0.5 7,0/10 ?? 10 1= 11 1= == 32,00 2: Thinker 4.6c 6,5/10 01 ?? 0= == 11 11 27,25 3: AnMon 5.50 5,5/10 0= 1= ?? 01 0= 11 25,50 4: Delfi 4.5 4,5/10 00 == 10 ?? 1= == 20,25 5: Pro Deo 1.0 3,5/10 0= 00 1= 0= ?? 01 17,00 6: Pharaon 3.1 3,0/10 == 00 00 == 10 ?? 15,00

Conclusions are obvious and commonplace:
1) The engines are actually close by strengh.
2) Any engine can win the event held for similar engines if the number of
games is small.

Just to complete overall picture, I add assembled cross-table and ratings.

Code: Select all: 2004.10.23 - 2004.10.25 Score 1 2 3 4 5 6 ------------------------------------------------------------------------------------------------------------- 1: Ruffian 1.0.5 36.5 / 60 XXXXXXXXXXXX =0100111=010 1=100==1111= ==01111==111 1=0111=0011= 010=11001=== 2: Thinker 4.6c 34.0 / 60 =1011000=101 XXXXXXXXXXXX ==1111001=0= 1011==1=0=== ====1=0=0=11 11==0001=111 3: AnMon 5.50 28.5 / 60 0=011==0000= ==0000110=1= XXXXXXXXXXXX 0==010011001 111==0110=0= 001011=11011 4: Delfi 4.5 28.5 / 60 ==10000==000 0100==0=1=== 1==101100110 XXXXXXXXXXXX 11=011000=1= =111=10100== 5: Pro Deo 1.0 28.0 / 60 0=1000=1100= ====0=1=1=00 000==1001=1= 00=100111=0= XXXXXXXXXXXX 110011=10101 6: Pharaon 3.1 24.5 / 60 101=00110=== 00==1110=000 110100=00100 =000=01011== 001100=01010 XXXXXXXXXXXX ------------------------------------------------------------------------------------------------------------- 180 games: +67 =52 -61 Program Elo + - Games Score Av.Op. Draws 1 Ruffian 1.0.5 : 2564 79 78 60 60.8 % 2487 28.3 % 2 Thinker 4.6c : 2539 84 66 60 56.7 % 2492 36.7 % 3 AnMon 5.50 : 2485 73 89 60 47.5 % 2503 25.0 % 4 Delfi 4.5 : 2485 68 89 60 47.5 % 2503 31.7 % 5 Pro Deo 1.0 : 2481 70 88 60 46.7 % 2504 30.0 % 6 Pharaon 3.1 : 2446 83 81 60 40.8 % 2511 21.7 %

But how much games are enough? That's the question.
To be continued...

Igor

by **Anonymous** » 27 Oct 2004, 14:16

In my opinion, this test is flawed because of the unreasonable time control.

by **Igor Gorelikov** » 27 Oct 2004, 14:19

Here time contol is of no importance.

Igor

by **Volker Pittlik** » 27 Oct 2004, 14:22

Igor Gorelikov wrote:Probability and computer chess.

...
Code: Select all
Program Elo + - Games Score Av.Op. Draws 1 Ruffian 1.0.5 : 2564 79 78 60 60.8 % 2487 28.3 % 2 Thinker 4.6c : 2539 84 66 60 56.7 % 2492 36.7 % 3 AnMon 5.50 : 2485 73 89 60 47.5 % 2503 25.0 % 4 Delfi 4.5 : 2485 68 89 60 47.5 % 2503 31.7 % 5 Pro Deo 1.0 : 2481 70 88 60 46.7 % 2504 30.0 % 6 Pharaon 3.1 : 2446 83 81 60 40.8 % 2511 21.7 %

But how much games are enough? That's the question.
To be continued...

Igor

Thanks Igor for this interesting experiment. I'm afraid really much more games are needed. Look at the Elo ratings and the error margins. Assuming Elostat is calculating correctly then Ruffian's "true" rating is somewhere between 2486 and 2643 (with an error probality of 5%). Pharaon's "true" score is somewhere between 2365 and 2529. Therefore Pharaon is possibly the "true" number one and Ruffian may be "only" number 3.

To save test time you can just copy the PGNs of all games and paste it to the same file again and again and watch how the error margin gets narrower. I once did that for different Ruffian versions. I don't remember exactly but thousands of games would have needed to distinguish the different versions.

Regards

Volker

by **José Carlos** » 27 Oct 2004, 14:28

Igor Gorelikov wrote:Here time contol is of no importance.

Igor

I agree. The problem is too few games. Another test that shows the same behaviour is: pick two engines of silimar strength, run a 100 games match. Now write down the results and pick series of 10 games. You'll find 7-3, 2-8, 5.5-4.5, etc. Only the final result is significative.
So yes, any tournamet of so few games is basically random in regard to final standings, unless you use engines of very different strength: if you match Shredder vs Averno and get 9.5-0.5, that's significative because you know in advance that Shredder is much stronger.

by **Anonymous** » 27 Oct 2004, 15:20

Igor Gorelikov wrote:Here time contol is of no importance.

Igor

I think your test is interesting. But it is a test to compare closeness of engine strength, and to say time control is of no importance in such a test is total nonsense.

by **Volker Pittlik** » 27 Oct 2004, 15:26

David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.

Because...?

Volker

by **Uri Blass** » 27 Oct 2004, 15:34

David Dahlem wrote:
Igor Gorelikov wrote:Here time contol is of no importance.

Igor

I think your test is interesting. But it is a test to compare closeness of engine strength, and to say time control is of no importance in such a test is total nonsense.

I agree that time control is of no importance because the test is not to find which engine is better at long time control.

You can expect similiar behaviour at longer time control except the fact that the order between engines with similiar strength may be different.

Uri

by **Heinz van Kempen** » 27 Oct 2004, 15:40

Hello Igor and all

,

thanks for the test. I agree to Volker. If rating difference is 20 points or less you will need thousands of games, if such difference can be proven then. This is somehow frustrating for testers and also a cause that sometimes people refuse to believe in results from other testers differing to their own. When you compare the many Elolists after several hundred of games you will find similarities but also differences for some engines and those differences must not necessarily come from diverse GUI?s, hardware or slightly different time controls but are often accidental.

What Dave wrote about time controls is also interesting. I do not refer to the fact that some engines are considerably better or weaker with more or less time (Quark is a known example here), but that less games are needed with longer timecontrol to prove differences compared to Blitz.
For example let us assume that Delfi and Tao are only 30 points apart in rating and that this difference will be the same in Blitz and 40/40 for example. Will we then see this difference after less games with longer timecontrol compared to Blitz, because both will play usually more exact with more time? No one has proven this as far as I know, but it is a riddle to solve.

Best Regards
Heinz

by **Anonymous** » 27 Oct 2004, 15:54

Volker Pittlik wrote:
David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.

Because...?

Volker

I already gave the reason, (which you snipped). Again, this test is about comparing engine strengths, and, obviously, time controls are closely related to engine strength.

by **José Carlos** » 27 Oct 2004, 17:17

David Dahlem wrote:
Volker Pittlik wrote:
David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.

Because...?

Volker

I already gave the reason, (which you snipped). Again, this test is about comparing engine strengths, and, obviously, time controls are closely related to engine strength.

Game in 5 minutes with todays computer is the same as game in 50 minutes in the past, with computers 10 times slower. So saying that blitz is irrelevant regarding playing strength is the same as saying that all tests done in the past are useless, no matter the time control.
...And today's game in 50 minutes will be considered crap when computers are 10 times faster than today.
All of this is incorrect, David.
You have to decide what you're testing: for example, I want to test strength in my computer (athlon MP 2400) with ponder on, own books, 3-4-5 EGTB's and games in 40 minutes, plus 10 seconds increment.
If you have a faster computer, all the same equal, your results will probably be slightly different from mine after 10000 games. It's that simple. Because we have different testing conditions. But both results are useful, as long as you know what you are measuring exactly.

by **Anonymous** » 27 Oct 2004, 17:41

Jos? Carlos wrote:
David Dahlem wrote:
Volker Pittlik wrote:
David Dahlem wrote:... and to say time control is of no importance in such a test is total nonsense.

Because...?

Volker

I already gave the reason, (which you snipped). Again, this test is about comparing engine strengths, and, obviously, time controls are closely related to engine strength.

Game in 5 minutes with todays computer is the same as game in 50 minutes in the past, with computers 10 times slower. So saying that blitz is irrelevant regarding playing strength is the same as saying that all tests done in the past are useless, no matter the time control.
...And today's game in 50 minutes will be considered crap when computers are 10 times faster than today.
All of this is incorrect, David.
You have to decide what you're testing: for example, I want to test strength in my computer (athlon MP 2400) with ponder on, own books, 3-4-5 EGTB's and games in 40 minutes, plus 10 seconds increment.
If you have a faster computer, all the same equal, your results will probably be slightly different from mine after 10000 games. It's that simple. Because we have different testing conditions. But both results are useful, as long as you know what you are measuring exactly.

Exactly right. And this thread is about a test that measures the reliablity of tournament results between approximately equal strength engines. No test is 100% reliable, since there will always be an error margin. And common sense should show that a time control of 1 minute + 3 second increment will increase the error margin. Longer time controls will decrease the error margin. It's as simple as that. :-)

by **Uri Blass** » 27 Oct 2004, 18:09

I see no reason to think that fast time control will increase the error margin.

Uri

by **fierz** » 28 Oct 2004, 10:52

i see absolutely no reason the time control would have anything to do with this.

for an example of the exact same phenomenon at long time controls (90 min + 30sec/move) see kurt utzinger's current tournament at http://www.utzingerk.com/at_2004.htm - he has results for every round robin cycle on that page, and you can see how most engines won at least once.

cheers
martin

by **mike schoonover** » 28 Oct 2004, 12:48

hi igor,
allthough i am not a tester my impression is much the same as your results.
i've been playing around with engine v engine games and matchs since 1999 and have noticed on any given day how one engine will
win against a field and then another will win against the same field.
i consider ruffian probably the strongest freeware i have.
i recall though sos arena4 and greenlight 3 amounst others
that have had there day against ruffian.
all this is relying an my memory and general overall impression
and not scientific testing.
if one were to take all the comercial engines each with its own pc(identical of course),
and ran a tournement x 100 interations the results would
probably be simular to yours.
but alas i could'nt do this due to a technical glitch.(lack of money):)
regards
mike

by **Rudolf Posch** » 28 Oct 2004, 19:07

It's all a question of statistic.
I have been developing RDChess and when releasing a new version I played a number of test games against the older RDChess version and a few other engines in order to see if the version is stronger.
I used shorter time controls (usually 5 min for a whole game) in order to get a reasonable number of games.
It was really frustrating. Once RdchessA won against RDChessB (or engine XXXXX) 30:10:10, the next time 20:18:12 and the next-next time e.g. 40:8:2.
But even worse is the following: I tuned a single parameter in RDChess, say some single value in the evaluation function and wanted to get a fast respone if the change is "good or bad" by playing a few (5 or 10) games.
One cannot trust the result of so few games, when the strength has changed not too much.

In Austria there is a licensed gambling "6 from 45", you have to hit 6 numbers out of 1 ... 45 (in Germany it is the same game "6 out of 49" as far is I remember).
The chance at "6 from 45" is roughly 1:8 millions to guess the right 6 numbers. if you play each week one row, you have to wait statistically 160.000 years to get 6 right numbers. But if you are lucky, you may win the fortune next week.
But if you are unlucky, you will have to play 320.000 years until you win a 6, or even longer ...

So I am sure, in the next 100.000 years RDChess will win with great probability at least one game against Ruffian

Rudolf

Winboard Forum

Probability and computer chess.

Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

Re: Probability and computer chess.

my impression

Re: Probability and computer chess.

Who is online