Winboard Forum

by **Georg** » 27 Sep 1999, 00:41

I was just trying to make some changes to crafty (time-function) and before
doing that I wanted to test its strength (so to be able to compare later).
I didn't want to use the autoplayer (people tend to get strange results) or the
cb-adaptor (where I get _very_ strange results) so I did let it play against the
(IMHO) second best freeware engine "Comet" under WBoard.
I did make sure they got the same HT (checked with memory tool).
I did make sure they got the same processor time (checked with some system
tool), of course no other processes running and restart before every match and I
have a very stable system.
I did delete the learning files after every match.
I did let them play from the Nunn positions, one time with white, one time with
black, cause I wanted to test the engine, not the opening book.
I did make sure both got 4man TB acess. (btw: i'd still like to know what you
think are the most usefull 5man TB)
k6II/400,15min/game. Result : newest Crafty 16.9 : 15,5 newest CometB06 : 4,5
(!).
Hey, I thought, this can't be: Comet isn't _that_ weak! So I did another test,
with _exactly_ the same configuration, only 5 0 games. Result 9,5, : 10,5. Comet
won (!).
Hey, I thought, this can't be: there is too much a difference between those
matches. So I did another test with the same configuration, only 14 0 (!) games.
Result: 14,0 : 6,0.
This is 1,5 more points for Comet only because 14 0 instead of 15 0 !
What does this teach us ?
Forget about any serious testing (hello SSDF ! ;-)

) if you don't play at
least 200 matches between every engine. If not you just get garbage results.

Best regards,
Tec--

by **Inmann Werner** » 27 Sep 1999, 09:03

I was just trying to make some changes to crafty (time-function) and before
doing that I wanted to test its strength (so to be able to compare later).
I didn't want to use the autoplayer (people tend to get strange results) or the
cb-adaptor (where I get _very_ strange results) so I did let it play against the
(IMHO) second best freeware engine "Comet" under WBoard.
I did make sure they got the same HT (checked with memory tool).
I did make sure they got the same processor time (checked with some system
tool), of course no other processes running and restart before every match and I
have a very stable system.
I did delete the learning files after every match.
I did let them play from the Nunn positions, one time with white, one time with
black, cause I wanted to test the engine, not the opening book.
I did make sure both got 4man TB acess. (btw: i'd still like to know what you
think are the most usefull 5man TB)
k6II/400,15min/game. Result : newest Crafty 16.9 : 15,5 newest CometB06 : 4,5
(!).
Hey, I thought, this can't be: Comet isn't _that_ weak! So I did another test,
with _exactly_ the same configuration, only 5 0 games. Result 9,5, : 10,5. Comet
won (!).
Hey, I thought, this can't be: there is too much a difference between those
matches. So I did another test with the same configuration, only 14 0 (!) games.
Result: 14,0 : 6,0.
This is 1,5 more points for Comet only because 14 0 instead of 15 0 !
What does this teach us ?
Forget about any serious testing (hello SSDF ! ) if you don't play at
least 200 matches between every engine. If not you just get garbage results.

Best regards,
Tec--

The only thing I can see is, that Comet seems better at Blitz (5 0).
15,5:4,5 and 14,0:6,0 is simple statistic and normal variation.
Where is your real problem?
And you do not need 200 games every engine against every engine. But you are right, that you need 200 games of one engine against others, to get a "real good" result with only small failure.
One thing IMHO is absolutly right. You can not compare 5 0 games with tournament games.
Werner

by **Dann Corbit** » 29 Sep 1999, 05:18

[snip]

What does this teach us ?
Forget about any serious testing (hello SSDF ! ) if you don't play at
least 200 matches between every engine. If not you just get garbage results.
Was bringt dieses uns bei?
Vergessen Sie über die irgendwie ernste Prüfung (hallo SSDF! wenn Sie nicht am
spielen; wenige 200 Übereinstimmungen zwischen jeder Maschine. Wenn nicht Sie gerade Abfallresultate erhalten.

That's why the SSDF plays at least 100 matches before they report anything at all.
The games are at long time control, and performed under very careful conditions.
Most people don't seem to have enough math background to even understand what the list means, let alone to make any useful judgements from it. So (as far as that goes) I agree that most people should not try to use it to make purchase decisions.
[Schnitzel] ;-)

Das ist, warum das SSDF mindestens 100 Übereinstimmungen spielt, bevor sie über alles an allen berichten.
Die Spiele sind an der langen Zeitsteuerung, und durchgeführt unter sehr vorsichtigen Bedingungen.
Die meisten Leute scheinen, zu haben genügend Mathehintergrund zum Verstehen sogar was die Liste bedeutet, geschweige denn, um keine nützlichen Urteile von ihr zu bilden. So (insoweit das geht), stimme ich darin überein, daß die meisten Leute nicht versuchen sollten, es zu verwenden, um Erwerb Entscheidungen zu treffen.

My ftp site

by **Georg** » 29 Sep 1999, 11:18

haha babelfish rules.
blub.

by **Pete Galati** » 29 Sep 1999, 16:06

haha babelfish rules.
blub.

Oddly enough, Bablefish can't really translate it's own translation, this is what Dann said before it was translated into German:
"Most people don't seem to have enough math background to even understand what the list means, let alone to make any useful judgements from it. So (as far as that goes) I agree that most people should not try to use it to make purchase decisions."
and then it was translated to German, and here I've translated the German part back into English:
"Most people seem to have sufficient Mathehintergrund for understanding even which the list meant, let alone, in order to form no useful judgements of it. So (it goes to that extent), I correspond therein that most people should not try to use it in order acquisition decisions to meet."
So obviouslly I don't have any idea what is involved in doing these translations, but this is a bit strange.
Pete

by **Dann Corbit** » 29 Sep 1999, 20:30

Here is babelfish:
http://babelfish.altavista.com/cgi-bin/translate?
I have the babelfish icon in my browser.
For some real fun, take some innocent phrase like:
"the duck went down the river" and translate to and from any chosen language pair until it reaches a homeostasis. E.G. "the duck went down the river" using
the duck went down the river
Le canard a descendu le fleuve
the duck went down the to rivet
le canard a descendu pour riveter
the duck went down to rivet
Ahh... We have come to a conclusion.
Now, English German:
I love to play winboard chess and eat a sandwitch

Ich liebe, winboardschach zu spielen und ein sandwitch zu essen

I love to play and sandwitch eat winboardschach
Ich liebe zu spielen und sandwitch essen winboardschach
I love to play and sandwitch eat winboardschach
Ahh... We have come to a conclusion.
Try it sometime. It's a barrel of laughs.
Essayez-l'autrefois. C'est un baril de rires.
Versuchen Sie es einmal. Es ist ein Faß Lachen.
Provarlo un momento. È un barilotto delle risate.
Tente-o sometime. É um tambor dos risos.
Inténtelo alguna vez. Es un barril de risas.

My ftp site

by **José Carlos** » 04 Oct 1999, 17:37

k6II/400,15min/game. Result : newest Crafty 16.9 : 15,5 newest CometB06 : 4,5
(!).
Hey, I thought, this can't be: Comet isn't _that_ weak! So I did another test,
with _exactly_ the same configuration, only 5 0 games. Result 9,5, : 10,5. Comet
won (!).
Hey, I thought, this can't be: there is too much a difference between those
matches. So I did another test with the same configuration, only 14 0 (!) games.
Result: 14,0 : 6,0.
This is 1,5 more points for Comet only because 14 0 instead of 15 0 !
What does this teach us ?
Forget about any serious testing (hello SSDF ! ) if you don't play at
least 200 matches between every engine. If not you just get garbage results.

Best regards,
Tec--

First, the difference between 14 0 and 5 0 is perfectly reasonbly. No program (neither person) plays at the same level at every time control. The difference between 15 and 14 is only 1.5 points. It can be due to just 2 different moves (for example, for reaching one more ply in 15 than in 14, or even due to different values in the hash table) or maybe it can be a random factor in eval function (you could try to repeat exactly the games at, say 15 0, and see if the games are equal move by move to test this).
I see your results absolutely normal.
José C.

by **Franz** » 08 Oct 1999, 12:45

Forget about any serious testing if you don't play at
least 200 matches between every engine. If not you just get garbage results.

You just discover statistical fluctuations!
Any "measurement" should be indicated as a value + an estimation of the error of the emasurment itself, it is never a pure number. So if you say temperature is 75F you should say it is 75 +/- 1 F if your error is 1F.
If I make another measure and say: the temperature is 73+/- 2F you can't say that this two number are really different, because you estimate a temperature between 74 and 76 and I do estimate between 71 and 75. So the pure numbers 75 and 73 look different but are not enough to describe the real observations. More precisely if you use one error bar and the two maserurements don't overlap with the error bars they are different withn ~76% of probability. If you want to be sure to >99% you nedd to take the error bar and multiply it times 3!
So with 15 games won how do you estimate the statistical error? A good guess is to take the square root of the number. So that you had to say 15+/-4 (the error is always rounded to a single figure so 3,8 is 4, 38 is 40, 220 is 200...).
If the other program scored 5 you should get 5+/-2.
Well this are distinguishable "within one error bar" but not "within 2 eeror bars" so there is some residual probability (~5%) that another match with so little number of games gives you the exactly opposite score. (As it happened to you).
With 200 games and a score 120-80 the numbers would read 120+/-10 and 80+/-9 which again are hardly distinguishable "within 3 error bars" [the ranges here would be 90-150 versus 53-107 which overlap in the region 90-107]. While a score of 150+/-10 to 50+/-7 would permit to establish a real difference betwen the two programs to more than 99% of probability.

In general the tests done by Swedish guys have a different mening. Even if you test two programs with only a 20 games match, when the program has reached some hundreds (or even 1000) games, well statistical effects tend to balance each other.

regards
Franz

Winboard Forum

Forget about all Computer-Computer matches

Forget about all Computer-Computer matches

Re: Forget about all Computer-Computer matches

Re: Forget about all Computer-Computer matches

Babelfish

Re: Babelfish

Re: Babelfish

Re: Forget about all Computer-Computer matches

Re: Forget about all Computer-Computer matches

Who is online