Winboard Forum

by **Uri Blass** » 28 Jul 2005, 19:42

Gian-Carlo Pascutto wrote:It's almost never exactly reproducible. If the OS decides to fart in one game, and the search is a picosecond slower, the games can diverge. I guess it's especially worse at fast timecontrols.

I think that the best decision is simply not to use time control but constant number of nodes to test changes(assuming that they are not changes in the time management).

Movei today does not support it but I will probably add it.

by **Alessandro Scotti** » 28 Jul 2005, 22:28

Pallav Nawani wrote:50 Noomen positions will give a total of 100 games. You should get same scores with each set of 100 games?? Or am I missing something?

Well I also tried to match Kiwi against itself, starting from the same position, and I get different games.
As Gian-Carlo said, even if the first few searches end selecting the same move (which doesn't always happen) they will stop at different instants leaving behind a different state (e.g. in the hash table, history table, etc.) that will propagate to future searches.
When the current test ends, I will try to play 10 Kiwi-Kiwi games from the same starting position, and then another 10 games with hash and history disabled (those will be terrible as MTDf without hash is going to suffer a lot).

by **Gian-Carlo Pascutto** » 29 Jul 2005, 12:56

Why not reset hash and history after each search?

by **diepeveen** » 30 Jul 2005, 11:49

Alessandro Scotti wrote:Hi,
I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!
It seems there are two easy ways to shorten the test: play at faster controls or play less games. And so the question is: what would be most effective?
I've found that even 100 games are not so many, and cannot detect small changes: I could usually measure a variation of 4-5% between the worst and best performance of the same version. On the other hand, if after 20 games it has scored 0 points instead of say the usual 4-5, is it correct to stop the test?
Just to report a peculiar experience with Kiwi... 0.5a scored 18.5 in the Gauntlet test by Karl-Heinz S?ntges, and also 15-18 in my tests against Yace. Version 0.5b had a very small change in the stage calculation and the value of pawns in the endgame, it scored 56 against 0.5a and the same against Yace, yet 14.5 in the Gauntlet. Version 0.5c also beat version 0.5b and scored 19.5 against Yace, Kiwi's best result. Yet, it was down to 13.5 in the Gauntlet. Version 0.5d scored 21.5 against Yace... it will probably drop to the bottom of the Gauntlet list if Karl-Heinz tests it again! All of this is very confusing... isn't it?

Oh well, the best way to test is a compute intensive manner.

A few years ago i had the privilege of knowing Jan Louwman as a tester.

He was the best tester on the surface of the planet. When because of health reasons he had to stop testing, so did diep deteriorate for a few years.

Right now testing is weakest spot of diep as i just own 4 computers.

Which means 1 computer-1computer matches.

That's slow.

The testing method of Jan Louwman.

The simplest testing method there is.

I designed a certain type of forward pruning. 6 months work.

I created 2 versions. 1 version with the forward pruning, searching 2-3 ply deeper. On average 2 ply. It used the non-published 'ed schroeder type forward pruning'. Ed has not published very clearly how his very special search works, though the tables used are posted online.

It's tactical god.

So i knew from home testing that it was 600 points stronger tactical.

But getting on average search depths 13-14 ply, it would be positional like 8-9 ply.

So there was big need for testing.

Jan Louwman played 3000 games. time control 40 in 2. At slower hardware a bit slower. Slow laptops about 6 hours a game.

about 500 games were with the forward pruning and 500 games were iwthout the forward pruning but WITH singular extensions.

The latter version scored 20-25% more against fritz,shredder,junior and a wide variaty of opponents.

a) it's statistical significant
b) it's the only possible test there is

My own experience is that playing your own program against itself is a very bad way to test things. Testsets are too. I already *knew* this version was tactical 600 points above the normal diep version. Ed Schroder is not a beginner to say polite and he had designed this method in a time (around the 80s - start 90s) when just tactics mattered. So i knew it would be superior in that respect.

But this test showed very clearly that good testing is what you need.
The result was the opposite of what i thought it would be. Of course after the test is long over, you can explain it. Thing is, accurate testing and keeping statistics is important.

The real strong point of Jan is that he, despite his age of 75 years old,
never made 1 mistake in testing in this sense that he always very accurately annotated which logfiles were which testmatch. And more importantly, at age=75 he did get the job done.

A single 25 game match just shows *nothing* statistical relevant, assuming your engine doesn't have too huge bugs. If you already are pretty strong and play strong opponents and you want to measure whether a change improved you say 30 rating points, then 25 games just won't do at all. The influence of learning, openings books, hardware, time control and the position of the sun perhaps, it is just too huge variables.

The standard deviation is too big.

A good example is the current diep version. I modified some things in evaluation function. I get a good weather report back from a tester who played just 10 games, and myself i discovered some horrors in some other tests. But is it better or not?

I don't know. I GUESS it is.

Only the Jan Louwman method, which counts testing games by the hundreds and always plays on till 3000 games, that's the only way to test thigns in this sport in a real good manner.

A big problem thereby is that in world champs you play on far superior hardware and a version that sucks in blitz might be great at 60 in 2 time control. So a slow time control is a necessity.

I hope that answers your question.

by **Uri Blass** » 30 Jul 2005, 16:00

I think that in the example of Diep too many games were played.

You could probably see after 100 games for every version that the forward pruning that you tried is not productive based on the results.

My opinion is that the best thing is not to do one change and test in games but to do more than one change and later tst in games.

There aremany type of changes that I can be almost sure that they are productive without a single game(for example adding some function to generate checks and captures in Movei or some improvement in the order of moves if I find that it is productive in test suites).

I think that it is probably better to do first the changes that it is obvious that they are productive without games and not to change something in the evaluation when you are not sure if it is productive.

I may do changes in the evaluation but in that case adding knowledge that Fruit use and Movei does not use are first candidates for changes(unfortunately I had not time recently to read Fruit's evaluation but I will do it)

Uri

by **diepeveen** » 30 Jul 2005, 16:13

Uri Blass wrote:I think that in the example of Diep too many games were played.

You could probably see after 100 games for every version that the forward pruning that you tried is not productive based on the results.

For someone who wrote article on statistics (unless someone else wanted his article cloaked under the name 'uri blass'), you definitely should grab back your own article from those days and just simply fill in with example test outputs.

If you do a test which already has inaccuracy, whereas you try to measure something tiney, then the case n=100 isn't proving much. It just gives a *certain statistical sureness* which is simply *too low* IMHO.

n=500 is far more accurate.

We aren't testing parallel speedup here, but even for parallel speedup, using 213 positions is kind of a minimum (i'm using 213). The professionals like Stefan MK want to use 1000 positions there.

So just fill in example runs to measure small differences and determine what amount of games you need for 99% sureness.

Of course in your Movei case, we aren't speaking of that. Movei is what is it. 2400 rated or 2200 rated?

If you improve something, like finally getting a hashtable to work, you might make a jump of 200 points or so.

That doesn't happen at the top however. If you gain 10 elorating points,
that's a lot for just 1 week work.

So we are clearly MEASURING whether some change has an impact at playing strength. Like using last ply pruning or not.

That does *not* give big differences in score. Still we want to measure that.

I'm very bad surprised that someone, no matter how autistic, who wrote an article on statistics, doesn't realize that 100 games isn't enough then.

In any case you aren't denying that measuring blitz versus slow time control is a big difference. So i assume you at least admit *that*.

Uri Blass wrote:My opinion is that the best thing is not to do one change and test in games but to do more than one change and later tst in games.

There aremany type of changes that I can be almost sure that they are productive without a single game(for example adding some function to generate checks and captures in Movei or some improvement in the order of moves if I find that it is productive in test suites).

I think that it is probably better to do first the changes that it is obvious that they are productive without games and not to change something in the evaluation when you are not sure if it is productive.

I may do changes in the evaluation but in that case adding knowledge that Fruit use and Movei does not use are first candidates for changes(unfortunately I had not time recently to read Fruit's evaluation but I will do it)

Uri

by **Joachim Rang** » 30 Jul 2005, 16:48

The latter version scored 20-25% more against fritz,shredder,junior and a wide variaty of opponents

You don't need more than 100 games to notice such a difference though ;-)

You probably meant 2.0 -2.5 % which needs indeed much more games to get a decent probability.

regards Joachim

by **Uri Blass** » 30 Jul 2005, 18:27

Joachim Rang wrote:
The latter version scored 20-25% more against fritz,shredder,junior and a wide variaty of opponents

You don't need more than 100 games to notice such a difference though

You probably meant 2.0 -2.5 % which needs indeed much more games to get a decent probability.

regards Joachim

I agree and this was exactly my point when I replied Vincent.

I did not say that 100 games of both versions are always enough but that if you score 20-25% worse after 100 games of both versions then 100 games are enough to know that the change is not good.

I also do not agree that big improvement are not possible at the top and I expect next Fruit to have big improvement relative to Fruit2.1 only thanks to search improvement and maybe better tuning of the evaluation without adding knowledge.

Uri

by **Greg Simpson** » 30 Jul 2005, 19:49

Is 100 games enough to tell that an improved engine scores 20% more against Shredder than the original?

Say your original engine has an expected score of 20 against Shredder in 100 games, then your improved version has an expected score of 24. Can this be reliably determined in 100 games?

Of course things get better against a more equal engine, but even versions going from 40% to 50% (a 25% improvement) against an opponent won't always score better in 100 games.

by **Joachim Rang** » 30 Jul 2005, 22:15

Greg Simpson wrote:Is 100 games enough to tell that an improved engine scores 20% more against Shredder than the original?

Say your original engine has an expected score of 20 against Shredder in 100 games, then your improved version has an expected score of 24. Can this be reliably determined in 100 games?

Of course things get better against a more equal engine, but even versions going from 40% to 50% (a 25% improvement) against an opponent won't always score better in 100 games.

usually it scored 20-25% better means it got 20-25% more points. That means if you score 20% against Shredder and get an improvement of 20-25% you score 40-45 points. That is noticeable after 100 games. Of course your interpretation means only 4% better score which needs more games of course.

Vincent probably meant the latter.

regards Joachim

by **Anonymous** » 31 Jul 2005, 12:43

Greg Simpson wrote:Is 100 games enough to tell that an improved engine scores 20% more against Shredder than the original?

Most probably, yes. The following table, that was calculated by Monte Carlo simulation gives a clue, which sort of difference you can detect with 100 games. For the calculation I used parameters, that yield in 50 % score. (Assume, you have an opponent, where your "stand-pat-version" scores 50% - with less balanced results, you need more games to be sure of the same improvement). You can see, that already a score of 55% is rather significant. The probability, that the engine did not improve is around 10%.

Regards,
Dieter

Code: Select all: Result of chess matches Player A as white wins 40.0%, draws 30.0% and loses 30.0% Player A as black wins 30.0%, draws 30.0% and loses 40.0% Expected result: 50.0000% (as white 55.0000%, as black 45.0000%) A match of 100 games was simulated 10000000 times by a Monte Carlo method result probability p <= res. p > res. 26.5 - 73.5 ( 26.500%): 0.0000% 0.0000% 100.0000% 27.5 - 72.5 ( 27.500%): 0.0000% 0.0000% 100.0000% 29.0 - 71.0 ( 29.000%): 0.0000% 0.0000% 100.0000% 29.5 - 70.5 ( 29.500%): 0.0000% 0.0001% 99.9999% 30.0 - 70.0 ( 30.000%): 0.0000% 0.0001% 99.9999% 30.5 - 69.5 ( 30.500%): 0.0001% 0.0002% 99.9998% 31.0 - 69.0 ( 31.000%): 0.0001% 0.0003% 99.9997% 31.5 - 68.5 ( 31.500%): 0.0002% 0.0005% 99.9995% 32.0 - 68.0 ( 32.000%): 0.0004% 0.0010% 99.9990% 32.5 - 67.5 ( 32.500%): 0.0007% 0.0016% 99.9984% 33.0 - 67.0 ( 33.000%): 0.0009% 0.0026% 99.9974% 33.5 - 66.5 ( 33.500%): 0.0016% 0.0042% 99.9958% 34.0 - 66.0 ( 34.000%): 0.0026% 0.0068% 99.9932% 34.5 - 65.5 ( 34.500%): 0.0040% 0.0108% 99.9892% 35.0 - 65.0 ( 35.000%): 0.0067% 0.0175% 99.9825% 35.5 - 64.5 ( 35.500%): 0.0101% 0.0276% 99.9724% 36.0 - 64.0 ( 36.000%): 0.0156% 0.0432% 99.9568% 36.5 - 63.5 ( 36.500%): 0.0234% 0.0665% 99.9334% 37.0 - 63.0 ( 37.000%): 0.0345% 0.1011% 99.8989% 37.5 - 62.5 ( 37.500%): 0.0510% 0.1521% 99.8479% 38.0 - 62.0 ( 38.000%): 0.0726% 0.2247% 99.7753% 38.5 - 61.5 ( 38.500%): 0.1038% 0.3285% 99.6715% 39.0 - 61.0 ( 39.000%): 0.1430% 0.4715% 99.5285% 39.5 - 60.5 ( 39.500%): 0.1956% 0.6671% 99.3329% 40.0 - 60.0 ( 40.000%): 0.2647% 0.9318% 99.0682% 40.5 - 59.5 ( 40.500%): 0.3489% 1.2807% 98.7193% 41.0 - 59.0 ( 41.000%): 0.4626% 1.7433% 98.2567% 41.5 - 58.5 ( 41.500%): 0.6006% 2.3439% 97.6561% 42.0 - 58.0 ( 42.000%): 0.7552% 3.0991% 96.9009% 42.5 - 57.5 ( 42.500%): 0.9483% 4.0473% 95.9527% 43.0 - 57.0 ( 43.000%): 1.1656% 5.2130% 94.7870% 43.5 - 56.5 ( 43.500%): 1.4164% 6.6293% 93.3707% 44.0 - 56.0 ( 44.000%): 1.6979% 8.3272% 91.6728% 44.5 - 55.5 ( 44.500%): 2.0001% 10.3274% 89.6726% 45.0 - 55.0 ( 45.000%): 2.3333% 12.6607% 87.3393% 45.5 - 54.5 ( 45.500%): 2.6795% 15.3402% 84.6598% 46.0 - 54.0 ( 46.000%): 3.0198% 18.3600% 81.6400% 46.5 - 53.5 ( 46.500%): 3.3718% 21.7317% 78.2683% 47.0 - 53.0 ( 47.000%): 3.7003% 25.4320% 74.5680% 47.5 - 52.5 ( 47.500%): 4.0103% 29.4423% 70.5577% 48.0 - 52.0 ( 48.000%): 4.2764% 33.7187% 66.2813% 48.5 - 51.5 ( 48.500%): 4.4942% 38.2129% 61.7871% 49.0 - 51.0 ( 49.000%): 4.6574% 42.8703% 57.1297% 49.5 - 50.5 ( 49.500%): 4.7563% 47.6266% 52.3734% 50.0 - 50.0 ( 50.000%): 4.7894% 52.4160% 47.5840% 50.5 - 49.5 ( 50.500%): 4.7664% 57.1824% 42.8176% 51.0 - 49.0 ( 51.000%): 4.6623% 61.8447% 38.1553% 51.5 - 48.5 ( 51.500%): 4.4859% 66.3306% 33.6694% 52.0 - 48.0 ( 52.000%): 4.2719% 70.6025% 29.3975% 52.5 - 47.5 ( 52.500%): 3.9995% 74.6020% 25.3981% 53.0 - 47.0 ( 53.000%): 3.6995% 78.3015% 21.6986% 53.5 - 46.5 ( 53.500%): 3.3604% 81.6619% 18.3381% 54.0 - 46.0 ( 54.000%): 3.0280% 84.6899% 15.3101% 54.5 - 45.5 ( 54.500%): 2.6736% 87.3636% 12.6364% 55.0 - 45.0 ( 55.000%): 2.3264% 89.6900% 10.3100% 55.5 - 44.5 ( 55.500%): 2.0050% 91.6950% 8.3050% 56.0 - 44.0 ( 56.000%): 1.6902% 93.3852% 6.6148% 56.5 - 43.5 ( 56.500%): 1.4168% 94.8020% 5.1980% 57.0 - 43.0 ( 57.000%): 1.1658% 95.9678% 4.0322% 57.5 - 42.5 ( 57.500%): 0.9408% 96.9086% 3.0914% 58.0 - 42.0 ( 58.000%): 0.7517% 97.6603% 2.3397% 58.5 - 41.5 ( 58.500%): 0.5925% 98.2528% 1.7472% 59.0 - 41.0 ( 59.000%): 0.4553% 98.7081% 1.2919% 59.5 - 40.5 ( 59.500%): 0.3533% 99.0614% 0.9386% 60.0 - 40.0 ( 60.000%): 0.2677% 99.3291% 0.6709% 60.5 - 39.5 ( 60.500%): 0.1972% 99.5263% 0.4737% 61.0 - 39.0 ( 61.000%): 0.1424% 99.6687% 0.3313% 61.5 - 38.5 ( 61.500%): 0.1033% 99.7720% 0.2280% 62.0 - 38.0 ( 62.000%): 0.0723% 99.8443% 0.1557% 62.5 - 37.5 ( 62.500%): 0.0522% 99.8965% 0.1035% 63.0 - 37.0 ( 63.000%): 0.0349% 99.9314% 0.0686% 63.5 - 36.5 ( 63.500%): 0.0239% 99.9552% 0.0448% 64.0 - 36.0 ( 64.000%): 0.0167% 99.9719% 0.0281% 64.5 - 35.5 ( 64.500%): 0.0107% 99.9826% 0.0174% 65.0 - 35.0 ( 65.000%): 0.0067% 99.9893% 0.0107% 65.5 - 34.5 ( 65.500%): 0.0040% 99.9934% 0.0066% 66.0 - 34.0 ( 66.000%): 0.0028% 99.9962% 0.0038% 66.5 - 33.5 ( 66.500%): 0.0015% 99.9977% 0.0023% 67.0 - 33.0 ( 67.000%): 0.0009% 99.9986% 0.0014% 67.5 - 32.5 ( 67.500%): 0.0007% 99.9993% 0.0007% 68.0 - 32.0 ( 68.000%): 0.0003% 99.9996% 0.0004% 68.5 - 31.5 ( 68.500%): 0.0002% 99.9998% 0.0002% 69.0 - 31.0 ( 69.000%): 0.0001% 99.9999% 0.0001% 69.5 - 30.5 ( 69.500%): 0.0000% 99.9999% 0.0001% 70.0 - 30.0 ( 70.000%): 0.0001% 100.0000% 0.0000% 70.5 - 29.5 ( 70.500%): 0.0000% 100.0000% 0.0000% Average result of simulation 49.9978%

by **Anonymous** » 31 Jul 2005, 12:51

Greg Simpson wrote:Is 100 games enough to tell that an improved engine scores 20% more against Shredder than the original?

Say your original engine has an expected score of 20 against Shredder in 100 games, then your improved version has an expected score of 24. Can this be reliably determined in 100 games?

Sorry, I did not read carefully enough, first. Simulation for your scenario follows. You can see, that it is about the edge, of being really significant.

Regards,
Dieter

Code: Select all: Result of chess matches Player A as white wins 20.0%, draws 10.0% and loses 70.0% Player A as black wins 10.0%, draws 10.0% and loses 80.0% Expected result: 20.0000% (as white 25.0000%, as black 15.0000%) A match of 100 games was simulated 10000000 times by a Monte Carlo method result probability p <= res. p > res. 3.0 - 97.0 ( 3.000%): 0.0000% 0.0000% 100.0000% 4.5 - 95.5 ( 4.500%): 0.0000% 0.0000% 100.0000% 5.0 - 95.0 ( 5.000%): 0.0001% 0.0001% 99.9999% 5.5 - 94.5 ( 5.500%): 0.0002% 0.0003% 99.9997% 6.0 - 94.0 ( 6.000%): 0.0006% 0.0009% 99.9991% 6.5 - 93.5 ( 6.500%): 0.0012% 0.0021% 99.9979% 7.0 - 93.0 ( 7.000%): 0.0026% 0.0046% 99.9954% 7.5 - 92.5 ( 7.500%): 0.0049% 0.0095% 99.9905% 8.0 - 92.0 ( 8.000%): 0.0098% 0.0193% 99.9807% 8.5 - 91.5 ( 8.500%): 0.0177% 0.0370% 99.9630% 9.0 - 91.0 ( 9.000%): 0.0309% 0.0679% 99.9321% 9.5 - 90.5 ( 9.500%): 0.0519% 0.1198% 99.8802% 10.0 - 90.0 ( 10.000%): 0.0844% 0.2042% 99.7958% 10.5 - 89.5 ( 10.500%): 0.1330% 0.3371% 99.6629% 11.0 - 89.0 ( 11.000%): 0.2032% 0.5404% 99.4596% 11.5 - 88.5 ( 11.500%): 0.3038% 0.8442% 99.1558% 12.0 - 88.0 ( 12.000%): 0.4362% 1.2805% 98.7195% 12.5 - 87.5 ( 12.500%): 0.6151% 1.8956% 98.1044% 13.0 - 87.0 ( 13.000%): 0.8281% 2.7236% 97.2764% 13.5 - 86.5 ( 13.500%): 1.1054% 3.8290% 96.1710% 14.0 - 86.0 ( 14.000%): 1.4320% 5.2609% 94.7391% 14.5 - 85.5 ( 14.500%): 1.8015% 7.0624% 92.9376% 15.0 - 85.0 ( 15.000%): 2.2259% 9.2883% 90.7117% 15.5 - 84.5 ( 15.500%): 2.6862% 11.9745% 88.0255% 16.0 - 84.0 ( 16.000%): 3.1494% 15.1239% 84.8761% 16.5 - 83.5 ( 16.500%): 3.6302% 18.7541% 81.2459% 17.0 - 83.0 ( 17.000%): 4.0938% 22.8478% 77.1522% 17.5 - 82.5 ( 17.500%): 4.5085% 27.3564% 72.6436% 18.0 - 82.0 ( 18.000%): 4.8667% 32.2231% 67.7769% 18.5 - 81.5 ( 18.500%): 5.1643% 37.3874% 62.6126% 19.0 - 81.0 ( 19.000%): 5.3732% 42.7606% 57.2394% 19.5 - 80.5 ( 19.500%): 5.4684% 48.2289% 51.7711% 20.0 - 80.0 ( 20.000%): 5.4511% 53.6800% 46.3200% 20.5 - 79.5 ( 20.500%): 5.3462% 59.0262% 40.9738% 21.0 - 79.0 ( 21.000%): 5.1759% 64.2021% 35.7979% 21.5 - 78.5 ( 21.500%): 4.8828% 69.0850% 30.9150% 22.0 - 78.0 ( 22.000%): 4.5430% 73.6279% 26.3721% 22.5 - 77.5 ( 22.500%): 4.1480% 77.7760% 22.2240% 23.0 - 77.0 ( 23.000%): 3.7450% 81.5210% 18.4790% 23.5 - 76.5 ( 23.500%): 3.2961% 84.8171% 15.1829% 24.0 - 76.0 ( 24.000%): 2.8677% 87.6848% 12.3152% 24.5 - 75.5 ( 24.500%): 2.4450% 90.1298% 9.8702% 25.0 - 75.0 ( 25.000%): 2.0658% 92.1956% 7.8044% 25.5 - 74.5 ( 25.500%): 1.7162% 93.9117% 6.0883% 26.0 - 74.0 ( 26.000%): 1.4011% 95.3128% 4.6872% 26.5 - 73.5 ( 26.500%): 1.1227% 96.4355% 3.5645% 27.0 - 73.0 ( 27.000%): 0.8881% 97.3236% 2.6764% 27.5 - 72.5 ( 27.500%): 0.6910% 98.0146% 1.9854% 28.0 - 72.0 ( 28.000%): 0.5345% 98.5491% 1.4509% 28.5 - 71.5 ( 28.500%): 0.4048% 98.9539% 1.0461% 29.0 - 71.0 ( 29.000%): 0.2990% 99.2530% 0.7470% 29.5 - 70.5 ( 29.500%): 0.2197% 99.4727% 0.5273% 30.0 - 70.0 ( 30.000%): 0.1608% 99.6336% 0.3665% 30.5 - 69.5 ( 30.500%): 0.1129% 99.7464% 0.2536% 31.0 - 69.0 ( 31.000%): 0.0822% 99.8286% 0.1714% 31.5 - 68.5 ( 31.500%): 0.0579% 99.8865% 0.1135% 32.0 - 68.0 ( 32.000%): 0.0386% 99.9252% 0.0749% 32.5 - 67.5 ( 32.500%): 0.0267% 99.9518% 0.0481% 33.0 - 67.0 ( 33.000%): 0.0171% 99.9690% 0.0310% 33.5 - 66.5 ( 33.500%): 0.0113% 99.9803% 0.0197% 34.0 - 66.0 ( 34.000%): 0.0074% 99.9877% 0.0123% 34.5 - 65.5 ( 34.500%): 0.0047% 99.9924% 0.0076% 35.0 - 65.0 ( 35.000%): 0.0030% 99.9954% 0.0046% 35.5 - 64.5 ( 35.500%): 0.0017% 99.9970% 0.0030% 36.0 - 64.0 ( 36.000%): 0.0012% 99.9982% 0.0018% 36.5 - 63.5 ( 36.500%): 0.0008% 99.9989% 0.0010% 37.0 - 63.0 ( 37.000%): 0.0004% 99.9993% 0.0007% 37.5 - 62.5 ( 37.500%): 0.0003% 99.9996% 0.0004% 38.0 - 62.0 ( 38.000%): 0.0002% 99.9998% 0.0002% 38.5 - 61.5 ( 38.500%): 0.0001% 99.9999% 0.0001% 39.0 - 61.0 ( 39.000%): 0.0000% 99.9999% 0.0001% 39.5 - 60.5 ( 39.500%): 0.0000% 100.0000% 0.0000% 40.0 - 60.0 ( 40.000%): 0.0000% 100.0000% 0.0000% 41.0 - 59.0 ( 41.000%): 0.0000% 100.0000% 0.0000% 41.5 - 58.5 ( 41.500%): 0.0000% 100.0000% 0.0000% 42.5 - 57.5 ( 42.500%): 0.0000% 100.0000% 0.0000% Average result of simulation 19.9998%

by **Anonymous** » 31 Jul 2005, 12:57

And with perhaps more realistic parameters. Last simulation expected more wins than draws. With 20% result, we can expect more draws than wins.

Code: Select all: Result of chess matches Player A as white wins 10.0%, draws 25.0% and loses 65.0% Player A as black wins 7.5%, draws 20.0% and loses 72.5% Expected result: 20.0000% (as white 22.5000%, as black 17.5000%) A match of 100 games was simulated 10000000 times by a Monte Carlo method result probability p <= res. p > res. [...] 15.0 - 85.0 ( 15.000%): 1.8896% 6.5596% 93.4404% 15.5 - 84.5 ( 15.500%): 2.4197% 8.9793% 91.0207% 16.0 - 84.0 ( 16.000%): 2.9829% 11.9622% 88.0378% 16.5 - 83.5 ( 16.500%): 3.5935% 15.5557% 84.4443% 17.0 - 83.0 ( 17.000%): 4.2073% 19.7630% 80.2370% 17.5 - 82.5 ( 17.500%): 4.7951% 24.5581% 75.4419% 18.0 - 82.0 ( 18.000%): 5.3106% 29.8687% 70.1313% 18.5 - 81.5 ( 18.500%): 5.7226% 35.5913% 64.4087% 19.0 - 81.0 ( 19.000%): 6.0327% 41.6240% 58.3760% 19.5 - 80.5 ( 19.500%): 6.1717% 47.7957% 52.2043% 20.0 - 80.0 ( 20.000%): 6.1869% 53.9826% 46.0174% 20.5 - 79.5 ( 20.500%): 6.0489% 60.0314% 39.9686% 21.0 - 79.0 ( 21.000%): 5.7914% 65.8228% 34.1772% 21.5 - 78.5 ( 21.500%): 5.4042% 71.2270% 28.7730% 22.0 - 78.0 ( 22.000%): 4.9329% 76.1600% 23.8400% 22.5 - 77.5 ( 22.500%): 4.4105% 80.5704% 19.4296% 23.0 - 77.0 ( 23.000%): 3.8484% 84.4188% 15.5812% 23.5 - 76.5 ( 23.500%): 3.2950% 87.7138% 12.2863% 24.0 - 76.0 ( 24.000%): 2.7524% 90.4661% 9.5339% 24.5 - 75.5 ( 24.500%): 2.2553% 92.7215% 7.2785% 25.0 - 75.0 ( 25.000%): 1.8185% 94.5400% 5.4600% 25.5 - 74.5 ( 25.500%): 1.4267% 95.9667% 4.0333% 26.0 - 74.0 ( 26.000%): 1.1136% 97.0803% 2.9197% 26.5 - 73.5 ( 26.500%): 0.8364% 97.9167% 2.0833% 27.0 - 73.0 ( 27.000%): 0.6200% 98.5367% 1.4633% 27.5 - 72.5 ( 27.500%): 0.4526% 98.9894% 1.0107% 28.0 - 72.0 ( 28.000%): 0.3245% 99.3138% 0.6862% 28.5 - 71.5 ( 28.500%): 0.2269% 99.5407% 0.4593% 29.0 - 71.0 ( 29.000%): 0.1568% 99.6976% 0.3024% 29.5 - 70.5 ( 29.500%): 0.1064% 99.8039% 0.1961% 30.0 - 70.0 ( 30.000%): 0.0705% 99.8744% 0.1255% [...] Average result of simulation 19.9999%

Regards,
Dieter

by **Alessandro Scotti** » 02 Aug 2005, 07:22

Gian-Carlo Pascutto wrote:Why not reset hash and history after each search?

D'oh!!!

P.S. Update: results of the first 3 matches at 40/2 are 49, 48 and 48.5... they seem to be more stable than the other time control so far (and not bad: average score of 500 games at 40/4 was 47.7)

by **Anonymous** » 04 Aug 2005, 08:40

Ciao Alessandro!

Maybe you can consider another way to test new version. If you store the played games, you can start new games from a "key" position in each one. I means, there are a point, in the game, when an opponent gets a very better position value. Just take the position some moves before that point, with the hope that the new engine don't make the same error as the old one. This would save a lot of time, i think.

Another way is to take the "key" position after some fixed moves from the starting position. Still you have a good time saving.

Thinking about my genetical approach... you could mix games started from "key" position with games started from scratch, and then delete some random ones. So the "key" position sets will be refreshed and renewed each time.

Good luck for Bologna!!!

))

Stefano

Winboard Forum

Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

20% improvement

Re: 20% improvement

Re: 20% improvement

Re: 20% improvement

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Who is online