Two Prophet versions in the Nunn test
Posted: 21 Aug 2006, 10:50
Activated by Marc Lacrosse's MLmfl test tests I thought about a way to enable even a lazy one as me to answer the question "which version is stronger" based upon a relevant number of games.
Marc's test or others with even more positions take to much time (for me) to wait for the results. The question should be answered after an overnight test. Therefore I have choosen the Nunn-1 test suite with this 10 positions:
r2qk2r/5pbp/p1np4/1p1Npb2/8/N1P5/PP3PPP/R2QKB1R w KQkq - 0 14
r1b2rk1/1pq1bppp/p1nppn2/8/3NP3/1BN1B3/PPP1QPPP/2KR3R w - - 6 11
r3k2r/p1qbnppp/1pn1p3/2ppP3/P2P4/2PB1N2/2P2PPP/R1BQ1RK1 b kq - 5 11
r1b2rk1/2q1bppp/p2p1n2/npp1p3/3PP3/2P2N1P/PPBN1PP1/R1BQR1K1 b - - 2 12
r1bqrnk1/pp2bppp/2p2n2/3p2B1/3P4/2NBPN2/PPQ2PPP/R4RK1 w - - 5 11
rnbq1rk1/pp2ppbp/6p1/8/3PP3/5N2/P3BPPP/1RBQK2R b K - 0 10
2rq1rk1/p2nbppp/bpp1p3/3p4/2PPP3/1PB3P1/P2N1PBP/R2Q1RK1 b - - 0 13
r1bqnrk1/ppp1npbp/3p2p1/3Pp3/2P1P3/2N1B3/PP2BPPP/R2QNRK1 b - - 4 10
r1bq1rk1/ppp1npbp/2np2p1/4p3/2P5/2NPP1P1/PP2NPBP/R1BQ1RK1 b - - 0 8
rnb1kb1r/1p3ppp/p2ppn2/6B1/3NPP2/q1N5/P1PQ2PP/1R2KB1R w Kkq - 2 10
Bullet time controls of one minute initially and 1 second per move was used. All games have been played with ponder on, all books and learning disabled, 4-piece TBs in a RAM disk. Test systems was my Xp2000+, OS: Linux (Ubuntu 6.06).
First I have choosen 10 engines of very different playing strength. It is important that these engine run without any bugs (especially not losing on time). I tested a lot and found those in the following table as my selection:
These engines played a round robin tourney using the test positions. I have adjusted the ratings that Pepito got a rating of 2530 what is almost its average rating in WBEC, RWBC, YABRL and CEGT.
If you compare the ratings of the other programs with these ratinglists you will find The Baron a bit to low and Needle to high. The error margins of the other programs overlap. Therfore I think these ratings are not to bad. (I have no idea why the Baron is relatively weak here. Maybe this have to do with that I'm the only one who uses the Linux version?)
But to establish the most accurate list of the world is not my goal. I just want to answer the simplier question which version of a program is stronger.
Therefore two version of Prophet were my first test candidates. Prophet-2.0-beta.4 and Prophet-2.0-delta.1 run the test against the engine listed above what took araound 16 hours. Here are the results:
The newer version performed clearly better in this test. Not only the raw rating difference is 264 points, also if the maximum addition and substraction according to the error margin are made the ratings do not match. I guess this correlates to something, possibly to playing strength.
I'll continue to test some engines (if you don't convice me this way of testing is a waste of time) and post the results in the tounament forum.
Regards
Volker
Marc's test or others with even more positions take to much time (for me) to wait for the results. The question should be answered after an overnight test. Therefore I have choosen the Nunn-1 test suite with this 10 positions:
r2qk2r/5pbp/p1np4/1p1Npb2/8/N1P5/PP3PPP/R2QKB1R w KQkq - 0 14
r1b2rk1/1pq1bppp/p1nppn2/8/3NP3/1BN1B3/PPP1QPPP/2KR3R w - - 6 11
r3k2r/p1qbnppp/1pn1p3/2ppP3/P2P4/2PB1N2/2P2PPP/R1BQ1RK1 b kq - 5 11
r1b2rk1/2q1bppp/p2p1n2/npp1p3/3PP3/2P2N1P/PPBN1PP1/R1BQR1K1 b - - 2 12
r1bqrnk1/pp2bppp/2p2n2/3p2B1/3P4/2NBPN2/PPQ2PPP/R4RK1 w - - 5 11
rnbq1rk1/pp2ppbp/6p1/8/3PP3/5N2/P3BPPP/1RBQK2R b K - 0 10
2rq1rk1/p2nbppp/bpp1p3/3p4/2PPP3/1PB3P1/P2N1PBP/R2Q1RK1 b - - 0 13
r1bqnrk1/ppp1npbp/3p2p1/3Pp3/2P1P3/2N1B3/PP2BPPP/R2QNRK1 b - - 4 10
r1bq1rk1/ppp1npbp/2np2p1/4p3/2P5/2NPP1P1/PP2NPBP/R1BQ1RK1 b - - 0 8
rnb1kb1r/1p3ppp/p2ppn2/6B1/3NPP2/q1N5/P1PQ2PP/1R2KB1R w Kkq - 2 10
Bullet time controls of one minute initially and 1 second per move was used. All games have been played with ponder on, all books and learning disabled, 4-piece TBs in a RAM disk. Test systems was my Xp2000+, OS: Linux (Ubuntu 6.06).
First I have choosen 10 engines of very different playing strength. It is important that these engine run without any bugs (especially not losing on time). I tested a lot and found those in the following table as my selection:
- Code: Select all
Rank Name Elo + - games score oppo. draws
1 Spike 1.2 Turin 2830 65 58 180 89% 2349 9%
2 Fruit 2.1 2734 56 53 180 81% 2360 11%
3 Aristarch 4.50 2613 53 51 180 69% 2373 8%
4 Pepito v1.59 (Conservador) 2530 50 50 180 61% 2382 10%
5 Yace Paderborn 2520 50 50 180 60% 2383 9%
6 The Baron 1.7.0 2466 49 49 180 54% 2389 13%
7 Scidlet 3.6 2324 51 52 180 40% 2405 9%
8 Natwarlal v0.12 2197 56 59 180 28% 2419 6%
9 Small Potato 0.6.1 2009 65 71 180 15% 2440 8%
10 Needle 0.53.1 1747 89 113 180 3% 2469 4%
These engines played a round robin tourney using the test positions. I have adjusted the ratings that Pepito got a rating of 2530 what is almost its average rating in WBEC, RWBC, YABRL and CEGT.
If you compare the ratings of the other programs with these ratinglists you will find The Baron a bit to low and Needle to high. The error margins of the other programs overlap. Therfore I think these ratings are not to bad. (I have no idea why the Baron is relatively weak here. Maybe this have to do with that I'm the only one who uses the Linux version?)
But to establish the most accurate list of the world is not my goal. I just want to answer the simplier question which version of a program is stronger.
Therefore two version of Prophet were my first test candidates. Prophet-2.0-beta.4 and Prophet-2.0-delta.1 run the test against the engine listed above what took araound 16 hours. Here are the results:
- Code: Select all
Rank Name Elo + - games score oppo. draws
1 Spike 1.2 Turin 2828 65 58 220 91% 2290 8%
2 Fruit 2.1 2740 57 53 220 85% 2298 9%
3 Aristarch 4.50 2618 52 51 220 75% 2309 7%
4 Pepito v1.59 (Conservador) 2530 50 49 220 67% 2317 9%
5 Yace Paderborn 2526 49 49 220 67% 2318 8%
6 The Baron 1.7.0 2465 47 47 220 61% 2323 12%
7 Scidlet 3.6 2314 48 48 220 46% 2337 8%
8 Natwarlal v0.12 2181 51 52 220 34% 2349 5%
9 Prophet-2.0-delta.1 2144 55 57 200 27% 2400 6%
10 Small Potato 0.6.1 2039 52 54 220 24% 2362 12%
11 Prophet 2.0-beta.4 1880 64 69 200 13% 2400 8%
12 Needle 0.53.1 1756 67 77 220 6% 2388 6%
The newer version performed clearly better in this test. Not only the raw rating difference is 264 points, also if the maximum addition and substraction according to the error margin are made the ratings do not match. I guess this correlates to something, possibly to playing strength.
I'll continue to test some engines (if you don't convice me this way of testing is a waste of time) and post the results in the tounament forum.
Regards
Volker