Two Prophet versions in the Nunn test

Discussions about Winboard/Xboard. News about engines or programs to use with these GUIs (e.g. tournament managers or adapters) belong in this sub forum.

Moderator: Andres Valverde

Two Prophet versions in the Nunn test

Postby Volker Pittlik » 21 Aug 2006, 10:50

Activated by Marc Lacrosse's MLmfl test tests I thought about a way to enable even a lazy one as me to answer the question "which version is stronger" based upon a relevant number of games.

Marc's test or others with even more positions take to much time (for me) to wait for the results. The question should be answered after an overnight test. Therefore I have choosen the Nunn-1 test suite with this 10 positions:

r2qk2r/5pbp/p1np4/1p1Npb2/8/N1P5/PP3PPP/R2QKB1R w KQkq - 0 14
r1b2rk1/1pq1bppp/p1nppn2/8/3NP3/1BN1B3/PPP1QPPP/2KR3R w - - 6 11
r3k2r/p1qbnppp/1pn1p3/2ppP3/P2P4/2PB1N2/2P2PPP/R1BQ1RK1 b kq - 5 11
r1b2rk1/2q1bppp/p2p1n2/npp1p3/3PP3/2P2N1P/PPBN1PP1/R1BQR1K1 b - - 2 12
r1bqrnk1/pp2bppp/2p2n2/3p2B1/3P4/2NBPN2/PPQ2PPP/R4RK1 w - - 5 11
rnbq1rk1/pp2ppbp/6p1/8/3PP3/5N2/P3BPPP/1RBQK2R b K - 0 10
2rq1rk1/p2nbppp/bpp1p3/3p4/2PPP3/1PB3P1/P2N1PBP/R2Q1RK1 b - - 0 13
r1bqnrk1/ppp1npbp/3p2p1/3Pp3/2P1P3/2N1B3/PP2BPPP/R2QNRK1 b - - 4 10
r1bq1rk1/ppp1npbp/2np2p1/4p3/2P5/2NPP1P1/PP2NPBP/R1BQ1RK1 b - - 0 8
rnb1kb1r/1p3ppp/p2ppn2/6B1/3NPP2/q1N5/P1PQ2PP/1R2KB1R w Kkq - 2 10

Bullet time controls of one minute initially and 1 second per move was used. All games have been played with ponder on, all books and learning disabled, 4-piece TBs in a RAM disk. Test systems was my Xp2000+, OS: Linux (Ubuntu 6.06).

First I have choosen 10 engines of very different playing strength. It is important that these engine run without any bugs (especially not losing on time). I tested a lot and found those in the following table as my selection:

Code: Select all
Rank Name                        Elo    +    - games score oppo. draws
   1 Spike 1.2 Turin            2830   65   58   180   89%  2349    9%
   2 Fruit 2.1                  2734   56   53   180   81%  2360   11%
   3 Aristarch 4.50             2613   53   51   180   69%  2373    8%
   4 Pepito v1.59 (Conservador) 2530   50   50   180   61%  2382   10%
   5 Yace Paderborn             2520   50   50   180   60%  2383    9%
   6 The Baron 1.7.0            2466   49   49   180   54%  2389   13%
   7 Scidlet 3.6                2324   51   52   180   40%  2405    9%
   8 Natwarlal v0.12            2197   56   59   180   28%  2419    6%
   9 Small Potato 0.6.1         2009   65   71   180   15%  2440    8%
  10 Needle 0.53.1              1747   89  113   180    3%  2469    4%


These engines played a round robin tourney using the test positions. I have adjusted the ratings that Pepito got a rating of 2530 what is almost its average rating in WBEC, RWBC, YABRL and CEGT.

If you compare the ratings of the other programs with these ratinglists you will find The Baron a bit to low and Needle to high. The error margins of the other programs overlap. Therfore I think these ratings are not to bad. (I have no idea why the Baron is relatively weak here. Maybe this have to do with that I'm the only one who uses the Linux version?)

But to establish the most accurate list of the world is not my goal. I just want to answer the simplier question which version of a program is stronger.

Therefore two version of Prophet were my first test candidates. Prophet-2.0-beta.4 and Prophet-2.0-delta.1 run the test against the engine listed above what took araound 16 hours. Here are the results:

Code: Select all
Rank Name                         Elo    +    - games score oppo. draws
   1 Spike 1.2 Turin             2828   65   58   220   91%  2290    8%
   2 Fruit 2.1                   2740   57   53   220   85%  2298    9%
   3 Aristarch 4.50              2618   52   51   220   75%  2309    7%
   4 Pepito v1.59 (Conservador)  2530   50   49   220   67%  2317    9%
   5 Yace Paderborn              2526   49   49   220   67%  2318    8%
   6 The Baron 1.7.0             2465   47   47   220   61%  2323   12%
   7 Scidlet 3.6                 2314   48   48   220   46%  2337    8%
   8 Natwarlal v0.12             2181   51   52   220   34%  2349    5%
   9 Prophet-2.0-delta.1         2144   55   57   200   27%  2400    6%
  10 Small Potato 0.6.1          2039   52   54   220   24%  2362   12%
  11 Prophet 2.0-beta.4          1880   64   69   200   13%  2400    8%
  12 Needle 0.53.1               1756   67   77   220    6%  2388    6%


The newer version performed clearly better in this test. Not only the raw rating difference is 264 points, also if the maximum addition and substraction according to the error margin are made the ratings do not match. I guess this correlates to something, possibly to playing strength.

I'll continue to test some engines (if you don't convice me this way of testing is a waste of time) and post the results in the tounament forum.

Regards

Volker
User avatar
Volker Pittlik
 
Posts: 1031
Joined: 24 Sep 2004, 10:14
Location: Murten / Morat, Switzerland

Re: Two Prophet versions in the Nunn test

Postby Tony Thomas » 22 Aug 2006, 13:17

I was very convinced that this is the future way of testing after Marc's post. I also started doing something similar, I just run the tournaments at 1/1 time control without pondering, own books and learning enabled and no Tb's. I did run in to problems with few programs, they perform very weak in that time control compared to long time controls. Then there are some like Danasah, Romichess, and Twisted, they perform exceptionally well in that time control.
Tony Thomas
 
Posts: 232
Joined: 14 May 2006, 19:13
Location: Atlanta, Ga


Return to Winboard and related Topics

Who is online

Users browsing this forum: No registered users and 28 guests