Winboard Forum

by **peterhughes** » 30 Mar 2005, 12:05

Say that I make a small change to SharpChess?s evaluation function, and I want to quickly, scientifically test whether that change increases or decreases its playing strength. What is the shortest/fastest/most accurate way of finding this out? Are there any testing tools for automating this process?

Thanks
Peter Hughes

by **Rémi Coulom** » 30 Mar 2005, 13:25

peterhughes wrote:Say that I make a small change to SharpChess?s evaluation function, and I want to quickly, scientifically test whether that change increases or decreases its playing strength. What is the shortest/fastest/most accurate way of finding this out? Are there any testing tools for automating this process?

Thanks
Peter Hughes

This is an important and difficult question.

Some simple changes that mainly affect speed (hash-table algorithm, move ordering, low-level optimizations) can be estimated by measuring average time-to-depth on a database of positions, for instance.

Changes in evaluation or selectivity are more difficult to assess. The only good way I know is by playing a lot of games, against a variety of opponents of similar strength. I have had a lot of bad experience by evaluating changes on databases of test positions. According to discussions I have had with some pros, many of them have a collection of computers that they use to run several matches in parallel all the time, to evaluate their changes. So I guess it is not a bad approach.

Now one question is: How many games are required to obtained a reliable estimation of strength difference ? The problem is that, usually, many are required, especially if the difference in strength is small. I have some statistical tools on my webpage that you may find useful:
http://remi.coulom.free.fr/WhoIsBest.zip can estimate the likelihood that one version is stronger than another, based on the outcome of games of these two programs against the same set of sparring partners
http://remi.coulom.free.fr/Bayesian-Elo/ describes some ongoing work on elo-rating estimation. It also provides tools that estimate the likelihood that one program is stronger than another.

Another important question is the choice of time control. If you wish to obtain an answer within a reasonable amount of time, you'll have to play fast games. What is the best time control is not obvious.

In practice, very small changes are very difficult to assess in a short time, and you'll often have to rely on your intuition. I think it is a good idea to watch and analyze the games that your program plays. Your understanding of what really happens in the games that you play provides a lot more information than merely counting the number of wins and losses. For instance, if you change your code for King safety, and the program loses because of a blunder in the endgame, then it does not mean the same as a loss because of a bad King attack. This kind of analysis is difficult to quantify in a purely statistical and scientific manner.

Of course, If you wish to present your result in a scientific paper, you should not rely on this kind of intuition. But in practice, it may be a faster way to make progress, if you have the humility not to rely too much on it.

R?mi

by **Rémi Coulom** » 30 Mar 2005, 13:49

R?mi Coulom wrote:
peterhughes wrote:Say that I make a small change to SharpChess?s evaluation function, and I want to quickly, scientifically test whether that change increases or decreases its playing strength. What is the shortest/fastest/most accurate way of finding this out? Are there any testing tools for automating this process?

Thanks
Peter Hughes

Of course, If you wish to present your result in a scientific paper, you should not rely on this kind of intuition. But in practice, it may be a faster way to make progress, if you have the humility not to rely too much on it.

I would also add that analyzing games is good not only because it gives you an intuitive indication of whether the new version plays better or not, but it also indicates why. So this helps a lot to understand how the evaluation should be modified to get improvement.

R?mi

by **peterhughes** » 01 Apr 2005, 13:15

I suppose what I'm looking for is a test program that:

Has a list of test positions
For each position, knows what the "best" move is that can be made from that position, and how many plies (search depth) a computer would have to search in order to find that best move.
The test proram would then start the WinBoard engine, to be tested, and send it the FEN code for the first test position.
The test program would then start the engine thinking and record the following information:
- Whether, or not, the correct move was found.
- How many plies (depth) the engine required to find the best move.
- Whether that depth was higher or lower than the expect depth to find the best move.
- How long the engine took (in seconds) to find the best move.
- The test program would stop the engine thinking if the engine:
  - a) Found the best move.
  - b) "Thought" for longer than x configurable seconds, without finding th best move.
  - c) Searched x configurable depth deeper than the best move was known to exist at.
The test program would then move onto the next test position, and start the process again, until all positions have been tested.
After all positions had been tested, the test program would output the following information:
- Number of positions tested.
- Number of correct moves/solutions found.
- Number of incorrect moves/solutions found.
- Number of failures due to spending more time thinking than allowed.
- Number of failures due to searching to a depth greater than set threshold for position.
- Total time of test.
- Total time spent finding correct positions.
- Total nodes searched finding correct positions.
- Average nodes/second across all positions in the entire test.
- Average nodes/second across all positions in the entire test.
- Deepest (most complex) position solved.
- A list of of the positions, with FEN, that the engine failed on, with details of the type of failure.

Does anyone know if a program like this already exists? Where can I get it?

Do you think it would be of any use/ do you find it useful?

by **Rémi Coulom** » 01 Apr 2005, 14:10

Maybe epd2wb is what you want:
http://www.seanet.com/~brucemo/gerbil/gerbil.htm

I do not think you'll find a database of test positions that is good enough to test changes in the evaluation function.

R?mi

by **Pallav Nawani** » 01 Apr 2005, 14:13

Try arena.
http://www.playwitharena.com
No program has a list of chess positions, but you can download a lot of EPD files from Dann Corbit's FTP. Alternatively, some EPD files come bundled with Arasan source code. http://www.arasanchess.org.
Dann corbit's FTP is a useful resource, it has many papers/publications on chess programming, and it also has sources of many programs.
Also, there is a program epd2wb that is bundled along with Bruce Moreland's chess program, Gerbil. It is also capable to running epd test suites.

Finally, a word of warning. Arena's epd test suite support is buggy, sometimes it fails to detect that a program has correctly solved the position.

Best regards,
Pallav

by **Charles Roberson** » 01 Apr 2005, 18:34

Peter,

I did all this myself in NoonianChess and I suggest you do as well,
becuase you'll find adding the specific functions will gain more benefit
than what you listed.

Code: Select all: What to do: 1) code a simple function to parse an fen line 2) Add a feature for fixed search time (you probably have it). 3) code a simple function (test suite) that reads 1 line at a time of a file and until EOF. a) set the search time via #2 b) for each line the function will call function #1 and then call your search routines. c) record data 4) add a command line option to tell the program to go into test suite mode and accept filename from command line.

This should do all that you ask for. Plus adding these features will
set you up for analysis modes.

by **peterhughes** » 01 Apr 2005, 20:55

R?mi Coulom wrote:Maybe epd2wb is what you want:
http://www.seanet.com/~brucemo/gerbil/gerbil.htm

I do not think you'll find a database of test positions that is good enough to test changes in the evaluation function.

R?mi

Yes, this is pretty much what I'm looking for. I was sure that somebody, somewhere would have done this kind of thing before.

Cheers
Pete

by **peterhughes** » 01 Apr 2005, 22:18

R?mi Coulom wrote:
I would also add that analyzing games is good not only because it gives you an intuitive indication of whether the new version plays better or not, but it also indicates why. So this helps a lot to understand how the evaluation should be modified to get improvement.

R?mi

Remi

Thank you for your detailed reply.

Because SharpChess only got its WinBoard interface last week, all of its development up till now has been measured by playing games against other chess programs that have their own GUI chess boards. This was achieved by me loading up SharpChess and the "enemy" program at the same time, setting them to play opposite colours, and then manually moving the pieces between the two program's chess boards! This is why it's taken me 16 months to develop an engine that plays at around 1600-1800 ELO!

Of course, by making the moves manually, I've been able to to observe the games in detail, think while they think, and spot, what I think, are SharpChess's errors, then gradually hone the evaluation function until it makes the moves that make sense to me. Consequently, I've been able to slowly watch it go from losing every time, winning nearly all the time!

My first goal was for SharpChess to beat a chess-playing friend of mine. After I'd first shown him the program, when it was playing at aournd 3-4 ply in 30 seconds, we agreed a challenge where SharpChess would get 30 seconds a move, and he would get as long as he liked. I figured with just a few weeks work, I'd be beating him in no time! Sadly, he got better at chess too! It actually was only last month that SharpChess actually beat him for the first time, only after programing Pondering in order to take advantage of his "long" thinking time. It really has been a titanic 16 month battle, of wins, loses, crashes etc... Great stuff!

Anyway, back to testing. My first great computer adversery, that will always hold a warm place in my heart, was:

Little Chess Partner
http://www.lokasoft.nl/uk/jchess/chessgame.htm

I started off setting it to 5 seconds a move, and SharpChess to 30 seconds. It was kicking my ass for a good while, but I gradually started winning, and increasing its time in 5 second increments, eventually to the point where, given equal time, SharpChess wins every game! Woot!

After that, I pitted it against: HotBabe chess

http://www.stauffercom.com/hotbabe/

If you havent downloaded this yet, then try it at least once. It's a great laugh. It actually plays a stronger game than Little Chess Partner, and it's written in Eiffel. Mad eh?! I applied the same process with HotBabe, starting it at 5 seconds and go up in small increments, until now SharpChess and HotBabe play at about the same level.

So, this has been a great and fun way of improving playing strength, but a very slow one.

Because I can savegames in SharpChess, I also have around 40 saved positions, that I use for testing. Some I use just for speed/node-count test, that are middle game positins with lots "going on"; some are where I know a fixed "best" move which is only found past a certain depth (say 8 ply), and some other positions of interest, like end games, three-move repetition positions, 50 move, all the types of things that you need to test for.

The hardest ones to test, I feel, are modifications that involve forward-pruning: null-move, futility etc. Because although these tend to result in both fewer nodes, and faster searches, it is very hard to tell whether they actually improve playing strength.

A great example of this was when I was fiddling with verified null-move forward pruning. I tried setting "verify=false" at the root of my alpha-beta search, instead of the recommended "verify=true". This resulted in an instant increases in search depth of a whole 2 ply (from 8 to 10 in my test positions.). "Woot!", says I. However, when testing this on an 8-ply test position, I found that the correct move wasnt then actually found until SharpChess reached ply 10. So, you cant be too careful! As it happens, I left the change in, 'cause it still seemed to increase play strength slightly. It'd by nice to be able to "prove" this though, hence my questions on here.

by **peterhughes** » 01 Apr 2005, 22:30

Pallav Nawani wrote:Try arena.
http://www.playwitharena.com
No program has a list of chess positions, but you can download a lot of EPD files from Dann Corbit's FTP. Alternatively, some EPD files come bundled with Arasan source code. http://www.arasanchess.org.
Dann corbit's FTP is a useful resource, it has many papers/publications on chess programming, and it also has sources of many programs.
Also, there is a program epd2wb that is bundled along with Bruce Moreland's chess program, Gerbil. It is also capable to running epd test suites.

Finally, a word of warning. Arena's epd test suite support is buggy, sometimes it fails to detect that a program has correctly solved the position.

Best regards,
Pallav

Yes. I've just got hold of Arena. What a fantastic program. Pure class. I've setup a few Gauntlet style games of all 1600-1700 ELO engine, with SharpChess as the first engine. This is then a pretty good automated test of play strength. I notice it has "Analyse" mode also, but I haven't programmed this yet, but hope to take advatage of it soon. Also, the ELOstat program look like it could be useful for my purposes too.

I hadnt realised that it had EPD testing. I'll get on to that asap.

Cheers
Pete

Winboard Forum

Fastest way to test engine playing strength?

Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Re: Fastest way to test engine playing strength?

Who is online