R?mi Coulom wrote:
I would also add that analyzing games is good not only because it gives you an intuitive indication of whether the new version plays better or not, but it also indicates why. So this helps a lot to understand how the evaluation should be modified to get improvement.
R?mi
Remi
Thank you for your detailed reply.
Because SharpChess only got its WinBoard interface last week, all of its development up till now has been measured by playing games against other chess programs that have their own GUI chess boards. This was achieved by me loading up SharpChess and the "enemy" program at the same time, setting them to play opposite colours, and then manually moving the pieces between the two program's chess boards! This is why it's taken me 16 months to develop an engine that plays at around 1600-1800 ELO!
Of course, by making the moves manually, I've been able to to observe the games in detail, think while they think, and spot, what I think, are SharpChess's errors, then gradually hone the evaluation function until it makes the moves that make sense to me. Consequently, I've been able to slowly watch it go from losing every time, winning nearly all the time!
My first goal was for SharpChess to beat a chess-playing friend of mine. After I'd first shown him the program, when it was playing at aournd 3-4 ply in 30 seconds, we agreed a challenge where SharpChess would get 30 seconds a move, and he would get as long as he liked. I figured with just a few weeks work, I'd be beating him in no time! Sadly, he got better at chess too! It actually was only last month that SharpChess actually beat him for the first time, only after programing Pondering in order to take advantage of his "long" thinking time. It really has been a titanic 16 month battle, of wins, loses, crashes etc... Great stuff!
Anyway, back to testing. My first great computer adversery, that will always hold a warm place in my heart, was:
Little Chess Partner
http://www.lokasoft.nl/uk/jchess/chessgame.htm
I started off setting it to 5 seconds a move, and SharpChess to 30 seconds. It was kicking my ass for a good while, but I gradually started winning, and increasing its time in 5 second increments, eventually to the point where, given equal time, SharpChess wins every game! Woot!
After that, I pitted it against: HotBabe chess
http://www.stauffercom.com/hotbabe/
If you havent downloaded this yet, then try it at least once. It's a great laugh. It actually plays a stronger game than Little Chess Partner, and it's written in Eiffel. Mad eh?! I applied the same process with HotBabe, starting it at 5 seconds and go up in small increments, until now SharpChess and HotBabe play at about the same level.
So, this has been a great and fun way of improving playing strength, but a very slow one.
Because I can savegames in SharpChess, I also have around 40 saved positions, that I use for testing. Some I use just for speed/node-count test, that are middle game positins with lots "going on"; some are where I know a fixed "best" move which is only found past a certain depth (say 8 ply), and some other positions of interest, like end games, three-move repetition positions, 50 move, all the types of things that you need to test for.
The hardest ones to test, I feel, are modifications that involve forward-pruning: null-move, futility etc. Because although these tend to result in both fewer nodes, and faster searches, it is very hard to tell whether they actually improve playing strength.
A great example of this was when I was fiddling with verified null-move forward pruning. I tried setting "verify=false" at the root of my alpha-beta search, instead of the recommended "verify=true". This resulted in an instant increases in search depth of a whole 2 ply (from 8 to 10 in my test positions.). "Woot!", says I. However, when testing this on an 8-ply test position, I found that the correct move wasnt then actually found until SharpChess reached ply 10. So, you cant be too careful! As it happens, I left the change in, 'cause it still seemed to increase play strength slightly. It'd by nice to be able to "prove" this though, hence my questions on here.