Alessandro Scotti wrote:Hi,
I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!
It seems there are two easy ways to shorten the test: play at faster controls or play less games. And so the question is: what would be most effective?
I've found that even 100 games are not so many, and cannot detect small changes: I could usually measure a variation of 4-5% between the worst and best performance of the same version. On the other hand, if after 20 games it has scored 0 points instead of say the usual 4-5, is it correct to stop the test?
Just to report a peculiar experience with Kiwi... 0.5a scored 18.5 in the Gauntlet test by Karl-Heinz S?ntges, and also 15-18 in my tests against Yace. Version 0.5b had a very small change in the stage calculation and the value of pawns in the endgame, it scored 56 against 0.5a and the same against Yace, yet 14.5 in the Gauntlet. Version 0.5c also beat version 0.5b and scored 19.5 against Yace, Kiwi's best result. Yet, it was down to 13.5 in the Gauntlet. Version 0.5d scored 21.5 against Yace... it will probably drop to the bottom of the Gauntlet list if Karl-Heinz tests it again! All of this is very confusing... isn't it?
Oh well, the best way to test is a compute intensive manner.
A few years ago i had the privilege of knowing Jan Louwman as a tester.
He was the best tester on the surface of the planet. When because of health reasons he had to stop testing, so did diep deteriorate for a few years.
Right now testing is weakest spot of diep as i just own 4 computers.
Which means 1 computer-1computer matches.
That's slow.
The testing method of Jan Louwman.
The simplest testing method there is.
I designed a certain type of forward pruning. 6 months work.
I created 2 versions. 1 version with the forward pruning, searching 2-3 ply deeper. On average 2 ply. It used the non-published 'ed schroeder type forward pruning'. Ed has not published very clearly how his very special search works, though the tables used are posted online.
It's tactical god.
So i knew from home testing that it was 600 points stronger tactical.
But getting on average search depths 13-14 ply, it would be positional like 8-9 ply.
So there was big need for testing.
Jan Louwman played 3000 games. time control 40 in 2. At slower hardware a bit slower. Slow laptops about 6 hours a game.
about 500 games were with the forward pruning and 500 games were iwthout the forward pruning but WITH singular extensions.
The latter version scored 20-25% more against fritz,shredder,junior and a wide variaty of opponents.
a) it's statistical significant
b) it's the only possible test there is
My own experience is that playing your own program against itself is a very bad way to test things. Testsets are too. I already *knew* this version was tactical 600 points above the normal diep version. Ed Schroder is not a beginner to say polite and he had designed this method in a time (around the 80s - start 90s) when just tactics mattered. So i knew it would be superior in that respect.
But this test showed very clearly that good testing is what you need.
The result was the opposite of what i thought it would be. Of course after the test is long over, you can explain it. Thing is, accurate testing and keeping statistics is important.
The real strong point of Jan is that he, despite his age of 75 years old,
never made 1 mistake in testing in this sense that he always very accurately annotated which logfiles were which testmatch. And more importantly, at age=75 he did get the job done.
A single 25 game match just shows *nothing* statistical relevant, assuming your engine doesn't have too huge bugs. If you already are pretty strong and play strong opponents and you want to measure whether a change improved you say 30 rating points, then 25 games just won't do at all. The influence of learning, openings books, hardware, time control and the position of the sun perhaps, it is just too huge variables.
The standard deviation is too big.
A good example is the current diep version. I modified some things in evaluation function. I get a good weather report back from a tester who played just 10 games, and myself i discovered some horrors in some other tests. But is it better or not?
I don't know. I GUESS it is.
Only the Jan Louwman method, which counts testing games by the hundreds and always plays on till 3000 games, that's the only way to test thigns in this sport in a real good manner.
A big problem thereby is that in world champs you play on far superior hardware and a version that sucks in blitz might be great at 60 in 2 time control. So a slow time control is a necessity.
I hope that answers your question.