Moderator: Andres Valverde
Alessandro Scotti wrote:Hi,
I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!
It seems there are two easy ways to shorten the test: play at faster controls or play less games. And so the question is: what would be most effective?
I've found that even 100 games are not so many, and cannot detect small changes: I could usually measure a variation of 4-5% between the worst and best performance of the same version. On the other hand, if after 20 games it has scored 0 points instead of say the usual 4-5, is it correct to stop the test?
Just to report a peculiar experience with Kiwi... 0.5a scored 18.5 in the Gauntlet test by Karl-Heinz S?ntges, and also 15-18 in my tests against Yace. Version 0.5b had a very small change in the stage calculation and the value of pawns in the endgame, it scored 56 against 0.5a and the same against Yace, yet 14.5 in the Gauntlet. Version 0.5c also beat version 0.5b and scored 19.5 against Yace, Kiwi's best result. Yet, it was down to 13.5 in the Gauntlet. Version 0.5d scored 21.5 against Yace... it will probably drop to the bottom of the Gauntlet list if Karl-Heinz tests it again! All of this is very confusing... isn't it?
I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!
Stan Arts wrote:Hi Alessandro,
Well, unless you have some spare computers getting an enourmous amount of games after a change each time is almost impossible as an amateur. But I tend to think watching 10-20 games and following it's thinking as programmer is worth as much as having hundreds of games with a cold strength-indication.
I think that's especially true for evaluationchanges, but also for search. Besides that for both it's handy to have a collection of favourite testpositions, that adress different parts of the chessprogram. (For instance positions for each type of extention, pruning-sensitive, zugzwang, positions that tend to cause instability, some quiet positional ones, etc.) They are great for knowing what's going on and finding bugs, etc. In a quick way. (But you've to be carefull not to specificly tune your program on them.) I'll post my favourite positions if anyone's interested.
Stan
Alessandro Scotti wrote:I've been running a few games between Yace (PB) and Glaurung (0.2.3) at 40/4 and the results are quite interesting IMO. Each match includes 100 games from the 50 Noomen positions, Yace scores:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
The average so far is 48.3% with a +/- 4% error, corresponding to +/- 30 elo. I'll try to reach 500 games total, then run the same test at 40/2 and see what happens.
Pallav Nawani wrote:Yace score is steadily increasing. Is learning on?
Alessandro Scotti wrote:I've been running a few games between Yace (PB) and Glaurung (0.2.3) at 40/4 and the results are quite interesting IMO. Each match includes 100 games from the 50 Noomen positions, Yace scores:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
The average so far is 48.3% with a +/- 4% error, corresponding to +/- 30 elo. I'll try to reach 500 games total, then run the same test at 40/2 and see what happens.
Robert Allgeuer wrote:My view with respect to testing:
- Nothing can beat real games (although some aspects are of course better tested with test-suites, e.g. move ordering)
- 400 to 500 games is the minimum for getting a reasonable estimation of strength (+/- 30 ELO)
- Test with well-defined start positions (e.g. Nunn), not opening books
- Do not rely on self-play or games against the previous version: This is often misleading ...
- Test against a set of standard opponents, the average strength of them should be around the strength of the tested engine (score should be around 50%)
- Short time controls are ok. It is preferable to have shorter time control and more games rather than fewer games and fewer opponents but longer time control. Best is of course testing at different time controls including long ones.
- When using short time controls, use a time control with increment, so that results are not dominated by losses on time
- Test features in isolation
Robert
Uri Blass wrote:Robert Allgeuer wrote:My view with respect to testing:
- Nothing can beat real games (although some aspects are of course better tested with test-suites, e.g. move ordering)
- 400 to 500 games is the minimum for getting a reasonable estimation of strength (+/- 30 ELO)
- Test with well-defined start positions (e.g. Nunn), not opening books
- Do not rely on self-play or games against the previous version: This is often misleading ...
- Test against a set of standard opponents, the average strength of them should be around the strength of the tested engine (score should be around 50%)
- Short time controls are ok. It is preferable to have shorter time control and more games rather than fewer games and fewer opponents but longer time control. Best is of course testing at different time controls including long ones.
- When using short time controls, use a time control with increment, so that results are not dominated by losses on time
- Test features in isolation
Robert
I never tested against many engines.
One of the problem that I have with doing it is that I need to install the relevant engines and to verify details like no learning of the engines that I install and also to edit batch files.
I prefer to use winboard and not other interfaces that may have bugs and I read in the past that arena has bugs like hiding time problems of some engines and I prefer a stable interface and not an interface that is changed frequently when the programmers fix one bug and generate another bug.
If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.
Uri
Uri Blass wrote:If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.
Alessandro Scotti wrote:I've run the last two matches at this time control for now:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
- match 4 = 42.0%
- match 5 = 51.5%
(Yace/Glaurung playing both colors from the 50 Noomen positions at 40/4 on a fast P4).
I am a bit "disappointed" by the wide error range: min score is 42 and max is 52.5, that's a really large gap! So, it seems running "only" 100 games is not very helpful unless you get, say, a 30% score.
Alessandro Scotti wrote:Uri Blass wrote:If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.
Hi Uri,
I run all these tests under Linux, if that is ok I can send you the (simple) scripts I use to run matches, generate statistics and so on.
Joachim Rang wrote:switch to ShredderClassic and UCI than it is as easy as creating different word documents.
regards Joachim
Return to Programming and Technical Discussions
Users browsing this forum: No registered users and 40 guests