Winboard Forum

by **Alessandro Scotti** » 11 Jul 2005, 23:53

Hi,
I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!
It seems there are two easy ways to shorten the test: play at faster controls or play less games. And so the question is: what would be most effective?
I've found that even 100 games are not so many, and cannot detect small changes: I could usually measure a variation of 4-5% between the worst and best performance of the same version. On the other hand, if after 20 games it has scored 0 points instead of say the usual 4-5, is it correct to stop the test?
Just to report a peculiar experience with Kiwi... 0.5a scored 18.5 in the Gauntlet test by Karl-Heinz S?ntges, and also 15-18 in my tests against Yace. Version 0.5b had a very small change in the stage calculation and the value of pawns in the endgame, it scored 56 against 0.5a and the same against Yace, yet 14.5 in the Gauntlet. Version 0.5c also beat version 0.5b and scored 19.5 against Yace, Kiwi's best result. Yet, it was down to 13.5 in the Gauntlet. Version 0.5d scored 21.5 against Yace... it will probably drop to the bottom of the Gauntlet list if Karl-Heinz tests it again! All of this is very confusing... isn't it? :shock:

by **Pedro Castro** » 12 Jul 2005, 00:18

I often have made test of the new version against the old one or against few engines and always I need many test to reach a conclusion, believe that the best way is since they do it Leo, Ciro, Guenther or Karl and others, to make a match between 20-30 engines and one double round, with 2 or 3 test I believe that where is the engine. Or what is the same, to do gauntlet with 20-30 engines with few rounds, better than many rounds with few engines.

On the other hand, before, I used the same openings book, but now I leave to the engine utilize the his own one, the engine in many matches will use its own book.

by **Dann Corbit** » 12 Jul 2005, 01:38

Alessandro Scotti wrote:Hi,
I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!
It seems there are two easy ways to shorten the test: play at faster controls or play less games. And so the question is: what would be most effective?
I've found that even 100 games are not so many, and cannot detect small changes: I could usually measure a variation of 4-5% between the worst and best performance of the same version. On the other hand, if after 20 games it has scored 0 points instead of say the usual 4-5, is it correct to stop the test?
Just to report a peculiar experience with Kiwi... 0.5a scored 18.5 in the Gauntlet test by Karl-Heinz S?ntges, and also 15-18 in my tests against Yace. Version 0.5b had a very small change in the stage calculation and the value of pawns in the endgame, it scored 56 against 0.5a and the same against Yace, yet 14.5 in the Gauntlet. Version 0.5c also beat version 0.5b and scored 19.5 against Yace, Kiwi's best result. Yet, it was down to 13.5 in the Gauntlet. Version 0.5d scored 21.5 against Yace... it will probably drop to the bottom of the Gauntlet list if Karl-Heinz tests it again! All of this is very confusing... isn't it?

Unless the change is lopsided, 500 games are what is needed to get a fairly certain result (and 1000 is a lot better).

Another possibility is faster hardware, since a doubling of CPU speed lets you halve the time control and get the same quality.

Website · by **Ross Boyd** » 12 Jul 2005, 01:50

Hi Alessandro,

Good to see Kiwi making steady progress.

I wonder what you consider a good method to test an engine. I have only very limited resources for testing, and would like to get the most out of it. So far, I have run many games with the Noomen positions at 40/5, but sometimes it takes more than 24 hours to finish the test, which is a bit too much for me!

I have only limited testing resources also, and have asked myself the same question many times. I test at 1'+1" using 150 game gauntlets with Nunn2 on an XP2800+. I realise now this bullet time control is no good. The overhead of switching between engines between moves can make the NPS jump all over the place. I get more games but the results are less reliable.

However, I think Pedro is right about testing against many engines rather than many games against only a few. Also, I think its important to choose your opponents carefully. You want opponents with differing strengths (I don't mean Elo, I mean abilities in endgame, tactical, king attack or piece activity etc), so you can see where your engine's relative weaknesses and strengths are.

Fabien has emphasised the importance of testing ONE change at a time. That makes absolute sense, and requires rigorous discipline. I must admit sometimes I make 20 changes.

Ideally, I think the best approach is to beg, borrow, buy or steal at least 20 PCs (40 would be twice as good) and set up a network that you remote control from your 'admin' PC. Write some scripts to automate the test process across the network and maybe you can play two hundred 20'+1" games in about six hours. (assuming each games last an average of 40 mins). Meanwhile your preparing the next test version. Now that would be heavenly.

Anyway, I haven't been very helpful. All I can say is I empathize entirely about the need for resources. Its a laborious labour of love, isn't it?

Cheers,
Ross

by **Stan Arts** » 12 Jul 2005, 14:00

Hi Alessandro,

Well, unless you have some spare computers getting an enourmous amount of games after a change each time is almost impossible as an amateur. But I tend to think watching 10-20 games and following it's thinking as programmer is worth as much as having hundreds of games with a cold strength-indication.
I think that's especially true for evaluationchanges, but also for search. Besides that for both it's handy to have a collection of favourite testpositions, that adress different parts of the chessprogram. (For instance positions for each type of extention, pruning-sensitive, zugzwang, positions that tend to cause instability, some quiet positional ones, etc.) They are great for knowing what's going on and finding bugs, etc. In a quick way. (But you've to be carefull not to specificly tune your program on them.) I'll post my favourite positions if anyone's interested.
Oh well, watching your program play a lot of games can be very frustrating, so better not put throwable valuable objects near the computer.

(An ideal computerchess-testlocation would be a bare soundproof concrete room with bolted-tight-chair and desk in the middle. (Also, that would be a great opportunity to call your testlocation: "secret test location". And then you can tell people about your "secret testlocation", so they'll ask what you are doing there, so you can awnser: "something very dangerous and secret" so that they will inform others and until eventually your government finds out, gets curious, so that you end up with Geiger-counters and special-force people on the roof.) That's what I would do anyway if Neurosis were a little stronger, but it isn't, so..

Happy testing.)

Goodluck with Kiwi, looks like it's making a lot of progress.

Stan

by **Anonymous** » 12 Jul 2005, 15:45

Stan Arts wrote:Hi Alessandro,

Well, unless you have some spare computers getting an enourmous amount of games after a change each time is almost impossible as an amateur. But I tend to think watching 10-20 games and following it's thinking as programmer is worth as much as having hundreds of games with a cold strength-indication.
I think that's especially true for evaluationchanges, but also for search. Besides that for both it's handy to have a collection of favourite testpositions, that adress different parts of the chessprogram. (For instance positions for each type of extention, pruning-sensitive, zugzwang, positions that tend to cause instability, some quiet positional ones, etc.) They are great for knowing what's going on and finding bugs, etc. In a quick way. (But you've to be carefull not to specificly tune your program on them.) I'll post my favourite positions if anyone's interested.

Stan

I am very interested, could you post these positions? I also have problems testing changes in my engine.

Regards
Dave

by **Robert Allgeuer** » 12 Jul 2005, 17:39

My view with respect to testing:
- Nothing can beat real games (although some aspects are of course better tested with test-suites, e.g. move ordering)
- 400 to 500 games is the minimum for getting a reasonable estimation of strength (+/- 30 ELO)
- Test with well-defined start positions (e.g. Nunn), not opening books
- Do not rely on self-play or games against the previous version: This is often misleading ...
- Test against a set of standard opponents, the average strength of them should be around the strength of the tested engine (score should be around 50%)
- Short time controls are ok. It is preferable to have shorter time control and more games rather than fewer games and fewer opponents but longer time control. Best is of course testing at different time controls including long ones.
- When using short time controls, use a time control with increment, so that results are not dominated by losses on time
- Test features in isolation

Robert

by **Alessandro Scotti** » 26 Jul 2005, 10:49

I've been running a few games between Yace (PB) and Glaurung (0.2.3) at 40/4 and the results are quite interesting IMO. Each match includes 100 games from the 50 Noomen positions, Yace scores:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
The average so far is 48.3% with a +/- 4% error, corresponding to +/- 30 elo. I'll try to reach 500 games total, then run the same test at 40/2 and see what happens.

by **Pallav Nawani** » 26 Jul 2005, 11:09

Alessandro Scotti wrote:I've been running a few games between Yace (PB) and Glaurung (0.2.3) at 40/4 and the results are quite interesting IMO. Each match includes 100 games from the 50 Noomen positions, Yace scores:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
The average so far is 48.3% with a +/- 4% error, corresponding to +/- 30 elo. I'll try to reach 500 games total, then run the same test at 40/2 and see what happens.

Yace score is steadily increasing. Is learning on?

Pallav

by **Alessandro Scotti** » 26 Jul 2005, 11:22

Pallav Nawani wrote:Yace score is steadily increasing. Is learning on?

Hi Pallav,
no the learn files are deleted after each single game... I think it's just a coincidence!

by **Tord Romstad** » 26 Jul 2005, 12:37

Alessandro Scotti wrote:I've been running a few games between Yace (PB) and Glaurung (0.2.3) at 40/4 and the results are quite interesting IMO. Each match includes 100 games from the 50 Noomen positions, Yace scores:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
The average so far is 48.3% with a +/- 4% error, corresponding to +/- 30 elo. I'll try to reach 500 games total, then run the same test at 40/2 and see what happens.

Yet another example of something I have observed since a long time: Yace is a remarkably tough opponent for Glaurung. On the CEGT rating list, Glaurung 0.2.4 is currently 97 rating points ahead of Yace 0.99.87 (which I assume is a newer and stronger version than Yace Paderborn), but in my own matches between Glaurung and Yace Paderborn, Glaurung rarely scores more than 50%.

Learning is always disabled in my tests, so the explanation must be something else. Perhaps the endgame is the place to look. Yace is known as one of the strongest endgame players among the amateur chess engines, while Glaurung occupy the opposite end of the spectrum.

Tord

by **Uri Blass** » 27 Jul 2005, 23:54

Robert Allgeuer wrote:My view with respect to testing:
- Nothing can beat real games (although some aspects are of course better tested with test-suites, e.g. move ordering)
- 400 to 500 games is the minimum for getting a reasonable estimation of strength (+/- 30 ELO)
- Test with well-defined start positions (e.g. Nunn), not opening books
- Do not rely on self-play or games against the previous version: This is often misleading ...
- Test against a set of standard opponents, the average strength of them should be around the strength of the tested engine (score should be around 50%)
- Short time controls are ok. It is preferable to have shorter time control and more games rather than fewer games and fewer opponents but longer time control. Best is of course testing at different time controls including long ones.
- When using short time controls, use a time control with increment, so that results are not dominated by losses on time
- Test features in isolation

Robert

I never tested against many engines.
One of the problem that I have with doing it is that I need to install the relevant engines and to verify details like no learning of the engines that I install and also to edit batch files.

I prefer to use winboard and not other interfaces that may have bugs and I read in the past that arena has bugs like hiding time problems of some engines and I prefer a stable interface and not an interface that is changed frequently when the programmers fix one bug and generate another bug.

If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.

Uri

by **Joachim Rang** » 28 Jul 2005, 10:17

Uri Blass wrote:
Robert Allgeuer wrote:My view with respect to testing:
- Nothing can beat real games (although some aspects are of course better tested with test-suites, e.g. move ordering)
- 400 to 500 games is the minimum for getting a reasonable estimation of strength (+/- 30 ELO)
- Test with well-defined start positions (e.g. Nunn), not opening books
- Do not rely on self-play or games against the previous version: This is often misleading ...
- Test against a set of standard opponents, the average strength of them should be around the strength of the tested engine (score should be around 50%)
- Short time controls are ok. It is preferable to have shorter time control and more games rather than fewer games and fewer opponents but longer time control. Best is of course testing at different time controls including long ones.
- When using short time controls, use a time control with increment, so that results are not dominated by losses on time
- Test features in isolation

Robert

I never tested against many engines.
One of the problem that I have with doing it is that I need to install the relevant engines and to verify details like no learning of the engines that I install and also to edit batch files.

I prefer to use winboard and not other interfaces that may have bugs and I read in the past that arena has bugs like hiding time problems of some engines and I prefer a stable interface and not an interface that is changed frequently when the programmers fix one bug and generate another bug.

If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.

Uri

switch to ShredderClassic and UCI than it is as easy as creating different word documents.

regards Joachim

by **Alessandro Scotti** » 28 Jul 2005, 10:32

Uri Blass wrote:If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.

Hi Uri,
I run all these tests under Linux, if that is ok I can send you the (simple) scripts I use to run matches, generate statistics and so on.

by **Alessandro Scotti** » 28 Jul 2005, 15:31

I've run the last two matches at this time control for now:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
- match 4 = 42.0%
- match 5 = 51.5%
(Yace/Glaurung playing both colors from the 50 Noomen positions at 40/4 on a fast P4).
I am a bit "disappointed" by the wide error range: min score is 42 and max is 52.5, that's a really large gap! So, it seems running "only" 100 games is not very helpful unless you get, say, a 30% score.
I'm now running another 500 games at 40/2 to see if and how the results change, then probably will try to "explore" faster time controls such as 1+1. Hopefully, these little experiments will provide some insight in how to better spend test time.

by **Gian-Carlo Pascutto** » 28 Jul 2005, 15:54

I find this maddening as well. Even with 400 games score differences of 5% are no exception. It's almost impossible to make sane decision in a reasonable amount of time.

by **Pallav Nawani** » 28 Jul 2005, 17:43

Alessandro Scotti wrote:I've run the last two matches at this time control for now:
- match 1 = 44.5%
- match 2 = 48.0%
- match 3 = 52.5%
- match 4 = 42.0%
- match 5 = 51.5%
(Yace/Glaurung playing both colors from the 50 Noomen positions at 40/4 on a fast P4).
I am a bit "disappointed" by the wide error range: min score is 42 and max is 52.5, that's a really large gap! So, it seems running "only" 100 games is not very helpful unless you get, say, a 30% score.

Hi,

50 Noomen positions will give a total of 100 games. You should get same scores with each set of 100 games?? Or am I missing something?

Pallav

by **Gian-Carlo Pascutto** » 28 Jul 2005, 18:29

It's almost never exactly reproducible. If the OS decides to fart in one game, and the search is a picosecond slower, the games can diverge. I guess it's especially worse at fast timecontrols.

by **Uri Blass** » 28 Jul 2005, 19:39

Alessandro Scotti wrote:
Uri Blass wrote:If I can download winboard together with some engines and some batch file in order to test then it can be productive because I do not like to spend time on installing new engines and on writing a specific batch file for every engine that I test against it.

Hi Uri,
I run all these tests under Linux, if that is ok I can send you the (simple) scripts I use to run matches, generate statistics and so on.

Hi Alessandro,
I do not work with linux but only with windows.

by **Uri Blass** » 28 Jul 2005, 19:41

Joachim Rang wrote:switch to ShredderClassic and UCI than it is as easy as creating different word documents.

regards Joachim

Maybe using shredderClassic is better but Movei today does not support
UCI and I wonder if supporting UCI make it easier to test.

Uri

Winboard Forum

Best method to test?

Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Re: Best method to test?

Who is online