Winboard Forum

by **Casper W. Berg** » 06 Feb 2006, 14:07

Why is the opening suites like Nunn, Noomen, and now also the Silver suite

, not bigger?

If you do a statistical test on proportions, there is a maximum error of 14 % on the score ratio in 50 games with 95% confidence. If you are comparing evaluation functions this is a large number I believe.
I have done some tests myself, and found great differences in results on different test suites.

I would also like to know, If anyone has tried to measure the variance of results played between the same two engines versus search depth/time controls? (either on the same set of positions or different ones)

My guess is that the variance will be higher at low search depths, due to the greater number of tactical errors.
But the crudeness of the evaluation functions used should maybe
also be considered here.

I find this interesting because the precise variance will give a better (and smaller) estimate of the maximum error, but longer search depths takes longer to compute. What is the quickest way to precise results when matching engines' evaluation functions: short time controls and many games or longer time controls and fewer games?

by **Roger Brown** » 06 Feb 2006, 15:10

Casper W. Berg wrote:Why is the opening suites like Nunn, Noomen, and now also the Silver suite , not bigger?

[snip]

What is the quickest way to precise results when matching engines' evaluation functions: short time controls and many games or longer time controls and fewer games?

Hello Casper,

In a nutshell you have defined the problem that has haunted many a chess engine writer and tester.

I am afraid that there is no quick answer to this one. In the end, if you are primarily a blitzer then short timecontrols are for you. Kurt, Dann et al believe in looooooong timecontrol games to evaluate an engine's capabilities.

Welcome to the world of chess engine testing.

Later.

by **Albert Silver** » 14 Feb 2006, 13:58

Casper W. Berg wrote:Why is the opening suites like Nunn, Noomen, and now also the Silver suite , not bigger?

If you do a statistical test on proportions, there is a maximum error of 14 % on the score ratio in 50 games with 95% confidence. If you are comparing evaluation functions this is a large number I believe.
I have done some tests myself, and found great differences in results on different test suites.

The reason for my suite is simply that I do not believe there is a need for more positions. You are correct that even if using the full version one would get only 100 games, but I don't believe the ideal solution is to have more positions, but rather more opponents. In other words, instead of playing 200 games against Fritz 9 (for example) as opposed to 100, I think it would be best to play 100 games against a different opponent.

When testing to see the performance differences of different parameters of an engine, I have seen that even when achieving a significantly better result against one opponent, one might easily score worse against others and thus have an overall worse result. I use no less than 4 different opponents, each with distinct playing styles and choices:

- Fruit 2.2 (and therefore not Toga, since Toga plays too similarly to Fruit)
- Fritz 9
- Hiarcs 10 Hypermodern
- Gambit Fruit 4bx; I've found that Gambit Fruit is very different from Fruit 2.2 so that success against one absolutely does not mean success with the other.

So if you played the full suite against all 4, you'd have 400 games, and that starts to get interesting statistically, unless you are trying to measure a microscopic difference. Still, in that case, I'd again advocate another different opponent, and not more positions.

Albert

by **Casper W. Berg** » 16 Feb 2006, 17:36

I see your point, Albert.

That is in fact very interesting, because some of the same math then applies to the number of engines you must use to obtain reliable test results (of course dependant on how correlated the results between different engines are).

It is obviously also of importance how the strength relationships are, the closer to equal, the less statistical error on the results...

Casper

Winboard Forum

Length of opening suites and search depths

Length of opening suites and search depths

Re: Length of opening suites and search depths

Re: Length of opening suites and search depths

Re: Length of opening suites and search depths

Who is online