Winboard Forum

by **Uri Blass** » 18 Feb 2006, 08:28

I know that testing of x+1 against x may be misleading and I read based on experience of testers that the best chessmaster personality againstchessmaster personalities is not best against other programs.

My question is the following.

Suppose that you have 2 personality of chessmaster and you have result of 100 game noomen match.

What is the minimal result that you can be practically sure that the winner is better against other programs(It seems to me that if you get 90-10 you can be sure that the winner is better also against other programs).

Another question:

What is the maximal result that you got between a and b in a match of 100 games when a won the match but was not better than b against other programs?(you can use of course example of programs that are not clones like fritz and Toga1.0).

Uri

by **Steve Maughan** » 18 Feb 2006, 14:51

Have a look at a little utility that I wrote many years ago:

http://www.stevemaughan.com/whoisbetter.htm

It uses the binomial distribution to test significance. Other have expanded upon this e.g. Remi Coulom

Steve

by **Uri Blass** » 18 Feb 2006, 15:39

Steve Maughan wrote:Have a look at a little utility that I wrote many years ago:

http://www.stevemaughan.com/whoisbetter.htm

It uses the binomial distribution to test significance. Other have expanded upon this e.g. Remi Coulom

Steve

The problem is that it may be possible that a program is stronger against previous version but not stronger aganist other programs.

You need not only to be sure that statistical noise did not effect the result but also to be sure that the program is not better only against itself so you need higher result.

My question is basically what is the best result that a got against b in 100 game match and still was weaker than b based on results against other programs.

Uri

by **Casper W. Berg** » 21 Feb 2006, 13:43

Unfortunately you can't test this.

If program A is really better than B, but not better at playing chess in general (i.e. not better against other engines), no amount of games between A and B will reveal this fact.

But the chance of finding a local optimum against variants of the same program (like Chessmaster) is probably larger than between to completely different engines because the searchs/evals will differ more in the last case.

To answer you questions you need to know the chance of hitting a local optimum + the statistical distribution of how much better these local optima are, which I dare say is impossible to find in general.

To get reliable results you need to test against different opponents...

Casper

Winboard Forum

What result is significant?

What result is significant?

Re: What result is significant?

Re: What result is significant?

Re: What result is significant?

Who is online