Page 1 of 1

Testing changes

PostPosted: 04 Oct 2005, 19:17
by Scott Gasch
I'm curious about what other programmers do to test code changes to their engines. Do you run test suites? Do you do self-play experiments? If so how many games?

I always run ECM at 20 sec / move. This is a metric that Bruce Moreland once talked with me about and something I adopted as a good idea. However as time has gone by I am less convinced about the merit of this test. Certainly it is good for some things but maybe not every change.

So, if you added an experiment to your code, what would you do to convince yourself it was a good thing to keep?

Scott

Re: Testing changes

PostPosted: 04 Oct 2005, 19:23
by mathmoi
Hi scott,

If it's a speed improvement change that is not supposed to change the choice of the engine at ply N I do tests with test suites. First I test in 60 seconds per position to verify that nothing is broken and the engine still solve enough positions, then I do a N depth per position to see if the thing is faster and by how much.

If it's a tactical/positional improvement that should make the engine smarter I test it on FICS, it play there all night, I get near 1000 games / month, I then look at rating differences.

Mathieu Pag

Re: Testing changes

PostPosted: 04 Oct 2005, 21:07
by Dann Corbit
First:
Run WAC at high speed as a "sanity check"
Run 500 self-play games at high speed as "sanity check level 2" (e.g. G/1 minute is OK for this). If super-lopsided, you can stop before 500, but less than 200 is very risky.

Additional levels of testing are a function of how many machines you can put into play.

Suggested miniumum:
500 games at 2' + 2" time control. Obviously, the more games you can play, and the slower the TC, the better.
Run at least 100 test positions that check the goal of your change (quiescence? king safety? pawn structure? etc.). Does it look like the fix corrected what you wanted to solve?

If you do not see measured improvement but you like the change, then leave it in. If it gets worse, then reexamine it. If it gets better, then keep it.

Re: Testing changes

PostPosted: 05 Oct 2005, 04:50
by Daniel Shawul
Hi Scott
I use testsuites only when i need to improve tactical strength.
I usually run test games against two oppenents
with very different playing style. one tactical and another positional.
I usually use 40/5 for tests. that will produce maximum 40 games in 6 hour run. I think that is enough to see if there is a change.
If there is not a big difference , i keep the one i think is better.
I don't usually do self tests because thery are boring and misleading most of the time.
best
daniel

Re: Testing changes

PostPosted: 05 Oct 2005, 09:33
by Alessandro Scotti
It's amazing how many games you need to detect a small change. In CEGT Shredder has played more than 3300 games and it still has a +/- 10 elo error margin!
I usually run 200/300 games with 20 or so good opponents and try to get an idea of the change. If it's really bad I throw it away, if it's within error bars I let it play another 200 games and so on. The "reference" version I'm trying to improve upon is usually rated with at least 700 games.
Up to a few versions ago I used to test *all* changes with WAC only for lack of a test machine. Now I use test suites as a "screening" test only if I suspect I may have broken something.

Re: Testing changes

PostPosted: 05 Oct 2005, 16:27
by Dann Corbit
Daniel Shawul wrote:Hi Scott
I use testsuites only when i need to improve tactical strength.
I usually run test games against two oppenents
with very different playing style. one tactical and another positional.
I usually use 40/5 for tests. that will produce maximum 40 games in 6 hour run. I think that is enough to see if there is a change.
If there is not a big difference , i keep the one i think is better.
I don't usually do self tests because thery are boring and misleading most of the time.
best
daniel


Statistically, 40 games is a huge risk, unless it is very, very lopsided.
See the analysis in Ernst A. Heinz book: "Scalable Search in Computer Chess''

Re: Testing changes

PostPosted: 07 Oct 2005, 04:08
by Scott Gasch
Dann Corbit wrote:
Daniel Shawul wrote:Hi Scott
I use testsuites only when i need to improve tactical strength.
I usually run test games against two oppenents
with very different playing style. one tactical and another positional.
I usually use 40/5 for tests. that will produce maximum 40 games in 6 hour run. I think that is enough to see if there is a change.
If there is not a big difference , i keep the one i think is better.
I don't usually do self tests because thery are boring and misleading most of the time.
best
daniel


Statistically, 40 games is a huge risk, unless it is very, very lopsided.
See the analysis in Ernst A. Heinz book: "Scalable Search in Computer Chess''


Sometimes I wish I would have paid more attention in my stat class in college. Of all the subjects, it's the one that I keep wishing I knew more about. There must be some tradeoff between number-of-games and statistical significance of the results but I don't know anywhere near enough to even guess what it would be.

Scott

Re: Testing changes

PostPosted: 07 Oct 2005, 07:50
by Volker Annuss
Scott Gasch wrote:Sometimes I wish I would have paid more attention in my stat class in college. Of all the subjects, it's the one that I keep wishing I knew more about. There must be some tradeoff between number-of-games and statistical significance of the results but I don't know anywhere near enough to even guess what it would be.


Hi Scott,

when your engine plays n games against an opponent of equal strength,
the expected result is between n/2 - sqrt(n/2) and n/2 + sqrt(n/2) but this is only a quick guess.
To get a better guess, look at the confidence interval of bayeselo or ELOstat.

Greetings,
Volkter

Edit: changed n to n/2

Re: Testing changes

PostPosted: 07 Oct 2005, 19:11
by Dann Corbit
Code: Select all
/*
Here is a coin toss simulator that uses the Mersenne Twister.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html
If you look at the experiments at the end, consider the possible error as a percentage function of the number of games in a set.

This will produce a result set that is just about as close to perfectly fair as is possible to get.

The 30 trial result should strike fear into your heart.
*/
#include <stdio.h>
#include <math.h>
#include "mt19937ar.h"
static char     string[32767];
int             win[1000];
int             loss[1000];
int             main(void)
{
    int             i,j;
    int             n;
    double             maxdisparity = 0;
    unsigned long   init[4] =
    {0x123, 0x234, 0x345, 0x456}, length = 4;
    puts("How many trials?");
    if (fgets(string, sizeof string, stdin)) {
        init_by_array(init, length);
        n = atoi(string);
        printf("1000 trials with %d experiments\n", n);
        for (i = 0; i < 1000; i++) {
            int             wins = 0;
            int             losses = 0;
            for (j = 0; j < n; j++)
                if (genrand_int32() % 2)
                    wins++;
                else
                    losses++;

            win[i] = wins;
            loss[i] = losses;
            if (fabs(wins-losses) > maxdisparity) maxdisparity = fabs(wins-losses);
        }
        printf("Maximum disparity was %.0f\n", maxdisparity);
    } else {
        puts("Error reading count of trials from standard input.");
    }
    return 0;
}
/*
U:\mt>mtsim
How many trials?
30
1000 trials with 30 experiments
Maximum disparity was 18

U:\mt>mtsim
How many trials?
1000
1000 trials with 1000 experiments
Maximum disparity was 102

U:\mt>mtsim
How many trials?
10000
1000 trials with 10000 experiments
Maximum disparity was 328

*/
[/code]

Re: Testing changes

PostPosted: 07 Oct 2005, 19:23
by Dann Corbit
Some extremes:

Big experiment totals:
----------------------------
U:\mt>mtsim
How many trials?
1000000
1000 trials with 1000000 experiments
Maximum disparity was 3604

U:\mt>mtsim
How many trials?
100000
1000 trials with 100000 experiments
Maximum disparity was 1224

Small experiment totals:
----------------------------
U:\mt>mtsim
How many trials?
8
1000 trials with 8 experiments
Maximum disparity was 8

U:\mt>mtsim
How many trials?
9
1000 trials with 9 experiments
Maximum disparity was 9

U:\mt>mtsim
How many trials?
10
1000 trials with 10 experiments
Maximum disparity was 10

U:\mt>mtsim
How many trials?
11
1000 trials with 11 experiments
Maximum disparity was 9

U:\mt>mtsim
How many trials?
12
1000 trials with 12 experiments
Maximum disparity was 10

U:\mt>mtsim
How many trials?
13
1000 trials with 13 experiments
Maximum disparity was 11

U:\mt>mtsim
How many trials?
14
1000 trials with 14 experiments
Maximum disparity was 12

U:\mt>mtsim
How many trials?
15
1000 trials with 15 experiments
Maximum disparity was 13

U:\mt>mtsim
How many trials?
16
1000 trials with 16 experiments
Maximum disparity was 14

U:\mt>mtsim
How many trials?
17
1000 trials with 17 experiments
Maximum disparity was 15

U:\mt>mtsim
How many trials?
18
1000 trials with 18 experiments
Maximum disparity was 14

U:\mt>mtsim
How many trials?
19
1000 trials with 19 experiments
Maximum disparity was 15

U:\mt>mtsim
How many trials?
20
1000 trials with 20 experiments
Maximum disparity was 16

Re: Testing changes

PostPosted: 08 Oct 2005, 17:48
by Steve Maughan
Scott & Volker

Volker Annuss wrote:...between n/2 - sqrt(n/2) and n/2 + sqrt(n/2)


If you assume the distribution is binomial (it's actually trinomial but binomial is a close approximation) then the variance is give as

Variance = p.(1-p).n

where p = prob of win and n = number of games

If the machines are the same strength then p = 0.5. Therefore the standard deviation is sqrt(n / 2). The 95% confidence limits are normally 1.95 standard deviations from the mean which would give:

Mean = n/2 +/- 1.95 * sqrt(n / 2)

For a 40 game match this would be mean of 20 with a +/- of 8.7. So the new version would need to win 29 v 11 for you to be statistically sure it's an improvement.

Steve